翻译自 Some important PyTorch tasks – A concise summary from a vision researcher
以前喜欢写 Keras,自从接触了 PyTorch,便一发不可收拾,喜欢的不得了,最近在网上看到了一篇比较实用的 PyTorch 指南,抽时间翻译一下,也为了自己学习。
开始之前
import torch.nn as nn
import torch
from torch.autograd.variable import Variable
from torchvision import datasets, models, transforms
model = models.resnet18(pretrained = False)
Section 1 使用预训练好的 Resnet 网络进行微调
我们首先观察一下 Resnet 模型的各个层,然后再决定用哪一层进行微调。使用预训练的意义是我们想要这些层的参数固定不变 (注:往往只去优化后面的全连接层)。微调简单来说就是使用一个在大规模数据集上 (注:cv 里面通常是 ImageNet) 预训练好模型在我们的目标数据集上接着训练。当然,我们也可以不微调,这意味的重新造轮子,我后面会解释为什么。
假设,我想要在一个训练集上训练一个可以区分轿车和自行车的模型。我可以选择先从零开始搜集很多相关领域的图片来训练模型。但是,考虑到已经有很多相关的工作提供了一个很好的模型来区分狗,猫和人;虽然,这三个都看起来不像轿车或者自行车,但是有句古话说,聊胜于无啊。我们可以站在这些模型的肩膀上来训练一个区分轿车和自行车的模型。
为什么呢?理由有2个:
- 更快
- 需要的数据集更少
如果对迁移学习感兴趣的话,可以参考这个 http://cs231n.github.io/transfer-learning
现在,让我们看一下 Resnet18 模型的内在结构,使用函数 .children()。
child_counter = 0
for child in model.children():
print(" child", child_counter, "is -")
print(child)
child_counter += 1
输出
child 0 is -
Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
child 1 is -
BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
child 2 is -
ReLU (inplace)
child 3 is -
MaxPool2d (size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
child 4 is -
Sequential (
(0): BasicBlock (
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
)
(1): BasicBlock (
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
)
)
child 5 is -
Sequential (
(0): BasicBlock (
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(downsample): Sequential (
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
)
)
(1): BasicBlock (
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
)
)
child 6 is -
Sequential (
(0): BasicBlock (
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(downsample): Sequential (
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(1): BasicBlock (
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
child 7 is -
Sequential (
(0): BasicBlock (
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(downsample): Sequential (
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
)
)
(1): BasicBlock (
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu): ReLU (inplace)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
)
)
child 8 is -
AvgPool2d (
)
child 9 is -
Linear (512 -> 1000)
之后,我们使用函数 .parameters() 来获得每一层的参数值。每一层的参数都有一个 .requires_grad 属性来确认这一层的参数是要固定不变还是跟着训练 (默认是 true,参数随着网络的每次更新而更新,如果设置为 false,则表示参数固定不变)。
for child in model.children():
for param in child.parameters():
print("This is what a parameter looks like - \n",param)
break
break
输出:
This is what a parameter looks like -
Parameter containing:
(0 ,0 ,.,.) =
1.8160e-02 2.1680e-02 5.6358e-02 ... -1.2987e-02 -6.1262e-02 -4.8870e-02
2.6440e-02 1.0603e-02 1.9794e-02 ... -4.2643e-02 -4.5565e-03 -4.8300e-02
9.0205e-03 1.9536e-03 1.9925e-04 ... 1.1413e-02 1.1395e-02 2.8418e-03
... ⋱ ...
-2.4830e-02 8.1022e-03 -4.9934e-02 ... 2.2573e-02 1.6346e-02 3.9666e-02
-2.3857e-02 -1.6275e-02 2.9058e-02 ... 3.0488e-02 2.0294e-02 -5.1073e-03
-1.6848e-04 5.9266e-02 -5.8456e-03 ... 1.9757e-02 -7.8441e-02 1.3667e-02
(0 ,1 ,.,.) =
-1.6319e-02 3.3193e-02 -2.2146e-04 ... 1.2571e-03 -1.3313e-02 -4.7580e-02
-4.9329e-02 3.2548e-02 5.4202e-03 ... -4.5771e-02 -2.6863e-03 -3.6992e-03
8.7714e-03 2.4772e-02 1.0026e-02 ... 1.6512e-02 -7.4382e-03 6.0990e-02
... ⋱ ...
-4.0751e-02 3.3605e-04 -2.1426e-02 ... 1.1318e-02 -1.5222e-04 -3.5020e-02
-4.1432e-02 -9.1312e-03 -1.7572e-02 ... 1.6974e-03 5.9792e-03 1.2868e-02
-4.4471e-02 -1.1013e-02 4.9902e-03 ... -2.1241e-02 2.2371e-02 -2.1672e-02
(0 ,2 ,.,.) =
1.0826e-02 -4.4230e-02 -1.5594e-02 ... -1.3197e-03 6.1211e-03 -1.6262e-02
-1.3989e-02 -3.2357e-02 2.0250e-02 ... 7.5012e-03 2.8761e-04 -2.1318e-02
-7.8574e-04 1.7702e-02 1.0301e-02 ... -2.0074e-02 4.4735e-02 1.0149e-02
... ⋱ ...
-2.4707e-02 2.3952e-03 6.5615e-04 ... 4.4371e-02 -1.0678e-02 2.3425e-02
-2.4330e-02 1.3018e-02 1.1473e-02 ... -3.6666e-03 -2.1145e-02 -1.5511e-02
-3.0876e-02 -1.6071e-02 -2.4506e-02 ... 2.7417e-03 6.2566e-03 1.6208e-02
⋮
⋮
[torch.FloatTensor of size 64x3x7x7]
很明显,训练过程中会伴随着大量的计算。现在,如果我们固定前 6 个 child 的参数不变的话,训练会得到很明显的加速。
child_counter = 0
for child in model.children():
if child_counter < 6:
print("child ",child_counter," was frozen")
for param in child.parameters():
param.requires_grad = False
elif child_counter == 6:
children_of_child_counter = 0
for children_of_child in child.children():
if children_of_child_counter < 1:
for param in children_of_child.parameters():
param.requires_grad = False
print('child ', children_of_child_counter, 'of child',child_counter,' was frozen')
else:
print('child ', children_of_child_counter, 'of child',child_counter,' was not frozen')
children_of_child_counter += 1
else:
print("child ",child_counter," was not frozen")
child_counter += 1
输出:
child 0 was frozen
child 1 was frozen
child 2 was frozen
child 3 was frozen
child 4 was frozen
child 5 was frozen
child 0 of child 6 was frozen
child 1 of child 6 was not frozen
child 7 was not frozen
child 8 was not frozen
child 9 was not frozen
重要提示
既然已经固定了预训练网络部分的参数不变,接下来要做的事情就是顺利跑起来。这就取决于你自己的优化器了。优化器是用来更新模型的参数的,通常,我们这么来写:
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.1)
但是,这样会在我们只更新模型的部分参数时发生错误。正确的更新方式如下:
optimizer = torch.optim.RMSprop(filter(lambda p: p.requires_grad, model.parameters()), lr=0.1)
Section 2 模型的保存和加载
PyTorch 中保存模型有 2 种方式,建议的方式是使用 “state dictionaries”,这样更快并且更节省空间。这里面存放的只是参数的值,并不包括模型的结构。所以,你必须重新创建模型的结构并且载入这些参数。
# Let's assume we will save/load from a path MODEL_PATH
# Saving a Model
torch.save(model.state_dict(), MODEL_PATH)
# Loading the model.
# First create a model and define it's architecture as done above in this notebook.
# If you want a custom architecture.
# read below it's been covered below.
checkpoint = torch.load(MODEL_PATH)
model.load_state_dict(checkpoint)
Section 3 修改、删除或增加最后一层
和 Keras 里面不一样的是,PyTorch 中不能使用 .pop() 函数来移除最后一层。现在让我们来看看在 PyTorch 中怎么来做。
修改最后一层
# Load the model
model = models.resnet18(pretrained = False)
# Get number of parameters going in to the last layer.
# we need this to change the final layer.
num_final_in = model.fc.in_features
# The final layer of the model is model.fc so we can basically just overwrite it
# to have the output = number of classes we need. Say, 300 classes.
NUM_CLASSES = 300
model.fc = nn.Linear(num_final_in, NUM_CLASSES)
删除最后一层 (通常,在需要一个层的参数时)
# Load the model
model = models.resnet18(pretrained = False)
我们可以使用 model.children() 来获得模型相关层的信息。之后,将他们转换成一个 list,就可以使用 list 操作来移除最后一层了。这里我们使用 PyTorch 的 nn.Sequential() 函数来将修改后的 list 重新装入模型。
new_model = nn.Sequential(*list(model.children())[:-1])
增加层
这部分会在下一个 section – creating custom models 里面介绍。
Section 4 自定义模型 : 结合 Section 1-3,在模型头部添加层
让我们来定义一个常用的 model。如前所述,这个 model 的参数将会有一部分来自预训练模型,另一部分来自自身的训练过程。看完下面这个例子,你会有很好的认识。
import torch.nn as nn
import math
import torch.utils.model_zoo as model_zoo
import torch
from torch.autograd.variable import Variable
from torchvision import datasets, models, transforms
# New models are defined as classes.
# Then, when we want to create a model,
# we create an object instantiating this class.
class Resnet_Added_Layers_Half_Frozen(nn.Module):
def __init__(self, LOAD_VIS_URL=None):
super(ResnetCombinedFull2, self).__init__()
# Start with half the resnet model, swap out the final layer
# because that's the model we had defined above.
model = models.resnet18(pretrained = False)
num_final_in = model.fc.in_features
model.fc = nn.Linear(num_final_in, 300)
# Now that the architecture is defined same as above,
# let's load the model we would have trained above.
checkpoint = torch.load(MODEL_PATH)
model.load_state_dict(checkpoint)
# Let's freeze the same as above.
# Same code as above without the print statements
child_counter = 0
for child in model.children():
if child_counter < 6:
for param in child.parameters():
param.requires_grad = False
elif child_counter == 6:
children_of_child_counter = 0
for children_of_child in child.children():
if children_of_child_counter < 1:
for param in children_of_child.parameters():
param.requires_grad = False
else:
children_of_child_counter += 1
else:
print("child ",child_counter," was not frozen")
child_counter += 1
# Now, let's define new layers that we want to add on top.
# Basically, these are just objects we define here.
# The "adding on top" is defined by the forward() function
# which decides the flow of the input data into the model.
# NOTE - Even the above model needs to be passed to self.
self.vismodel = nn.Sequential(*list(model.children()))
self.projective = nn.Linear(512, 400)
self.nonlinearity = nn.ReLU(inplace=True)
self.projective2 = nn.Linear(400, 300)
# The forward function defines the flow of the input data
# and thus decides which layer/chunk goes on top of what.
def forward(self,x):
x = self.vismodel(x)
x = torch.squeeze(x)
x = self.projective(x)
x = self.nonlinearity(x)
x = self.projective2(x)
return x
Section 5 自定义损失函数
损失函数量化的表明了我们现在的模型和我们理想的模型之间的距离。有时候,我们需要自定义我们的损失函数,下面对如何自定义进行介绍。
- 通过 class 来定义,和定义 model 一样,需要继承自 torch.nn.Module。
- 使用 view() 来改变输入的纬度。
- 使用 unsqueeze() 来增加 tensor 的纬度。
- loss function 返回的值必须要是一个标量,不能是 vector 或者 tensor。
- 返回值必须是 Variable 类型。这样才能用来更新参数。确保这样的前提是 x 和 y 都要是 Variable。
这里我举了一个 Regress_Loss 的例子。输入的 x 和 y 是两种不同的类型。通过将 x 进行 reshape 等操作将 x 转换到和 y 相同的 shape,然后返回 x 和 y 的 L2 距离作为 loss 的值。掌握了这个例子之后,定义其他的 loss function 也很容易了。
举例:x 的 shape 为 (5,10), y 的 shape (5,5,10)。所以我们需要给 x 增加一维和 y 匹配。(x-y) 的 shape 为 (5,5,10)。我们将三个纬度上的值都累加起来以得到一个标量值。
class Regress_Loss(torch.nn.Module):
def __init__(self):
super(Regress_Loss,self).__init__()
def forward(self,x,y):
y_shape = y.size()[1]
x_added_dim = x.unsqueeze(1)
x_stacked_along_dimension1 = x_added_dim.repeat(1, y_shape, 1)
diff = torch.sum((y - x_stacked_along_dimension1)**2, 2)
totloss = torch.sum(torch.sum(torch.sum(diff)))
return totloss