昨天训练LSTM模型,在训练集上正常,在验证集上就不对了,主要问题在于把model.eval()和with torch.no_grad()同时使用,去掉model.eval()就好了,我真的是一脸懵逼。pytorch的代码结构没弄清楚,一直在踩坑(愚蠢如我),奉上跑通的的代码结构先:
import torch
import torch.optim as optim
import torch.nn as nn
# from my code
from model import Mymodel
from dataloader import MydataLoader
from args import get_args
args = get_args()
# load data
train_loader = MydataLoader(args.train_file, args.gpu)
valid_loader = MydataLoader(args.valid_file, args.gpu)
model = Mymodel(config)
if torch.cuda.is_available():
model.cuda()
# show model parameters
for name, param in model.named_parameters():
print(name, param.size())
criterion = nn.MarginRankingLoss(args.loss_margin) # Max margin ranking loss function
optimizer = optim.Adam(model.parameters(), lr=args.lr)
for epoch in range(1, args.epochs+1):
if early_stop:
print("Early stopping. Epoch: {}, Best Dev. Acc: {}".format(epoch, best_dev_acc))
break
n_correct, n_total = 0, 0
losses = []
model.train()
for batch_idx, batch in enumerate(train_loader.next_batch()):
iterations += 1
ques, rels, neg_rels = batch
neg_size = neg_rels.size(1)
model.zero_grad()
#optimizer.zero_grad()
pos_score,neg_score = model(ques, rels, neg_rels,is_train=True)
n_correct += (torch.sum(torch.gt(pos_score, neg_score), 1).data == neg_size).sum().item()
n_total += len(ques)
train_acc = 100. * n_correct / n_total
ones = torch.ones(neg_score.size(0),neg_score.size(1)).cuda(args.gpu)
loss = criterion(pos_score,neg_score,ones)
losses.append(loss.item())
loss.backward()
# clip the gradient
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_gradient)
optimizer.step()
if iterations % args.dev_every == 0:
#model.eval()
with torch.no_grad():
#if True:
# model.eval()
dev_acc = 0
n_dev_correct = 0
n_dev_total = 0
for valid_batch_idx, valid_batch in enumerate(valid_loader.next_batch()):
val_ques, val_rels, val_neg_rels = valid_batch
val_neg_size = val_neg_rels.size(1)
val_ps, val_ns = model(val_ques, val_rels, val_neg_rels,is_train=True)
n_dev_correct += (torch.sum(torch.gt(val_ps, val_ns), 1).data == val_neg_size).sum().item()
n_dev_total += len(val_ques)
print("n_dev_correct,n_dev_total:",n_dev_correct,n_dev_total)
dev_acc = 100 * n_dev_correct/n_dev_total
主要是以下这几个问题一直没搞清楚:
1.分不清楚model.train(),model.eval()在哪一步用
2.optimizer.zero_grad()和model.zero_grad()区别是啥?
3.用了with torch.no_grad()还用model.eval()吗?
第一个问题:我不是一个epoch测试一次,而是按照迭代次数进行测试:
for epoch in range(1, args.epochs+1):
n_correct, n_total = 0, 0
losses = []
model.train()
for batch_idx, batch in enumerate(train_loader.next_batch()):
# 看这里,有人喜欢把model.train放这里,也没问题
# model.train()
....
# 看这里,就是这个model.eval()导致测试失效,为什么呢?看第三个问题
# model.eval()
if iterations % args.dev_every == 0:
with torch.no_grad():
第二个问题:搬运网上的解释
ifoptimizer = optim.Optimizer(net.parameters())
,they are the same.
there might be use cases where you would like to use different optimizers for different parts of your model. In such a case,model.zero_grad()
would clear all parameters of the model, while theoptimizerX.zero_grad()
call will just clean the gradients of the parameters that were passed to it
第三个问题:
They do different things, and have different scopes.
with torch.no_grad
– disables tracking of gradients inautograd
.model.eval()
changes theforward()
behaviour of the module it is called upon- eg, it disables dropout and has batch norm use the entire population statistics
with torch.no_grad
The torch.autograd.no_grad documentation says:
Context-manager that disabled [sic] gradient calculation.
Disabling gradient calculation is useful for inference, when you are sure that you will not call
Tensor.backward()
. It will reduce memory consumption for computations that would otherwise have
requires_grad=True
. In this mode, the result of every computation will have
requires_grad=False
, even when the inputs have
requires_grad=True
.
model.eval()
The nn.Module.eval documentation says:
Sets the module in evaluation mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout
,
BatchNorm
, etc
看到这里我总算明白了,我的模型里面用了Dropout
,BatchNorm
,用了model.eval(),这两部分就失效了,可是为失效后效果相差甚远,待我仔细研究一下我的模型
割
找到模型问题啦,我训练的损失函数是maxmarginloss,即训练的目标是正样本与负样本的差距尽可能大。在我的模型里正负样本没有共享lstm编码器,正样本对应一个lstm,负样本对应另一个lstm,所以学出来的模型直接是正样本的lstm权重比负样本lstm权重高就完了,这样是不对的,改为共享lstm编码就对了。