torch.optim
torch.optim是一个实施各种优化算法的包。已经支持了大多数常见优化方法,并且这些接口足够通用,以便于以后能够非常容易的集成更复杂的优化算法。
==============================
内容概括:
==============================
1、How to use an optimizer
2、若干优化器
3、若干lr_scheduler
==============================
How to use an optimizer
要使用torch.optim必须创建一个优化器对象,它将存放当前状态并将根据计算出来的梯度更新参数;
Constructing it
★★★为了创建一个优化器必须给它一个可迭代对象来存放待优化的参数(这些参数必须是Variable类) ,然后你可以规定优化器具体的参数比如学习率,weight decay等等。
Note
★★★如果你需要将模型通过.cuda()转移到GPU,请在为模型构建优化器之前就这么做。模型参数在.cuda()之后会变成不同的对象。 总的来说,你要保证在优化器构建和使用后,被优化的参数一直在一个固定的位置。
Example:
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)
Per-parameter options
优化器也支持设置每一个参数的选项。要这么做就不能传递一个Variables的迭代器,而需要传递字典作为可迭代对象的元素。看例子,更清晰。
Note
You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.
For example, this is very useful when one wants to specify per-layer learning rates:
optim.SGD([ {'params': model.base.parameters()}, {'params': model.classifier.parameters(), 'lr': 1e-3} ], lr=1e-2, momentum=0.9)
This means that model.base’s parameters will use the default learning rate of 1e-2,model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters 很简单就不翻译了。
Taking an optimization step
所有的优化器均有step()方法,这个方法用来更新参数。他有两种用法:
optimizer.step()
这是一个被大多数优化器支持的简化版。一旦梯度被通过backward()计算就可以调用step()。
Example:
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
optimizer.step(closure)
一些优化算法,例如Conjugate Gradient(共轭梯度) and LBFGS需要重新计算好多次函数(说的应该是损失函数),因此你必须传递一个闭包函数,从而允许你重新计算你的模型。这个闭包函数能够清除梯度计算损失并且返回它。
Example:
for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
★★★closure()用来调用外层函数的input, target计算损失函数,实际上每个batch对应的损失函数本来就是不同的优化器优化的是当前batch所对应的损失函数的参数
Algorithms
class torch.optim.Optimizer(params, defaults)
source
Base class for all optimizers.
Warning
Parameters need to be specified as collections that have a deterministic ordering that is consistent between runs. Examples of objects that don’t satisfy those properties are sets and iterators over values of dictionaries.
⚠️注意
这部分不翻译了,很好意会,但不好翻译;
Parameters:
params (iterable) – an iterable of torch.Tensor s or dict s. Specifies what Tensors should be optimized.
defaults – (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).
add_param_group(param_group)
def add_param_group(self, param_group):
r"""Add a param group to the :class:`Optimizer` s `param_groups`.
This can be useful when fine tuning a pre-trained network as frozen layers can be made
trainable and added to the :class:`Optimizer` as training progresses.
Arguments:
param_group (dict): Specifies what Tensors should be optimized along with group
specific optimization options.
"""
assert isinstance(param_group, dict), "param group must be a dict"
params = param_group['params']
if isinstance(params, torch.Tensor):
param_group['params'] = [params]
elif isinstance(params, set):
raise TypeError('optimizer parameters need to be organized in ordered collections, but '
'the ordering of tensors in sets will change between runs. Please use a list instead.')
else:
param_group['params'] = list(params)
for param in param_group['params']:
if not isinstance(param, torch.Tensor):
raise TypeError("optimizer can only optimize Tensors, "
"but one of the params is " + torch.typename(param))
if not param.is_leaf:
raise ValueError("can't optimize a non-leaf Tensor")
for name, default in self.defaults.items():
if default is required and name not in param_group:
raise ValueError("parameter group didn't specify a value of required optimization parameter " +
name)
else:
param_group.setdefault(name, default)
param_set = set()
for group in self.param_groups:
param_set.update(set(group['params']))
if not param_set.isdisjoint(set(param_group['params'])):
raise ValueError("some parameters appear in more than one parameter group")
self.param_groups.append(param_group)
增加一个参数组到优化器的参数组中。 我们微调一个作为固定层的预训练的网络时,可以让这部分参数能够加入到优化器的训练过程让其变的可以被训练。
Parameters:
param_group (dict) – Specifies what Tensors should be optimized along with group optimization options.
(specific) –
load_state_dict(state_dict)
def load_state_dict(self, state_dict):
r"""Loads the optimizer state.
Arguments:
state_dict (dict): optimizer state. Should be an object returned
from a call to :meth:`state_dict`.
"""
# deepcopy, to be consistent with module API
state_dict = deepcopy(state_dict)
# Validate the state_dict
groups = self.param_groups
saved_groups = state_dict['param_groups']
if len(groups) != len(saved_groups):
raise ValueError("loaded state dict has a different number of "
"parameter groups")
param_lens = (len(g['params']) for g in groups)
saved_lens = (len(g['params']) for g in saved_groups)
if any(p_len != s_len for p_len, s_len in zip(param_lens, saved_lens)):
raise ValueError("loaded state dict contains a parameter group "
"that doesn't match the size of optimizer's group")
# Update the state
id_map = {old_id: p for old_id, p in
zip(chain(*(g['params'] for g in saved_groups)),
chain(*(g['params'] for g in groups)))}
def cast(param, value):
r"""Make a deep copy of value, casting all tensors to device of param."""
if isinstance(value, torch.Tensor):
# Floating-point types are a bit special here. They are the only ones
# that are assumed to always match the type of params.
if param.is_floating_point():
value = value.to(param.dtype)
value = value.to(param.device)
return value
elif isinstance(value, dict):
return {k: cast(param, v) for k, v in value.items()}
elif isinstance(value, Iterable):
return type(value)(cast(param, v) for v in value)
else:
return value
# Copy state assigned to params (and cast tensors to appropriate types).
# State that is not assigned to params is copied as is (needed for
# backward compatibility).
state = defaultdict(dict)
for k, v in state_dict['state'].items():
if k in id_map:
param = id_map[k]
state[param] = cast(param, v)
else:
state[k] = v
# Update parameter groups, setting their 'params' value
def update_group(group, new_group):
new_group['params'] = group['params']
return new_group
param_groups = [
update_group(g, ng) for g, ng in zip(groups, saved_groups)]
self.__setstate__({'state': state, 'param_groups': param_groups})
Loads the optimizer state.
Parameters:
state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().
state_dict()
Returns the state of the optimizer as a dict. It contains two entries:
- state – a dict holding current optimization state. Its content differs between optimizer classes.
- param_groups – a dict containing all parameter groups
step(closure)
Performs a single optimization step (parameter update).
Parameters:
closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
zero_grad() Clears the gradients of all optimized torch.Tensors.
接下来讲的是优化方法:
class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)
Implements Adagrad algorithm.
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
Implements Adam algorithm.
class torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)
Implements lazy version of Adam algorithm suitable for sparse tensors.
class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
Implements Adamax algorithm (a variant of Adam based on infinity norm).
class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
Implements Averaged Stochastic Gradient Descent.
class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)
Implements L-BFGS algorithm.
class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
Implements RMSprop algorithm.
class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
Implements the resilient backpropagation algorithm.
class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)
Implements stochastic gradient descent (optionally with momentum).
How to adjust Learning Rate
torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. torch.optim.lr_scheduler.ReduceLROnPlateau allows dynamic learning rate reducing based on some validation measurements.
class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
Sets the learning rate of each parameter group to the initial lr times a given function. When last_epoch=-1, sets initial lr as lr.
class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)
Sets the learning rate of each parameter group to the initial lr decayed by gamma every step_size epochs. When last_epoch=-1, sets initial lr as lr.
class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)
Set the learning rate of each parameter group to the initial lr decayed by gamma once the number of epoch reaches one of the milestones. When last_epoch=-1, sets initial lr as lr.
class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)
Set the learning rate of each parameter group to the initial lr decayed by gamma every epoch. When last_epoch=-1, sets initial lr as lr.
class torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)
Set the learning rate of each parameter group using a cosine annealing schedule
class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode=’min’, factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode=’rel’, cooldown=0, min_lr=0, eps=1e-08)
Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.