@(Paper summaries)[Neural Networks|Optimization]
应用场景:训练集很大。
好处:避免取到局部最优解。
Neural networks are often trained stochastically, i.e. using a method where the objective function changes at each iteration. This stochastic variation is due to the model being trained on different data during each iteration. This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. Furthermore, training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately, many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.
In the following, we will use θt to denote some generic parameter of the model at iteration t , to be optimized according to some loss function which is to be minimized.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to the parameter, scaled by the learning rate η , a hyperparameter. If η is too large, SGD will diverge; if it’s too small, it will converge slowly. The update rule is simply
θt+1=θt−η∇(θt)
Momentum
In SGD, the gradient ∇(θt) often changes rapidly at each iteration t due to the fact that the loss is being computed over different data. This is often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperparameter μ , as follows:
vt+1θt+1=μvt−η∇(θt)=θt+vt+1
It has been argued that including the previous gradient step has the effect of approximating some second-order information about the gradient.
Nesterov’s Accelerated Gradient
In Nesterov’s Accelerated Gradient (NAG), the gradient of the loss at each step is computed at θt+μvt instead of θt . In momentum, the parameter update could be written θt+1=θt+μvt−η∇(θt) , so NAG effectively computes the gradient at the new parameter location but without considering the gradient term. In practice, this causes NAG to behave more stably than regular momentum in many situations. A more thorough analysis can be found in ((Sutskever, Martens, Dahl, and Hinton, “On the importance of initialization and momentum in deep learning” (ICML 2013) )). The update rules are then as follows:
vt+1θt+1=μvt−η∇(θt+μvt)=θt+vt+1
Adagrad
Adagrad effectively rescales the learning rate for each parameter according to the history of the gradients for that parameter. This is done by dividing each term in ∇ by the square root of the sum of squares of its historical gradient. Rescaling in this way effectively lowers the learning rate for parameters which consistently have large gradient values. It also effectively decreases the learning rate over time, because the sum of squares will continue to grow with the iteration. After setting the rescaling term g=0 , the updates are as follows:
gt+1θt+1=gt+∇(θt)2=θt−η∇(θt)gt+1‾‾‾‾√+ϵ
where division is elementwise and
ϵ is a small constant included for numerical stability. It has nice theoretical guarantees and empirical results ((Dyer, “Notes on AdaGrad”)) ((Duchi, Hazan, and Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization” (COLT 2010) )).
RMSProp
In its originally proposed form ((Hinton, Srivastava, and Swersky, “rmsprop: Divide the gradient by a running average of its recent magnitude”)), RMSProp is very similar to Adagrad. The only difference is that the gt term is computed as a exponentially decaying average instead of an accumulated sum. This makes gt an estimate of the second moment of ∇ and avoids the fact that the learning rate effectively shrinks over time. The name “RMSProp” comes from the fact that the update step is normalized by a decaying RMS of recent gradients. The update is as follows:
gt+1θt+1=γgt+(1−γ)∇(θt)2=θt−η∇(θt)gt+1‾‾‾‾√+ϵ
In the original lecture slides where it was proposed, γ is set to .9 . In ((Dauphin, Vries, Chung and Bengion, “RMSProp and equilibrated adaptive learning rates for non-convex optimization”)), it is shown that the gt+1‾‾‾‾√ term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are (0,1) distributed). It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points.
Alternatively, in ((Graves, “Generating Sequences with Recurrent Neural Networks”)), a first-order moment approximator mt is added. It is included in the denominator of the preconditioner so that the learning rate is effectively normalized by the standard deviation ∇ . There is also a vt term included for momentum. This gives
mt+1gt+1vt+1θt+1=γmt+(1−γ)∇(θt)=γgt+(1−γ)∇(θt)2=μvt−η∇(θt)gt+1−m2t+1+ϵ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√=θt+vt+1
Adadelta
Adadelta ((Zeiler, “Adadelta: An Adaptive Learning Rate Method”)) uses the same exponentially decaying moving average estimate of the gradient second moment gt as RMSProp. It also computes a moving average xt of the updates vt similar to momentum, but when updating this quantity it squares the current step, which I don’t have any intuition for.
gt+1vt+1xt+1θt+1=γgt+(1−γ)∇(θt)2=−xt+ϵ‾‾‾‾‾‾√∇(θt)gt+1+ϵ‾‾‾‾‾‾‾‾√=γxt+(1−γ)v2t+1=θt+vt+1
Adam
Adam is somewhat similar to Adagrad/Adadelta/RMSProp in that it computes a decayed moving average of the gradient and squared gradient (first and second moment estimates) at each time step. It differs mainly in two ways: First, the first order moment moving average coefficient is decayed over time. Second, because the first and second order moment estimates are initialized to zero, some bias-correction is used to counteract the resulting bias towards zero. The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is ≈±η and that in magnitude it is less than η . However, as θt approaches a true minimum, the uncertainty of the gradient will increase and the step size will decrease. It is also invariant to the scale of the gradients. Given hyperparameters γ1 , γ2 , λ , and η , and setting m0=0 and g0=0 (note that the paper denotes γ1 as β1 , γ2 as β2 , η as α and gt as vt ), the update rule is as follows: ((Kingma and Ba, “Adam: A Method for Stochastic Optimization”))
mt+1gt+1m̂ t+1ĝ t+1θt+1=γ1mt+(1−γ1)∇(θt)=γ2gt+(1−γ2)∇(θt)2=mt+11−γt+11=gt+11−γt+12=θt−ηm̂ t+1ĝ t+1‾‾‾‾√+ϵ
ESGD
((Dauphin, Vries, Chung and Bengion, “RMSProp and equilibrated adaptive learning rates for non-convex optimization”))
Adasecant
((Gulcehre and Bengio, “Adasecant: Robust Adaptive Secant Method for Stochastic Gradient”))
vSGD
((Schaul, Zhang, LeCun, “No More Pesky Learning Rates”))
Rprop
[Riedmiller and Bruan, “A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm”]
引用自
http://colinraffel.com/wiki/stochastic_optimization_techniques