A Tour of Recurrent Neural Network Algorithms for Deep Learning
Recurrent neural networks, or RNNs, area type of artificial neural network that add additional weights to the networkto create cycles in the network graph in an effort to maintain an internalstate.
循环神经网络,RNN,是一种人工神经网络,向网络添加额外的权重以在网络图中建立循环以保持内部状态。
The promise of adding state to neuralnetworks is that they will be able to explicitly learn and exploit context insequence prediction problems, such as problems with an order or temporalcomponent.
向神经网络添加状态的希望是,它们将能够在序列预测问题中明确地学习和利用上下文,例如具有顺序或时间成分的问题。
In this post, you are going take a tourof recurrent neural networks used for deep learning.
这篇文章将回顾用于深入学习的循环神经网络。
After reading this post, you will know: 阅读本文后,将了解
· How top recurrent neural networks used for deep learningwork, such as LSTMs, GRUs, and NTMs.
· 用于深度学习的top循环神经网络如何工作,如LSTMs,GRUs,NTMs。
· How top RNNs relate to the broader study of recurrence inartificial neural networks.
· top RNN与人工神经网络中关于循环的更广泛研究的关系
· How research in RNNs has led to state-of-the-art performanceon a range of challenging problems.
· RNN中的研究如何导致一些挑战性问题的最优性能
Note, we’re not going to cover everypossible recurrent neural network. Instead, we will focus on recurrent neuralnetworks used for deep learning (LSTMs, GRUs and NTMs) and the context neededto understand them.
本文不会涵盖所有循环神经网络,而是只关注用于深度学习的循环神经网络(LSTMs, GRUs and NTMs)以及理解它们需要的上下文。
Let’s get started.
Overview
We will start offby setting the scene for the field of recurrent neural networks.
我们将首先为递归神经网络领域设置场景。
Next, we will takea closer look at LSTMs, GRUs, and NTM used for deep learning.
We will then spendsome time on advanced topics related to using RNNs for deep learning. 然后来看深度学习中使用RNN的一些高级主题。
· Recurrent Neural Networks
· Fully Recurrent Networks
· Recursive Neural Networks
· Neural History Compressor
· Long Short-Term MemoryNetworks
· Gated Recurrent UnitNeural Networks
· Neural Turing Machines
RecurrentNeural Networks
Let’s set thescene.
Popular beliefsuggests that recurrence imparts a memory to the network topology.
流行的观点认为,复发(recurrence)会给网络拓扑带来记忆。
Abetter way to consider this is the training set contains examples with a set ofinputs for the current training example.
考虑这一点的更好的方法是,训练集包含了一组用于当前训练示例的输入的示例。
Thisis “conventional, e.g. a traditional multilayered Perceptron.传统多层感知器如下所以。
X(i) -> y(i) |
Butthe training example is supplemented with a set of inputs from the previousexample.
但是训练示例由先前示例的输入提供。
Thisis “unconventional,” e.g. a recurrent neural network.非传统,循环神经网络,如下所示。
[X(i-1), X(i)] -> y(i) |
As with all feedforward network paradigms, the issues are how to connect the input layer to theoutput layer, include feedback activations, and then train the construct toconverge.
与所有前馈网络范式一样,问题是如何将输入层连接到输出层,包括反馈激活,然后训练网络结构以收敛。
Let’s now take atour of the different types of recurrent neural networks, starting with verysimple conceptions.
Fully Recurrent Networks
The layeredtopology of a multilayer Perceptron is preserved, but every element has aweighted connection to every other element in the architecture and has a singlefeedback connection to itself.
多层感知器的分层拓扑被保留,但是每个元素都与体系结构中的其他元素有一个加权连接,并且有一个单独的反馈连接。
Not all connections are trained and the extreme non-linearity ofthe error derivatives means conventional Backpropagation will not work, and so Backpropagation Through Time approaches or Stochastic GradientDescent is employed.
并不是所有的连接都经过训练,误差导数的极端非线性意味着传统的反向传播将无法工作,因此,采用通过时间方法的反向传播或随机梯度下降。
Also, see Bill Wilson’s Tensor Product Networks (1991).
Recursive Neural Networks
Recurrent neural networks are linear architectural variant of recursive networks.
循环神经网络是递归网络的线性结构变体。
Recursion promotesbranching in hierarchical feature spaces and the resulting network architecturemimics this as training proceeds.
递归促进了分层特征空间的分支,由此产生的网络体系结构模拟了这个过程,就像训练进行的。
Training isachieved with Gradient Descent by sub-gradient methods.
采用梯度下降法进行训练。
This is described in detail in R. Socher, et al., Parsing Natural Scenes and Natural Language with Recursive NeuralNetworks, 2011.
Neural History Compressor
Schmidhuberreported a very deep learner, first in 1991, that was able to perform creditassignment over hundreds of neural layers by unsupervised pre-training for a hierarchyof RNNs.
Schmidhuber在1991年提出了深度学习器,通过对一族RNN无监督预训练,能够在上千层神经层进行信用分配。
Each RNN is trainedunsupervised to predict the next input. Then only inputs generating an errorare fed forward, conveying new information to the next RNN in the hierarchy,which then processes at a slower, self-organizing time scale.
每个RNN无监督的训练以预测下一个输入。只对产生一个误差的输入前馈,传达新信息到层次结构中的下一个RNN,然后以较慢的自组织、时间尺度进行。
It was shown thatno information is lost, just compressed. The RNN stack is a “deep generativemodel” of the data. The data can be reconstructed from the compressed form.
没有信息损失,仅仅压缩。RNN栈是数据的深度生成模型。数据可以由压缩形式重构回来。
See J. Schmidhuber, et al., Deep Learning in Neural Networks: An Overview, 2014.
Backpropagationfails as the calculation of extremity of non-linear derivatives increases asthe error is propagated backwards through large topologies, making creditassignment difficult, if not impossible.
反向传播失败,误差通过拓扑反向传播,非线性求导的端点计算增加,使信用赋值困难。
LongShort-Term Memory Networks
With conventionalBack-Propagation Through Time (BPTT) or Real Time Recurrent Learning (RTTL),error signals flowing backward in time tend to either explode or vanish. 以BPTT或RTTL,反向传播的误差信号趋向于爆炸或消失。
The temporalevolution of the back-propagated error exponentially depends on the size of theweights. Weight explosion may lead to oscillating weights, while in vanishingcauses learning to bridge long time lags and takes a prohibitive amount oftime, or does not work at all. 反向传播误差的时间演化指数取决于权重的大小。权重爆炸可能导致振荡权重,而在消失的原因学习连接长时间滞后,并占用大量的时间,或根本不工作。
· LSTM is a novel recurrentnetwork architecture training with an appropriate gradient-based learningalgorithm.
· LSTM是一种新型的递归/循环网络结构,以适当的梯度学习算法训练。
· LSTM is designed toovercome error back-flow problems. It can learn to bridge time intervals inexcess of 1000 steps.
· LSTM旨在克服误差反流问题。它可以学习使得桥接时间间隔超过1000步。
· This true in presence ofnoisy, incompressible input sequences, without loss of short time lagcapabilities.
· 在有噪声的、不可压缩的输入序列的情况下,这是真实的,不损失短时间延迟的能力。
Error back-flowproblems are overcome by an efficient, gradient-based algorithm for anarchitecture enforcing constant (thus neither exploding nor vanishing) errorflow through internal states of special units. These units reduce the effectsof the “Input Weight Conflict” and the “Output Weight Conflict.”
一种有效的基于梯度的算法克服了误差回流问题,该结构使常数误差(因此既不爆炸也不消失)流经特殊单元的内部状态。这些单位减少了“输入权重冲突”和“输出权重冲突”的影响。
The Input Weight Conflict: Provided the input is non-zero, the same incoming weight hasto be used for both storing certain inputs and ignoring others, then will oftenreceive conflicting weight update signals.
These signals willattempt to make the weight participate in storing the input and protecting theinput. This conflict makes learning difficult and calls for a morecontext-sensitive mechanism for controlling “write operations” through inputweights.
The Output Weight Conflict: As long as the output of a unit is non-zero, the weight on theoutput connection from this unit will attract conflicting weight update signalsgenerated during sequence processing.
These signals willattempt to make the outgoing weight participate in accessing the informationstored in the processing unit and, at different times, protect the subsequentunit from being perturbed by the output of the unit being fed forward.
These conflicts arenot specific to long-term lags and can equally impinge on short-term lags. Ofnote though is that as lag increases, stored information must be protected fromperturbation, especially in the advanced stages of learning.
Network Architecture: Different types of units may convey useful information aboutthe current state of the network. For instance, an input gate (output gate) mayuse inputs from other memory cells to decide whether to store (access) certaininformation in its memory cell.
Memory cellscontain gates. Gates are specific to the connection they mediate. Input gateswork to remedy the Input Weight Conflict while Output Gates work to eliminatethe Output Weight Conflict.
Gates: Specifically, to alleviate the input and output weightconflicts and perturbations, a multiplicative input gate unit is introduced toprotect the memory contents stored from perturbation by irrelevant inputs and amultiplicative output gate unit protects other units from perturbation bycurrently irrelevant memory contents stored.
Example of an LSTM net with 8 input units, 4output units, and 2 memory cell blocks of size 2. in1 marks the input gate,out1 marks the output gate, and cell1 = block1 marks the first memory cell ofblock 1.
Taken from Long Short-Term Memory, 1997.
Connectivity inLSTM is complicated compared to the multilayer Perceptron because of thediversity of processing elements and the inclusion of feedback connections.
Memory cell blocks: Memory cells sharing the same input gate and the same outputgate form a structure called a “memory cell block”.
Memory cell blocksfacilitate information storage; as with conventional neural nets, it is not soeasy to code a distributed input within a single cell. A memory cell block ofsize 1 is just a simple memory cell.
Learning: A variant of Real Time Recurrent Learning (RTRL) that takesinto account the altered, multiplicative dynamics caused by input and outputgates is used to ensure non-decaying error back propagated through internalstates of memory cells errors arriving at “memory cell net inputs” do not getpropagated back further in time.
Guessing: This stochastic approach can outperform many term lagalgorithms. It has been established that many long-time lag tasks used inprevious work can be solved more quickly by simple random weight guessing thanby the proposed algorithms.
See S. Hochreiter and J. Schmidhuber, Long-ShortTerm Memory, 1997.
The mostinteresting application of LSTM Recurrent Neural Networks has been the workdone with language processing. See the work of Gers for a comprehensivedescription.
· F. Gers and J.Schmidhuber, LSTMRecurrent Networks Learn Simple Context Free and Context Sensitive Languages,2001.
· F. Gers, LongShort-Term Memory in Recurrent Neural Networks, Ph.D. Thesis, 2001.
LSTM Limitations
The efficient,truncated version of LSTM will not easily solve problems similar to “stronglydelayed XOR.”
Each memory cellblock needs an input gate and an output gate. Not necessary in other recurrentapproaches.
Constant error flowthrough “Constant Error Carrousels” inside memory cells produces the sameeffect as a conventional feed-forward architecture being presented with theentire input string at once.
LSTM is as flawedwith the concept of “regency” as other feed-forward approaches. Additionalcounting mechanisms may be required if fine-precision counting time steps isneeded.
LSTM Advantages
The algorithmsability to bridge long time lags is the result of constant errorBackpropagation in the architecture’s memory cells.
LSTM canapproximate noisy problem domains, distributed representations, and continuousvalues.
LSTM generalizeswell over problem domains considered. This is important given some tasks areintractable for already established recurrent networks.
Fine tuning of networkparameters over the problem domains appears to be unnecessary.
In terms of updatecomplexity per weight and time steps, LSTM is essentially equivalent to BPTT.
LSTMs are showingto be powerful, achieving state-of-the-art results in domains like machinetranslation.
GatedRecurrent Unit Neural Networks
Gated RecurrentNeural Networks have been successfully applied to sequential or temporal data.
Most suitable forspeech recognition, natural language processing, and machine translation,together with LSTM they have performed well with long sequence problem domains.
Gating wasconsidered in the LSTM topic and involves a gating network generating signalsthat act to control how the present input and previous memory work to updatethe current activation, and thereby the current network state.
Gates arethemselves weighted and are selectively updated according to an algorithm,throughout the learning phase.
Gate networksintroduce added computational expense in the form of increased complexity, andtherefore added parameterization.
The LSTM RNNarchitecture uses the computation of the simple RNN as an intermediatecandidate for the internal memory cell (state). The Gated Recurrent Unit (GRU)RNN reduces the gating signals to two from the LSTM RNN model. The two gatesare called an update gate and a reset gate.
The gatingmechanism in the GRU (and LSTM) RNN is a replica of the simple RNN in terms ofparameterization. The weights corresponding to these gates are also updatedusing BPTT stochastic gradient descent as it seeks to minimize a cost function.
Each parameterupdate will involve information pertaining to the state of the overall network.This can have detrimental effects.
The concept ofgating is explored further and extended with three new variant gatingmechanisms.
The three gating variants that have been considered are, GRU1 where each gateis computed using only the previous hidden state and the bias; GRU2, where eachgate is computed using only the previous hidden state; and GRU3, where eachgate is computed using only the bias. A significant reduction in parameters isobserved with GRU3 yielding the smallest number.
The three variantsand the GRU RNN were benchmarked using data from the MNIST Database ofhandwritten digits and the IMDB movie review dataset.
Two sequenceslengths were generated from the MNIST dataset and one was generated from theIMDB dataset.
The main drivingsignal of the gates appears to be the (recurrent) state as it containsessential information about other signals.
The use of thestochastic gradient descent implicitly carries information about the networkstate. This may explain the relative success in using the bias alone in thegate signals as its adaptive update carries information about the state of thenetwork.
Gated variantsexplore the mechanisms of gating with limited evaluation of topologies.
For moreinformation see:
· R. Dey and F. M. Salem, Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks,2017.
· J. Chung, et al., Empirical Evaluation of Gated Recurrent Neural Networks onSequence Modeling, 2014.
NeuralTuring Machines
Neural TuringMachines extend the capabilities of neural networks by coupling them toexternal memory resources, which they can interact with through attentionprocesses.
The combined systemis analogous to a Turing Machine or Von Neumann architecture, but isdifferentiable end-to-end, allowing it to be efficiently trained with gradientdescent.
Preliminary resultsdemonstrate that Neural Turing Machines can infer simple algorithms, such ascopying, sorting, and associative recall from input and output examples.
RNNs stand out fromother machine learning methods for their ability to learn and carry outcomplicated transformations of data over extended periods of time. Moreover, itis known that RNNs are Turing-Complete and therefore have the capacity tosimulate arbitrary procedures, if properly wired.
The capabilities ofstandard RNNs are extended to simplify the solution of algorithmic tasks. Thisenrichment is primarily via a large, addressable memory, so, by analogy toTuring’s enrichment of finite-state machines by an infinite memory tape, and sodubbed “Neural Turing Machine” (NTM).
Unlike a Turingmachine, an NTM is a differentiable computer that can be trained by gradientdescent, yielding a practical mechanism for learning programs.
NTM Architecture is generically shown above.During each update cycle, the controller network receives inputs from anexternal environment and emits outputs in response. It also reads to and writesfrom a memory matrix via a set of parallel read-and-write heads. The dashedline indicates the division between the NTM circuit and the outside world.
Taken from Neural Turing Machines, 2014.
Crucially, everycomponent of the architecture is differentiable, making it straightforward totrain with gradient descent. This was achieved this by defining ‘blurry’read-and-write operations that interact to a greater or lesser degree with allthe elements in memory (rather than addressing a single element, as in a normalTuring machine or digital computer).
For moreinformation see:
· A. Graves, et al., NeuralTuring Machines, 2014.
· R. Greve, et al., Evolving Neural Turing Machines for Reward-based Learning,2016.
NTM Experiments
The copy task testswhether NTM can store and recall a long sequence of arbitrary information. Thenetwork is presented with an input sequence of random binary vectors followedby a delimiter flag.
The networks weretrained to copy sequences of eight-bit random vectors where the sequencelengths were randomized between 1 and 20. The target sequence was simply a copyof the input sequence (without the delimiter flag).
Repeat copy taskextends copy by requiring the network to output the copied sequence a specifiednumber of times and then emit an end-of-sequence marker. The main motivationwas to see if the NTM could learn a simple nested function.
The networkreceives random-length sequences of random binary vectors, followed by a scalarvalue indicating the desired number of copies, which appears on a separateinput channel.
Associative recalltasks involve organizing data arising from “indirection”, that is when one dataitem points to another. A list of items is constructed so that querying withone of the items demands that the network returns the subsequent item.
A sequence ofbinary vectors that is bounded on the left and right by delimiter symbols isdefined. After several items have been propagated to the network, the networkis queried by showing a random item, and seeing if the network can produce thenext item.
Dynamic N-Gramstask tests if the NTM can adapt quickly to new predictive distributions byusing memory as a re-writable table that it could use to keep count oftransition statistics, thereby emulating a conventional N-Gram model.
Consider the set ofall possible 6-Gram distributions over binary sequences. Each 6-Gram distributioncan be expressed as a table of 32 numbers, specifying the probability that thenext bit will be one, given all possible length five binary histories. Aparticular training sequence was generated by drawing 200 successive bits usingthe current lookup table. The network observes the sequence one bit at a timeand is then asked to predict the next bit.
Priority sort tasktests the NTM’s ability to sort. A sequence of random binary vectors is inputto the network along with a scalar priority rating for each vector. Thepriority is drawn uniformly from the range [-1, 1]. The target sequencecontains the binary vectors sorted according to their priorities.
NTMs havefeed-forward architectures to LSTMs as one of their components.
Summary
In this post, youdiscovered recurrent neural networks for deep learning.
Specifically, youlearned:
· How top recurrent neuralnetworks used for deep learning work, such as LSTMs, GRUs, and NTMs.
· How top RNNs relate to thebroader study of recurrence in artificial neural networks.
· How research in RNNs haslead to state-of-the-art performance on a range of challenging problems.
This was a bigpost.