Gated Recurrent Units Viewed Through the Lens of Continuous Time Dynamical Systems

1 Introduction

Recurrent neural networks (RNNs) can capture and utilize sequential structure in natural and artificial languages, speech, video, and various other forms of time series. The recurrent information flow within RNN implies that the data seen in the past has influence on the current state of the RNN, forming a mechanism for having memory through (nonlinear) temporal traces that encode both what and when

. Unfortunately, training vanilla RNNs to capture long-range dependences within a sequence is challenging due to the vanishing gradient problem

13, 4

. Several special RNN architectures have been proposed to mitigate this issue, notably the long short-term memory (LSTM

14 ) units which explicitly guards against unwanted corruption of the information stored in the hidden state until necessary. Recently, a simplification of the LSTM called the gated recurrent unit (GRU 5

) has become popular in the machine learning community thanks to its performance in speech

21 , music 6 , video 10 , and extracting nonlinear dynamics underlying neural data 20 . However, certain mechanistic tasks, specifically unbounded counting, come easy to LSTM networks but not to GRU networks 23 , which we will explain.

Despite these empirical findings, we lack systematic understanding of the internal time evolution of GRU's memory structure and its capability to represent nonlinear temporal dynamics. Such an understanding will make clear what specific tasks can and cannot be performed 4 , how computation is implemented 22, 2 , and help to predict qualitative behavior 25 . In addition, a great deal of the literature discusses the local dynamics (equilibrium points) of RNNs, but a complete theory requires an understanding of the global properties as well 1 . Furthermore, a deterministic understanding of a GRU network's topological structure (dynamical properties under a set of parameters) will provide fundamental insight as to a trained network's generalization ability, and therefore help in understanding how to seed RNNs for specific tasks 9 .

In general, an RNN can be written as where is the current input in a sequence indexed by , is a nonlinear function, and represents the hidden memory state that carries all information responsible for future output. In the absence of input, evolves over time on its own:

where for notational simplicity. In other words, we can consider the temporal evolution of memory stored within an RNN as a trajectory of an autonomous dynamical system defined by (1

), and use dynamical systems theory to further investigate and classify the temporal features obtainable in an RNN. In this paper, we intend on providing a deep intuition of the inner workings of the GRU through a continuous time analysis. We discuss all observed local and global dynamical structures obtainable, and validate the theory by training GRUs to predict time series with prescribed dynamics. In addition, we look into the role the continuous analysis plays in the context of generalizing a working memory task. As to not compromise the presentation, we restrict our analysis to low dimensions for easy visualization.

1 However, given a trained GRU of any finite dimension, the findings here still apply, and can be generalized on a case by case basis. We recommend Meiss 19 for more background on dynamical systems theory.

2 Underlying Continuous Time System of Gated Recurrent Units

The GRU uses two internal gating variables: the update gate which protects the -dimensional hidden state and the reset gate which allows overwriting of the hidden state and controls the interaction with the input .

where and are the parameter matrices,

are bias vectors,

represents the Hadamard product, and

is the element-wise logistic sigmoid function. Note that the hidden state is asymptotically contained within

due to the saturating nonlinearities, implying that if the state is initialized outside of this trapping region, it must eventually enter it in finite time and remain in it for all later time.

Note that the update gate controls how fast each dimension at the hidden state decays, providing an adaptive time constant for memory. Specifically, as , GRUs can implement perfect memory of the past and ignore . Hence, a -dimensional GRU is capable of keeping a near constant memory through the update gate—near constant since , where denotes -th component of a vector. Moreover, the autoregressive weights (mainly and ) can support time evolving memory (Laurent and von Brecht 18 considered this a hindrance and proposed removing all complex dynamical behavior in a simplified GRU).

To investigate the memory structure further, let us consider the dynamics of hidden state in the absence of input, i.e. , which is of the form (1). From a dynamical system's point of view, all inputs to the system can be understood as perturbations to the autonomous system, and therefore have no effect on the set of achievable dynamics. To utilize the rich descriptive language of continuous time dynamical systems theory, we recognize the autonomous GRU-RNN as a forward Euler discretization to the following continuous time dynamical system:

(continuous update gate) (5)
(continuous reset gate) (6)
(continuous hidden state) (7)

where denotes point-wise division and . Since both and are smooth, this continuous limit is justified and serves as a basis for further analysis, * * *

Note that this analysis technique can be applied to any RNN with differentiable activation functions.

as all GRU networks are attempting to approximate this continuous limit. In the following, GRU refers to the continuous time version (7). Note that the update gate again plays the role of a state-dependent time constant for memory decay. Furthermore, since , it does not change the topological structure of the dynamics (only speed). For the following theoretical analysis sections (3 & 4), we can safely ignore the effects of . A derivation of the continuous-time GRU can be found in appendix A.

3 Stability Analysis of a One Dimensional GRU

Figure 1: Three possible types of one dimensional flow for a 1D GRU. When , increases. This flow is indicated by a rightward arrow. Nodes ( ) are represented as circles and classified by their stability 19 .

For a 1D GRU The number/dimension of GRUs referenced will indicate the dimension of the latent dynamics of a GRU network. ( ), (7) reduces to a one dimensional dynamical system where every variable is a scalar. The expressive power of a 1D GRU is quite limited, as only three stability structures (topologies) exist (see appendix B): (A) a single stable node, (B) a stable node and a half-stable node, and (C) two stable nodes separated by an unstable node (see Fig.1). The corresponding time evolution of the hidden state are (A) decay to a fixed value, (B) decay to a fixed value, but from one direction halt at an intermediate value until perturbed, or (C) decay to one of two fixed values (bistability). The bistability can be used to capture a binary latent state in the input sequence. It should be noted that a one dimensional continuous time autonomous system cannot exhibit oscillatory behavior, unlike its discrete time counterpart, as is the case here 19 .

The topology the GRU takes is determined by its parameters. If the GRU begins in a region of the parameter space corresponding to (A), we can smoothly vary the parameters to transverse (B) in the parameter space, and end up at (C). This is commonly known as a saddle-node bifurcation. Speaking generally, a bifurcation is the change in topology of a dynamical system, resulting from a smooth change in parameters. The point in the parameters space at which the bifurcation occurs is called the bifurcation point (e.g. Fig.1B), and we will refer to the fixed point that changes its stability at the bifurcation point as the bifurcation fixed point (e.g. the half-stable fixed point in Fig.1B). The codimension of a bifurcation is the number of parameters which must vary in order to remain on the bifurcation manifold. In the case of our example, a saddle-node bifurcation is codimension-1 17 . Right before transitioning to (B), from (A), the flow near where the half-stable node would appear can exhibit arbitrarily slow flow. We will refer to these as slow points 22 . In this context, slow points allow for metastable states, where a trajectory will flow towards the slow point, remain there for a period of time, before moving to the stable fixed point.

4 Analysis of a Two Dimensional GRU

We will see that the addition of a second GRU opens up a substantial variety of possible topological structures. For notational simplicity, we denote the two dimensions of as and . We visualize the flow fields defined by (7) in 2-dimension as phase portraits which reveal the topological structures of interest 19 . For starters, the phase portrait of two independent bistable GRUs can be visualized as Figure2A. It clearly shows 4 stable states as expected, with a total of 9 fixed points. This could be thought of as a continuous-time continuous-space implementation of a finite state machine with 4 states (Fig.2B). The 3 types of observed fixed points—stable (sinks), unstable (sources), and saddle points—exhibit locally linear dynamics, however, the global geometry is nonlinear and their topological structures can vary depending on their arrangement.

Figure 2: Illustrative example of two independent bistable GRUs. (A) Phase portrait. The flow field is decomposed into direction (black arrows) and speed (color). Purple lines represent trajectories of the hidden state which converge to one of the four stable fixed points. Note the four quadrants coincide with the basin of attraction for each of the stable nodes. The fixed points appear when the x- and y-nullclines intersect. (B) The four stable nodes of this system can be interpreted as a continuous analogue of 4-discrete states with input-driven transitions.

We explored stability structures attainable by 2D GRUs. Due to the relatively large number of observed topologies, this section's main focus will be on demonstrating all observed local dynamical features obtainable by 2D GRUs. In addition, existence of two non-local dynamical features will be presented. A complete catalog of all observed topologies can be found in the appendix D, along with the parameters of every phase portrait depicted in this paper.

Before proceeding, let us take this time to describe all the local dynamical features observed. In addition to the previously mentioned three types of fixed points, 2D GRUs can exhibit a variety of bifurcation fixed points, resulting from regions of parameter space that separate all topologies restricted to simple fixed points (i.e stable, unstable, and saddle points). Behaviorally speaking, these fixed points act as hybrids between the previous three, resulting in a much richer set of obtainable dynamics. In figure3, we show all observed types of fixed points 2D GRUs feature both codimension-1 and codimension-2 bifurcation fixed points. In codimension-1, we have the saddle-node bifurcation fixed point, as expected from its existence in the 1D GRU case. These can be thought of as both the fusion of a stable fixed point and a saddle point, and the fusion of an unstable fixed point and a saddle point. We will refer to these fixed points as saddle-node bifurcation fixed points of the first kind and second kind respectively. . The only type of codimension-2 bifurcation fixed point that has been observed in the 2D GRU system (Fig.3B) acts as the fusion of all three simple fixed points. More specifically, these points arise from the fusion of a stable fixed point, unstable fixed point, and two saddle points.

In Figure 3A, we see 11 fixed points, the maximum number of fixed points observed in a 2D GRU system. A closer look at this system reveals one interpretation as a continuous analogue of 5-discrete states with input-driven transitions, similar to that depicted in figure 2. This imposes a strict upper bound on the network's capacity to encode a finite set of states in this manner (more on this in section 5.4).

Figure 3: Existence of all observed simple fixed points and bifurcation fixed points with 2D GRUs, depicted in phase space. Orange and pink lines represent the x and y nullclines respectively. Purple lines indicate various trajectories of the hidden state. Direction of the flow is determined by the black arrows, where the colormap underlaying the figure depicts the magnitude of the velocity of the flow in log scale.

The addition of bifurcation fixed points opens the door to dynamically realize more sophisticated models. Take for example the four state system depicted in figure 3B. If the hidden state is set to initialize in the first quadrant of phase space, the trajectory will flow towards the codimension-2 bifurcation fixed point at the origin. Introducing noise through the input will stochastically cause the trajectory to approach the stable fixed point at either directly, or by first flowing into one of the two saddle-node bifurcation fixed points of the first kind. Models of this sort can be used in a variety of applications, such as perceptual decision making 24, 7 .

We will begin our investigation into the non-local dynamics observed with 2D GRUs by showing the existence of homoclinic orbits. A trajectory initialized on a homoclinic orbit will approach the same fixed point in both forward and backward time. We observe that 2D GRUs can exhibit one or two bounded planar regions of uncountably infinite numbers of homoclinic orbits for a given set of parameters, as shown in figure 4A and 4

B respectively. Any trajectory initialized in one of these regions will flow into the codimension-2 bifurcation fixed point at the origin, regardless of which direction time flows in. This featured behavior enables the accurate depiction of various models, including neuron spiking

15 .

Figure 4: Two GRUs exhibit bounded regions of homoclinic orbits. 4C and 4D represent the hidden state as a function of time, for a single initial condition within the homoclinic region(s) of the single and double homoclinic region cases respectively (denoted by solid black trajectories in each corresponding phase portrait).

In regards to the second non-local dynamical feature, it can be shown that 2D GRUs can exhibit an Andronov-Hopf bifurcation, whereby a stable fixed point bifurcates into an unstable fixed point surrounded by a limit cycle. Behaviorally speaking, a limit cycle is a type of attractor, in the sense that there exists a defined basin of attraction. However, unlike a stable fixed point, where trajectories initialized in the basin of attraction flow towards a single point, a limit cycle pulls trajectories into a stable periodic orbit around the unstable fixed point at its center. To demonstrate this phenomenon, let (8) define the parameters of (7). For ,

If , the system has a single stable fixed point (stable spiral), as depicted in figure 5A. If we continuously decrease , the system undergoes an Andronov-Hopf bifurcation approximately about . As continuously decreases, the orbital period increases, and as the nullclines can be made arbitrarily close together, the length of this orbital period can be set arbitrarily. Figure 5B shows an example of a relatively short orbital period, and figure 5C depicts the behavior seen for slower orbits.

Figure 5: Two GRUs exhibit an Andronov-Hopf bifurcation, where the parameters are defined by (8). When the system exhibits a single stable fixed point at the origin (Fig. 5A). If decreases continuously, a limit cycle emerges around the fixed point, and the fixed point changes stability (Fig. 5B). Allowing to decrease further increases the size and orbital period of the limit cycle (Fig. 5C). The bottom row represents the hidden state as a function of time, for a single trajectory (denoted by black trajectories in each corresponding phase portrait)

With finite-fixed point topologies and global structures out of the way, the next logical question to ask is, can 2D GRUs exhibit an infinite number of fixed points? Such behavior is often desirable in models that require stationary attraction to non-point structures, such as line attractors and ring attractors. While such behavior in the continuous GRU system has yet to be seen, an approximation of a line attractor can be made, as depicted in supplementary figure 9. We will refer to this phenomenon as a pseudo-line attractor, where the nullclines remain sufficiently close on a small finite interval, thereby allowing for arbitrarily slow flow, by means of slow points.

5 Experiments: Sequence Prediction and Classification

As a means to put our theory to practice, in this section we explore several examples of time series prediction of continuous time planar dynamical systems using 2D GRUs. Results from the previous section indicate what dynamical features can be learned by this RNN, and suggest cases by which training will fail. All of the following computer experiments consist of an RNN, by which the hidden layer is made up of a 2D GRU, followed by a linear output layer. The network is trained to make a 29-step prediction from a given initial observation, and no further input through prediction. As such, to produce accurate predictions, the RNN must rely solely on the hidden layer dynamics.

We train the network to minimize the following multi-step loss function:

where are the parameters of the GRU and linear readout, is the prediction horizon, is the -th time series generated by the true system, and is -step the prediction given .

The hidden states are initialized at zero for each trajectory. The RNN is then trained for 4000 epochs, using ADAM

16 in whole batch mode to minimize the loss function, i.e., the mean square error between the predicted trajectory and the data. time series were used for training. Figure6 depicts the experimental results of the RNN's attempt at learning each dynamical system we describe below.

Figure 6: Training 2-dim GRUs. (top row) Phase portraits of target dynamical systems. Red solid line represents 1-dimensional attractor. See main text for each system. (middle row) GRU dynamics learned from corresponding 29-step forecasting tasks. The prediction is an affine transformation of the hidden state. (bottom row) An example time series generated through closed-loop prediction of the trained GRU (denoted by a black trajectory). GRU fails to learn the ring attractor.
Figure 7: Learning curve (training loss) for ring attractor (left) and the FitzHugh-Nagumo (right) dynamics. Note that the performance of the ring attractor improves as the dimensionality of the GRU increases unlike the FHN dynamics. Four network sizes (2, 4, 8, 16 dimensional GRU) were trained 3 times with different initializations.

5.1 Limit Cycle

To test if 2D GRUs can learn a limit cycle, we use a simple nonlinear oscillator called the FitzHugh-Nagumo Model. The FitzHugh-Nagumo model is defined by: , where in this experiment we will chose , , , and . Under this choice of model parameters, the system will exhibit an unstable fixed point (unstable spiral) surrounded by a limit cycle (Fig.6). As shown in section 4, 2D GRUs are capable of representing this topology. The results of this experiment verify this claim (Fig.6), as 2D GRUs can capture topologically equivalent dynamics.

5.2 Line Attractor

As discussed in section 4, 2D GRUs can exhibit a pseudo-line attractor, by which the system mimics an analytic line attractor on a small finite domain. We will use the simplest representation of a planar line attractor: . This system will exhibit a line attractor along the -axis, at (Fig.6). Trajectories will flow directly perpendicular towards the attractor. white Gaussian noise in the training data. While the hidden state dynamics of the trained network do not perfectly match that of an analytic line attractor, there exists a small subinterval near each of the fixed points acting as a pseudo-line attractor (Fig.6). As such, the added affine transformation (linear readout) can scale and reorient this subinterval on a finite domain. Since all attractors in a d-dimensional GRU are bound to , no line attractor can extend infinitely in any given direction, which matches well with the GRUs inability to perform unbounded counting, as the continuous analog of such a task would require a trajectory to move along such an attractor.

5.3 Ring Attractor

For this experiment, a dynamical system representing a standard ring attractor of radius one is used: . This system exhibits an attracting ring, centered around an unstable fixed point. We added Gaussian noise in the training data.

In our analysis we did not observe two GRUs exhibit this set of dynamics, and the results of this experiment, demonstrated in figure 6, suggest they cannot. Rather, the hidden state dynamics fall into an observed finite fixed point topology (see case xxix in appendixD). In addition, we robustly see this over multiple initializations, and the quality of approximation improves as the dimensionality of GRU increases (Fig.7), suggesting that many GRUs are required to obtain a sufficient approximation of this set of dynamics for a practical task. 11

5.4 Working Memory Task

In order to stress the importance of this analysis in the context of a task, our theory will be used on the 2-sequence problem 4

. In this experiment, we train a 2D GRU to classify a given ten element input sequence as one of two categories, based only on the first three elements. These first three elements are either all 1's or -1's, with an equal probability of each. The remaining elements of each sequence are drawn from a Gaussian distribution with zero mean and variance

.

We train the network to minimize the cross entropy loss function. The states are initialized randomly. The RNN was then trained for 4000 epochs using ADAM. sequences were used for training. Figure 8 depicts underlying learned dynamics for and respectively, along with the trajectories of the test data for each. The final test losses were when and when .

Figure 8: (A),(C): planar dynamical system learned by 2 GRUs for the 2-sequence problem, where is set to 1 and 2.5 respectively, with the weights initialized randomly. (B),(D): corresponding trajectories of the test data with respect to (A) and (C) respectively. Red dots indicate the initial condition of each trajectory, and the light and dark blue trajectories distinguish between the two classes, as learned by the RNN.

Under this specific set of initial parameters, we can see that the GRU-RNN solves the problem using two different topologies depending on the noise variance. While the task is not explicitly stated in terms of a continuous dynamical system, the geometry of each solution gives immediate insight as to how the RNN will perform on more general sequences. When , the system exhibits two stable fixed points, separated by a saddle (fig. 8A), and therefore can be expected to generalize well to sequences of arbitrary length, as the added elements will not be able to escape the basin of attraction of the sequence's corresponding fixed point. However, given a sequence with sufficiently large noise, the trajectory may be thrown out into the basin of attraction of the wrong fixed point. 4 When the trajectories fall into a limit cycle (fig. 8C). As such, this solution will not generalize well to sequences of arbitrary length, as the location of the classification is relative to the final location of the trajectory on a periodic orbit.

6 Conclusion

Gated neural networks were designed and used primarily as memory devices, but the temporal computation aspect is also important. Recurrent neural networks with gated structure has also been used to interpret cortical circuit dynamics 8 and also design canonical circuits using spiking neural networks 12, 3 . Understanding not only the memory structure, but what neural networks can compute over time is critical for extending the horizon of recurrent neural networks. Our analysis provides an intuition for how GRUs express dynamic behavior in continuous time, and reveals the rich but limited classes of dynamics the GRU can approximate in one and two dimensions. We show that 2D GRUs can exhibit a variety of expressive dynamical features, such as limit cycles, uncountably many homoclinic orbits, and a substantial catalog of stability structures and bifurcations. These claims were then experimentally verified in two dimensions, and then used in analyzing the potential of generalizing a trained GRU-RNN on a working memory task. We believe these findings also unlock new avenues of research on the trainability of recurrent neural networks.

References

  • Beer 1995 Randall D Beer. On the dynamics of small Continuous-Time recurrent neural networks. Adapt. Behav., 3(4):469–509, March 1995.
  • Beer 2006 Randall D Beer. Parameter space structure of continuous-time recurrent neural networks. Neural Comput., 18(12):3009–3051, December 2006.
  • Bellec et al. 2018 Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31, pages 787–797. Curran Associates, Inc., 2018.
  • Bengio et al. 1994 Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. ISSN 1045-9227. doi: 10.1109/72.279181.
  • Cho et al. 2014 Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [cs, stat], June 2014. URL http://arxiv.org/abs/1406.1078. arXiv: 1406.1078.
  • Choi et al. 2017 K. Choi, G. Fazekas, M. Sandler, and K. Cho. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2392–2396, March 2017. doi: 10.1109/ICASSP.2017.7952585.
  • Churchland and Cunningham 2014 Mark M. Churchland and John P. Cunningham. A Dynamical Basis Set for Generating Reaches. Cold Spring Harbor Symposia on Quantitative Biology, 79:67–80, 2014. ISSN 1943-4456. doi: 10.1101/sqb.2014.79.024703.
  • Costa et al. 2017 Rui Costa, Ioannis Alexandros Assael, Brendan Shillingford, Nando de Freitas, and Tim Vogels. Cortical microcircuits as gated-recurrent neural networks. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems 30, pages 272–283. Curran Associates, Inc., 2017.
  • Doya 1993 K Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Trans. Neural Netw., 1993.
  • Dwibedi et al. 2018 Debidatta Dwibedi, Pierre Sermanet, and Jonathan Tompson. Temporal Reasoning in Videos Using Convolutional Gated Recurrent Units. pages 1111–1116, 2018. URL http://openaccess.thecvf.com/content_cvpr_2018_workshops/w19/html/Dwibedi_Temporal_Reasoning_in_CVPR_2018_paper.html.
  • Funahashi and Nakamura 1993 Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6):801–806, January 1993. ISSN 0893-6080. doi: 10.1016/S0893-6080(05)80125-X. URL http://www.sciencedirect.com/science/article/pii/S089360800580125X.
  • Heeger and Mackey 2018 David J Heeger and Wayne E Mackey. ORGaNICs: A theory of working memory in brains and machines. arXiv:1803.06288 [cs], March 2018.
  • Hochreiter 1991 Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, TU Munich, 1991. Advisor J. Schmidhuber.
  • Hochreiter and Schmidhuber 1997 Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
  • Izhikevich 2007 Eugene M. Izhikevich. Dynamical systems in neuroscience. MIT press, 2007.
  • Kingma and Ba 2014 Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/1412.6980. arXiv: 1412.6980.
  • Kuznetsov 1998 Yuri A. Kuznetsov. Elements of Applied Bifurcation Theory (2nd Ed.). Springer-Verlag, Berlin, Heidelberg, 1998. ISBN 978-0-387-98382-0.
  • Laurent and von Brecht 2016 Thomas Laurent and James von Brecht. A recurrent neural network without chaos. arXiv:1612.06212 [cs], December 2016. URL http://arxiv.org/abs/1612.06212. arXiv: 1612.06212.
  • Meiss 2007 J. Meiss. Differential Dynamical Systems. Mathematical Modeling and Computation. Society for Industrial and Applied Mathematics, January 2007. ISBN 978-0-89871-635-1. doi: 10.1137/1.9780898718232. URL https://epubs.siam.org/doi/book/10.1137/1.9780898718232.
  • Pandarinath et al. 2018 Chethan Pandarinath, Daniel J. O'Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D. Stavisky, Jonathan C. Kao, Eric M. Trautmann, Matthew T. Kaufman, Stephen I. Ryu, Leigh R. Hochberg, Jaimie M. Henderson, Krishna V. Shenoy, L. F. Abbott, and David Sussillo. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods, 15(10):805–815, October 2018. ISSN 1548-7105. doi: 10.1038/s41592-018-0109-9. URL https://www.nature.com/articles/s41592-018-0109-9.
  • Prabhavalkar et al. 2017 Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Interspeech 2017, pages 939–943. ISCA, August 2017. doi: 10.21437/Interspeech.2017-233. URL http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0233.html.
  • Sussillo and Barak 2012 David Sussillo and Omri Barak. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Computation, 25(3):626–649, December 2012. ISSN 0899-7667. doi: 10.1162/NECO_a_00409. URL https://www.mitpressjournals.org/doi/10.1162/NECO_a_00409.
  • Weiss et al. 2018 Gail Weiss, Yoav Goldberg, and Eran Yahav. On the Practical Computational Power of Finite Precision RNNs for Language Recognition. arXiv:1805.04908 [cs, stat], May 2018. URL http://arxiv.org/abs/1805.04908. arXiv: 1805.04908.
  • Wong and Wang 2006 Kong-Fatt Wong and Xiao-Jing Wang. A recurrent network mechanism of time integration in perceptual decisions. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 26(4):1314–1328, January 2006. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.3733-05.2006.
  • Zhao and Park 2016 Yuan Zhao and Il Memming Park. Interpretable nonlinear dynamic modeling of neural trajectories. In Advances in Neural Information Processing Systems (NIPS), 2016.

Appendix A Continuous Time System Derivation

We begin with the fully gated GRU as a discrete time system, where the input vector has been set equal to zero, as depicted in (10) - (12), where is the Hadamard product, and is the sigmoid function.

We recognize that (12) is a forward Euler discretization of a continuous time dynamical system. This allows us to consider the underlying continuous time dynamics on the basis of the discretization. The following steps are a walk through of the derivation:

Since is a bounded function on , there exists a function , such that at each time step (due to the symmetry of , is the result of vertically flipping about , the midpoint of its range). As such, we can rewrite (12) with as depicted in (13).

where,

Let . As a result, we can say and , as depicted in (17).

where,

Let define an arbitrary time interval. Then (17) becomes,

Dividing both sides of the equation by yields (21).

If we take the limit as , we get the analogous continuous time system to (10) - (12),

where

Finally, we can rewrite (22) as follows:

where

Appendix B Single GRU Fixed Point Proofs

The fixed points of our continuous time system (22) exist where the derivative . In the single GRU case, the Hadamard product reduces to standard scalar multiplication, yielding,

where and are defined by (24) and (19) respectively, and represents a solution of (25).

We can divide out , indicating that the update gate does not play a part in the stability of the system. For simplicity, lets expand in (25) by its definition (19).

Lemma 1 .

for all , there exists such that (26) is satisfied.

Proof.

The hyperbolic tangent function is continuous and bounded on , having a range of . is monotonic and achieves all values on , as is bijective on . Thus, their sum is unbounded and obtains every value on at least once. By the intermediate value theorem, there is at least one point such that (26) is satisfied, regardless of the choice of parameters . ∎

Lemma 2 .

There exists a set of parameters such that there exists one, two, or three solutions to (26).

Proof.

We will prove this lemma by showing the existence of each case. Let , , and . We then allow to vary. The existence of each of the three cases are shown in figure 1.

If , there exists a single solution to (26). If decreases continuously, a second root appears and splits in two. Analogously, the system (22) goes through a saddle-node bifurcation, where a half-stable fixed point appears, and splits into a stable/unstable fixed point pair. ∎

Theorem 1 .

For any choice of parameters , there can only exist one, two, or three solutions to (26), and all solutions exist on the interval (-1,1).

Proof.

We begin with the argument of the hyperbolic tangent function in (26),

Taking the derivative of (27) yields,

Setting (28) to zero and simplifying will allow us to find the nontrivial critical points of (27), as shown in (29). Note that if , (27) is equal to , yielding no critical points.

Let and , and solve (29) for .

where W is principal branch of the Lambert W function. Therefore, (27) has exactly one local maximum or minimum, so long as .

Now consider,

The hyperbolic tangent function preserves intervals of monotonic behavior in its argument. Therefore, (31) has at most one local maximum or minimum.

We take into account the fact that the hyperbolic tangent function bounds its argument on the interval (-1,1). If there exists a subset , for some such that (31) is increasing, then there exists a such that when (32) and (33) hold.

This result in conjunction with the previous two lemmas completes the proof. ∎

Appendix C Pseudo-line attractor

See Fig 9 for an example of a pseudo-line attractor, as expressed by a 2D GRU.

Figure 9: Two GRUs exhibit a pseudo-line attractor. Nullclines intersect at one point, but are close enough on a finite region to mimic an analytic line attractor in practice. 9A and 9B depict the phase portrait, on and respectively.

Appendix D All Topological Stability Structures Observed in 2-D GRUs

Table 2 depicts all observed topologies of multiple-fixed point structures using 2D GRUs. Figure 12 displays an example of a phase portrait from a 2D GRU system for each case listed in 2. Note that all fixed points are denoted by a red dot, regardless of classification. Table 3 lists the parameters used for each of the observed cases. Note that all the update gate parameters are set to zero.

Each case in this paper was discovered by hand by considering the geometric constraints on the structure of nullclines for both the decoupled and coupled system (i.e reset gate inactive and active respectively). An exhaustive analysis on the one dimensional GRU allowed for a natural extension into the two dimensional decoupled GRU. Upon establishing a set of base cases (a combinatorial argument regarding all possible ways the decoupled nullclines – topologically conjugate to linear and cubic polynomials – can intersect) From these base cases, the reset gate can be used as a means of bending and manipulating structure of the decoupled nullclines in order to obtain new intersection patterns in the coupled system.

Fig Uh12 Uh21 Uh22 Ur11 Ur12 Ur21 Ur22 bh1 bh2 br1 br2
2 3 0 0 3 0 0 0 0 0 0 0 0
3a 2 0 0 2 5 8 8 5 0 0 5 5
3b 2 0 0 2 -1 0 0 -1 0 0 0 0
3c 2 0 0 2 1 -2 3 1 -0.06 0 0.2 -0.85
4a 2 0 0 2 5 9 5 9 0 0 0 0
4b 2 0 0 2 5 9 9 5 0 0 0 0
5a 1.5 -2.598 2.598 1.5 0 0 0 0 0 0 0 0
5b 2.4271 -1.7634 1.7634 2.4271 0 0 0 0 0 0 0 0
5c 2.9665 -0.4471 0.4471 2.9665 0 0 0 0 0 0 0 0
9 0.1 -0.1 -1 0 0 0 0 0 0 0 0 0
Table 1: Parameters Used for all Phase Portraits in Section 4
Case
Saddle Point
and Stable
Node Collisions
Saddle Point
and Unstable
Node Collisions
Codim. 2
Bifurcation
Point
i 2 1 - - 1 - - 9i
ii 3 2 - 1 - - - 9ii
iii 3 1 - - 2 - - 9iii
iv 4 1 - - 2 - 1 9iv
v 4 2 - 1 - - 1 9v
vi 4 2 - 1 - 1 - 9vi
vii 4 2 - 1 1 - - 9vii
viii 4 1 - - 3 - - 9viii
ix 5 2 1 2 - - - 9ix
x 5 3 - 2 - - - 9x
xi 5 3 - 1 - 1 - 9xi
xii 5 2 - 1 - 2 - 9xii
xiii 5 2 - 1 1 1 - 10i
xiv 5 - 1 - 4 - - 10ii
xv 6 2 - 1 2 1 - 10iii
xvi 6 3 - 2 - 1 - 10iv
xvii 6 2 1 2 1 - - 10v
xviii 6 3 - 2 - - 1 10vi
xix 6 2 1 2 - 1 - 10vii
xx 6 3 - 2 1 - - 10viii
xxi 6 1 1 1 3 - - 10ix
xxii 7 3 1 3 - - - 10x
xxiii 7 2 2 3 - - - 10xi
xxiv 7 4 3 - - - - 10xii
xxv 7 2 1 2 2 - - 11i
xxvi 7 3 - 2 2 - - 11ii
xxvii 7 3 - 2 1 1 - 11iii
xxviii 8 4 - 3 - 1 - 11iv
xxix 8 3 1 3 1 - - 11v
xxx 8 3 - 2 2 1 - 11vi
xxxi 9 4 1 4 - - - 11vii
xxxii 9 3 1 3 2 - - 11viii
xxxiii 9 5 - 4 - - - 11ix
xxxiv 10 4 1 4 1 - - 11x
xxxv 10 5 - 4 - 1 - 11xi
xxxvi 11 5 1 5 - - - 11xii
Table 2: Multiple Fixed Point Stability Structures Obtainable with 2D GRUs
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
Figure 10: Thirty six multiple fixed-point topologies obtainable with 2D GRUs, depicted in phase space. Orange and pink lines represent the x and y nullclines respectively. Red dots indicate fixed points. Each subfigure contains 64 purple lines, indicating trajectories in forward time, whose initial conditions were chosen to be evenly spaced on the vertices of a square grid on . Direction of the flow is determined by the black arrows, and the underlaying color map represents the magnitude of the velocity of the flow in log scale.
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
Figure 11: Thirty six multiple fixed-point topologies obtainable with 2D GRUs, depicted in phase space. Orange and pink lines represent the x and y nullclines respectively. Red dots indicate fixed points. Each subfigure contains 64 purple lines, indicating trajectories in forward time, whose initial conditions were chosen to be evenly spaced on the vertices of a square grid on . Direction of the flow is determined by the black arrows, and the underlaying color map represents the magnitude of the velocity of the flow in log scale.
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
Figure 12: Thirty six multiple fixed-point topologies obtainable with 2D GRUs, depicted in phase space. Orange and pink lines represent the x and y nullclines respectively. Red dots indicate fixed points. Each subfigure contains 64 purple lines, indicating trajectories in forward time, whose initial conditions were chosen to be evenly spaced on the vertices of a square grid on . Direction of the flow is determined by the black arrows, and the underlaying color map represents the magnitude of the velocity of the flow in log scale.
Case Uh11 Uh12 Uh21 Uh22 Ur11 Ur12 Ur21 Ur22 bh1 bh2 br1 br2
i 2 2 2 2 100 100 100 100 0 0 0 0
ii 2 2 2 2 0 0 0 0 0 0 0 0
iii 1.2 0 0 2 5 12 8 5 0 0 -0.22 -0.5
iv 2 0 0 2 -1 0 0 -1 0 0 0 0
v 2 0 0 2 1 -1 1 1 0 0 0 0
vi 2 0 0 2 1 -2 3 1 -0.06 0 0.2 -0.85
vii 2 0 0 4 1 -2 3 1 -0.06 0 -0.3 -0.42
viii 1.2 0 0 2 5 20.5 8 5 0 0 -0.275 -3.3
ix 3 3 0 3 0 0 0 0 0 0 0 0
x 6 0 0 6 0 -2 0 0 0 0 -1.695 0
xi 6 0 0 6 0 1 0 0 0 0 -2.5 0
xii 2 0 0 2 1 -2 3 1 -0.055 0 -0.1 -0.02
xiii 2 0 0 4 1 -2 3 1 -0.06 0 -0.085 -0.22
xiv 2.9674 -0.4409 0.4409 2.9674 0 0 0 0 0 0 0 0
xv 6 0 0 2 -1 0 0 0 0 0 0 0
xvi 2 0 0 2 1 -2 3 1 -0.06 0 -0.08 3
xvii 2 0 0 2 1 -2 3 1.1 -0.06 0 0.2 0
xviii 2 0 0 2 3 2 2 3 0 0 0 0
xix 10 0 0 6 -3 5 -5 3 0 0 -3 -3
xx 1.2 0 0 2 -17 35 10 -8 0 0 1.2 -1.2
xxi 2.9763 -0.376 0.376 2.9763 0 0 0 0 0.0315 0 -0.015 -0.015
xxii 6 0 0 6 0 -2 0 0 0 0 0 0
xxiii0 3 0 0 3 -3 -5 -5 -3 0 0 -2 -2
xxiv 1.5 0 0 2 -4 -7 8 5 0 0 -0.4 -1.2
xxv 2 0 0 3 12.4 11.6 -8 -5 0 0 1.8 7
xxvi 3 0 0 3 5.175 9 9 5.175 0.3 0.3 3.95 3.95
xxvii 7 0 0 3 6 3 9 6 0.62 0.373 4 3.4
xxviii 6 0 0 10 0 0 -5 -8 -0.08 0 0 -2
xxix 2 0 0 3 -10 -11.6 8 5 0 0 4 4.8
xxx 3 0 0 3 5.26 9 9 5.26 0.25 0.25 3.95 3.95
xxxi 3 0 0 3 0 0 0 0 0 0 0 0
xxxii 2 0 0 2 5 8 8 5 0 0 4.4 4.4
xxxiii 3 0 0 3 6 9 9 6 0.3 0.3 3.75 3.75
xxxiv 1 0 0 1 5 8 8 5 0 0 4.4 4.9
xxxv 3 0 0 3 6 9 9 6 0.24 0.24 3.95 3.95
xxxvi 2 0 0 2 5 8 8 5 0 0 5 5
Table 3: Parameters of each multiple fixed-point stability structure example

mcdougallprephicer.blogspot.com

Source: https://deepai.org/publication/gated-recurrent-units-viewed-through-the-lens-of-continuous-time-dynamical-systems

0 Response to "Gated Recurrent Units Viewed Through the Lens of Continuous Time Dynamical Systems"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel