A Formal Hierarchy of RNN Architectures

William Cooper Merrill
Gail Weiss
Yoav Goldberg
Roy Schwartz
Noah A. Smith
Eran Yahav
ACL
2020
View in Semantic Scholar

Abstract

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN’s memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, which formally separates it from the related QRNN (Bradbury et al., 2016). We also show how these models’ expressive capacity is expanded by stacking multiple layers or composing them with different pooling functions. Our results build on the theory of “saturated” RNNs (Merrill, 2019). While formally extending these findings to unsaturated RNNs is left to future work, we hypothesize that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy. We provide empirical results to support this conjecture. Experimental findings from training unsaturated networks on formal languages support this conjecture.

1 Introduction

While neural networks are central to the performance of today's strongest NLP systems, theoretical understanding of the formal properties of different kinds of networks is still limited. It is established, for example, that the Elman (1990) RNN is Turing-complete, given infinite precision and computation time Sontag, 1992, 1994; Chen et al., 2018) . But tightening these unrealistic assumptions has serious implications for expressive power (Weiss et al., 2018) , leaving a significant gap between classical theory and practice, which theorems in this paper attempt to address.

Recently, Peng et al. (2018) introduced rational RNNs, a subclass of RNNs whose internal state can be computed by independent weighted finite automata (WFAs). Intuitively, such models have a computationally simpler recurrent update than conventional models like long short-term memory networks (LSTMs; Hochreiter and Schmidhuber, 1997) . Empirically, rational RNNs like the quasirecurrent neural network (QRNN; Bradbury et al., 2016) and unigram rational RNN (Dodge et al., 2019) perform comparably to the LSTM, with a smaller computational budget. Still, the underlying simplicity of rational models raises the question of whether their expressive power is fundamentally limited compared to other RNNs.

In a separate line of work, Merrill (2019) introduced the saturated RNN 1 as a formal model for analyzing the capacity of RNNs. A saturated RNN is a simplified network where all activation functions have been replaced by step functions. The saturated network may be seen intuitively as a "stable" version of its original RNN, in which the in-ternal activations act discretely. A growing body of work-including this paper-finds that the saturated theory predicts differences in practical learnable capacity for various RNN architectures (Weiss et al., 2018; Merrill, 2019; Suzgun et al., 2019a) .

We compare the expressive power of rational and non-rational RNNs, distinguishing between state expressiveness (what kind and amount of information the RNN states can capture) and language expressiveness (what languages can be recognized when the state is passed to a classifier). To do this, we build on the theory of saturated RNNs.

State Expressiveness

We introduce a unified hierarchy ( Figure 1 ) of the functions expressible by the states of rational and non-rational RNN encoders. The hierarchy is defined by two formal properties: space complexity, which is a measure of network memory, 2 and rational recurrence, whether the internal structure of the RNN can be described by WFAs. The hierarchy reveals concrete differences between LSTMs and QRNNs, and further separates both from a class containing convolutional neural networks (CNNs, Lecun and Bengio, 1995; Kim, 2014) , Elman RNNs, and gated recurrent units (GRU; Cho et al., 2014) .

We provide the first formal proof that LSTMs can encode functions that rational recurrences cannot. On the other hand, we show that the saturated Elman RNN and GRU are rational recurrences with constant space complexity, whereas the QRNN has unbounded space complexity. We also show that an unrestricted WFA has rich expressive power beyond any saturated RNN we consider-including the LSTM. This difference potentially opens the door to more expressive RNNs incorporating the computational efficiency of rational recurrences.

Language expressiveness When applied to classification tasks like language recognition, RNNs are typically combined with a "decoder": additional layer(s) that map their hidden states to a prediction. Thus, despite differences in state expressiveness, rational RNNs might be able to achieve comparable empirical performance to non-rational RNNs on NLP tasks. In this work, we consider the setup in which the decoders only view the final hidden state of the RNN. 3 We demonstrate that a sufficiently strong decoder can overcome some of the differences in state expressiveness between different models. For example, an LSTM can recognize a n b n with a single decoding layer, whereas a QRNN provably cannot until the decoder has two layers. However, we also construct a language that an LSTM can recognize without a decoder, but a QRNN cannot recognize with any decoder. Thus, no decoder can fully compensate for the weakness of the QRNN compared to the LSTM.

Experiments Finally, we conduct experiments on formal languages, justifying that our theorems correctly predict which languages unsaturated recognizers trained by gradient descent can learn. Thus, we view our hierarchy as a useful formal tool for understanding the relative capabilities of different RNN architectures.

Roadmap We present the formal devices for our analysis of RNNs in Section 2. In Section 3 we develop our hierarchy of state expressiveness for single-layer RNNs. In Section 4, we shift to study RNNs as language recognizers. Finally, in Section 5, we provide empirical results evaluating the relevance of our predictions for unsaturated RNNs.

2 Building Blocks

In this work, we analyze RNNs using formal models from automata theory-in particular, WFAs and counter automata. In this section, we first define the basic notion of an encoder studied in this paper, and then introduce more specialized formal concepts: WFAs, counter machines (CMs), space complexity, and, finally, various RNN architectures.

2.1 Encoders

We view both RNNs and automata as encoders: machines that can be parameterized to compute a set of functions f : Σ * → Q k , where Σ is an input alphabet and Q is the set of rational reals. Given an encoder M and parameters θ, we use M θ to represent the specific function that the parameterized encoder computes. For each encoder, we refer to the set of functions that it can compute as its state expressiveness. For example, a deterministic finite state acceptor (DFA) is an encoder whose parameters are its transition graph. Its state expressiveness is the indicator functions for the regular languages.

2.2 Wfas

Formally, a WFA is a non-deterministic finite automaton where each starting state, transition, and final state is weighted. Let Q denote the set of states, Σ the alphabet, and Q the rational reals. 4 This weighting is specified by three functions:

1. Initial state weights λ : Q → Q 2. Transition weights τ : Q × Σ × Q → Q 3. Final state weights ρ : Q → Q The weights are used to encode any string x ∈ Σ * :

Definition 1 (Path score). Let π be a path of the form

q 0 → x 1 q 1 → x 2 • • • → xt q t through WFA A.

The score of π is given by

A[π] = λ(q 0 ) t i=1 τ (q i−1 , x i , q i ) ρ(q t ).

By Π(x), denote the set of paths producing x.

Definition 2 (String encoding). The encoding computed by a WFA A on string x is

A[x] = π∈Π(x) A[π].

Hankel matrix Given a function f : Σ * → Q and two enumerations α, ω of the strings in Σ * , we define the Hankel matrix of f as the infinite matrix

[H f ] ij = f (α i •ω j ).

( 1)where • denotes concatenation. It is sometimes convenient to treat H f as though it is directly indexed by Σ * , e.g.

[H f ] α i ,ω j = f (α i •ω j )

, or refer to a sub-block of a Hankel matrix, row-and columnindexed by prefixes and suffixes P, S ⊆ Σ * . The following result relates the Hankel matrix to WFAs:

Theorem 1 (Carlyle and Paz, 1971; Fliess, 1974) . For any f : Σ * → Q, there exists a WFA that computes f if and only if H f has finite rank.

Rational series (Sakarovitch, 2009) For all k ∈ N, f :

Σ * → Q k is a rational series if there exist WFAs A 1 , • • • , A k such that, for all x ∈ Σ * and 1 ≤ i ≤ k, A i [x] = f i (x).

2.3 Counter Machines

We now turn to introducing a different type of encoder: the real-time counter machine (CM; Merrill, 2020; Fischer, 1966; Fischer et al., 1968) . CMs are deterministic finite-state machines augmented with finitely many integer counters. While processing a string, the machine updates these counters, and may use them to inform its behavior. We view counter machines as encoders mapping

Σ * → Z k . For m ∈ N, • ∈ {+, −, ×}, let •m denote the function f (n) = n • m.

Definition 3 (General CM; Merrill, 2020) . A kcounter CM is a tuple Σ, Q, q 0 , u, δ with 1. A finite alphabet Σ 2. A finite set of states Q, with initial state q 0 3. A counter update function

u : Σ × Q × {0, 1} k → {×0, −1, +0, +1} k 4. A state transition function δ : Σ × Q × {0, 1} k → Q

A CM processes input tokens {x t } n t=1 sequentially. Denoting q t , c t ∈ Q × Z k a CM's configuration at time t, define its next configuration:

EQUATION (3): Not extracted; please refer to original document.

where 1 =0 is a broadcasted "zero-check" operation, i.e., 1

=0 (v) i 1 =0 (v i ).

In 2and 3, note that the machine only views the zeroness of each counter, and not its actual value. A general CM's encoding of a string x is the value of its counter vector c t after processing all of x.

Restricted Cms

1. A CM is Σ-restricted iff u and δ depend only on the current input σ ∈ Σ. 2. A CM is (Σ × Q)-restricted iff u and δ depend only on the current input σ ∈ Σ and the current state q ∈ Q. 3. A CM is Σ w -restricted iff it is (Σ × Q)restricted, and the states Q are windows over the last w input tokens, e.g., Q = Σ ≤w . 5 These restrictions prevent the machine from being "counter-aware": u and δ cannot condition on the counters' values. As we will see, restricted CMs have natural parallels in the realm of rational RNNs. In Subsection 3.2, we consider the relationship between counter awareness and rational recurrence.

2.4 Space Complexity

As in Merrill (2019) , we also analyze encoders in terms of state space complexity, measured in bits.

Definition 4 (Bit complexity). An encoder M :

Σ * → Q k has T (n) space iff max θ {s M θ (x) | x ∈ Σ ≤n } = 2 T (n) ,

where s M θ (x) is a minimal representation 6 of M 's internal configuration immediately after x.

We consider three asymptotic space complexity classes: Θ(1), Θ(log n), and Θ(n), corresponding to encoders that can reach a constant, polynomial, and exponential (in sequence length) number of configurations respectively. Intuitively, encoders that can dynamically count but cannot use more complex memory like stacks-such as all CMs-are in Θ(log n) space. Encoders that can uniquely encode every input sequence are in Θ(n) space.

2.5 Saturated Networks

A saturated neural network is a discrete approximation of neural network considered by Merrill (2019), who calls it an "asymptotic network." Given a parameterized neural encoder M θ (x), we construct the saturated network s-

EQUATION (4): Not extracted; please refer to original document.

where N θ denotes the parameters θ multiplied by a scalar N . This transforms each "squashing" function (sigmoid, tanh, etc.) to its extreme values (0, ±1). In line with prior work (Weiss et al., 2018; Merrill, 2019; Suzgun et al., 2019b) , we consider saturated networks a reasonable approximation for analyzing practical expressive power. For clarity, we denote the saturated approximation of an architecture by prepending it with s, e.g., s-LSTM.

2.6 Rnns

A recurrent neural network (RNN) is a parameterized update function g θ : Q k ×Q dx → Q k , where θ are the rational-valued parameters of the RNN and d x is the dimension of the input vector. g θ takes as input a current state h ∈ Q k and input vector x ∈ Q dx , and produces the next state. Defining the initial state as h 0 = 0, an RNN can be applied to an input sequence x ∈ (Q dx ) * one vector at a time to create a sequence of states {h t } t≤|x| , each representing an encoding of the prefix of x up to that time step. RNNs can be used to encode sequences over a finite alphabet x ∈ Σ * by first applying a mapping (embedding) e : Σ → Q dx .

Multi-layer RNNs "Deep" RNNs are RNNs that have been arranged in L stacked layers R 1 , ..., R L . In this setting, the series of output states h 1 , h 2 , ..., h |x| generated by each RNN on its input is fed as input to the layer above it, and only the first layer receives the original input sequence x ∈ Σ * as input. The recurrent update function g can take several forms. The original and most simple form is that of the Elman RNN. Since then, more elaborate forms using gating mechanisms have become popular, among them the LSTM, GRU, and QRNN.

Elman RNNs (Elman, 1990) Let x t be a vector embedding of x t . For brevity, we suppress the bias terms in this (and the following) affine operations.

EQUATION (5): Not extracted; please refer to original document.

We refer to the saturated Elman RNN as the s-RNN.

The s-RNN has Θ(1) space (Merrill, 2019) .

LSTMs (Hochreiter and Schmidhuber, 1997) An LSTM is a gated RNN with a state vector h t ∈ Q k and memory vector c t ∈ Q k . 7

EQUATION (11): Not extracted; please refer to original document.

The LSTM can use its memory vector c t as a register of counters (Weiss et al., 2018) . Merrill (2019) showed that the s-LSTM has Θ(log n) space.

GRUs (Cho et al., 2014) Another kind of gated RNN is the GRU.

EQUATION (15): Not extracted; please refer to original document.

Weiss et al. (2018) found that, unlike the LSTM, the GRU cannot use its memory to count dynamically. Merrill (2019) showed the s-GRU has Θ(1) space. QRNNs Bradbury et al. (2016) propose QRNNs as a computationally efficient hybrid of LSTMs and CNNs. Let * denote convolution over time, let W z , W f , W o ∈ Q dx×w×k be convolutions with window length w, and let X ∈ Q n×dx denote the matrix of n input vectors. An ifo-QRNN (henceforth referred to as a QRNN) with window length w is defined by W z , W f , and W o as follows:

Z = tanh(W z * X) (16) F = σ(W f * X) (17) O = σ(W o * X) (18) c t = f t c t−1 + i t z t (19) h t = o t c t (20)

where z t , f t , o t are respectively rows of Z, F, O. A QRNN Q can be seen as an LSTM in which all uses of the state vector h t have been replaced with a computation over the last w input tokens-in this way it is similar to a CNN. The s-QRNN has Θ(log n) space, as the analysis of Merrill (2019) for the s-LSTM directly applies. Indeed, any s-QRNN is also a (Σ w )-restricted CM extended with =±1 ("set to ±1") operations.

3 State Expressiveness

We now turn to presenting our results. In this section, we develop a hierarchy of single-layer RNNs based on their state expressiveness. A set-theoretic view of the hierarchy is shown in Figure 2 .

Figure 2: Diagram of the relations between encoders. Neural networks are underlined. We group by asymptotic upper bound (O), as opposed to tight (Θ).

Let R be the set of rational series. The hierarchy relates Θ(log n) space to the following sets:

• RR As in Peng et al. (2018) , we say that An encoder is rationally recurrent (RR) iff its state expressiveness is a subset of R.

• RR-hard An encoder is RR-hard iff its state expressiveness contains R. A Turing machine is RR-hard, as it can simulate any WFA.

• RR-complete Finally, an encoder is RRcomplete iff its state expressiveness is equivalent to R. A trivial example of an RRcomplete encoder is a vector of k WFAs.

The different RNNs are divided between the intersections of these classes. In Subsection 3.1, we prove that the s-LSTM, already established to have Θ(log n) space, is not RR. In Subsection 3.2, we demonstrate that encoders with restricted counting ability (e.g., QRNNs) are RR, and in Subsection 3.3, we show the same for all encoders with finite state (CNNs, s-RNNs, and s-GRUs). In Subsection 3.4, we demonstrate that none of these RNNs are RR-hard. In Appendix F, we extend this analysis from RNNs to self attention.

3.1 Counting Beyond Rr

We find that encoders like the s-LSTM-which, as discussed in Subsection 2.3, is "aware" of its current counter values-are not RR. To do this, we construct f 0 : {a, b} * → N that requires counter awareness to compute on strings of the form a * b * , making it not rational. We then construct an s-

LSTM computing f 0 over a * b * .

Let # a−b (x) denote the number of as in string x minus the number of bs.

Definition 5 (Rectified counting).

f 0 : x → # a−b (x) if # a−b (x) > 0 0 otherwise. Lemma 1. For all f : {a, b} * → N, if f (a i b j ) = f 0 (a i b j ) for all i, j ∈ N, then f ∈ R .

Proof. Consider the Hankel sub-block A n of H f with prefixes P n = {a i } i≤n and suffixes S n = {b j } j≤n . A n is lower-triangular:

EQUATION (21): Not extracted; please refer to original document.

Therefore rank(A n ) = n−1. Thus, for all n, there is a sub-block of H f with rank n − 1, and so rank(H f ) is unbounded. It follows from Theorem 1 that there is no WFA computing f .

Theorem 2. The s-LSTM is not RR.

q 0 start a/+1 b, =0/−1 b, =0/+0 Figure 3: A 1-CM computing f 0 for x ∈ {a i b j | i, j ∈ N}.

Figure 3: A 1-CM computing f0 for x ∈ {aibj | i, j ∈ N}. Let σ/±m denote a transition that consumes σ and updates the counter by ±m. We write σ,=0/±m (or 6=) for a transition that requires the counter is 0.

Let σ/±m denote a transition that consumes σ and updates the counter by ±m. We write σ, =0/±m (or =) for a transition that requires the counter is 0.

Proof. Assume the input has the form a i b j for some i, j. Consider the following LSTM 8 :

EQUATION (25): Not extracted; please refer to original document.

Let N → ∞. Then i t = 0 iff x t = b and h t−1 = 0 (i.e. c t−1 = 0). Meanwhile,c t = 1 iff x t = a. The update term becomes

EQUATION (26): Not extracted; please refer to original document.

For a string a i b j , the update in (26) is equivalent to the CM in Figure 3 . Thus, by Lemma 1, the s-LSTM (and the general CM) is not RR.

3.2 Rational Counting

While the counter awareness of a general CM enables it to compute non-rational functions, CMs that cannot view their counters are RR.

Theorem 3. Any Σ-restricted CM is RR.

Proof. We show that any function that a Σrestricted CM can compute can also be computed by a collection of WFAs. The CM update operations (−1, +0, +1, or ×0) can all be reexpressed in terms of functions r(x), u(x) : Σ * → Z k to get:

EQUATION (27): Not extracted; please refer to original document.

EQUATION (28): Not extracted; please refer to original document.

A WFA computing [c t ] i is shown in Figure 4 . 8 In which ft and ot are set to 1, such that ct = ct−1 +itct.

Figure 4: WFA simulating unit i of a Σ-restricted CM. Let ∀σ/w(σ) denote a set of transitions consuming each token σ with weight w(σ). We use standard DFA notation to show initial weights λ(q0) = 1, λ(q1) = 0 and accepting weights ρ(q0) = 0, ρ(q1) = 1.

The WFA in Figure 4 also underlies unigram rational RNNs (Peng et al., 2018) . Thus, Σ-restricted CMs are actually a special case of unigram WFAs. In Appendix A, we show the more general result:

Theorem 4. Any (Σ × Q)-restricted CM is RR.

In many rational RNNs, the updates at different time steps are independent of each other outside of a window of w tokens. Theorem 4 tells us this independence is not an essential property of rational encoders. Rather, any CM where the update is conditioned by finite state (as opposed to being conditioned by a local window) is in fact RR.

Furthermore, since (Σ w )-restricted CMs are a special case of (Σ × Q)-restricted CMs, Theorem 4 can be directly applied to show that the s-QRNN is RR. See Appendix A for further discussion of this.

3.3 Finite-Space Rr

Theorem 4 motivates us to also think about finitespace encoders: i.e., encoders with no counters" where the output at each prefix is fully determined by a finite amount of memory. The following lemma implies that any finite-space encoder is RR:

Lemma 2. Any function f : Σ * → Q computable by a Θ(1)-space encoder is a rational series.

Proof. Since f is computable in Θ(1) space, there exists a DFA A f whose accepting states are isomorphic to the range of f . We convert A f to a WFA by labelling each accepting state by the value of f that it corresponds to. We set the starting weight of the initial state to 1, and 0 for every other state. We assign each transition weight 1.

Since the CNN, s-RNN, and s-GRU have finite state, we obtain the following result:

Theorem 5. The CNN, s-RNN, and s-GRU are RR.

While and Peng et al. (2018) showed the CNN to be RR over the max-plus semiring, Theorem 5 shows the same holds for Q, •, + .

3.4 Rr Completeness

While "rational recurrence" is often used to indicate the simplicity of an RNN architecture, we find in this section that WFAs are surprisingly computationally powerful. Figure 5 shows a WFA mapping binary string to their numeric value, proving WFAs have Θ(n) space. We now show that none of our RNNs are able to simulate an arbitrary WFA, even in the unsaturated form.

Figure 5: A WFA mapping binary strings to their numeric value. This can be extended for any base > 2. Cortes and Mohri (2000) present a similar construction. Notation is the same as Figure 4.

q 0 start q 1 ∀σ/1 ∀σ/u i (σ) ∀σ/r i (σ)

Figure 4: WFA simulating unit i of a Σ-restricted CM. Let ∀σ/w(σ) denote a set of transitions consuming each token σ with weight w(σ). We use standard DFA notation to show initial weights λ(q 0 ) = 1, λ(q 1 ) = 0 and accepting weights ρ(q 0 ) = 0, ρ(q 1 ) = 1.

q 0 start q 1 ∀σ/1 ∀σ/σ ∀σ/2

Figure 5: A WFA mapping binary strings to their numeric value. This can be extended for any base > 2. Cortes and Mohri (2000) present a similar construction. Notation is the same as Figure 4 .

Theorem 6. Both the saturated and unsaturated RNN, GRU, QRNN, and LSTM 9 are not RR-hard.

Proof. Consider the function f b mapping binary strings to their value, e.g. 101 → 5. The WFA in Figure 5 shows that this function is rational.

The value of f b grows exponentially with the sequence length. On the other hand, the value of the RNN and GRU cell is bounded by 1, and QRNN and LSTM cells can only grow linearly in time. Therefore, these encoders cannot compute f b .

In contrast, memory networks can have Θ(n) space. Appendix G explores this for stack RNNs.

3.5 Towards Transformers

Appendix F presents preliminary results extending saturation analysis to self attention. We show saturated self attention is not RR and consider its space complexity. We hope further work will more completely characterize saturated self attention.

4 Language Expressiveness

Having explored the set of functions expressible internally by different saturated RNN encoders, we turn to the languages recognizable when using them with a decoder. We consider the following setup:

1. An s-RNN encodes x to a vector h t ∈ Q k . 2. A decoder function maps the last state h t to an accept/reject decision, respectively: {1, 0}.

We say that a language L is decided by an encoder-decoder pair e, d if d(e(x)) = 1 for every sequence x ∈ L and otherwise d(e(x)) = 0. We explore which languages can be decided by different encoder-decoder pairings.

Some related results can be found in Cortes and Mohri (2000) , who study the expressive power of WFAs in relation to CFGs under a slightly different definition of language recognition.

4.1 Linear Decoders

Let d 1 be the single-layer linear decoder

EQUATION (29): Not extracted; please refer to original document.

parameterized by w and b. For an encoder architecture E, we denote by D 1 (E) the set of languages decidable by E with d 1 . We use D 2 (E) analogously for a 2-layer decoder with 1 >0 activations, where the first layer has arbitrary width.

4.2 A Decoder Adds Power

We refer to sets of strings using regular expressions, e.g. a * = {a i | i ∈ N}. To illustrate the purpose of the decoder, consider the following language:

EQUATION (30): Not extracted; please refer to original document.

The Hankel sub-block of the indicator function for L ≤ over P = a * , S = b * is lower triangular. Therefore, no RR encoder can compute it. However, adding the D 1 decoder allows us to compute this indicator function with an s-QRNN, which is RR. We set the s-QRNN layer to compute the simple series c t = # a−b (x) (by increasing on a and decreasing on b). The D 1 layer then checks c t ≤ 0. So, while the indicator function for L ≤ is not itself rational, it can be easily recovered from a rational representation. Thus, L ≤ ∈ D 1 (s-QRNN).

4.3 Case Study: A N B N

We compare the language expressiveness of several rational and non-rational RNNs on the following:

EQUATION (32): Not extracted; please refer to original document.

a n b n is more interesting than L ≤ because the D 1 decoder cannot decide it simply by asking the encoder to track # a−b (x), as that would require it to compute the non-linearly separable =0 function.

Thus, it appears at first that deciding a n b n with D 1 might require a non-rational RNN encoder. However, we show below that this is not the case. Let • denote stacking two layers. We will go on to discuss the following results:

a n b n ∈ D 1 (WFA) (33) a n b n ∈ D 1 (s-LSTM) (34) a n b n ∈ D 1 (s-QRNN) (35) a n b n ∈ D 1 (s-QRNN • s-QRNN) (36) a n b n ∈ D 2 (s-QRNN) (37) a n b n Σ * ∈ D 1 (s-LSTM) (38) a n b n Σ * / ∈ D (s-QRNN) for any D (39) a n b n Σ * ∪ { } ∈ D 1 (s-QRNN • s-QRNN) (40)

WFAs (Appendix B) In Theorem 8 we present a function f : Σ * → Q satisfying f (x) > 0 iff x ∈ a n b n , and show that H f has finite rank. It follows that there exists a WFA that can decide a n b n with the D 1 decoder. Counterintuitively, a n b n can be recognized using rational encoders.

QRNNs (Appendix C) Although a n b n ∈ D 1 (WFA), it does not follow that every rationally recurrent model can also decide a n b n with the help of D 1 . Indeed, in Theorem 9, we prove that a n b n / ∈ D 1 (s-QRNN), whereas a n b n ∈ D 1 (s-LSTM) (Theorem 13).

It is important to note that, with a more complex decoder, the QRNN could recognize a n b n . For example, the s-QRNN can encode c 1 = # a−b (x) and set c 2 to check whether x contains ba, from which a D 2 decoder can recognize a n b n (Theorem 10).

This does not mean the hierarchy dissolves as the decoder is strengthened. We show that a n b n Σ *which seems like a trivial extension of a n b n -is not recognizable by the s-QRNN with any decoder.

This result may appear counterintuitive, but in fact highlights the s-QRNN's lack of counter awareness: it can only passively encode the information needed by the decoder to recognize a n b n . Failing to recognize that a valid prefix has been matched, it cannot act to preserve that information after additional input tokens are seen. We present a proof in Theorem 11. In contrast, in Theorem 14 we show that the s-LSTM can directly encode an indicator for a n b n Σ * in its internal state.

Proof sketch: a n b n Σ * / ∈ D(s-QRNN). A sequence s 1 ∈ a n b n Σ * is shuffled to create s 2 / ∈ a n b n Σ * with an identical multi-set of counter up-dates. 10 Counter updates would be order agnostic if not for reset operations, and resets mask all history, so extending s 1 and s 2 with a single suffix s containing all of their w-grams reaches the same final state. Then for any D, D(s-QRNN) cannot separate them. We formalize this in Theorem 11.

We refer to this technique as the suffix attack, and note that it can be used to prove for multiple other languages L ∈ D 2 (s-QRNN) that L•Σ * is not in D(s-QRNN) for any decoder D.

2-layer QRNNs Adding another layer overcomes the weakness of the 1-layer s-QRNN, at least for deciding a n b n . This follows from the fact that a n b n ∈ D 2 (s-QRNN): the second QRNN layer can be used as a linear layer.

Similarly, we show in Theorem 10 that a 2-layer s-QRNN can recognize a n b n Σ * ∪ { }. This suggests that adding a second s-QRNN layer compensates for some of the weakness of the 1-layer s-QRNN, which, by the same argument for a n b n Σ * cannot recognize a n b n Σ * ∪ { } with any decoder.

4.4 Arbitrary Decoder

Finally, we study the theoretical case where the decoder is an arbitrary recursively enumerable (RE) function. We view this as a loose upper bound of stacking many layers after a rational encoder. What information is inherently lost by using a rational encoder? WFAs can uniquely encode each input, making them Turing-complete under this setup; however, this does not hold for rational s-RNNs.

Rr-Complete

Assuming an RR-complete encoder, a WFA like Figure 5 can be used to encode each possible input sequence over Σ to a unique number. We then use the decoder as an oracle to decide any RE language. Thus, an RR-complete encoder with an RE decoder is Turing-complete.

Bounded space However, the Θ(log n) space bound of saturated rational RNNs like the s-QRNN means these models cannot fully encode the input. In other words, some information about the prefix x :t must be lost in c t . Thus, rational s-RNNs are not Turing-complete with an RE decoder.

5 Experiments

In Subsection 4.3, we showed that different saturated RNNs vary in their ability to recognize a n b n and a n b n Σ * . We now test empirically whether Figure 6 : Accuracy recognizing L 5 and a n b n Σ * . "QRNN+" is a QRNN with a 2-layer decoder, and "2QRNN" is a 2-layer QRNN with a 1-layer decoder.

Figure 6: Accuracy recognizing L5 and anbnΣ∗. “QRNN+” is a QRNN with a 2-layer decoder, and “2QRNN” is a 2-layer QRNN with a 1-layer decoder.

these predictions carry over to the learnable capacity of unsaturated RNNs. 11 We compare the QRNN and LSTM when coupled with a linear decoder D 1 . We also train a 2-layer QRNN ("QRNN2") and a 1-layer QRNN with a D 2 decoder ("QRNN+").

We train on strings of length 64, and evaluate generalization on longer strings. We also compare to a baseline that always predicts the majority class. The results are shown in Figure 6 . We provide further experimental details in Appendix E. Experiment 1 We use the following language, which has similar formal properties to a n b n , but with a more balanced label distribution:

L 5 = x ∈ (a|b) * | |# a−b (x)| < 5 . (41)

In line with (34), the LSTM decides L 5 perfectly for n ≤ 64, and generalizes fairly well to longer strings. As predicted in (35), the QRNN cannot fully learn L 5 even for n = 64. Finally, as predicted in (36) and (37), the 2-layer QRNN and the QRNN with D 2 do learn L 5 . However, we see that they do not generalize as well as the LSTM for longer strings. We hypothesize that these multi-11 https://github.com/viking-sudo-rm/ rr-experiments layer models require more epochs to reach the same generalization performance as the LSTM. 12 Experiment 2 We also consider a n b n Σ * . As predicted in (38) and (40), the LSTM and 2-layer QRNN decide a n b n Σ * flawlessly for n = 64. A 1-layer QRNN performs at the majority baseline for all n with both a 1 and 2-layer decoder. Both of these failures were predicted in (39). Thus, the only models that learned a n b n Σ * were exactly those predicted by the saturated theory.

6 Conclusion

We develop a hierarchy of saturated RNN encoders, considering two angles: space complexity and rational recurrence. Based on the hierarchy, we formally distinguish the state expressiveness of the non-rational s-LSTM and its rational counterpart, the s-QRNN. We show further distinctions in state expressiveness based on encoder space complexity.

Moreover, the hierarchy translates to differences in language recognition capabilities. Strengthening the decoder alleviates some, but not all, of these differences. We present two languages, both recognizable by an LSTM. We show that one can be recognized by an s-QRNN only with the help of a decoder, and that the other cannot be recognized by an s-QRNN with the help of any decoder.

While this means existing rational RNNs are fundamentally limited compared to LSTMs, we find that it is not necessarily being rationally recurrent that limits them: in fact, we prove that a WFA can perfectly encode its input-something no saturated RNN can do. We conclude with an analysis that shows that an RNN architecture's strength must also take into account its space complexity. These results further our understanding of the inner working of NLP systems. We hope they will guide the development of more expressive rational RNNs.

A Rational Counting

We extend the result in Theorem 3 as follows.

Theorem 7. Any (Σ × Q)-restricted CM is rationally recurrent.

Proof. We present an algorithm to construct a WFA computing an arbitrary counter in a (Σ × Q)restricted CM. First, we create two independent copies of the transition graph for the restricted CM. We refer to one copy of the CM graph as the add graph, and the other as the multiply graph.

The initial state in the add graph receives a starting weight of 1, and every other state receives a starting weight of 0. Each state in the add graph receives an accepting weight of 0, and each state in the multiply graph receives an accepting weight of 1. In the add graph, each transition receives a weight of 1. In the multiply graph, each transition receives a weight of 0 if it represents ×0, and 1 otherwise. Finally, for each non-multiplicative update σ/+m 13 from q i to q j in the original CM, we add a WFA transition σ/m from q i in the add graph to q j in the multiply graph.

Each counter update creates one path ending in the multiply graph. The path score is set to 0 if that counter update is "erased" by a ×0 operation. Thus, the sum of all the path scores in the WFA equals the value of the counter. This construction can be extended to accommodate =m counter updates from q i to q j by adding an additional transition from the initial state to q j in the multiplication graph with weight m. This allows us to apply it directly to s-QRNNs, whose update operations include =1 and =−1.

B Wfas

We show that while WFAs cannot directly encode an indicator for the language a n b n = {a n b n | | n ∈ N}, they can encode a function that can be thresholded to recognize a n b n , i.e.:

Theorem 8. The language a n b n = {a n b n | n ∈ N} over Σ = {a, b} is in D 1 (WFA).

We prove this by showing a function whose Hankel matrix has finite rank that, when combined with the identity transformation (i.e., w = 1, b = 0) followed by thresholding, is an indicator for a n b n . Using the shorthand σ(x) = # σ (x), the function is:

EQUATION (42): Not extracted; please refer to original document.

Immediately f satisfies 1 >0 (f (x)) ⇐⇒ x ∈ a n b n . To prove that its Hankel matrix, H f , has finite rank, we will create 3 infinite matrices of ranks 3, 3 and 1, which sum to H f . The majority of the proof will focus on the rank of the rank 3 matrices, which have similar compositions. We now show 3 series r, s, t and a set of series they can be combined to create. These series will be used to create the base vectors for the rank 3 matrices.

EQUATION (47): Not extracted; please refer to original document.

where for every j ≤ 2,

EQUATION (48): Not extracted; please refer to original document.

Lemma 3. Let c i = 1 − 2i 2 and {c (k) } k∈N be the set of series defined c

(k) i = c |i−k| . Then for every i, k ∈ N, c (k) i = c (k) 0 r i + c (k) 1 s i + c (k) 2 t i .

Proof. For i ∈ {0, 1, 2}, r i , s i and t i collapse to a 'select' operation, giving the true statement c

(k) i = c (k) i • 1.

We now consider the case i > 2. Substituting the series definitions in the right side of the equation gives

EQUATION (49): Not extracted; please refer to original document.

which can be expanded to

(1 − 2k 2 ) • i 2 − 3i + 2 2 + (1 − 2(k − 1) 2 ) • (1 − (i − 1) 2 ) + (1 − 2(k − 2) 2 ) • (i − 1)i 2 .

Reordering the first component and partially opening the other two gives

(−2k 2 + 1) i 2 − 3i + 2 2 + (−2k 2 + 4k − 1)(2i − i 2 )+ (−k 2 + 4k − 3.5)(i 2 − i)

and a further expansion gives

−k 2 i 2 + 0.5i 2 + 3k 2 i − 1.5i − 2k 2 + 1+ 2k 2 i 2 − 4ki 2 + i 2 − 4k 2 i + 8ki − 2i+ −k 2 i 2 + 4ki 2 − 3.5i 2 + k 2 i − 4ki + 3.5i which reduces to −2i 2 + 4ki − 2k 2 + 1 = 1 − 2(k − i) 2 = c (k) i .

We restate this as:

Corollary 1. For every k ∈ N, the series c (k) is a linear combination of the series r, s and t.

We can now show that f is computable by a WFA, proving Theorem 8. By Theorem 1, it is sufficient to show that H f has finite rank.

Lemma 4. H f has finite rank.

Proof. For every P, S ⊆ {a, b} * , denote

[H f | P,S ] u,v = [H f ] u,v if u ∈ P and v ∈ S 0 otherwise

Using regular expressions to describe P, S, we create the 3 finite rank matrices which sum to H f :

EQUATION (52): Not extracted; please refer to original document.

Intuitively, these may be seen as a "split" of H f into sections as in Figure 7 , such that A and B together cover the sections of H f on which u•v does not contain the substring ba (and are equal on them to H f + 0.5), and C is simply the constant matrix −0.5. Immediately, H f = A + B + C, and rank(C) = 1. We now consider A. Denote P A = a * , S A = a * b * . A is non-zero only on indices u ∈ P A , v ∈ S A , and for these, For each τ ∈ {r, s, t}, defineτ ∈ Q {a,b} * as

Figure 7: Intuition of the supports of A,B and C.

u•v ∈ a * b * and A u,v = 0.5 + f (u•v) = 1 − 2(a(u) + a(v) − b(v)) 2 . This gives that for every u ∈ P A , v ∈ S A , A u,v = c |a(u)−(b(v)−a(v))| = c (a(u)) b(v)−a(v) . (53)

EQUATION (54): Not extracted; please refer to original document.

We get from Corollary 1 that for every u ∈ a * , the uth row of A is a linear combination ofr,s, and t. The remaining rows of A are all 0 and so also a linear combination of these, and so rank(A) ≤ 3.

Similarly, we find that the nonzero entries of B satisfy

EQUATION (55): Not extracted; please refer to original document.

and so, for τ ∈ {r, s, t}, the columns of B are linear combinations of the columns τ ∈ Q {a,b} * defined

τ u = 1 u∈a * b + • τ a(u)−b(u) . (56)

Thus we conclude rank(B) ≤ 3.

Finally, H f = A + B + C, and so by the subadditivity of rank in matrices,

EQUATION (57): Not extracted; please refer to original document.

In addition, the rank ofH

f ∈ Q {a,b} ≤2 ,{a,b} ≤2 defined [H f ] u,v = [H f ] u,v

is 7, and so we can conclude that the bound in the proof is tight, i.e., rank(H f ) = 7. From hereH f is a complete subblock of H f and can be used to explicitly construct a WFA for f , using the spectral method described by Balle et al. (2014) .

C S-Qrnns

Theorem 9. No s-QRNN with a linear threshold decoder can recognize a n b n = {a n b n | n ∈ N}, i.e., a n b n / ∈ D 1 (s-QRNN).

Proof. An ifo s-QRNN can be expressed as a Σ krestricted CM with the additional update operations {:= −1, := 1}, where k is the window size of the QRNN. So it is sufficient to show that such a machine, when coupled with the decoder D 1 (linear translation followed by thresholding), cannot recognize a n b n . Let A be some such CM, with window size k and h counters. Take n = k + 10 and for every m ∈ N denote w m = a n b m and the counter values of A after w m as c m ∈ Q h . Denote by u t the vector of counter update operations made by this machine on input sequence w m at time t ≤ n + m. As A is dependent only on the last k counters, necessarily all u k+i are identical for every i ≥ 1.

It follows that for all counters in the machine that go through an assignment (i.e., :=) operation in u k+1 , their values in c k+i are identical for every i ≥ 1, and for every other counter j, c k+i j − c k j = i • δ for some δ ∈ Z. Formally: for every i ≥ 1 there are two sets I, J = [h] \ I and constant vectors u ∈

N I , v ∈ N J s.t. c k+i | I = u and [c k+i − c k ]| J = i • v.

We now consider the linear thresholder, defined by weights and bias w, b. In order to recognise a n b n , the thresholder must satisfy:

EQUATION (60): Not extracted; please refer to original document.

Opening these equations gives:

w| J (•c k | J +9v| J ) + w| I • u < 0 (61) w| J (•c k | J +10v| J ) + w| I • u > 0 (62) w| J (•c k | J +11v| J ) + w| I • u < 0 (63) but this gives 9w| J •v| J < 10w| J •v| J > 11w| J •v| J , which is impossible.

However, this does not mean that the s-QRNN is entirely incapable of recognising a n b n . Increasing the decoder power allows it to recognise a n b n quite simply:

Theorem 10. For the two-layer decoder D 2 , a n b n ∈ D 2 (s-QRNN).

Proof. Let # ba (x) denote the number of ba 2grams in x. We use s-QRNN with window size 2 to maintain two counters:

EQUATION (65): Not extracted; please refer to original document.

[c t ] 2 can be computed provided the QRNN window size is ≥ 2. A two-layer decoder can then check

EQUATION (66): Not extracted; please refer to original document.

Theorem 11 (Suffix attack). No s-QRNN and decoder can recognize the language a n b n Σ * = a n b n (a|b) * , n > 0, i.e., a n b n Σ * / ∈ L(s-QRNN) for any decoder L.

The proof will rely on the s-QRNN's inability to "freeze" a computed value, protecting it from manipulation by future input.

Proof. As in the proof for Theorem 9, it is sufficient to show that no Σ k -restricted CM with the additional operations {:=−1, :=1} can recognize a n b n Σ * for any decoder L.

Let A be some such CM, with window size k and h counters. For every w ∈ Σ n denote by c(w) ∈ Q h the counter values of A after processing w. Denote by u t the vector of counter update operations made by this machine on an input sequence w at time t ≤ |w|. Recall that A is Σ k restricted, meaning that u i depends exactly on the window of the last k tokens for every i.

We now denote j = k + 10 and consider the sequences w 1 = a j b j a j b j a j b j , w 2 = a j b j−1 a j b j+1 a j b j . w 2 is obtained from w 1 by removing the 2j-th token of w 1 and reinserting it at position 4j.

As all of w 1 is composed of blocks of ≥ k identical tokens, the windows preceding all of the other tokens in w 1 are unaffected by the removal of the 2j-th token. Similarly, being added onto the end of a substring b k , its insertion does not affect the windows of the tokens after it, nor is its own window different from before. This means that overall, the set of all operations u i performed on the counters is identical in w 1 and in w 2 . The only difference is in their ordering.

w 1 and w 2 begin with a shared prefix a k , and so necessarily the counters are identical after processing it. We now consider the updates to the counters after these first k tokens, these are determined by the windows of k tokens preceding each update.

First, consider all the counters that undergo some assignment (:=) operation during these sequences, and denote by {w} the multiset of windows in w ∈ Σ k for which they are reset. w 1 and w 2 only contain k-windows of types a x b k−x or b x a k−x , and so these must all re-appear in the shared suffix b j a j b j of w 1 and w 2 , at which point they will be synchronised. It follows that these counters all finish with identical value in c(w 1 ) and c(w 2 ).

All the other counters are only updated using addition of −1, 1 and 0, and so the order of the updates is inconsequential. It follows that they too are identical in c(w 1 ) and c(w 2 ), and therefore necessarily that c(w 1 ) = c(w 2 ).

From this we have w 1 , w 2 satisfying w 1 ∈ a n b n Σ * , w 2 / ∈ a n b n Σ * but also c(w 1 ) = c(w 2 ). Therefore, it is not possible to distinguish between w 1 and w 2 with the help of any decoder, despite the fact that w 1 ∈ a n b n Σ * and w 2 / ∈ a n b n Σ * . It follows that the CM and s-QRNN cannot recognize a n b n Σ * with any decoder.

For the opposite extension Σ * a n b n , in which the language is augmented by a prefix, we cannot use such a "suffix attack". In fact, Σ * a n b n can be recognized by an s-QRNN with window length w ≥ 2 and a linear threshold decoder as follows: a counter counts # a−b (x) and is reset to 1 on appearances of ba, and the decoder compares it to 0.

Note that we define decoders as functions from the final state to the output. Thus, adding an additional QRNN layer does not count as a "decoder" (as it reads multiple states). In fact, we show that having two QRNN layers allows recognizing a n b n Σ * .

Theorem 12. Let be the empty string. Then,

a n b n Σ * ∪ { } ∈ D 1 (s-QRNN • s-QRNN).

Proof. We construct a two-layer s-QRNN from which a n b n Σ * can be recognized. Let $ denote the left edge of the string. The first layer computes two quantities d t and e t as follows:

EQUATION (68): Not extracted; please refer to original document.

Note that e t can be interpreted as a binary value checking whether the first token was b. The second layer computes c t as a function of d t , e t , and x t (which can be passed through the first layer). We will demonstrate a construction for c t by creating linearly separable functions for the gate terms f t and z t that update c t .

EQUATION (70): Not extracted; please refer to original document.

Now, the update function u t to c t can be expressed

u t = f t z t =      +0 if 0 < d t +1 if d t ≤ 0 ∧ (x t = a ∨ e t )

−1 otherwise. (71) Finally, the decoder accepts iff c t ≤ 0. To justify this, we consider two cases: either x starts with b or a. If x starts with b, then e t = 0, so we increment c t by 1 and never decrement it. Since 0 < c t for any t, we will reject x. If x starts with a, then we accept iff there exists a sequence of bs following the prefix of as such that both sequences have the same length.

D S-Lstms

In contrast to the s-QRNN, we show that the s-LSTM paired with a simple linear and thresholding decoder can recognize both a n b n and a n b n Σ * .

Theorem 13. a n b n ∈ D 1 (s-LSTM).

Proof. Assuming a string a i b i , we set two units of the LSTM state to compute the following functions using the CM in Figure 3 :

EQUATION (73): Not extracted; please refer to original document.

We also add a third unit [c t ] 3 that tracks whether the 2-gram ba has been encountered, which is equivalent to verifying that the string has the form a i b i . Allowing h t = tanh(c t ), we set the linear threshold layer to check

EQUATION (74): Not extracted; please refer to original document.

Theorem 14.

a n b n Σ * ∈ D 1 (s-LSTM).

Proof. We use the same construction as Theorem 13, augmenting it with

[c t ] 4 [h t−1 ] 1 + [h t−1 ] 2 + [h t−1 ] 3 ≤ 0. (75)

We decide x according to the (still linearly separable) equation

0 < [h t ] 4 ∨ [h t ] 1 + [h t ] 2 + [h t ] 3 ≤ 0 . (76) E Experimental Details

Models were trained on strings up to length 64, and, at each index t, were asked to classify whether or not the prefix up to t was a valid string in the language. Models were then tested on independent datasets of lengths 64, 128, 256, 512, 1024, and 2048. The training dataset contained 100000 strings, and the validation and test datasets contained 10000. We discuss task-specific schemes for sampling strings in the next paragraph. All models were trained for a maximum of 100 epochs, with early stopping after 10 epochs based on the validation cross entropy loss. We used default hyperparameters provided by the open-source AllenNLP framework (Gardner et al., 2018) . The code is available at https://github.com/viking-sudo-rm/ rr-experiments.

Sampling strings For the language L 5 , each token was sampled uniformly at random from Σ = {a, b}. For a n b n Σ * , half the strings were sampled in this way, and for the other half, we sampled n uniformly between 0 and 32, fixing the first 2n characters of the string to a n b n and sampling the suffix uniformly at random.

Experimental cost Experiments were run for 20 GPU hours on Quadro RTX 8000.

F Self Attention

Architecture We place saturated self attention (Vaswani et al., 2017) into the state expressiveness hierarchy. We consider a single-head self attention encoder that is computed as follows:

1. At time t, compute queries q t , keys k t , and values v t from the input embedding x t using a linear transformation.

2. Compute attention head h t by attending over the keys and values up to time t (K :t and V :t ) with query q t .

3. Let • L denote a layer normalization operation (Ba et al., 2016) .

EQUATION (78): Not extracted; please refer to original document.

This simplified architecture has only one attention head, and does not incorporate residual connections. It is also masked (i.e., at time t, can only see the prefix X :t ), which enables direct comparison with unidirectional RNNs. For simplicity, we do not add positional information to the input embeddings.

Theorem 15. Saturated masked self attention is not RR.

Proof. Let # σ (x) denote the number of occurences of σ ∈ Σ in string x. We construct a self attention layer to compute the following function over {a, b} * :

EQUATION (79): Not extracted; please refer to original document.

Since the Hankel sub-block over P = a * , S = b * has infinite rank, f ∈ R.

Fix v t = x t .

As shown by Merrill (2019) , saturated attention over a prefix of input vectors X :t reduces to sum of the subsequence for which key-query similarity is maximized, i.e., denoting

EQUATION (80): Not extracted; please refer to original document.

For all t, set the key and query k t , q t = 1. Thus, all the key-query similarities are 1, and we obtain:

h t = 1 t t t =1

x t (81)

EQUATION (82): Not extracted; please refer to original document.

Applying layer norm to this quantity preserves equality of the first and second elements. Thus, we set the layer in (77) to independently check

0 < [h 0 t ] 1 − [h 0 t ] 2 and [h 0 t ] 1 − [h 0 t ] 2 < 0 using ReLU.

The final layer c t sums these two quantities, returning 0 if neither condition is met, and 1 otherwise.

Since saturated self attention can represent f / ∈ R, it is not RR. the stack RNN has Θ(n) space. Additionally, we find that it is not RR. This places it in the upperright box of Figure 1 .

Classically, a stack is a dynamic list of objects to which elements v ∈ V can be added and removed in a LIFO manner (using push and pop operations). The stack RNN proposed in Suzgun et al. (2019b) maintains a differentiable variant of such a stack, as follows:

Differentiable Stack In a differentiable stack, the update operation takes an element s t to push and a distribution π t over the update operations push, pop, and no-op, and returns the weighted average of the result of applying each to the current stack. The averaging is done elementwise along the stacks, beginning from the top entry. To facilitate this, differentiable stacks are padded with infinite 'null entries'. Their elements must also have a weighted average operation defined.

Definition 6 (Geometric k-stack RNN encoder). Initialize the stack S to an infinite list of null entries, and denote by S t the stack value at time t. Using 1-indexing for the stack and denoting [S t−1 ] 0 s t , the geometric k-stack RNN recurrent update is: 15

s t = f s (x t , c t−1 ) π t = f π (x t , c t−1 ) ∀i ≥ 1 [S t ] i = 3 a=1 [π t ] a [S t−1 ] i+a−2 .

In this work we will consider the case where the null entries are 0 and the encoding c t is produced as a geometric-weighted sum of the stack contents,

c t = ∞ i=1 1 2 i−1 [S t ] i .

This encoding gives preference to the latest values in the stack, giving initial stack encoding c 0 = 0.

Space Complexity The memory introduced by the stack data structure pushes the encoder into Θ(n) space. We formalize this by showing that, like a WFA, the stack RNN can encode binary strings to their value.

Lemma 5. The saturated stack RNN can compute the converging binary encoding function, i.e., 101 → 1 • 1 + 0.5 • 0 + 0.25 • 1 = 1.25.

Proof. Choose k = 1. Fix the controller to always push x t . Then, the encoding at time t will be

EQUATION (84): Not extracted; please refer to original document.

This is the value of the prefix x :t in binary.

Rational Recurrence We provide another construction to show that the stack RNN can compute non-rational series. Thus, it is not RR.

Definition 7 (Geometric counting). Define f 2 : {a, b} * → N such that

f 2 (x) = exp 1 2 # a−b (x) − 1.

Like similar functions we analyzed in Section 3, the Hankel matrix H f 2 has infinite rank over the sub-block a i b j .

Lemma 6. The saturated stack RNN can compute f 2 .

Proof. Choose k = 1. Fix the controller to push 1 for x t = a, and pop otherwise.

Originally referred to as the asymptotic RNN.

Space complexity measures the number of different configurations an RNN can reach as a function of input length. Formal definition deferred until Section 2.3 This is common, but not the only possibility. For example, an attention decoder observes the full sequence of states.

WFAs are often defined over a generic semiring; we consider only the special case when it is the field of rational reals.

The states q ∈ Σ

I.e., the minimal state representation needed to compute M θ correctly. This distinction is important for architectures like attention, for which some implementations may retain unusable information such as input embedding order.

With respect to our presented definition of RNNs, the concatenation of ht and ct can be seen as the recurrently updated state. However in all discussions of LSTMs we treat only ht as the LSTM's 'state', in line with common practice.

As well as CMs.

Since QRNN counter updates depend only on the wgrams present in the sequence.

As shown by the baseline, generalization is challenging because positive labels become less likely as strings get longer.

Note that m = −1 for the −1 counter update.

Note that any periodic positional encoding will also have finite image.

Intuitively, [πt]a corresponds to the operations push, noop, and pop, for the values a = 1, 2, 3 respectively.