RNN Architecture Learning with Sparse Regularization

Jesse Dodge
Roy Schwartz
Hao Peng
Noah A. Smith
EMNLP/IJCNLP
2019
View in Semantic Scholar

Abstract

Neural models for NLP typically use large numbers of parameters to reach state-of-the-art performance, which can lead to excessive memory usage and increased runtime. We present a structure learning method for learning sparse, parameter-efficient NLP models. Our method applies group lasso to rational RNNs (Peng et al., 2018), a family of models that is closely connected to weighted finite-state automata (WFSAs). We take advantage of rational RNNs’ natural grouping of the weights, so the group lasso penalty directly removes WFSA states, substantially reducing the number of parameters in the model. Our experiments on a number of sentiment analysis datasets, using both GloVe and BERT embeddings, show that our approach learns neural structures which have fewer parameters without sacrificing performance relative to parameter-rich baselines. Our method also highlights the interpretable properties of rational RNNs. We show that sparsifying such models makes them easier to visualize, and we present models that rely exclusively on as few as three WFSAs after pruning more than 90% of the weights. We publicly release our code.

1 Introduction

State-of-the-art neural models for NLP are heavily parameterized, requiring hundreds of millions (Devlin et al., 2019) and even billions (Radford et al., 2019) of parameters. While overparameterized models can sometimes be easier to train (Livni et al., 2014) , they may also introduce memory problems on small devices and lead to increased carbon emission (Strubell et al., 2019; Schwartz et al., 2019) .

In feature-based NLP, structured-sparse regularization, in particular the group lasso (Yuan and 1 https://github.com/dodgejesse/ sparsifying_regularizers_for_RRNNs

w (1) 1 w (1) 2 w (1) 3ŵ

(1) 1ŵ

(1) 2ŵ

(1) 3

EQUATION (0): Not extracted; please refer to original document.

Base structure Learned structure Figure 1 : Our approach learns a sparse structure (right hand side) of a base rational RNN (left hand side) where each hidden unit corresponds to a WFSA (in this example, three hidden units, represented by the three rows). Grayed-out, dashed states are removed from the model, while retained states are marked in bold green. Lin, 2006) , has been proposed as a method to reduce model size while preserving performance (Martins et al., 2011) . But, in neural NLP, some of the most widely used models-LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014) -do not have an obvious, intuitive notion of "structure" in their parameters (other than, perhaps, division into layers), so the use of structured sparsity at first may appear incongruous.

Figure 1: Our approach learns a sparse structure (right hand side) of a base rational RNN (left hand side) where each hidden unit corresponds to a WFSA (in this example, three hidden units, represented by the three rows). Grayed-out, dashed states are removed from the model, while retained states are marked in bold green.

In this paper we show that group lasso can be successfully applied to neural NLP models. We focus on a family of neural models for which the hidden state exhibits a natural structure: rational RNNs (Peng et al., 2018) . In a rational RNN, the value of each hidden dimension is the score of a weighted finite-state automaton (WFSA) on (a prefix of) the input vector sequence. This property offers a natural grouping of the transition function parameters for each WFSA. As shown by and Peng et al. (2018) , a variety of state-of-the-art neural architectures are rational (Lei et al., 2017; Bradbury et al., 2017; Foerster et al., 2017, inter alia) , so learning parameterefficient rational RNNs is of practical value. We also take advantage of the natural interpretation of rational RNNs as "soft" patterns .

We apply a group lasso penalty to the WFSA parameters of rational RNNs during training, where each group is comprised of the parameters associated with one state in one WFSA ( Fig. 1; §2 ). This penalty pushes the parameters in some groups to zero, effectively eliminating them, and making the WFSA smaller. When all of the states for a given WFSA are eliminated, the WFSA is removed entirely, so this approach can be viewed as learning the number of WFSAs (i.e., the RNN hidden dimension) as well as their size. We then retain the sparse structure, which results in a much smaller model in terms of parameters.

We experiment with four text classification benchmarks ( §3), using both GloVe and BERT embeddings. As we increase the regularization strength, we end up with smaller models. These models have a better tradeoff between the number of parameters and model performance compared to setting the number of WFSAs and their lengths by hand or using hyperparameter search. In almost all cases, our approach results in models with fewer parameters and similar or better performance compared to our baselines.

In contrast to neural architecture search (Jozefowicz et al., 2015; Zoph and Le, 2017), which can take several GPU years to learn an appropriate neural architecture, our approach requires only two training runs: one to learn the structure, and the other to estimate its parameters. Other approaches either ignore the structure of the model and only look at the value of individual weights (Liu et al., 2019; LeCun et al., 1990; Frankle and Carbin, 2019) or only use highlevel structures like the number of layers of the network (Wen et al., 2016; Scardapane et al., 2017; Gordon et al., 2018) .

Finally, our approach touches on another appealing property of rational RNNs-their interpretability. Each WFSA captures a "soft" version of patterns like "such a great X", and can be visualized as such . By retaining a small number of WFSAs, model structures learned using our method can be visualized succinctly. In §4 we show that some of our sentiment analysis models rely exclusively on as few as three WFSAs. 2 2 That is, a rational RNN with hidden size 3.

x 7 ! 1

x 7 ! u (1) (x) x 7 ! u (2) (x) x 7 ! u (3) (x) x 7 ! u (4) (x) q 0 x 7 ! f (1) (x) q 1 x 7 ! f (2) (x) q 2 x 7 ! f (3) (x) q 3 x 7 ! f (4) (x) q 4

Figure 2: A 4-gram WFSA, from which we derive the rational RNN ( §2). The rational RNN's hidden states correspond to a set of WFSAs, each separately parameterized. We apply group lasso to each WFSA.

2 Method

We describe the proposed method. At a high level, we follow the standard practice for using 1 regularization for sparsification (Wen et al., 2016) :

1. Fit a model on the training data, with the group lasso regularizer added to the loss during training (the parameters associated with one state comprise one group). 2. After convergence, eliminate the states whose parameters are zero. 3. Finetune the resulting, smaller model, by minimizing the unregularized loss with respect to its parameters. In this work, we assume a single layer rational RNN, but our approach is equally applicable to multi-layer models. For clarity of the discussion, we start with a one-dimensional rational RNN (i.e., one based on a single WFSA only). We then generalize to the d-dimensional case (computing the scores of d WFSAs in parallel).

Rational recurrent networks Following Peng et al. (2018), we parameterize the transition functions of WFSAs with neural networks, such that each transition (main path or self loop) defines a weighted function over the input word vector. We consider a 5-state WFSA, diagrammed in Fig. 2 .

A path starts at q 0 ; at least four tokens must be consumed to reach q 4 , and in this sense it captures 4-gram "soft" patterns (Peng et al., 2018; . In addition to q 4 , we also designate q 1 , q 2 , and q 3 as final states, allowing for the interpolation between patterns of different lengths. 3 The self-loop transitions over q 1 , q 2 , q 3 , and q 4 aim to allow, but downweight, nonconsecutive patterns, as the self-loop transition functions yield values between 0 and 1 (using a sigmoid function). The recurrent function is equivalent to applying the Forward dynamic programming algorithm (Baum and Petrie, 1966) .

Promoting sparsity with group lasso We aim to learn a sparse rational model with fewer WFSA states. This can be achieved by penalizing the parameters associated with a given state, specifically the parameters associated with entering that state, by a transition from another state or a self-loop on that state. For example, the parameters of the WFSA in Fig. 2 (excluding the word embedding parameters) are assigned to four nonoverlapping groups, one for each non-starting state.

During training, the regularization term will push all parameters toward zero, and some will converge close to zero. 4 After convergence, we remove groups for which the 2 norm falls below

. 5 The resulting smaller model is then finetuned by continuing training without regularizing. With our linear-structured WFSA, zeroing out the group associated with a state in the middle effectively makes later states inaccessible. While our approach offers no guarantee to remove states from the end first (thus leaving no unreachable states), in our experiments it always did so.

d-dimensional Case To construct a rational RNN with d WFSAs (a d-dimensional model),

we stack d one-dimensional models, each of them separately parameterized. The parameters of a d-dimensional rational model derived from the WFSA in Fig. 2 are organized into 4d groups, four for each dimension. Since there is no direct interaction between different dimensions (e.g., through an affine transformation), group lasso sparsifies each dimension/WFSA independently. Hence the resulting rational RNN can consist of WFSAs of different sizes, the number of which could be smaller than d if any of the WFSAs have all states eliminated.

One can treat the numbers and sizes of WF-SAs as hyperparameters (Oncina et al., 1993; Ron et al., 1994; de la Higuera, 2010; . By eliminating states from WFSAs with group lasso, we learn the WFSA structure while fitting the models' parameters, reducing the number of training cycles by reducing the number of tunable hyperparameters.

3 Experiments

We run sentiment analysis experiments. We train the rational RNN models ( §2) with group lasso regularization, using increasingly large regularization strengths, resulting in increasingly compact models. As the goal of our experiments is to demonstrate the ability of our approach to reduce the number of parameters, we only consider rational baselines: the same rational RNNs trained without group lasso. 6 We manually tune the number and sizes of the baselines WFSAs, and then compare the tradeoff curve between model size and accuracy. We describe our experiments below. For more details, see Appendix A.

Data We experiment with the Amazon reviews binary sentiment classification dataset (Blitzer et al., 2007) , composed of 22 product categories. We examine the standard dataset (original mix) comprised of a mixture of data from the different categories (Johnson and Zhang, 2015) . 7 We also examine three of the largest individual categories as separate datasets (kitchen, dvd, and books), following Johnson and Zhang (2015) . The three category datasets do not overlap with each other (though they do with original mix), and are significantly different in size (see Appendix A), so we can see how our approach behaves with different amounts of training data.

Implementation Details

To classify text, we concatenate the scores computed by each WFSA, then feed this d-dimensional vector of scores into a linear binary classifier. We use log loss. We experiment with both type-level word embeddings (GloVe.6B.300d; Pennington et al., 2014) and contextual embeddings (BERT large; Devlin et al., 2019). 8 In both cases, we keep the embeddings fixed, so the vast majority of the learnable parameters are in the WFSAs.

We train models using GloVe embeddings on all datasets. Due to memory constraints we evaluate BERT embeddings (frozen, not fine-tuned) only on the smallest dataset (kitchen).

Baselines As baselines, we train five versions of each rational architecture without group lasso, using the same number of WFSAs as our regularized models (24 for GloVe, 12 for BERT). Four of the baselines each use the same number of transitions for all WFSAs (1, 2, 3, and 4, corresponding to 2-5 states, and to 24, 48, 72, and 96 total transitions). The fifth baseline has an equal mix of all lengths (6 WFSAs of each size for GloVe, leading to 60 total transitions, and 3 WFSAs of each size for BERT, leading to 30 total transitions).

Each transition in our model is independently parameterized, so the total number of transitions linearly controls the number of learnable parameters (in addition to the parameters in the embedding layer).

Results Fig. 3 shows our classification test accuracy as a function of the total number of WFSA transitions in the model. We first notice that, as expected, the performance of our unregularized baselines improves as models are trained with more transitions (i.e., more parameters).

Figure 3: Text classification with GloVe embeddings (four leftmost graphs) and BERT (rightmost): accuracy (yaxis) vs. number of WFSA transitions (x-axis). Higher and to the left is better. Our method (dashed orange line, varying regularization strength) provides a better tradeoff than the baseline (solid blue line, directly varying the number of transitions). Vertical lines encode one standard deviation for accuracy, while horizontal lines encode one standard deviation in the number of transitions (applicable only to our method).

Compared to the baselines, training with group lasso provides a better tradeoff between performance and number of transitions. In particular, our heavily regularized models perform substantially better than the unigram baselines, gaining between 1-2% absolute improvements in four out of five cases. As our regularization strength decreases, we naturally gain less compared to our baselines, although still similar or better than the best baselines in four out of five cases.

4 Visualization

Using our method with a high regularization strength, the resulting sparse structures often contain only a handful of WFSAs. In such cases, building on the interpretability of individual WF-SAs, we are able visualize every hidden unit, i.e., the entire model. To visualize a single WFSA B, we follow and compute the score of B on every phrase in the training corpus. We then select the top and bottom scoring phrases for B, 9 and get a prototype-like description of the pattern representing B. 10 9 As each WFSA score is used as a feature that is fed to a linear classifier, negative scores are also meaningful.

10 While the WFSA scores are the sum of all paths deriving a document (plus-times semiring), here we search for the max (or min) scoring one. Despite the mismatch, a WFSA scores Each column represents one main-path transition, plus potential self-loops preceding it (marked like this SL ). ". . . SL " marks more than 2 self loops. "" marks an end-of-document token. Table 1 visualizes a sparse rational RNN trained on original mix with only three WFSAs, (8 mainpath transitions in total). 11 The table shows that looking at the top scores of each WFSA, two of the patterns respectively capture the phrases "not worth X " and "miserable/returned X ''. Pattern 3 is not as coherent, but most examples do contain sentiment-bearing words such as bad, horrible, or best. This might be the result of the tuning process of the sparse rational structure simply learning a collection of words, rather than coherant phrases. As a result, this WFSA is treated as a unigram pattern rather than a trigram. The lowest scoring phrases show a similar trend. Appendix B shows the same visualization for another sparse rational RNN containing only four WFSAs and 11 main-path transitions, trained with BERT embeddings.

Table 1: Visualization of a sparse rational RNN trained on original mix containing only 3 WFSAs. For each WFSA (i.e., pattern), we show the 4 top and bottom scoring phrases in the training corpus with this WFSA. Each column represents one main-path transition, plus potential self-loops preceding it (marked like thisSL). “. . . SL” marks more than 2 self loops. “” marks an end-of-document token.

We observe another interesting trend: two of the three patterns prefer expressions that appear near the end of the document. This could result from the nature of the datasets (e.g., many reviews end every possible path, and thus the max/min scoring path selection is still valid. As our examples show, many of these extracted paths are meaningful.

11 The test performance of this model is 88%, 0.6% absolute below the average of the five models reported in Fig. 3 . 24 48 72 96 Number of Transitions 83 84 85 86 87 88 Classification Accuracy kitchen 24 48 72 96 86 87 88 89 90 dvd 24 48 72 96 87 88 89 90 books 24 48 72 96 88 89 90 91 original_mix 12 24 36 48 Number of Transitions 90 91 92 93 Classification Accuracy kitchen BERT Figure 3 : Text classification with GloVe embeddings (four leftmost graphs) and BERT (rightmost): accuracy (yaxis) vs. number of WFSA transitions (x-axis). Higher and to the left is better. Our method (dashed orange line, varying regularization strength) provides a better tradeoff than the baseline (solid blue line, directly varying the number of transitions). Vertical lines encode one standard deviation for accuracy, while horizontal lines encode one standard deviation in the number of transitions (applicable only to our method).

with a summary, containing important sentiment information), and/or our rational models' recency preference. More specifically, the first self loop has weight 1, and hence the model is not penalized for taking self loops before the match; in contrast, the weights of the last self loop take values in (0, 1) due to the sigmoid, forcing a penalty for earlier phrase matches. 12

5 Conclusion

We presented a method for learning parameterefficient RNNs.

Our method applies group lasso regularization on rational RNNs, which are strongly connected to weighted finite-state automata, and thus amenable to learning with structured sparsity. Our experiments on four text classification datasets, using both GloVe and BERT embeddings, show that our sparse models provide a better performance/model size tradeoff. We hope our method will facilitate the development of "thin" NLP models, that are faster, consume less memory, and are interpretable (Schwartz et al., 2019) .

A Experiment Details

Dataset statistics Table 2 shows the sizes of the datasets experimented with.

Training

Dev. Test kitchen 3,298 822 4,118 dvd 14,066 3,514 17,578 books 20,000 5,000 25,000 original mix 20,000 5,000 25,000 Preprocessing As preprocessing for the data for each individual category, we tokenize using the NLTK word tokenizer. We removed reviews with text shorter than 5 tokens. We binarize the review score using the standard procedure, assigning 1-and 2-star reviews as negative, and 4-and 5-star reviews as positive (discarding 3-star reviews). Then, if there were more than 25,000 negative reviews, we downsample to 25,000 (otherwise we keep them all), and then downsample the positive reviews to be the same number as negative, to have a balanced dataset. We match the train, development, and test set proportions of 4:1:5 from the original mixture.

We generate the BERT embeddings using the sum of the last four hidden layers of the large uncased BERT model, so our embedding size is 1024. Summing the last four layers was the best performing approach in the ablation of Devlin et al. (2019) that had fewer than 4096 embedding size (which was too large to fit in memory). We embed each sentence individually (there can be multiple sentences within one example).

Implementation details For GloVe, we train rational models with 24 5-state WFSAs, each corresponding to a 4-gram soft-pattern (Fig. 2) . For BERT, we train models with 12 WFSAs. 13 Experiments For each model (regularized or baseline), we run random search to select our hyperparameters (evaluating 20 uniformly sampled hyperparameter configurations). For the hyperparameter configuration that leads to the best development result, we train the model again 5 times with different random seeds, and report the mean and standard deviation of the models' test performance.

Parameters The models are trained with Adam (Kingma and Ba, 2015) . During training with group lasso we turn off the learning rate schedule (so the learning rate stays fixed), similarly to Gordon et al. (2018) . This leads to improved stability in the learned structure for a given hyperparameter assignment.

Table 3: Visualization of a sparse rational RNN containing 4 WFSAs only, trained on kitchen using BERT.

Following Peng et al. (2018) we sample 20 hyperparameters uniformly, for which we train and evaluate our models. Hyperparameter ranges are presented in Table 4 . For the BERT experiments, we reduced both the upper and lower bound on the learning rate by two orders of magnitude.

Table 4: Hyperparameter ranges considered in our experiments.

Regularization strength search We searched for model structures that were regularized down to close to 20, 40, 60, or 80 transitions (10, 20, 30, and 40 for BERT experiments) . For a particular goal size, we uniformly sample 20 hyperparameter assignments from the ranges in Table 4 , then sorted the samples by increasing learning rate. For each hyperparameter assignment, we trained a model with the current regularization strength. If the resulting learned structure was too large (small), we doubled (halved) the regularization strength, repeating until we were within 10 transitions of our goal (5 for BERT experiments). 14 Finally, we finetuned the appropriately-sized learned structure by continuing training without the regularizer, and computed the result on the development set. For the best model on the development set, we retrained (first with the regularizer to learn a structure, then finetuned) five times, and plot the mean and variance of the test accuracy and learned structure size. Table 3 shows the same visualization shown in §4 for another sparse rational RNN containing only four WFSAs and 11 main-path transitions, trained with BERT embeddings on kitchen. It also shows a few clear patterns (e.g., Patt. 2). Interpretation here is more challenging though, as contextual embeddings make every token embedding depend 14 If the regularization strength became larger than 10 2 or smaller than 10 −9 , we threw out the hyperparameter assignment and resampled (this happened when, e.g., the learning rate was too small for any of the weights to actually make it to zero). on the entire context. 15 A particular example of this is the excessive use of the start token ([CLS]), whose contextual embedding has been shown to capture the sentiment information at the sentence level (Devlin et al., 2019) .

B Visualization

Regularization strength recommendation If a practitioner wishes to learn a single small model, we recommend they start with λ such that the loss L(w) and the regularization term are equal at initialization (before training). We found that having equal contribution led to eliminating approximately half of the states, though this varies with data set size, learning rate, and gradient clipping, among other variables.

15 Indeed, contextual embeddings raise problems for interpretation methods that work by targeting individual words, e.g., attention (Bahdanau et al., 2015) , as these embeddings also depend on other words. Interpretation methods for contextual embeddings are an exciting direction for future work.

We found this to be more stable than using only q4.

There are optimization methods that achieve "strong" sparsity(Parikh and Boyd, 2013), where some parameters are exactly set to zero during training. Recent work has shown these approaches can converge in nonconvex settings(Reddi et al., 2016), but our experiments found them to be unstable.5 We use 0.1. This threshold was lightly tuned in preliminary experiments on the validation set and found to reliably remove those parameters which converged around zero without removing others.

Rational RNNs have shown strong performance on the dataset we experiment with: a 2-layer rational model with between 100-300 hidden units obtained 92.7% classification accuracy, substantially outperforming an LSTM baseline(Peng et al., 2018). The results of our models, which are single-layered and capped at 24 hidden units, are not directly comparable to these baselines, but are still within two points of the best result from that paper.7 http://riejohnson.com/cnn_data.html 8 https://github.com/huggingface/ pytorch-pretrained-BERT

Changing this behavior could be easily done by fixing the final self-loop to 1 as well.

The BERT embedding dimension is significantly larger than GloVe (1024 compared to 300), so we used a smaller number of WFSAs. As our results show, the BERT models still substantially outperform the GloVe ones.