## Abstract

We present PaLM, a hybrid parser and neural language model. Building on an RNN language model, PaLM adds an attention layer over text spans in the left context. An unsupervised constituency parser can be derived from its attention weights, using a greedy decoding algorithm. We evaluate PaLM on language modeling, and empirically show that it outperforms strong baselines. If syntactic annotations are available, the attention component can be trained in a supervised manner, providing syntactically-informed representations of the context, and further improving language modeling performance.

## 1 Introduction

Recent language models have shown very strong data-fitting performance (Jozefowicz et al., 2016; Merity et al., 2018) . They offer useful products including, most notably, contextual embeddings (Peters et al., 2018; Radford et al., 2019) , which benefit many NLP tasks such as text classification (Howard and Ruder, 2018) and dataset creation (Zellers et al., 2018) .

Language models are typically trained on large amounts of raw text, and therefore do not explicitly encode any notion of structural information. Structures in the form of syntactic trees have been shown to benefit both classical NLP models (Gildea and Palmer, 2002; Punyakanok et al., 2008; Das et al., 2012, inter alia) and recent state-of-the-art neural models Swayamdipta et al., 2018; Peng et al., 2018b; Strubell et al., 2018, inter alia) . In this paper we show that LMs can benefit from syntacticallyinspired encoding of the context.

We introduce PaLM (parser and language model; Fig. 1 Figure 1 : An illustration of PaLM. The LM (first line) predicts the next word (x t+1 , double blue arrow) by attending over previous spans ending in time t − 1 (dashed lines). The parser (lines 2-4) splits the prefix into two spans (line 2) by the taking the top scoring attended span (red solid line) and the prefix leading to it. It then recursively splits the two sub-spans using the same procedure (line 3). Finally, spans of length two are trivially split into terminal nodes (line 4).

tokens, implicitly learning which syntactic constituents are likely. A span-based parser is then derived from the attention information (Stern et al., 2017) . PaLM has several benefits. First, it is an intuitive and lightweight way of incorporating structural information ( §2.1), requiring no marginal inference, which can be computationally expensive (Jelinek and Lafferty, 1991; Chelba and Jelinek, 1998; Roark, 2001; Buys and Blunsom, 2018; Kim et al., 2019, inter alia) . Second, the attention can be syntactically informed, in the sense that the attention component can optionally be supervised using syntactic annotations, either through pretraining or by joint training with the LM ( §2.2). Last, PaLM can derive an unsupervised constituency parser ( §2.2), whose parameters are estimated purely using the language modeling objective.

To demonstrate the empirical benefits of PaLM, we experiment with language modeling ( §3). PaLM outperforms the AWD-LSTM model (Merity et al., 2018) Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) datasets by small but consistent margins in the unsupervised setup. When the parser is trained jointly with the language model, we see additional perplexity reductions in both cases. Our implementation is available at https: //github.com/Noahs-ARK/PaLM.

## 2 Palm-Parser And Language Model

We describe PaLM in detail. At its core is an attention component, gathering the representations of preceding spans at each time step. Similar to self-attention, PaLM can be implemented on top of RNN encoders (Parikh et al., 2016) , or as it is (Vaswani et al., 2017) . Here we encode the tokens using a left-to-right RNN, denoted with vectors h t . 1 Below we describe the span-attention component and the parsing algorithm. We use [i, j], i ≤ j to denote text span x i . . . x j , i.e., inclusive on both sides. When i = j, it consists of a single token.

## 2.1 Span Attention

We want the language model attention to gather context information aware of syntactic structures. A constituency parse can be seen as a collection of syntactic constituents, i.e., token spans. Therefore we attend over preceding spans. 2 At step t, PaLM attends over the spans ending at t − 1, up to a maximum length m, i.e.,

{[i, t − 1]} t−1 i=t−m . 3

Essentially, this can be seen as splitting the prefix span [t − m, t − 1] into two, and attending over the one on the right. Such a span attention mechanism is inspired by the top-down greedy span parser of Stern et al. (2017) , which recursively divides phrases. In §2.2, we will use a similar algorithm to derive a constituency parser from the span attention weights.

Bidirectional span representation with rational RNNs. Meaningful span representations are crucial in span-based tasks (Lee et al., 2017; Peng et al., 2018c; Swayamdipta et al., 2018, inter 1 We experiment with a strong LSTM implementation for language modeling (Merity et al., 2018) , see §3.

2 Standard token-based self-attention naturally relates to dependency structures through head selection (Strubell et al., 2018) . In a left-to-right factored language model, dependencies are less natural if we want to allow a child to precede its parent.

3 m is set to 20. This reduces the number of considered spans from O(n 2 ) to O(mn). Besides practical concerns, it makes less sense if a phrase goes beyond one single sentence (the average sentence length of WSJ training sentences is 21). alia). Typical design choices are based on start and end token vectors contextualized by bidirectional RNNs. However, a language model does not have access to future words, and hence running a backward RNN from right to left is less straightforward: one will have to start an RNN running at each token, which is computationally daunting (Kong et al., 2016) . To compute span representations efficiently, we use rational RNNs (RRNNs; Peng et al., 2018a) .

RRNNs are a family of RNN models, where the recurrent function can be computed with weighted finite-state automata (WFSAs). We use the unigram WFSA-inspired RRNN (Peng et al., 2018a) , where the cell state update is

`EQUATION (1a): Not extracted; please refer to original document.`

`EQUATION (1b): Not extracted; please refer to original document.`

c t = f t c t−1 + u t . (1c)

f t is a forget gate implemented with the elementwise sigmoid function σ, and denotes elementwise multiplication. W u and W f are learned matrices. Bias terms are suppressed for clarity. 4 Slightly overloading the notation, let − → c i,j denote the encoding of span [i, j] by running a forward RRNN in Eq. 1, from left to right. It can be efficiently computed by subtracting − → c i−1 from − → c j , weighted by a product of forget gates:

`EQUATION (2): Not extracted; please refer to original document.`

− → f k vectors are the forget gates. See §B for a detailed derivation. Using this observation, we now derive an efficient algorithm to calculate the span representations based on bidirectional RRNNs. In the interest of space, Alg. 1 describes the forward span representations. It takes advantage of the distributivity property of rational RNNs (Peng et al., 2018a) , and the number of RNN function calls is linear in the input length. 5 Although overall asymptotic time complexity is still quadratic, Alg. 1 only involves elementwise operations, which can be eas-

Algorithm 1 RRNN-based span representation. 6 1: procedure SPANREPR({ − → c t, − → f t}) 2:

Accumulate forward forget gates 3:

for i = 1, . . . , n do 4: − → f 1,i = − → f 1,i−1 − → f i 5:

end for 6:

for j = 1, . . . , n do 7:

for i = 1, . . . , j do 8: − → c i,j= − → c j − − → c i−1 − → f 1,j / − → f 1,i−1 9:

end for 10: end for 11:

return − → c i,j vectors 12: end procedure ily parallelized on modern GPUs. The backward one (right to left) is analogous.

Computing attention. As in standard attention, we use a normalized weighted sum of the span representations. Let g(

[i, j]) = [ − → c i,j ; ← − c i,j ] denote the representation of span [i, j]

, which concatenates the forward and backward representations calculated using Alg. 1. The context vector a t is

`EQUATION (3b): Not extracted; please refer to original document.`

Here s t,i is implemented as an MLP, taking as input the concatenation of h t+1 and g([t − i, t]) and outputs the attention score. The context vector is then concatenated with the hidden stateh t+1 = [h t+1 ; a t+1 ], and fed into onward computation. In summary, given an input sequence, PaLM: 1. First uses a standard left-to-right RNN to calculate the hidden states h t . 2. Feed h t vectors into a one-layer bidirectional rational RNN (Eq. 1), using Alg. 1 to compute the span representations. 3. Attends over spans (Eq. 3b) to predict the next word.

## 2.2 Attention-Based Constituency Parsing

We next describe the other facet or PaLM: the constituency parser. Our parsing algorithm is similar to the greedy top-down algorithm proposed by Stern et al. (2017) . It recursively divides a span into two smaller ones, until a single-token span, i.e., a leaf, is reached. The order of the partition specifies the tree structure. 7 Formally, for a maximum span length m, at each time step j + 1, we split the span [j − m + 1, j] into two smaller parts [j − m + 1, k 0 ] and [k 0 + 1, j]. The partitioning point is greedily selected, maximizing the attention scores of spans ending at j: 8

`EQUATION (4): Not extracted; please refer to original document.`

The span is directly returned as a leaf if it contains a single token. A full parse is derived by running the algorithm recursively, starting with the input as a single span (with a special end-of-sentence mark at the end). The runtime is O(n 2 ), with n − 1 partitioning points. See Fig. 1 for an illustration.

Supervising the attention. Now that we are able to derive phrase structures from attention weights, we can further inform the attention if syntactic annotations are available, using oracle span selections. For each token, the gold selection is a m-dimensional binary vector, and then normalized to sum to one, denoted y t . 9 We add a crossentropy loss (averaged across the training data) to the language modeling objective, with λ trading off between the two:

`EQUATION (5): Not extracted; please refer to original document.`

with ω being the attention distribution at step t, and N the length of the training corpus. As we will see in §3, providing syntactically guided span attention improves language modeling performance.

Discussion. PaLM provides an intuitive way to inject structural inductive bias into the language model-by supervising the attention distribution. This setting can be seen as a very lightweight multitask learning, where no actual syntactic tree is predicted during language modeling training or evaluation. The attention weight predictor (i.e., the s scores in Eq. 3b) can be replaced with an off-the-shelf parser, or deterministically set (e.g., to simulate left/right-branching).

We evaluate PaLM on language modeling. We experiment with the Penn Treebank corpus (PTB) and WikiText-2 (WT2). We follow the preprocessing of Mikolov et al. (2010) for PTB and Merity et al. (2018) for WT2. More implementation details are described in Appendix A. We compare two configurations of PaLM:

• PaLM-U builds on top of AWD-LSTM (Merity et al., 2018) , a state-of-the-art of LSTM implementation for language modeling. The span attention is included before the last layer. 10 • PaLM-S is the same model as PaLM-U, but uses phrase syntax annotation to provide additional supervision to the attention component ( §2.2). 11 We compare against the AWD-LSTM baseline. On PTB, we also compare to two models using structural information in language modeling: parsing-reading-predict networks (PRPN; Shen et al., 2018a) predicts syntactic distance as structural features for language modeling; orderedneuron LSTM (ON-LSTM; Shen et al., 2018b) posits a novel ordering on LSTM gates, simulating the covering of phrases at different levels in a constituency parse. On PTB we also compare to PaLM-RB, a baseline deterministically setting the attention scores (Eq. 3b) in decreasing order, such that the derived trees will be right-branching. 12 Tables 1 and 2 summarize the language modeling results. On both datasets, the unsupervised configuration (PaLM-U) outperforms AWD-LSTM. On PTB, PaLM-U achieves similar performance to ON-LSTM and much better performance than PRPN. PaLM-S further reduces the perplexity by 1.6-3.4% (relative), showing that incorporating structural information with supervised span attention helps language modeling. Naively promoting right-branching attention (PaLM-RB) yields no improvement over the baseline.

Unsupervised constituency parsing. We evaluate the parser component of PaLM-U on WSJ-40. It uses the same data as in language modeling, but filters out sentences longer than 40 tokens after 10 Preliminary experiments show that including the span attention after the last layer yields similar empirical results, but is more sensitive to hyperparameters. 11 We use the WSJ portion of PTB for parsing annotations. 12 We set scores to m, m − 1, . . . , 1, before the softmax.

13 Several recent works report better language modeling perplexity Takase et al., 2018; Dai et al., 2019, inter alia) . Their contribution is orthogonal to ours and not head-to-head comparable to the models in the punctuation removal. The model is selected based on language modeling validation perplexity. In addition to PRPN, we compare to DIORA (Drozdov et al., 2019) , which uses an inside-outside dynamic program in an autoencoder. Table 3 shows the F 1 results. PaLM outperforms the right branching baseline, but is not as accurate as the other models. 14 This indicates that the type of syntactic trees learned by it, albeit useful to the LM component, do not correspond well to PTB-like syntactic trees.

Discussion. Despite its strong performance, the parsing algorithm used by Shen et al. (2018a) and Shen et al. (2018b) suffers from an incomplete support issue. 15 More precisely, it fails to produce "close-open-open," i.e., )(( structures. As a result, the parser is intrinsically biased toward rightbranching structures. PaLM, on the other hand, scores all the spans, and therefore can produce any binary tree spanning a given sentence: the algorithm recovers any given binary tree by letting s j,j−i = 1 if the tree contains nonterminal [i, j] , and 0 otherwise. 16 Is PaLM empirically biased toward any branching direction? In greedily selected trees, we measure the percentage of left-branching splits (divid- Table 4 summarizes the results on WSJ-40 test set. The first row shows the results for randomly initialized models without training. We observe no significant trend of favoring one branching direction over the other. However, after training with the language modeling objective, PaLM-U shows a clear right-skewness more than it should: it produces much more right-branching structures than the gold annotation. This means that the span attention mechanism has learned to emphasize longer prefixes, rather than make strong Markov assumptions. More exploration of this effect is left to future work.

## 4 Conclusion

We present PaLM, a hybrid parser and language model. PaLM attends over the preceding text spans. From its attention weights phrase structures can be derived. The attention component can be separately trained to provide syntacticallyinformed context gathering. PaLM outperforms strong baselines on language modeling. Incorporating syntactic supervision during training leads to further language modeling improvements. Training our unsupervised model on large-scale corpora could result in both stronger language models and, potentially, stronger parsers. Our code is publicly available at https:// github.com/Noahs-ARK/PaLM.

Unlike other RNNs such as LSTM(Hochreiter and Schmidhuber, 1997) or GRU(Cho et al., 2014), RRNNs do not apply an affine transformation or a nonlinear dependency of ct on ct−1.5 In contrast, the dynamic program ofKong et al. (2016) for segmental (span-scoring) RNNs requires a quadratic number of recurrent function calls, since they use LSTMs, where distributivity does not hold.

/ denotes elementwise division. Both elementwise product and division are implemented in log-space.

It is only able to produce binarized unlabeled trees. 8 Another natural choice is to maximize the sum of the scores of [i, k0] and [k0+1, j]. The attention score of [i, k0] is computed at time step k0, and hence does not know anything about the other span on the right. Therefore we consider only the score of the right span.9 Not necessarily one-hot: multiple spans can end at the same token.

Evaluation on WSJ-10, which contains sentences with 10 or less tokens, shows a similar trend.15 Chris Dyer, personal communication.16 The maximum span length m is only forced in language modeling training and evaluation.

We exclude trivial splits dividing a length-2 span into two tokens.