Deep Semantic Role Labeling: What Works and What's Next
We introduce a new deep learning model for semantic role labeling (SRL) that significantly improves the state of the art, along with detailed analyses to reveal its strengths and limitations. We use a deep highway BiLSTM architecture with constrained decoding, while observing a number of recent best practices for initialization and regularization. Our 8-layer ensemble model achieves 83.2 F1 on theCoNLL 2005 test set and 83.4 F1 on CoNLL 2012, roughly a 10% relative error reduction over the previous state of the art. Extensive empirical analysis of these gains show that (1) deep models excel at recovering long-distance dependencies but can still make surprisingly obvious errors, and (2) that there is still room for syntactic parsers to improve these results.
Semantic role labeling (SRL) systems aim to recover the predicate-argument structure of a sentence, to determine essentially "who did what to whom", "when", and "where." Recently breakthroughs involving end-to-end deep models for SRL without syntactic input (Zhou and Xu, 2015; Marcheggiani et al., 2017) seem to overturn the long-held belief that syntactic parsing is a prerequisite for this task (Punyakanok et al., 2008) . In this paper, we show that this result can be pushed further using deep highway bidirectional LSTMs with constrained decoding, again significantly moving the state of the art (another 2 points on CoNLL 2005) . We also present a careful empirical analysis to determine what works well and what might be done to progress even further.
Our model combines a number of best practices in the recent deep learning literature. Fol-lowing Zhou and Xu (2015) , we treat SRL as a BIO tagging problem and use deep bidirectional LSTMs. However, we differ by (1) simplifying the input and output layers, (2) introducing highway connections (Srivastava et al., 2015; Zhang et al., 2016) , (3) using recurrent dropout (Gal and Ghahramani, 2016) , (4) decoding with BIOconstraints, and (5) ensembling with a product of experts. Our model gives a 10% relative error reduction over previous state of the art on the test sets of CoNLL 2005 and 2012. We also report performance with predicted predicates to encourage future exploration of end-to-end SRL systems.
We present detailed error analyses to better understand the performance gains, including (1) design choices on architecture, initialization, and regularization that have a surprisingly large impact on model performance; (2) different types of prediction errors showing, e.g., that deep models excel at predicting long-distance dependencies but still struggle with known challenges such as PPattachment errors and adjunct-argument distinctions; (3) the role of syntax, showing that there is significant room for improvement given oracle syntax but errors from existing automatic parsers prevent effective use in SRL.
In summary, our main contributions incluede:
• A new state-of-the-art deep network for endto-end SRL, supported by publicly available code and models. 1 • An in-depth error analysis indicating where the model works well and where it still struggles, including discussion of structural consistency and long-distance dependencies. • Experiments that point toward directions for future improvements, including a detailed discussion of how and when syntactic parsers could be used to improve these results.
Two major factors contribute to the success of our deep SRL model: (1) applying recent advances in training deep recurrent neural networks such as highway connections (Srivastava et al., 2015) and RNN-dropouts (Gal and Ghahramani, 2016 ), 2 and (2) using an A * decoding algorithm (Lewis and Steedman, 2014; Lee et al., 2016) to enforce structural consistency at prediction time without adding more complexity to the training process. Formally, our task is to predict a sequence y given a sentence-predicate pair (w, v) as input. Each y i ∈ y belongs to a discrete set of BIO tags T . Words outside argument spans have the tag O, and words at the beginning and inside of argument spans with role r have the tags B r and I r respectively. Let n = |w| = |y| be the length of the sequence.
Predicting an SRL structure under our model involves finding the highest-scoring tag sequence over the space of all possibilities Y:
EQUATION (1): Not extracted; please refer to original document.
We use a deep bidirectional LSTM (BiLSTM) to learn a locally decomposed scoring function conditioned on the input: n t=1 log p(y t | w). To incorporate additional information (e.g., structural consistency, syntactic input), we augment the scoring function with penalization terms:
f (w, y) = n t=1 log p(y t | w) − c∈C c(w, y 1:t ) (2)
Each constraint function c applies a non-negative penalty given the input w and a length-t prefix y 1:t . These constraints can be hard or soft depending on whether the penalties are finite.
2.1 Deep Bilstm Model
Our model computes the distribution over tags using stacked BiLSTMs, which we define as follows:
EQUATION (8): Not extracted; please refer to original document.
2 We thank Mingxuan Wang for suggesting highway connections with simplified inputs and outputs. Part of our model is extended from his unpublished implementation. where x l,t is the input to the LSTM at layer l and timestep t. δ l is either 1 or −1, indicating the directionality of the LSTM at layer l.
To stack the LSTMs in an interleaving pattern, as proposed by Zhou and Xu (2015) , the layerspecific inputs x l,t and directionality δ l are arranged in the following manner:
EQUATION (10): Not extracted; please refer to original document.
The input vector x 1,t is the concatenation of token w t 's word embedding and an embedding of the binary feature (t = v) indicating whether w t word is the given predicate. Finally, the locally normalized distribution over output tags is computed via a softmax layer:
EQUATION (11): Not extracted; please refer to original document.
Highway Connections To alleviate the vanishing gradient problem when training deep BiL-STMs, we use gated highway connections (Zhang et al., 2016; Srivastava et al., 2015) . We include transform gates r t to control the weight of linear and non-linear transformations between layers (See Figure 1) . The output h l,t is changed to:
EQUATION (13): Not extracted; please refer to original document.
Recurrent Dropout To reduce over-fitting, we use dropout as described in Gal and Ghahramani (2016) . A shared dropout mask z l is applied to the hidden state:
EQUATION (16): Not extracted; please refer to original document.
z l is shared across timesteps at layer l to avoid amplifying the dropout noise along the sequence.
2.2 Constrained A * Decoding
The approach described so far does not model any dependencies between the output tags. To incorporate constraints on the output structure at decoding time, we use A * search over tag prefixes for decoding. Starting with an empty sequence, the tag sequence is built from left to right. The score for a partial sequence with length t is defined as:
EQUATION (17): Not extracted; please refer to original document.
An admissible A * heuristic can be computed efficiently by summing over the best possible tags for all timesteps after t:
EQUATION (18): Not extracted; please refer to original document.
Exploration of the prefixes is determined by an agenda A which is sorted by f (w, y 1:t ) + g(w, y 1:t ). In the worst case, A * explores exponentially many prefixes, but because the distribution p(y t | w) learned by the BiLSTM models is very peaked, the algorithm is efficient in practice. We list some example constraints as follows:
BIO Constraints These constraints reject any sequence that does not produce valid BIO transitions, such as B ARG0 followed by I ARG1 .
SRL Constraints Punyakanok et al. (2008) ; Täckström et al. (2015) described a list of SRLspecific global constraints:
• Unique core roles (U): Each core role (ARG0-ARG5, ARGA) should appear at most once for each predicate. • Continuation roles (C): A continuation role C-X can exist only when its base role X is realized before it. • Reference roles (R): A reference role R-X can exist only when its base role X is realized (not necessarily before R-X).
We only enforce U and C constraints, since the R constraints are more commonly violated in gold data and enforcing them results in worse performance (see discussions in Section 4.3).
We can enforce consistency with a given parse tree by rejecting or penalizing arguments that are not constituents. In Section 4.4, we will discuss the motivation behind using syntactic constraints and experimental results using both predicted and gold syntax.
2.3 Predicate Detection
While the CoNLL 2005 shared task assumes gold predicates as input (Carreras and Màrquez, 2005) , this information is not available in many downstream applications. We propose a simple model for end-to-end SRL, where the system first predicts a set of predicate words v from the input sentence w. Then each predicate in v is used as an input to argument prediction. We independently predict whether each word in the sentence is a predicate, using a binary softmax over the outputs of a bidirectional LSTM trained to maximize the likelihood of the gold labels.
We measure the performance of our SRL system on two PropBank-style, span-based SRL datasets: CoNLL-2005 (Carreras and Màrquez, 2005) and CoNLL-2012 (Pradhan et al., 2013 3 . Both datasets provide gold predicates (their index in the sentence) as part of the input. Therefore, each provided predicate corresponds to one training/test tag sequence. We follow the traindevelopment-test split for both datasets and use the official evaluation script from the CoNLL 2005 shared task for evaluation on both datasets.
3.2 Model Setup
Our network consists of 8 BiLSTM layers (4 forward LSTMs and 4 reversed LSTMs) with 300dimensional hidden units, and a softmax layer for predicting the output distribution.
Initialization All the weight matrices in BiL-STMs are initialized with random orthonormal matrices as described in Saxe et al. (2013). All tokens are lower-cased and initialized with 100-dimensional GloVe embeddings pre-trained on 6B tokens (Pennington et al., 2014) and updated during training. Tokens that are not covered by GloVe are replaced with a randomly initialized UNK embedding.
Training We use Adadelta (Zeiler, 2012) with = 1e −6 and ρ = 0.95 and mini-batches of size 80. We set RNN-dropout probability to 0.1 and clip gradients with norm larger than 1. All the models are trained for 500 epochs with early stopping based on development results. 4 Ensembling We use a product of experts (Hinton, 2002) to combine predictions of 5 models, each trained on 80% of the training corpus and validated on the remaining 20%. For the CoNLL 2012 corpus, we split the training data from each sub-genre into 5 folds, such that the training data will have similar genre distributions.
Constrained Decoding We Experimented With Different Types Of Constraints On The Conll 2005
and CoNLL 2012 development sets. Only the BIO hard constraints significantly improve over the ensemble model. Therefore, in our final results, we only use BIO hard constraints during decoding. 5
In Table 1 and 2, we compare our best single and ensemble model with previous work. Our ensemble (PoE) has an absolute improvement of 2.1 F1 on both CoNLL 2005 and CoNLL 2012 over the previous state of the art. Our single model also achieves more than a 0.4 improvement on both datasets. In comparison with the best reported results, our percentage of completely correct predicates improves by 5.9 points. While the continuing trend of improving SRL without syntax seems to suggest that neural end-to-end systems no longer needs parsers, our analysis in Section 4.4 will show that accurate syntactic information can improve these deep models. Figure 2 shows learning curves of our model ablations on the CoNLL 2005 development set. We ablate our full model by removing highway connections, RNN-dropout, and orthonormal initialization independently. Without dropout, the model overfits at around 300 epochs at 78 F1. Orthonormal parameter initialization is surprisingly important-without this, the model achieves only 65 F1 within the first 50 epochs. All 8 layer ablations suffer a loss of more than 1.7 in absolute F1 compared to the full model.
3.5 End-To-End Srl
The network for predicate detection (Section 2.3) contains 2 BiLSTM layers with 100-dimensional hidden units, and is trained for 30 epochs. For end-to-end evaluation, all arguments predicted for the false positive predicates are counted as precision loss, and all arguments for the false negative predicates are considered as recall loss. Table 3 shows the predicate detection F1 as well as end-to-end SRL results with predicted predi-cates. 6 On CoNLL 2005, the predicate detector achieved over 96 F1, and the final SRL results only drop 1.2-3.5 F1 compared to using the gold predicates. However, on CoNLL 2012, the predicate detector has only about 90 F1, and the final SRL results decrease by up to 6.2 F1. This is at least in part due to the fact that CoNLL 2012 contains some nominal and copula predicates (Weischedel et al., 2013) , making predicate identification a more challenging problem.
To better understand our deep SRL model and its relation to previous work, we address the following questions with a suite of empirical analyses:
• What is the model good at and what kinds of mistakes does it make? • How well do LSTMs model global structural consistency, despite conditionally independent tagging decisions? • Is our model implicitly learning syntax, and could explicitly modeling syntax still help? All the analysis in this section is done on the CoNLL 2005 development set with gold predicates, unless otherwise stated. We are also able to compare to previous systems whose model predictions are available online (Punyakanok et al., 2005; Pradhan et al., 2005 ). 7
4.1 Error Types Breakdown
Inspired by Kummerfeld et al. (2012) , we define a set of oracle transformations that fix various prediction errors sequentially and observe the relative improvement after each operation (see Table 4 ). Figure 3 : Performance after doing each type of oracle transformation in sequence, compared to two strong non-neural baselines. The gap is closed after the Add Arg. transformation, showing how our approach is gaining from predicting more arguments than traditional systems. vious systems in terms of different types of mistakes. While our model makes a similar number of labeling errors to traditional syntax-based systems, it has far fewer missing arguments (perhaps due to parser errors making some arguments difficult to recover for syntax-based systems). Table 4 , our system most commonly makes labeling errors, where the predicted span is an argument but the role was incorrectly labeled. Table 5 shows a confusion matrix for the most frequent labels. The model often confuses ARG2 with AM-DIR, AM-LOC and AM-MNR. These confusions can arise due to the use of ARG2 in many verb frames to represent semantic relations such as direction or location. For example, ARG2 in the frame move.01 is defined as Arg2-GOL: destination. 8 This type of argumentadjunct distinction is known to be difficult (Kingsbury et al., 2002) , and it is not surprising that our neural model has many such failure cases.
Label Confusion As Shown In
Attachment Mistakes A second common source of error is reflected by the Merge Spans transformation (10.6%) and the Split Spans transformation (14.7%). These errors are closely tied to prepositional phrase (PP) attachment errors, which are also known to be some of the biggest challenges for linguistic analysis (Kummerfeld et al., 2012) . Figure 4 shows the distribution of syntactic span labels involved in an attachment mistake, where 62% of the syntactic spans are prepositional phrases. For example, in Sumitomo financed the acquisition from Sears, our model mistakenly labels the prepositional phrase from Sears as the ARG2 of financed, whereas it should instead attach to acquisition.
4.2 Long-Range Dependencies
To analyze the model's ability to capture longrange dependencies, we compute the F1 of our model on arguments with various distances to the predicate. Figure 5 shows that performance tends to degrade, for all models, for arguments further from the predicate. Interestingly, the gap between shallow and deep models becomes much larger for the long-distance predicate-argument structures. The absolute gap between our 2 layer and 8 layer models is 3-4 F1 for arguments that are within 2 words to the predicate, and 5-6 F1 for arguments that are farther away from the predicate. Surpris- Figure 4 : For cases where our model either splits a gold span into two (Z → XY ) or merges two gold constituents (XY → Z), we show the distribution of syntactic labels for the Y span. Results show the major cause of these errors is inaccurate prepositional phrase attachment. ingly, the neural model performance deteriorates less severely on long-range dependencies than traditional syntax-based models.
4.3 Structural Consistency
We can quantify two types of structural consistencies: the BIO constraints and the SRL-specific constraints. Via our ablation study, we show that deeper BiLSTMs are better at enforcing these structural consistencies, although not perfectly.
The BIO format requires argument spans to begin with a B tag. Any I tag directly following an O tag or a tag with different label is considered a violation. Table 6 shows the number of BIO violations per token for BiLSTMs with different depths. The number of BIO violations decreases when we use a deeper model. The gap is biggest between 2-layer and 4-layer models, and diminishes after that. It is surprising that although the deeper models generate impressively accurate token-level predic- Figure 6 : Example where performance is hurt by enforcing the constraint that core roles may only occur once (+SRL).
tions, they still make enough BIO errors to significantly hurt performance-when these constraints are simple enough to be enforced by trivial rules. We compare the average entropy between tokens involved in BIO violations with the averaged entropy of all tokens. For the 8-layer model, the average entropy on these tokens is 30 times higher than the averaged entropy on all tokens. This suggests that BIO inconsistencies occur when there is some ambiguity. For example, if the model is unsure whether two consecutive words should belong to an ARG0 or ARG1, it might generate inconsistent BIO sequences such as B ARG0 , I ARG1 when decoding at the token level. Using BIO-constrained decoding can resolve this ambiguity and result in a structurally consistent solution.
SRL Structure Violations The model predictions can also violate the SRL-specific constraints commonly used in prior work (Punyakanok et al., 2008; . As shown in Table 7, the model occasionally violates these SRL constraints. With our constrained decoding algorithm, it is straightforward to enforce the unique core roles (U) and continuation roles (C) constraints during decoding. The constrained decoding results are shown with the model named L8+PoE+SRL in Table 7 .
Although the violations are eliminated, the performance does not significantly improve. This is mainly due to two factors: (1) the model often already satisfies these constraints on its own, so the number of violations to be fixed are relatively small, and (2) the gold SRL structure sometimes violates the constraints and enforcing hard constraints can hurt performance. Figure 6 shows a sentence in the CoNLL 2005 development set. Our original model produces two ARG2s for the predicate quicken, and this violates the SRL constraints. When the A * decoder fixes this violation, it changes the first ARG1 into ARG2 because ARG0, ARG1, ARG2 is a more frequent pattern in the training data and has higher overall score. Table 6 : Comparison of BiLSTM models without BIO decoding. We compare F1 and token-level accuracy (Token), averaged BIO violations per token (BIO), overall model entropy (All) model entropy at tokens involved in BIO violations (BIO).
Increasing the depth of the model beyond 4 does not produce more structurally consistent output, emphasizing the need for constrained decoding.
4.4 Can Syntax Still Help Srl?
The Propbank-style SRL formalism is closely tied to syntax (Bonial et al., 2010; Weischedel et al., 2013) . In Table 7 , we show that 98.7% of the gold SRL arguments match an unlabeled constituent in the gold syntax tree. Similar to some recent work (Zhou and Xu, 2015) , our model achieves strong performance without directly modeling syntax. A natural question follows: are neural SRL models implicitly learning syntax? Table 7 shows the trend of deeper models making predictions that are more consistent with the gold syntax in terms of span boundaries. With our best model (L8+PoE), 94.3% of the predicted arguments spans are part of the gold parse tree. This consistency is on par with previous CoNLL 2005 systems that directly model constituency and use predicted parse trees as features (Punyakanok, 95.3% and Pradhan, 93.0%) .
Constrained Decoding with Syntax The above analysis raises a further question: would improving consistency with syntax provide improvements for SRL? Our constrained decoding algorithm described in Section 2.2 enables us to inject syntax as a decoding constraint without having to retrain the model. Specifically, if the decoded sequence contains k arguments that do not match any unlabeled syntactic constituent, it will receive a penalty of kC, where C is a single parameter dictating how much the model should trust the provided syntax. In Figure 7 , we compare the SRL accuracy with syntactic constraints specified by gold parse or automatic parses. When using gold syntax, the predictions improve up to 2 F1 as the penalty increases. A state-of-the-art parser (Choe Table 7 : Comparison of models with different depths and decoding constraints (in addition to BIO) as well as two previous systems. We compare F1, unlabeled agreement with gold constituency (Syn%) and each type of SRL-constraint violations (Unique core roles, Continuation roles and Reference roles). Our best model produces a similar number of constraint violations to the gold annotation, explaining why deterministically enforcing these constraints is not helpful.
and Charniak, 2016) provides smaller gains, while using the Charniak parser (Charniak, 2000) hurts performance if the model places too much trust in it. These results suggest that high-quality syntax can still make a large impact on SRL. A known challenge for syntactic parsers is robustness on out-of-domain data. Therefore we provide experimental results in Table 8 for On the CoNLL 2005 development set, the predicted syntax gives a 0.5 F1 improvement over our best model, while on in-domain test and outof-domain development sets, the improvement is much smaller. As expected, on CoNLL 2012, syntax improves most on the newswire (NW) domain. These improvements suggest that while decoding with hard constraints is beneficial, joint training or multi-task learning could be even more effective by leveraging full, labeled syntactic structures.
5 Related Work
Traditional approaches to semantic role labeling have used syntactic parsers to identify constituents and model long-range dependencies, and enforced Figure 7 : Performance of syntax-constrained decoding as the non-constituent penalty increases for syntax from two parsers (from Choe and Charniak (2016) and Charniak (2000) ) and gold syntax. The best existing parser gives a small improvement, but the improvement from gold syntax shows that there is still potential for syntax to help SRL. global consistency using integer linear programming (Punyakanok et al., 2008) or dynamic programs . More recently, neural methods have been employed on top of syntactic features (FitzGerald et al., 2015; Roth and Lapata, 2016) . Our experiments show that offthe-shelf neural methods have a remarkable ability to learn long-range dependencies, syntactic constituency structure, and global constraints without coding task-specific mechanisms for doing so. An alternative line of work has attempted to reduce the dependency on syntactic input for semantic role labeling models. Collobert et al. (2011) first introduced an end-to-end neural-based approach with sequence-level training and uses a convolutional neural network to model the context window. However, their best system fell short of traditional feature-based systems. Neural methods have also been used as classifiers in transition-based SRL systems (Henderson et al., 2013; Swayamdipta et al., 2016) . Most recently, several successful LSTM-based architec-tures have achieved state-of-the-art results in English span-based SRL (Zhou and Xu, 2015) , Chinese SRL (Wang et al., 2015) , and dependencybased SRL (Marcheggiani et al., 2017) with little to no syntactic input. Our techniques push results to more than 3 F1 over the best syntax-based models. However, we also show that there is potential for syntax to further improve performance.
6 Conclusion And Future Work
We presented a new deep learning model for spanbased semantic role labeling with a 10% relative error reduction over the previous state of the art. Our ensemble of 8 layer BiLSTMs incorporated some of the recent best practices such as orthonormal initialization, RNN-dropout, and highway connections, and we have shown that they are crucial for getting good results with deep models.
Extensive error analysis sheds light on the strengths and limitations of our deep SRL model, with detailed comparison against shallower models and two strong non-neural systems. While our deep model is better at recovering longdistance predicate-argument relations, we still observe structural inconsistencies, which can be alleviated by constrained decoding.
Finally, we posed the question of whether deep SRL still needs syntactic supervision. Despite recent success without syntactic input, we found that our best neural model can still benefit from accurate syntactic parser output via straightforward constrained decoding. In our oracle experiment, we observed a 3 F1 improvement by leveraging gold syntax, showing the potential for high quality parsers to further improve deep SRL models.
h l,t = r l,t • h l,t + (1 − r l,t ) • W l h x l,t(14)
We used the version of OntoNotes downloaded at: http://cemantix.org/data/ontonotes.html.
Training the full model on CoNLL 2005 takes about 5 days on a single Titan X Pascal GPU.
A * search in this setting finds the optimal sequence for all sentences and is therefore equivalent to Viterbi decoding.
The frame identification numbers reported in Pradhan et al. (2013) are not comparable, due to errors in the original release of the data, as mentioned in.7 Model predictions of CoNLL 2005 systems: http:// www.cs.upc.edu/˜srlconll/st05/st05.html
Source: Unified verb index: http://verbs. colorado.edu.