Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Shallow Syntax in Deep Water

Authors

Abstract

Shallow syntax provides an approximation of phrase-syntactic structure of sentences; it can be produced with high accuracy, and is computationally cheap to obtain. We investigate the role of shallow syntax-aware representations for NLP tasks using two techniques. First, we enhance the ELMo architecture to allow pretraining on predicted shallow syntactic parses, instead of just raw text, so that contextual embeddings make use of shallow syntactic context. Our second method involves shallow syntactic features obtained automatically on downstream task data. Neither approach leads to a significant gain on any of the four downstream tasks we considered relative to ELMo-only baselines. Further analysis using black-box probes confirms that our shallow-syntax-aware contextual embeddings do not transfer to linguistic tasks any more easily than ELMo's embeddings. We take these findings as evidence that ELMo-style pretraining discovers representations which make additional awareness of shallow syntax redundant.

1 Introduction

The NLP community is revisiting the role of linguistic structure in applications with the advent of contextual word representations (CWRs) derived from pretraining language models on large corpora (Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018; Devlin et al., 2018) . Recent work has shown that downstream task performance may benefit from explicitly injecting a syntactic inductive bias into model architectures (Kuncoro et al., 2018) , even when CWRs are also used (Strubell et al., 2018) . However, high quality linguistic structure annotation at a large scale remains expensive-a trade-off needs to be made Figure 1 : A sentence with its phrase-syntactic tree (brown) and shallow syntactic (chunk) annotations (red). Nodes in the tree which percolate down as chunk labels are in red. Not all tokens in the sentence get chunk labels; e.g., punctuation is not part of a chunk.

Figure 1: A sentence with its phrase-syntactic tree (brown) and shallow syntactic (chunk) annotations (red). Nodes in the tree which percolate down as chunk labels are in red. Not all tokens in the sentence get chunk labels; e.g., punctuation is not part of a chunk.

between the quality of the annotations and the computational expense of obtaining them. Shallow syntactic structures (Abney, 1991 ; also called chunk sequences) offer a viable middle ground, by providing a flat, non-hierarchical approximation to phrase-syntactic trees (see Fig. 1 for an example). These structures can be obtained efficiently, and with high accuracy, using sequence labelers. In this paper we consider shallow syntax to be a proxy for linguistic structure.

While shallow syntactic chunks are almost as ubiquitous as part-of-speech tags in standard NLP pipelines (Jurafsky and Martin, 2000) , their relative merits in the presence of CWRs remain unclear. We investigate the role of these structures using two methods. First, we enhance the ELMo architecture (Peters et al., 2018b) to allow pretraining on predicted shallow syntactic parses, instead of just raw text, so that contextual embeddings make use of shallow syntactic context ( §2). Our second method involves classical addition of chunk features to CWR-infused architectures for four different downstream tasks ( §3). Shallow syntactic information is obtained automatically us-arXiv:1908.11047v1 [cs.CL] 29 Aug 2019

ing a highly accurate model (97% F 1 on standard benchmarks). In both settings, we observe only modest gains on three of the four downstream tasks relative to ELMo-only baselines ( §4).

Recent work has probed the knowledge encoded in CWRs and found they capture a surprisingly large amount of syntax (Blevins et al., 2018; Liu et al., 2019; Tenney et al., 2019) . We further examine the contextual embeddings obtained from the enhanced architecture and a shallow syntactic context, using black-box probes from Liu et al. (2019) . Our analysis indicates that our shallowsyntax-aware contextual embeddings do not transfer to linguistic tasks any more easily than ELMo embeddings ( §4.2).

Overall, our findings show that while shallow syntax can be somewhat useful, ELMo-style pretraining discovers representations which make additional awareness of shallow syntax largely redundant.

2 Pretraining With Shallow Syntactic Annotations

We briefly review the shallow syntactic structures used in this work, and then present a model architecture to obtain embeddings from shallow Syntactic Context (mSynC).

2.1 Shallow Syntax

Base phrase chunking is a cheap sequencelabeling-based alternative to full syntactic parsing, where the sequence consists of nonoverlapping labeled segments ( Fig. 1 includes an example.) Full syntactic trees can be converted into such shallow syntactic chunk sequences using a deterministic procedure (Jurafsky and Martin, 2000) . Tjong Kim Sang and Buchholz (2000) offered a rule-based transformation deriving nonoverlapping chunks from phrase-structure trees as found in the Penn Treebank (Marcus et al., 1993) . The procedure percolates some syntactic phrase nodes from a phrase-syntactic tree to the phrase in the leaves of the tree. All overlapping embedded phrases are then removed, and the remainder of the phrase gets the percolated label-this usually corresponds to the head word of the phrase. In order to obtain shallow syntactic annotations on a large corpus, we train a BiLSTM-CRF model (Lample et al., 2016; , which achieves 97% F 1 on the CoNLL 2000 benchmark test set. The training data is obtained from the CoNLL 2000 shared task (Tjong Kim Sang and Buchholz, 2000) , as well as the remaining sections (except §23 and §20) of the Penn Treebank, using the official script for chunk generation. 1 The standard task definition from the shared task includes eleven chunk labels, as shown in Table 1 Kim Sang and Buchholz, 2000) and their occurrence % in the training data.

Table 1: Shallow syntactic chunk phrase types from CoNLL 2000 shared task (Tjong Kim Sang and Buchholz, 2000) and their occurrence % in the training data.

2.2 Pretraining Objective

Traditional language models are estimated to maximize the likelihood of each word x i given the words that precede it, p(x i | x

Figure 2: Model architecture for pretraining with shallow syntax. A sequential encoder converts the raw text into CWRs (shown in blue). Observed shallow syntactic structure (chunk boundaries and labels, shown in red) are combined with these CWRs in a shallow syntactic encoder to get contextualized representations for chunks (shown in orange). Both representations are passed through a projection layer to get mSynC embeddings (details shown only in some positions, for clarity), used both for computing the data likelihood, as shown, as well as in downstream tasks.

A right-to-left model is constructed analogously, conditioning on c i alongside x >i . Fol-lowing Peters et al. (2018a), we use a joint objective maximizing data likelihood objectives in both directions, with shared softmax parameters.

2.3 Pretraining Model Architecture

Our model uses two encoders: e seq for encoding the sequential history (x (Vaswani et al., 2017) , which consist of large feedforward networks equipped with multiheaded self-attention mechanisms. : Model architecture for pretraining with shallow syntax. A sequential encoder converts the raw text into CWRs (shown in blue). Observed shallow syntactic structure (chunk boundaries and labels, shown in red) are combined with these CWRs in a shallow syntactic encoder to get contextualized representations for chunks (shown in orange). Both representations are passed through a projection layer to get mSynC embeddings (details shown only in some positions, for clarity), used both for computing the data likelihood, as shown, as well as in downstream tasks.

As inputs to e seq , we use a context-independent embedding, obtained from a CNN character encoder (Kim et al., 2016) for each token x i . The outputs h i from e seq represent words in context. Next, we build representations for (observed) chunks in the sentence by concatenating a learned embedding for the chunk label with hs for the boundaries and applying a linear projection (f proj ). The output from f proj is input to e syn , the shallow syntactic encoder, and results in contextualized chunk representations, g. Note that the number of chunks in the sentence is less than or equal to the number of tokens.

Each h i is now concatentated with g c i , where g c i corresponds to c i , the last chunk before position i. Finally, the output is given by

mSynC i = u proj (h i , g c i ) = W [h i ; g c i ],

where W is a model parameter. For training, mSynC i is used to compute the probability of the next word, using a sampled softmax (Bengio et al., 2003) . For downstream tasks, we use a learned linear weighting of all layers in the encoders to obtain a task-specific mSynC, following Peters et al. (2018a) .

Staged parameter updates Jointly training both the sequential encoder e seq , and the syntactic encoder e syn can be expensive, due to the large number of parameters involved. To reduce cost, we initialize our sequential CWRs h, using pretrained embeddings from ELMo-transformer. Once initialized as such, the encoder is fine-tuned to the data likelihood objective ( §2.2). This results in a staged parameter update, which reduces training duration by a factor of 10 in our experiments. We discuss the empirical effect of this approach in §4.3.

3 Shallow Syntactic Features

Our second approach incorporates shallow syntactic information in downstream tasks via tokenlevel chunk label embeddings. Task training (and test) data is automatically chunked, and chunk boundary information is passed into the task model via BIOUL encoding of the labels. We add randomly initialized chunk label embeddings to task-specific input encoders, which are then finetuned for task-specific objectives. This approach does not require a shallow syntactic encoder or chunk annotations for pretraining CWRs, only a chunker. Hence, this can more directly measure the impact of shallow syntax for a given task. 3

4 Experiments

Our experiments evaluate the effect of shallow syntax, via contextualization (mSynC, §2) and features ( §3). We provide comparisons with four baselines-ELMo-transformer (Peters et al., 2018b) , our reimplementation of the same, as well as two CWR-free baselines, with and without shallow syntactic features. Both ELMo-transformer and mSynC are trained on the 1B word benchmark corpus (Chelba et al., 2013) ; the latter also employs chunk annotations ( §2.1). Experimental settings are detailed in Appendix §A.1.

Table 2: Test-set performance of ELMo-transformer (Peters et al., 2018b), our reimplementation, and mSynC, compared to baselines without CWR. Evaluation metric is F1 for all tasks except sentiment, which reports accuracy. Reported results show the mean and standard deviation across 5 runs for coarse-grained NER and sentiment classification and 3 runs for other tasks.
Table 3: Test performance of ELMo-transformer (Peters et al., 2018b) vs. mSynC on several linguistic probes from Liu et al. (2019). In each case, performance of the best layer from the architecture is reported. Details on the probes can be found in §4.2.1.
Table 4: Validation F1 for fine-grained NER across syntactic pretraining schemes, with mean and standard deviations across 3 runs.

4.1 Downstream Task Transfer

We employ four tasks to test the impact of shallow syntax. The first three, namely, coarse and fine-grained named entity recognition (NER), and constituency parsing, are span-based; the fourth is a sentence-level sentiment classification task. Following Peters et al. (2018a), we do not apply finetuning to task-specific architectures, allowing us to do a controlled comparison with ELMo. Given an identical base architecture across models for each task, we can attribute any difference in performance to the incorporation of shallow syntax or contextualization. Details of downstream architectures are provided below, and overall dataset statistics for all tasks is shown in the Appendix, Table 5 .

Table 5: Downstream dataset statistics describing the number of train, heldout and test set instances for each task.

NER We use the English portion of the CoNLL 2003 dataset (Tjong Kim Sang and De Meulder, 2003) , which provides named entity annotations on newswire data across four different entity types (PER, LOC, ORG, MISC). A bidirectional LSTM-CRF architecture (Lample et al., 2016 ) and a BIOUL tagging scheme were used.

Fine-grained NER The same architecture and tagging scheme from above is also used to predict fine-grained entity annotations from OntoNotes 5.0 (Weischedel et al., 2011) . There are 18 finegrained NER labels in the dataset, including regular named entitities as well as entities such as date, time and common numerical entries.

Phrase-structure parsing We use the standard Penn Treebank splits, and adopt the span-based model from Stern et al. (2017) . Following their approach, we used predicted part-of-speech tags from the Stanford tagger (Toutanova et al., 2003) for training and testing. About 51% of phrasesyntactic constituents align exactly with the predicted chunks used, with a majority being singlewidth noun phrases. Given that the rule-based procedure used to obtain chunks only propagates the phrase type to the head-word and removes all overlapping phrases to the right, this is expected. We did not employ jack-knifing to obtain predicted chunks on PTB data; as a result there might be differences in the quality of shallow syntax annotations between the train and test portions of the data.

Sentiment

analysis We consider finegrained (5-class) classification on Stanford Sentiment Treebank (Socher et al., 2013) .

The labels are negative, somewhat negative, neutral, positive and somewhat positive. Our model was based on the biattentive classification network (McCann et al., 2017) . We used all phrase lengths in the dataset for training, but test results are reported only on full sentences, following prior work.

Results are shown in Table 2 . Consistent with previous findings, CWRs offer large improvements across all tasks. Though helpful to span-level task models without CWRs, shallow syntactic features offer little to no benefit to ELMo models. mSynC's performance is similar. This holds even for phrase-structure parsing, where (gold) chunks align with syntactic phrases, indicating that taskrelevant signal learned from exposure to shallow syntax is already learned by ELMo. On sentiment classification, chunk features are slightly harmful on average (but variance is high); mSynC again performs similarly to ELMo-transformer. Overall, the performance differences across all tasks are small enough to infer that shallow syntax is not particularly helpful when using CWRs.

4.2 Linguistic Probes

We further analyze whether awareness of shallow syntax carries over to other linguistic tasks, via probes from Liu et al. (2019) . Probes are linear models trained on frozen CWRs to make predictions about linguistic (syntactic and semantic) properties of words and phrases. Unlike §4.1, there is minimal downstream task architecture, bringing into focus the transferability of CWRs, as opposed to task-specific adaptation.

4.2.1 Probing Tasks

The ten different probing tasks we used include CCG supertagging (Hockenmaier and Steedman, 2007) , part-of-speech tagging from PTB (Marcus et al., 1993) and EWT (Universal Depedencies Silveira et al., 2014) , named entity recognition (Tjong Kim Sang and De Meulder, 2003) , Baseline (no CWR) 88.1 ± 0.27 78.5 ± 0.19 88.9 ± 0.05 51.6 ± 1.63 + shallow syn. features 88.6 ± 0.22 78.9 ± 0.13 90.8 ± 0.14 51.1 ± 1.39

ELMo-transformer (Peters et al., 2018b) 91.1 ± 0.26 -93.7 ± 0.00 -ELMo-transformer (our reimplementation) 91.5 ± 0.25 85.7 ± 0.08 94.1 ± 0.06 53.0 ± 0.72 + shallow syn. features 91.6 ± 0.40 85.9 ± 0.28 94.3 ± 0.03 52.6 ± 0.54 Shallow syn. contextualization (mSynC) 91.5 ± 0.19 85.9 ± 0.20 94.1 ± 0.07 53.0 ± 1.07 Table 2 : Test-set performance of ELMo-transformer (Peters et al., 2018b) , our reimplementation, and mSynC, compared to baselines without CWR. Evaluation metric is F 1 for all tasks except sentiment, which reports accuracy. Reported results show the mean and standard deviation across 5 runs for coarse-grained NER and sentiment classification and 3 runs for other tasks. Table 3 : Test performance of ELMo-transformer (Peters et al., 2018b) vs. mSynC on several linguistic probes from Liu et al. (2019) . In each case, performance of the best layer from the architecture is reported. Details on the probes can be found in §4.2.1.

base-phrase chunking (Tjong Kim Sang and Buchholz, 2000) , grammar error detection (Yannakoudakis et al., 2011), semantic tagging (Bjerva et al., 2016) , preposition supersense identification (Schneider et al., 2018) , and event factuality detection (Rudinger et al., 2018) . Metrics and references for each are summarized in Table 6 . For more details, please see Liu et al. (2019) . Results in Table 3 show ten probes. Again, we see the performance of baseline ELMotransformer and mSynC are similar, with mSynC doing slightly worse on 7 out of 9 tasks. As we would expect, on the probe for predicting chunk tags, mSynC achieves 96.9 F 1 vs. 92.2 F 1 for ELMo-transformer, indicating that mSynC is indeed encoding shallow syntax. Overall, the results further confirm that explicit shallow syntax does not offer any benefits over ELMo-transformer.

Table 6: Dataset and metrics for each probing task from Liu et al. (2019), corresponding to Table 3.

4.3 Effect Of Training Scheme

We test whether our staged parameter training ( §2.3) is a viable alternative to an end-to-end training of both e syn and e seq . We make a further distinction between fine-tuning e seq vs. not updating it at all after initialization (frozen). Downstream validation-set F 1 on fine-grained NER, reported in Table 4 , shows that the end-to-end strategy lags behind the others, perhaps indicating the need to train longer than 10 epochs. However, a single epoch on the 1B-word benchmark takes 36 hours on 2 Tesla V100s, making this prohibitive. Interestingly, the frozen strategy, which takes the least amount of time to converge (24 hours on 1 Tesla V100), also performs almost as well as finetuning.

5 Conclusion

We find that exposing CWR-based models to shallow syntax, either through new CWR learning architectures or explicit pipelined features, has little effect on their performance, across several tasks. Linguistic probing also shows that CWRs aware of such structures do not improve task transferability. Our architecture and methods are general enough to be adapted for richer inductive biases, such as those given by full syntactic trees (RNNGs; , or to different pretraining objectives, such as masked language modeling (BERT; Devlin et al., 2018); we leave this pursuit to future work.

A Supplemental Material

A.1 Hyperparameters ELMo-transformer Our baseline pretraining model was a reimplementation of that given in Peters et al. (2018b) . Hyperparameters were generally identical, but we trained on only 2 GPUs with (up to) 4,000 tokens per batch. This difference in batch size meant we used 6,000 warm up steps with the learning rate schedule of Vaswani et al. (2017) .

mSynC The function f seq is identical to the 6layer biLM used in ELMo-transformer. f syn , on the other hand, uses only 2 layers. The learned embeddings for the chunk labels have 128 dimensions and are concatenated with the two boundary h of dimension 512. Thus f proj maps 1024 + 128 dimensions to 512. Further, we did not perform weight averaging over several checkpoints.

Shallow Syntax The size of the shallow syntactic feature embedding was 50 across all experiments, initialized uniform randomly. All model implementations are based on the AllenNLP library (Gardner et al., 2017) .

Train Heldout Test

CoNLL 2003 NER (Tjong Kim Sang and De Meulder, 2003) 23,499 5,942 5,648 OntoNotes NER (Weischedel et al., 2013 ) 81,828 11,066 11,257 Penn TreeBank (Marcus et al., 1993 39,832 1,700 2,416 Stanford Sentiment Treebank (Socher et al., 2013) 8,544 1,101 2,210 (Rudinger et al., 2018) Pearson R Table 6 : Dataset and metrics for each probing task from Liu et al. (2019) , corresponding to Table 3.

https://www.clips.uantwerpen.be/ conll2000/chunking/ 2 A different objective could consider predicting the next chunks, along with the next word. However, this chunker would have access to strictly less information than usual, since the entire sentence would no longer be available.

In contrast, in §2, the shallow-syntactic encoder itself, as well as predicted chunk quality on the large pretraining corpus could affect downstream performance.

References

  • Steven P Abney. 1991. Parsing by chunks. In Principle-based parsing, pages 257-278. Springer.
    Return to section: 1 Introduction
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Re- search, pages 1137-1155.
    Return to section: 2.3 Pretraining Model Architecture
  • Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Nat- ural Language Processing, Computational Linguis- tics, and Speech Recognition, 1st edition. Prentice Hall PTR, Upper Saddle River, NJ, USA.
    Return to section: 1 Introduction, 2.1 Shallow Syntax
  • Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2016. Character-aware neural lan- guage models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, pages 2741-2749. AAAI Press.
    Return to section: 2.3 Pretraining Model Architecture
  • Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo- gatama, Stephen Clark, and Phil Blunsom. 2018. Lstms can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proc. of ACL.
    Return to section: 1 Introduction
  • Guillaume Lample, Miguel Ballesteros, Sandeep Sub- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proc. of NAACL-HLT.
    Return to section: 2.1 Shallow Syntax, 4.1 Downstream Task Transfer
  • Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A. Smith. 2019. Lin- guistic knowledge and transferability of contextual representations. In Proc. of NAACL-HLT.
    Return to section: 1 Introduction, 4.2 Linguistic Probes, 4.2.1 Probing Tasks, Train Heldout Test
  • Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Computa- tional Linguistics, 19(2):313-330.
    Return to section: 4.2.1 Probing Tasks
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Con- textualized word vectors. In Proc. of NeurIPS, pages 6294-6305.
    Return to section: Sentiment
  • Matthew Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- quence tagging with bidirectional language models. In Proc. of ACL.
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word rep- resentations. In Proc. of NAACL-HLT.
    Return to section: 1 Introduction, 2.3 Pretraining Model Architecture
  • Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proc. of EMNLP, pages 1499-1509.
    Return to section: 1 Introduction, 4 Experiments, 4.2.1 Probing Tasks, A Supplemental Material
  • Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In Proc. of COLING.
    Return to section: 4.2.1 Probing Tasks
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing by generative pre-training.
    Return to section: 1 Introduction
  • Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proc. of ACL.
    Return to section: 4.2.1 Probing Tasks, Train Heldout Test
  • Nathan Schneider, Jena D Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R Moeller, Aviram Stern, Adi Bitan, and Omri Abend. 2018. Comprehensive supersense disambiguation of En- glish prepositions and possessives.
    Return to section: 4.2.1 Probing Tasks
  • Natalia Silveira, Timothy Dozat, Marie-Catherine De Marneffe, Samuel R Bowman, Miriam Connor, John Bauer, and Christopher D Manning. 2014. A gold standard dependency corpus for English. In Proc. of LREC, pages 2897-2904.
    Return to section: 4.2.1 Probing Tasks
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proc. of EMNLP.
    Return to section: Sentiment, Train Heldout Test
  • Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Proc. of ACL.
    Return to section: 4.1 Downstream Task Transfer
  • Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proc. of EMNLP.
    Return to section: 1 Introduction
  • Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextu- alized word representations. In Proc. of ICLR.
    Return to section: 1 Introduction
  • Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proc. of CoNLL.
    Return to section: 2.1 Shallow Syntax, 4.2.1 Probing Tasks
  • Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. of NAACL. Association for Computational Linguistics.
    Return to section: 4.1 Downstream Task Transfer, 4.2.1 Probing Tasks, Train Heldout Test
  • Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep rnns encode soft hierarchical syntax. In Proc. of ACL.
    Return to section: 1 Introduction
  • Kristina Toutanova, Dan Klein, Christopher D. Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In Proc. of NAACL.
    Return to section: 4.1 Downstream Task Transfer
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NeurIPS, pages 5998-6008.
    Return to section: 2.3 Pretraining Model Architecture, A Supplemental Material
  • Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. OntoNotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadel- phia, PA.
    Return to section: Train Heldout Test
  • LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium.
  • Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading esol texts. In Proc. of ACL.
  • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for mea- suring progress in statistical language modeling. ArXiv:1312.3005.
    Return to section: 4 Experiments
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- standing. ArXiv:1810.04805.
    Return to section: 1 Introduction
  • Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network grammars. In Proc. of NAACL-HLT.
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform. ArXiv:1803.07640.
    Return to section: A Supplemental Material
  • Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A corpus of ccg derivations and dependency structures extracted from the Penn Treebank. Com- putational Linguistics, 33(3).
    Return to section: 4.2.1 Probing Tasks
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proc. of ACL.
    Return to section: 1 Introduction