Structural Scaffolds for Citation Intent Classification in Scientific Publications

Arman Cohan
Waleed Ammar
Madeleine van Zuylen
Field Cady
NAACL
2019
View in Semantic Scholar

Abstract

Identifying the intent of a citation in scientific papers (e.g., background information, use of methods, comparing results) is critical for machine reading of individual publications and automated analysis of the scientific literature. We propose structural scaffolds, a multitask model to incorporate structural information of scientific papers into citations for effective classification of citation intents. Our model achieves a new state-of-the-art on an existing ACL anthology dataset (ACL-ARC) with a 13.3% absolute increase in F1 score, without relying on external linguistic resources or hand-engineered features as done in existing methods. In addition, we introduce a new dataset of citation intents (SciCite) which is more than five times larger and covers multiple scientific domains compared with existing datasets. Our code and data are available at: https://github.com/allenai/scicite.

1 Introduction

Citations play a unique role in scientific discourse and are crucial for understanding and analyzing scientific work (Luukkonen, 1992; Leydesdorff, 1998) . They are also typically used as the main measure for assessing impact of scientific publications, venues, and researchers (Li and Ho, 2008) . The nature of citations can be different. Some citations indicate direct use of a method while some others merely serve as acknowledging a prior work. Therefore, identifying the intent of citations ( Figure 1 ) is critical in improving automated analysis of academic literature and scientific impact measurement (Leydesdorff, 1998; Small, 2018) . Other applications of citation intent classification are enhanced research experience (Moravcsik and Murugesan, 1975) , information retrieval (Ritchie, 2009) , summarization (Co-…. A previously described computerized force sensitive system was used to quantify gait cycle timing, specifically the swing time and the stride-to-stride variability of swing time (Bazner et al. 2000) . …. han and Goharian, 2015), and studying evolution of scientific fields (Jurgens et al., 2018) .

Figure 1: Example of citations with different intents (BACKGROUND and METHOD).

In this work, we approach the problem of citation intent classification by modeling the language expressed in the citation context. A citation context includes text spans in a citing paper describing a referenced work and has been shown to be the primary signal in intent classification (Teufel et al., 2006; Abu-Jbara et al., 2013; Jurgens et al., 2018) . Existing models for this problem are feature-based, modeling the citation context with respect to a set of predefined handengineered features (such as linguistic patterns or cue phrases) and ignoring other signals that could improve prediction.

In this paper we argue that better representations can be obtained directly from data, sidestepping problems associated with external features. To this end, we propose a neural multitask learning framework to incorporate knowledge into citations from the structure of scientific papers. In particular, we propose two auxiliary tasks as structural scaffolds to improve citation intent prediction: 1 (1) predicting the section title in which the citation occurs and (2) predicting whether a sentence needs a citation. Unlike the primary task of citation intent prediction, it is easy to collect large Figure 2 : Our proposed scaffold model for identifying citation intents. The main task is predicting the citation intent (top left) and two scaffolds are predicting the section title and predicting if a sentence needs a citation (citation worthiness).

Figure 2: Our proposed scaffold model for identifying citation intents. The main task is predicting the citation intent (top left) and two scaffolds are predicting the section title and predicting if a sentence needs a citation (citation worthiness).

amounts of training data for scaffold tasks since the labels naturally occur in the process of writing a paper and thus, there is no need for manual annotation. On two datasets, we show that the proposed neural scaffold model outperforms existing methods by large margins. Our contributions are: (i) we propose a neural scaffold framework for citation intent classification to incorporate into citations knowledge from structure of scientific papers; (ii) we achieve a new state-of-the-art of 67.9% F1 on the ACL-ARC citations benchmark, an absolute 13.3% increase over the previous state-of-the-art (Jurgens et al., 2018) ; and (iii) we introduce SciCite, a new dataset of citation intents which is at least five times as large as existing datasets and covers a variety of scientific domains.

2 Model

We propose a neural multitask learning framework for classification of citation intents. In particular, we introduce and use two structural scaffolds, auxiliary tasks related to the structure of scientific papers. The auxiliary tasks may not be of interest by themselves but are used to inform the main task. Our model uses a large auxiliary dataset to incorporate this structural information available in scientific documents into the citation intents. The overview of our model is illustrated in Figure 2 .

Let C denote the citation and x denote the ci-tation context relevant to C. We encode the tokens in the citation context of size n as x = {x 1 , ..., x n }, where x i ∈ R d 1 is a word vector of size d 1 which concatenates non-contextualized word representations (GloVe, Pennington et al., 2014) and contextualized embeddings (ELMo, Peters et al., 2018) , i.e.:

x i = x GloVe i ; x ELMo i

We then use a bidirectional long short-term memory (Hochreiter and Schmidhuber, 1997 ) (BiL-STM) network with hidden size of d 2 to obtain a contextual representation of each token vector with respect to the entire sequence: 2

h i = − −−− → LSTM(x, i); ← −−− − LSTM(x, i) ,

where h ∈ R (n,2d 2 ) and − −−− → LSTM(x, i) processes x from left to write and returns the LSTM hidden state at position i (and vice versa for the backward direction ← −−− − LSTM). We then use an attention mechanism to get a single vector representing the whole input sequence:

z = n i=1 α i h i , α i = softmax(w h i ),

where w is a parameter served as the query vector for dot-product attention. 3 So far we have obtained the citation representation as a vector z. Next, we describe our two proposed structural scaffolds for citation intent prediction.

2.1 Structural Scaffolds

In scientific writing there is a connection between the structure of scientific papers and the intent of citations. To leverage this connection for more effective classification of citation intents, we propose a multitask framework with two structural scaffolds (auxiliary tasks) related to the structure of scientific documents. A key point for our proposed scaffolds is that they do not need any additional manual annotation as labels for these tasks occur naturally in scientific writing. The structural scaffolds in our model are the following:

Citation worthiness. The first scaffold task that we consider is "citation worthiness" of a sentence, indicating whether a sentence needs a citation. The language expressed in citation sentences is likely distinctive from regular sentences in scientific writing, and such information could also be useful for better language modeling of the citation contexts. To this end, using citation markers such as "[12]" or "Lee et al (2010)", we identify sentences in a paper that include citations and the negative samples are sentences without citation markers. The goal of the model for this task is to predict whether a particular sentence needs a citation. 4 Section title. The second scaffold task relates to predicting the section title in which a citation appears. Scientific documents follow a standard structure where the authors typically first introduce the problem, describe methodology, share results, discuss findings and conclude the paper. The intent of a citation could be relevant to the section of the paper in which the citation appears. For example, method-related citations are more likely to appear in the methods section. Therefore, we use the section title prediction as a scaffold for predicting citation intents. Note that this scaffold task is different than simply adding section title as an additional feature in the input. We are using the section titles from a larger set of data than training data for the main task as a proxy to learn linguistic patterns that are helpful for citation intents. In particular, we leverage a large number of scientific papers for which the section information is known for each citation to automatically generate large amounts of training data for this scaffold task. 5 Multitask formulation. Multitask learning as defined by Caruana (1997) is an approach to inductive transfer learning that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It requires the model to have at least some sharable parameters between the tasks. In a general setting in our model, we have a main task T ask (1) and n − 1 auxiliary tasks T ask (i) . As shown in Figure 2 , each scaffold task will have its task-specific parameters for effective classifica-tion and the parameters for the lower layers of the network are shared across tasks. We use a Multi Layer Perceptron (MLP) for each task and then a softmax layer to obtain prediction probabilites. In particular, given the vector z we pass it to n MLPs and obtain n output vectors y (i) :

y (i) = softmax(MLP (i) (z))

We are only interested in the output y (1) and the rest of outputs (y (2) , ..., y (n) ) are regarding the scaffold tasks and only used in training to inform the model of knowledge in the structure of the scientific documents. For each task, we output the class with the highest probability in y. An alternative inference method is to sample from the output distribution.

2.2 Training

Let D 1 be the labeled dataset for the main task T ask (1) , and D i denote the labeled datasets corresponding to the scaffold task T ask (i) where i ∈ {2, ..., n}. Similarly, let L 1 and L i be the main loss and the loss of the auxiliary task i, respectively. The final loss of the model is:

L = (x,y)∈D 1 L1(x, y) + n i=2 λi (x,y)∈D i Li(x, y), (1)

where λ i is a hyper-parameter specifying the sensitivity of the parameters of the model to each specific task. Here we have two scaffold tasks and hence n=3. λ i could be tuned based on performance on validation set (see §4 for details).

We train this model jointly across tasks and in an end-to-end fashion. In each training epoch, we construct mini-batches with the same number of instances from each of the n tasks. We compute the total loss for each mini-batch as described in Equation 1, where L i =0 for all instances of other tasks j =i. We compute the gradient of the loss for each mini-batch and tune model parameters using the AdaDelta optimizer (Zeiler, 2012) with gradient clipping threshold of 5.0. We stop training the model when the development macro F1 score does not improve for five consecutive epochs.

3 Data

We compare our results on two datasets from different scientific domains. While there has been a long history of studying citation intents, there are only a few existing publicly available datasets on Intent cateogry Definition Example

Background Information

The citation states, mentions, or points to the background information giving more context about a problem, concept, approach, topic, or importance of the problem in the field.

Recent evidence suggests that co-occurring alexithymia may explain deficits [12] . Locally high-temperature melting regions can act as permanent termination sites [6] [7] [8] [9] . One line of work is focused on changing the objective function (Mao et al., 2016) .

Method

Making use of a method, tool, approach or dataset Fold differences were calculated by a mathematical model described in [4] . We use Orthogonal Initialization (Saxe et al., 2014)

Result Comparison

Comparison of the paper's results/findings with the results/findings of other work

Weighted measurements were superior to T2-weighted contrast imaging which was in accordance with former studies [25] [26] [27] Similar results to our study were reported in the study of Lee et al (2010). 2018the task of citation intent classification. We use the most recent and comprehensive (ACL-ARC citations dataset) by Jurgens et al. (2018) as a benchmark dataset to compare the performance of our model to previous work. In addition, to address the limited scope and size of this dataset, we introduce SciCite, a new dataset of citation intents that addresses multiple scientific domains and is more than five times larger than ACL-ARC. Below is a description of both datasets.

3.1 Result Comparison

Table 1: The definition and examples of citation intent categories in our SciCite.

3.1 Acl-Arc Citations Dataset

ACL-ARC is a dataset of citation intents released by Jurgens et al. (2018) . The dataset is based on a sample of papers from the ACL Anthology Reference Corpus (Bird et al., 2008) and includes 1,941 citation instances from 186 papers and is annotated by domain experts in the NLP field. The data was split into three standard stratified sets of train, validation, and test with 85% of data used for training and remaining 15% divided equally for validation and test. Each citation unit includes information about the immediate citation context, surrounding context, as well as information about the citing and cited paper. The data includes six intent categories outlined in Table 2 .

Table 2: Characteristics of SciCite compared with ACL-ARC dataset by Jurgens et al. (2018)

3.2 Scicite Dataset

Most existing datasets contain citation categories that are too fine-grained. Some of these intent categories are very rare or not useful in meta analysis of scientific publications. Since some of these fine-grained categories only cover a minimal percentage of all citations, it is difficult to use them to gain insights or draw conclusions on impacts of papers. Furthermore, these datasets are usually domain-specific and are relatively small (less than 2,000 annotated citations).

To address these limitations, we introduce Sci-Cite, a new dataset of citation intents that is significantly larger, more coarse-grained and generaldomain compared with existing datasets. Through examination of citation intents, we found out many of the categories defined in previous work such as motivation, extension or future work, can be considered as background information providing more context for the current research topic. More interesting intent categories are a direct use of a method or comparison of results. Therefore, our dataset provides a concise annotation scheme that is useful for navigating research topics and machine reading of scientific papers. We consider three intent categories outlined in Table 1 : BACK-GROUND, METHOD and RESULTCOMPARISON. Below we describe data collection and annotation details.

3.2.1 Data Collection And Annotation

Citation intent of sentence extractions was labeled through the crowdsourcing platform Figure Eight . 6 We selected a sample of papers from the Semantic Scholar corpus, 7 consisting of papers in general computer science and medicine domains. Citation contexts were extracted using science-parse. 8 The annotators were asked to identify the intent of a citation, and were directed to select among three citation intent options: METHOD, RESULTCOMPARISON and BACKGROUND. The annotation interface also included a dummy option OTHER which helps improve the quality of annotations of other categories. We later removed instances annotated with the OTHER option from our dataset (less than 1% of the annotated data), many of which were due to citation contexts which are incomplete or too short for the annotator to infer the citation intent.

We used 50 test questions annotated by a domain expert to ensure crowdsource workers were following directions and disqualify annotators with accuracy less than 75%. Furthermore, crowdsource workers were required to remain on the annotation page (five annotations) for at least ten seconds before proceeding to the next page. Annotations were dynamically collected. The annotations were aggregated along with a confidence score describing the level of agreement between multiple crowdsource workers. The confidence score is the agreement on a single instance weighted by a trust score (accuracy of the annotator on the initial 50 test questions).

To only collect high quality annotations, instances with confidence score of ≤0.7 were discarded. In addition, a subset of the dataset with 100 samples was re-annotated by a trained, expert annotator to check for quality, and the agreement rate with crowdsource workers was 86%. Citation contexts were annotated by 850 crowdsource workers who made a total of 29,926 annotations and individually made between 4 and 240 annotations. Each sentence was annotated, on average, 3.74 times. This resulted in a total 9,159 crowdsourced instances which were divided to training and validation sets with 90% of the data used for the training set. In addition to the crowdsourced data, a separate test set of size 1,861 was annotated by a trained, expert annotator to ensure high quality of the dataset.

3.3 Data For Scaffold Tasks

For the first scaffold (citation worthiness), we sample sentences from papers and consider the sentences with citations as positive labels. We also remove the citation markers from those sentences such as numbered citations (e.g., [1]) or name-year combinations (e.g, Lee et al (2012)) to not make the second task artificially easy by only detecting citation markers. For the second scaffold (citation section title), respective to each test dataset, we sample citations from the ACL-ARC corpus and Semantic Scholar corpus 9 and extract the citation context as well as their corresponding sections. We manually define regular expression patterns mappings to normalized section titles: "introduction", "related work", "method", "experiments", "conclusion". Section titles which did not map to any of the aforementioned titles were excluded from the dataset. Overall, the size of the data for scaffold tasks on the ACL-ARC dataset is about 47K (section title scaffold) and 50K (citation worthiness) while on SciCite is about 91K and 73K for section title and citation worthiness scaffolds, respectively.

4.1 Implementation

We implement our proposed scaffold framework using the AllenNLP library (Gardner et al., 2018) . For word representations, we use 100-dimensional GloVe vectors (Pennington et al., 2014) trained on a corpus of 6B tokens from Wikipedia and Gigaword. For contextual representations, we use ELMo vectors released by Peters et al. (2018) 10 with output dimension size of 1,024 which have been trained on a dataset of 5.5B tokens. We use a single-layer BiLSTM with a hidden dimension size of 50 for each direction 11 . For each of scaffold tasks, we use a single-layer MLP with 20 hidden nodes , ReLU (Nair and Hinton, 2010) activation and a Dropout rate (Srivastava et al., 2014) of 0.2 between the hidden and input layers. The hyperparameters λ i are tuned for best performance on the validation set of the respective datasets using a 0.0 to 0.3 grid search. For example, the following hyperparameters are used for the ACL-ARC. Citation worthiness saffold: λ 2 =0.08, λ 3 =0, section title scaffold: λ 3 =0.09, λ 2 =0; both scaffolds: λ 2 =0.1, λ 3 =0.05. Batch size is 8 for ACL-ARC dataset and 32 for SciCite dataset (recall that SciCite is larger than ACL-ARC). We use Beaker 12 for running the experiments. On the smaller dataset, our best model takes approximately 30 minutes per epoch to train (training time without ELMo is significantly faster). It is known that multiple runs of probabilistic deep learning models can have variance in overall scores (Reimers and Gurevych, 2017) 13 . We control this by setting random-number generator seeds; the reported overall results are average of multiple runs with different random seeds. To facilitate reproducibility, we release our code, data, and trained models. 14

4.2 Baselines

We compare our results to several baselines including the model with state-of-the-art performance on the ACL-ARC dataset.

• BiLSTM Attention (with and without ELMo).

This baseline uses a similar architecture to our proposed neural multitask learning framework, except that it only optimizes the network for the main loss regarding the citation intent classification (L 1 ) and does not include the structural scaffolds. We experiment with two variants of this model: with and without using the contextualized word vector representations (ELMo) of Peters et al. (2018) . This baseline is useful for evaluating the effect of adding scaffolds in controlled experiments.

• Jurgens et al. (2018) . To make sure our results are competitive with state-of-the-art results on this task, we also compare our model to Jurgens et al. (2018) which has the best reported results on the ACL-ARC dataset. Jurgens et al. (Jurgens et al., 2018) 54.6

This work

BiLSTM-Attn + section title scaffold 56.9 BiLSTM-Attn + citation worthiness scaffold 56.3 BiLSTM-Attn + both scaffolds 63.1 BiLSTM-Attn w/ ELMo + both scaffolds 67.9 Table 3 : Results on the ACL-ARC citations dataset.

Table 3: Results on the ACL-ARC citations dataset.

leave-one-out cross validation in our experiments since it is impractical to re-train each variant of our deep learning models thousands of times. Therefore, we opted for a standard setup of stratified train/validation/test data splits with 85% data used for training and the rest equally split between validation and test.

4.3 Results

Our main results for the ACL-ARC dataset (Jurgens et al., 2018) is shown in Table 3 . We observe that our scaffold-enhanced models achieve clear improvements over the state-of-the-art approach on this task. Starting with the 'BiLSTM-Attn' baseline with a macro F1 score of 51.8, adding the first scaffold task in 'BiLSTM-Attn + section title scaffold' improves the F1 score to 56.9 (∆=5.1). Adding the second scaffold in 'BiLSTM-Attn + citation worthiness scaffold' also results in similar improvements: 56.3 (∆=4.5). When both scaffolds are used simultaneously in 'BiLSTM-Attn + both scaffolds', the F1 score further improves to 63.1 (∆=11.3), suggesting that the two tasks provide complementary signal that is useful for citation intent prediction. The best result is achieved when we also add ELMo vectors (Peters et al., 2018) to the input representations in 'BiLSTM-Attn w/ ELMo + both scaffolds', achieving an F1 of 67.9, a major improvement from the previous state-of-the-art results of Jurgens et al. (2018) 54.6 (∆=13.3). We note that the scaffold tasks provide major contributions on top of the ELMo-enabled baseline (∆=13.6), demonstrating the efficacy of using structural scaffolds for citation intent prediction. We note that these results were obtained without using hand-curated features or additional linguistic resources as used in Jurgens et al. (2018) . We also experimented with adding features used in Jurgens et al. (2018) to our best model and not only we did not see any improvements, but we observed (Jurgens et al., 2018) 79.6

This work

BiLSTM-Attn + section title scaffold 77.8 BiLSTM-Attn + citation worthiness scaffold 78.1 BiLSTM-Attn + both scaffolds 79.1 BiLSTM-Attn w/ ELMo + both scaffolds 84.0 at least 1.7% decline in performance. This suggests that these additional manual features do not provide the model with any additional useful signals beyond what the model already learns from the data. Table 4 shows the main results on SciCite dataset, where we see similar patterns. Each scaffold task improves model performance. Adding both scaffolds results in further improvements. And the best results are obtained by using ELMo representation in addition to both scaffolds. Note that this dataset is more than five times larger in size than the ACL-ARC, therefore the performance numbers are generally higher and the F1 gains are generally smaller since it is easier for the models to learn optimal parameters utilizing the larger annotated data. On this dataset, the best baseline is the neural baseline with addition of ELMo contextual vectors achieving an F1 score of 82.6 followed by Jurgens et al. (2018) , which is expected because neural models generally achieve higher gains when more training data is available and because Jurgens et al. (2018) was not designed with the SciCite dataset in mind.

Table 4: Results on the SciCite dataset.

The breakdown of results by intent on ACL-ARC and SciCite datasets is respectively shown in Tables 5 and 6. Generally we observe that results on categories with more number of instances are higher. For example on ACL-ARC, the results on the BACKGROUND category are the highest as this category is the most common. Conversely, the results on the FUTUREWORK category are the lowest. This category has the fewest data points (see distribution of the categories in Table 2 ) and thus it is harder for the model to learn the optimal parameters for correct classification in this category.

4.4 Analysis

To gain more insight into why the scaffolds are helping the model in improved citation intent classification, we examine the attention weights assigned to inputs for our best proposed model this work A possible future direction would be to compare the query string to retrieved results using a method similar to that of Tsuruoka and Tsujii ( 2003 ) . ('BiLSTM-Attn w/ ELMo + both scaffolds') compared with the best neural baseline ('BiLSTM-Attn w/ ELMO'). We conduct this analysis for examples from both datasets. Figure 3 shows an example input citation along with the horizontal line and the heatmap of attention weights for this input resulting from our model versus the baseline. For first example (3a) the true label is FU-TUREWORK. We observe that our model puts more weight on words surrounding the word "future" which is plausible given the true label. On the other hand, the baseline model attends most to the words "compare" and consequently incorrectly predicts a COMPARE label. In second example (3b) the true label is RESULTCOMPARISON. The baseline incorrectly classifies it as a BACK-GROUND, likely due to attending to another part of the sentence ("analyzed seprately"). Our model correctly classifies this instance by putting more attention weights on words that relate to comparison of the results. This suggests that the our model is more successful in learning optimal parameters for representing the citation text and classifying its respective intent compared with the baseline. Note that the only difference between our model and the neural baseline is inclusion of the structural scaffolds. Therefore, suggesting the effectiveness the scaffolds in informing the main task of relevant signals for citation intent classification.

Figure 3: Visualization of attention weights corresponding to our best scaffold model compared with the best baseline neural baseline model without scaffolds.

Error analysis. We next investigate errors made by our best model (Figure 4 plots classification errors). One general error pattern is that the model has more tendency to make false positive errors in the BACKGROUND category likely due to this category dominating both datasets. It's interesting that for the ACL-ARC dataset some prediction

Figure 4: Confusion matrix showing classification errors of our best model on two datasets. The diagonal is masked to bring focus only on errors.

Example True Prediction

Our work is inspired by the latent left-linking model in (CITATION) and the ILP formulation from (CITATION).

Motivation Use

ASARES is presented in detail in (CITATION) .

Use Background

The advantage of tuning similarity to the application of interest has been shown previously by (CITATION).

Compare Background

One possible direction is to consider linguistically motivated approaches , such as the extraction of syntactic phrase tables as proposed by (CITATION).

Table 5: Detailed per category classification results on ACL-ARC dataset.

Table 6: Detailed per category classification results on the SciCite dataset.

Futurework Background

After the extraction, pruning techniques (CITATION) can be applied to increase the precision of the extraction. Table 7 : A sample of model's classification errors on ACL-ARC dataset errors are due to the model failing to properly differentiate the USE category with BACKGROUND. We found out that some of these errors would have been possibly prevented by using additional context. Table 7 shows a sample of such classification errors. For the citation in the first row of the table, the model is likely distracted by "model in (citation)" and "ILP formulation from (citation)" deeming the sentence is referring to the use of another method from a cited paper and it misses the first part of the sentence describing the motivation. This is likely due to the small number of training instances in the MOTIVATION category, preventing the model to learn such nuances. For the examples in the second and third row, it is not clear if it is possible to make the correct prediction without additional context. And similarly in the last row the instance seems ambiguous without accessing to additional context. Similarly as shown in Figure 4a two of FUTUREWORK labels are wrongly classified. One of them is illustrated in the forth row of Table 7 where perhaps additional context could have helped the model in identifying the correct label. One possible way to prevent this type of errors, is to provide the model with an additional input, modeling the extended surrounding context. We experimented with encoding the extended surrounding context using a BiLSTM and concatenating it with the main citation context vector (z), but it resulted in a large decline in overall performance likely due to the overall noise introduced by the additional context. A possible future work is to investigate alternative effective approaches for incorporating the surrounding extended context.

Table 7: A sample of model’s classification errors on ACL-ARC dataset

5 Related Work

There is a large body of work studying the intent of citations and devising categorization systems (Stevens and Giuliano, 1965; Moravcsik and Murugesan, 1975; Garzone and Mercer, 2000; White, 2004; Ahmed et al., 2004; Teufel et al., 2006; Agarwal et al., 2010; Dong and Schäfer, 2011) . Most of these efforts provide citation categories that are too fine-grained, some of which rarely occur in papers. Therefore, they are hardly useful for automated analysis of scientific publications.

To address these problems and to unify previous efforts, in a recent work, Jurgens et al. (2018) proposed a six category system for citation intents. In this work, we focus on two schemes: (1) the scheme proposed by Jurgens et al. (2018) and

(2) an additional, more coarse-grained generalpurpose category system that we propose (details in §3). Unlike other schemes that are domainspecific, our scheme is general and naturally fits in scientific discourse in multiple domains. Early works in automated citation intent classification were based on rule-based systems (e.g., (Garzone and Mercer, 2000; Pham and Hoffmann, 2003) ). Later, machine learning methods based on linguistic patterns and other hand-engineered features from citation context were found to be effective. For example, Teufel et al. (2006) proposed use of "cue phrases", a set of expressions that talk about the act of presenting research in a paper. Abu-Jbara et al. (2013) relied on lexical, structural, and syntactic features and a linear SVM for classification. Researchers have also investigated methods of finding cited spans in the cited papers. Examples include feature-based methods , domain-specific knowledge (Cohan and Goharian, 2017) , and a recent CNNbased model for joint prediction of cited spans and citation function (Su et al., 2018) . We also experimented with CNNs but found the attention BiL-STM model to work significantly better. Jurgens et al. (2018) expanded all pre-existing featurebased efforts on citation intent classification by proposing a comprehensive set of engineered features, including boostrapped patterns, topic modeling, dependency-based, and metadata features for the task. We argue that we can capture necessary information from the citation context using a data driven method, without the need for handengineered domain-dependent features or external resources. We propose a novel scaffold neural model for citation intent classification to incorporate structural information of scientific discourse into citations, borrowing the "scaffold" terminology from Swayamdipta et al. (2018) who use auxiliary syntactic tasks for semantic problems.

6 Conclusions And Future Work

In this work, we show that structural properties related to scientific discourse can be effectively used to inform citation intent classification. We propose a multitask learning framework with two auxiliary tasks (predicting section titles and citation worthiness) as two scaffolds related to the main task of citation intent prediction. Our model achieves state-of-the-art result (F1 score of 67.9%) on the ACL-ARC dataset with 13.3 absolute increase over the best previous results. We additionally introduce SciCite, a new large dataset of citation intents and also show the effectiveness of our model on this dataset. Our dataset, unlike existing datasets that are designed based on a specific domain, is more general and fits in scientific discourse from multiple scientific domains.

We demonstrate that carefully chosen auxiliary tasks that are inherently relevant to a main task can be leveraged to improve the performance on the main task. An interesting line of future work is to explore the design of such tasks or explore the properties or similarities between the auxiliary and the main tasks. Another relevant line of work is adapting our model to other domains containing documents with similar linked structured such as Wikipedia articles. Future work may benefit from replacing ELMo with other types of contextualized representations such as BERT in our scaffold model. For example, at the time of finalizing the camera ready version of this paper, Beltagy et al. (2019) showed that a BERT contextualized representation model (Devlin et al., 2018) trained on scientific text can achieve promising results on the SciCite dataset.

We borrow the scaffold terminology fromSwayamdipta et al. (2018) in the context of multitask learning.

In our experiments BiGRUs resulted in similar performance.3 We also experimented BiLSTMs without attention; we found that BiLSTMs/BiGRUs along with attention provided best results. Other types of attention such as additive attention result in similar performance.

We note that this task may also be useful for helping authors improve their paper drafts. However, this is not the focus of this work.5 We also experimented with adding section titles as additional feature to the input, however, it did not result in any improvements.

https://www.figure-eight.com/ platform/ 7 https://semanticscholar.org/

https://github.com/allenai/ science-parse

https://semanticscholar.org/ 10 https://allennlp.org/elmo 11 Experiments with other types of RNNs such as BiGRUs and more layers showed similar or slightly worst performance