Multi-class Hierarchical Question Classification for Multiple Choice Science Exams
Authors
Abstract
Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed classification labels from a fine-grained hierarchical taxonomy of 406 problem domains. We then show that a BERT-based model trained on this dataset achieves a large (+0.12 MAP) gain compared with previous methods, while also achieving state-of-the-art performance on benchmark open-domain and biomedical QC datasets. Finally, we show that using this model’s predictions of question topic significantly improves the accuracy of a question answering system by +1.7% P@1, with substantial future gains possible as QC performance improves.
1. Introduction
Understanding what a question is asking is one of the first steps that humans use to work towards an answer. In the context of question answering, question classification allows automated systems to intelligently target their inference systems to domain-specific solvers capable of addressing specific kinds of questions and problem solving methods with high confidence and answer accuracy (Hovy et al., 2001; Moldovan et al., 2003) .
To date, question classification has primarily been studied in the context of open-domain TREC questions (Voorhees and Tice, 2000) , with smaller recent datasets available in the biomedical (Roberts et al., 2014; Wasim et al., 2019) and education (Godea and Nielsen, 2018) domains. The opendomain TREC question corpus is a set of 5,952 short factoid questions paired with a taxonomy developed by Li and Roth (2002) that includes 6 coarse answer types (such as entities, locations, and numbers), and 50 fine-grained types (e.g. specific kinds of entities, such as animals or vehicles). While a wide variety of syntactic, semantic, and other features and classification methods have been applied to this task, culminating in near-perfect classification performance (Madabushi and Lee, 2016) , recent work has demonstrated that QC methods developed on TREC questions generally fail to transfer to datasets with more complex questions such as those in the biomedical domain (Roberts et al., 2014) , likely due in part to the simplicity and syntactic regularity of the questions, and the ability for simpler term-frequency models to achieve near-ceiling performance (Xia et al., 2018) .
In this work we explore question classification in the context of multiple choice science exams. Standardized science exams have been proposed as a challenge task for question answering (Clark, 2015) , as most questions contain a variety of challenging inference problems (Clark et al., 2013; Jansen et al., 2016) , require detailed scientific and common-sense (QC label) can provide an important contextual signal to guide a QA system to the correct answer (A'). Here, knowing the problem domain of Gravitational Pull allows the model to recognize that some properties (such as weight) change when objects move between celestial bodies, while others (including density) are unaffected by such a change.
knowledge to answer and explain the reasoning behind those answers (Jansen et al., 2018) , and questions are often embedded in complex examples or other distractors. Question classification taxonomies and annotation are difficult and expensive to generate, and because of the unavailability of this data, to date most models for science questions use one or a small number of generic solvers that perform little or no question decomposition (e.g. Khot et al., 2015; Clark et al., 2016; Khashabi et al., 2016; Khot et al., 2017; Jansen et al., 2017) . Our long-term interest is in developing methods that intelligently target their inferences to generate both correct answers and compelling human-readable explanations for the reasoning behind those answers. The lack of targeted solving -using the same methods for inferring answers to spatial questions about planetary motion, chemical questions about photosynthesis, and electrical questions about circuit continuity -is a substantial barrier to increasing performance (see Figure 1 ).
To address this need for developing methods of targetted arXiv:1908.05441v1 [cs.CL] 15 Aug 2019
inference, this work makes the following contributions:
1. We provide a large challenge dataset of question classification labels for 7,787 standardized science exam questions labeled using a hierarchical taxonomy of 406 detailed problem types across 6 levels of granularity. To the best of our knowledge this is the most detailed question classification dataset constructed by nearly an order of magnitude, while also being 30% larger than TREC, and nearly three times the size of the largest biomedical dataset.
2. We empirically demonstrate large performance gains of +0.12 MAP (+13.5% P@1) on science exam question classification using a BERT-based model over five previous state-of-the art methods, while improving performance on two biomedical question datasets by 4-5%. This is the first model to show consistent state-of-theart performance across multiple question classification datasets.
3. We show predicted question labels significantly improve a strong QA model by +1.7% P@1, where ceiling performance with perfect classification can reach +10.0% P@1. We also show that the error distribution of question classification matters when coupled with multiple choice QA models, and that controlling for correlations between classification labels and incorrect answer candidates can increase performance.
2. Related Work
Question classification typically makes use of a combination of syntactic, semantic, surface, and embedding methods. Syntactic patterns (Li and Roth, 2006; Silva et al., 2011; Patrick and Li, 2012; Mishra et al., 2013) and syntactic dependencies (Roberts et al., 2014) have been shown to improve performance, while syntactically or semantically important words are often expanding using Wordnet hypernyms or Unified Medical Language System categories (for the medical domain) to help mitigate sparsity (Huang et al., 2008; Yu and Cao, 2008; Van-Tu and Anh-Cuong, 2016) . Keyword identification helps identify specific terms useful for classification (Liu et al., 2011; Roberts et al., 2014; Khashabi et al., 2017) . Similarly, named entity recognizers (Li and Roth, 2002; Neves and Kraus, 2016) or lists of semantically related words (Li and Roth, 2002; Van-Tu and Anh-Cuong, 2016) can also be used to establish broad topics or entity categories and mitigate sparsity, as can word embeddings (Kim, 2014; Lei et al., 2018) . Here, we empirically demonstrate many of these existing methods do not transfer to the science domain. The highest performing question classification systems tend to make use of customized rule-based pattern matching (Lally et al., 2012; Madabushi and Lee, 2016) , or a combination of rule-based and machine learning approaches (Silva et al., 2011) , at the expense of increased model construction time. A recent emphasis on learned methods has shown a large set of CNN (Lei et al., 2018) and LSTM (Xia et al., 2018) variants achieve similar accuracy on TREC question classification, with these models exhibiting at best small gains over simple term frequency models. These recent developments echo the observations of Roberts et al. (2014) , who showed that existing methods beyond term frequency models failed to generalize to medical domain questions. Here we show that strong performance across multiple datasets is possible using a single learned model. Due to the cost involved in their construction, question classification datasets and classification taxonomies tend to be small, which can create methodological challenges. Roberts et al. (2014) generated the next-largest dataset from TREC, containing 2,936 consumer health questions classified into 13 question categories. More recently, Wasim et al. (2019) generated a small corpus of 780 biomedical domain questions organized into 88 categories. In the education domain, Godea et al. (2018) collected a set of 1,155 classroom questions and organized these into 16 categories. To enable a detailed study of science domain question classification, here we construct a large-scale challenge dataset that exceeds the size and classification specificity of other datasets, in many cases by nearly an order of magnitude.
3. Questions And Classification Taxonomy
Questions: We make use of the 7,787 science exam questions of the Aristo Reasoning Challenge (ARC) corpus (Clark et al., 2018) , which contains standardized 3 rd to 9 th grade science questions from 12 US states from the past decade. Each question is a 4-choice multiple choice question. Summary statistics comparing the complexity of ARC and TREC questions are shown in Table 1 .
Taxonomy: Starting with the syllabus for the NY Regents exam, we identified 9 coarse question categories (Astronomy, Earth Science, Energy, Forces, Life Science, Matter, Safety, Scientific Method, Other), then through a data-driven analysis of 3 exam study guides and the 3,370 training questions, expanded the taxonomy to include 462 fine-grained categories across 6 hierarchical levels of granularity. The taxonomy is designed to allow categorizing questions into broad curriculum topics at it's coarsest level, while labels at full specificity separate questions into narrow problem domains suitable for targetted inference methods. Because of its size, a subset of the classification taxonomy is shown in Annotation: Because of the complexity of the questions, it is possible for one question to bridge multiple categoriesfor example, a wind power generation question may span both renewable energy and energy conversion. We allow up to 2 labels per question, and found that 16% of questions required multiple labels. Each question was independently annotated by two annotators, with the lead annotator a domain expert in standardized exams. Annotators first independently annotated the entire question set, then questions without complete agreement were discussed until resolution. Before resolution, interannotator agreement (Cohen's Kappa) was κ = 0.58 at the finest level of granularity, and κ = 0.85 when considering only the coarsest 9 categories. This is considered moderate to strong agreement (McHugh, 2012) . Based on the results of our error analysis (see Section 4.3.), we estimate the overall accuracy of the question classification labels after resolution to be approximately 96%. While the full taxonomy contains 462 fine-grained categories derived from both standardized questions, study guides, and exam syllabi, we observed only 406 of these categories are tested in the ARC question set.
4.1. Question Classification On Science Exams
We identified 5 common models in previous work primarily intended for learned classifiers rather than hand-crafted rules. We adapt these models to a multi-label hierarchical classification task by training a series of one-vs-all binary classifiers (Tsoumakas and Katakis, 2007) , one for each label in the taxonomy. With the exception of the CNN and BERT models, following previous work (e.g. Silva et al., 2011; Roberts et al., 2014; Xia et al., 2018) we make use of an SVM classifier using the LIBSvM framework (Chang and Lin, 2011) with a linear kernel. Models are trained and evaluated from coarse to fine levels of taxonomic specificity. At each level of taxonomic evaluation, a set of non-overlapping confidence scores for each binary classifier are generated and sorted to produce a list of ranked label predictions. We evaluate these ranks using Mean Average Precision (see . ARC questions are evaluated using the standard 3,370 questions for training, 869 for development, and 3,548 for testing.
N-Grams, Pos, Hierarchical Features:
A baseline bagof-words model incorporating both tagged and untagged unigrams and bigams. We also implement the hierarchical classification feature of Li and Roth (Li and Roth, 2002) , where for a given question, the output of the classifier at coarser levels of granularity serves as input to the classifier at the current level of granularity.
Dependencies:
Bigrams of Stanford dependencies (De Marneffe and . For each word, we create one unlabeled bigram for each outgoing link from that word to it's dependency (Patrick and Li, 2012; Roberts et al., 2014) .
Question Expansion with Hypernyms: We perform hypernym expansion (Huang et al., 2008; Silva et al., 2011; Roberts et al., 2014) it's direct outgoing links. WordNet sense is identified using Lesk word-sense disambiguation (Lesk, 1986) , using question text for context. We implement the heuristic of Van-tu et al. 2016, where more distant hypernyms receive less weight.
Essential Terms: Though not previously reported for QC, we make use of unigrams of keywords extracted using the Science Exam Essential Term Extractor of Khashabi et al. (2017) . For each keyword, we create one binary unigram feature.
CNN: Kim (2014) demonstrated near state-of-the-art performance on a number of sentence classification tasks (including TREC question classification) by using pre-trained word embeddings (Mikolov et al., 2013) as feature extractors in a CNN model. Lei et al. (2018) showed that 10 CNN variants perform within +/-2% of Kim's (2014) model on TREC QC. We report performance of our best CNN model based on the MP-CNN architecture 1 of Rao et al. (Rao et al., 2016) , which works to establish the similarity between question text and the definition text of the question classes. We adapt the MP-CNN model, which uses a "Siamese" structure (He et al., 2015) , to create separate representations for both the question and the question class. The model then makes use of a triple ranking loss function to minimize the distance between the representations of questions and the correct class while simultaneously maximising the distance between questions and incorrect classes. We optimize the network using the method of Tu (2018).
Bert-Qc (This Work):
We make use of BERT (Devlin et al., 2018 ), a language model using bidirectional encoder representations from transformers, in a sentence-classification configuration. As the original settings of BERT do not support multi-label classification scenarios, and training a series of 406 binary classifiers would be computationally expensive, we use the duplication method of Tsoumakas et al. (2007) where we enumerate multi-label questions as multiple single-label instances during training by duplicating question text, and assigning each instance one of the multiple labels. Evaluation follows the standard procedure where we generate a list of ranked class predictions based on class probabilities, and use this to calculate Mean Average Precision (MAP) and Precision@1 (P@1). As shown in Table 3 , this BERT-QC model achieves our best question classification performance, significantly exceeding baseline performance on ARC by 0.12 MAP and 13.5% P@1.
4.2. Comparison With Benchmark Datasets
Apart from term frequency methods, question classification methods developed on one dataset generally do not exhibit strong transfer performance to other datasets (Roberts et al., 2014) . While BERT-QC achieves large gains over existing methods on the ARC dataset, here we demonstrate that BERT-QC also matches state-of-the-art performance on TREC (Li and Roth, 2002) , while surpassing state-of-the-art performance on the GARD corpus of consumer health questions (Roberts et al., 2014) and MLBioMedLAT corpus of biomedical questions (Wasim et al., 2019) . As such, BERT-QC is the first model to achieve strong performance across TREC-6 includes 6 coarse question classes (abbreviation, entity, description, human, location, numeric), while TREC-50 expands these into 50 more fine-grained types. TREC question classification methods can be divided into those that learn the question classification task, and those that make use of either hand-crafted or semi-automated syntactic or semantic extraction rules to infer question classes. To date, the best reported accuracy for learned methods is 98.0% by Xia et al. (2018) for TREC-6, and 91.6% by Van-tu et al. (Van-Tu and Anh-Cuong, 2016) for TREC-50 4 . Madabushi et al. (2016) achieve the highest to-date performance on TREC-50 at 97.2%, using rules that leverage the strong syntactic regularities in the short TREC factoid questions.
We compare the performance of BERT-QC with recently reported performance on this dataset in Table 4 . BERT-QC achieves state-of-the-art performance on fine-grained classification (TREC-50) for a learned model at 92.0% accuracy, and near state-of-the-art performance on coarse classification (TREC-6) at 96.2% accuracy. 5
4.2.2. Medical Question Classification
Because of the challenges with collecting biomedical questions, the datasets and classification taxonomies tend to be small, and rule-based methods often achieve strong results (e.g. Sarrouti et al., 2015) . Roberts et al. (2014) created the largest biomedical question classification dataset to date, annotating 2,937 consumer health questions drawn from the Genetic and Rare Diseases (GARD) question database with 13 question types, such as anatomy, disease cause, diagnosis, disease management, and prognoses. Roberts et al. (2014) found these questions largely resistant to learningbased methods developed for TREC questions. Their best model (CPT2), shown in Table 5 , makes use of stemming and lists of semantically related words and cue phrases to achieve 80.4% accuracy. BERT-QC reaches 84.9% accuracy on this dataset, an increase of +4.5% over the best previous model. We also compare performance on the recently released MLBioMedLAT dataset (Wasim et al., 2019) , a multi-label biomedical question classification dataset with 780 questions labeled using 88 classification types drawn from 133 Unified Medical Language System (UMLS) categories. Table 6 shows BERT-QC exceeds their best model, focus-driven semantic features (FDSF), by +0.05 Micro-F1 and +3% accuracy.
4.3. Error Analysis
We performed an error analysis on 50 ARC questions where the BERT-QC system did not predict the correct label, with a summary of major error categories listed in Table 7 .
Associative Errors: In 35% of cases, predicted labels were nearly correct, differing from the correct label only by the finest-grained (leaf) element of the hierarchical label (for example, predicting Matter → Changes of State → Boiling instead of Matter → Changes of State → Freezing). The bulk of the remaining errors were due to questions containing highly correlated words with a different class, or classes themselves being highly correlated. For example, a specific question about Weather Models discusses "environments" changing over "millions of years", where discussions of environments and long time periods tend to be associated with questions about Locations of Fossils. Similarly, a question containing the word "evaporation" could be primarily focused on either Changes of State or the Water Cycle (cloud generation), and must rely on knowledge from the entire question text to determine the correct problem domain. We believe these associative errors are addressable technical challenges that could ultimately lead to increased performance in subsequent models.
Errors specific to the multiple-choice domain: We observed that using both question and all multiple choice answer text produced large gains in question classification performance -for example, BERT-QC performance increases from 0.516 (question only) to 0.654 (question and all four answer candidates), an increase of 0.138 MAP. Our error analysis observed that while this substantially increases QC performance, it changes the distribution of errors made by the system. Specifically, 25% of errors become highly correlated with an incorrect answer candidate, which (we show Proportion Error Type 46% Question contains words correlated with incorrect class 35%
Predicted class is nearly correct, and distance 1 from gold class (different leaf node selected in taxonomy) 25%
Predicted class is highly correlated with an incorrect multiple choice answer 18%
Predicted class and gold class are on different aspects of similar topics/otherwise correlated 10%
Annotation: Gold label appears incorrect, predicted label is good. 8%
Annotation: Predicted label is good, but not in gold list.
8%
Correctly predicting the gold label may require knowing the correct answer to the question. Table 7 : BERT-QC Error Analysis: Classes of errors for 50 randomly selected questions from the development set where BERT-QC did not predict the correct label. These errors reflect the BERT-QC model trained and evaluated with terms from both the question and all multiple choice answer candidates. Questions can occupy more than one error category, and as such proportions do not sum to 100%.
in Section 5.) can reduce the performance of QA solvers. 6
5. Question Answering With Qc Labels
Because of the challenges of errorful label predictions correlating with incorrect answers, it is difficult to determine the ultimate benefit a QA model might receive from reporting QC performance in isolation. Coupling QA and QC systems can often be laborious -either a large number of independent solvers targeted to specific question types must be constructed (e.g. Minsky, 1986) , or an existing single model must be able to productively incorporate question classification information. Here we demostrate the latter -that a BERT QA model is able to incorporate question classification information through query expansion. BERT (Devlin et al., 2018) recently demonstrated stateof-the-art performance on benchmark question answering datasets such as SQUaD (Rajpurkar et al., 2016) , and near human-level performance on SWAG (Zellers et al., 2018) . Similarly, Pan et al. (2019) demonstrated that BERT achieves the highest accuracy on the most challenging subset of ARC science questions. We make use of a BERT QA model using the same QA paradigm described by Pan et al. (2019) , where QA is modeled as a next-sentence prediction task that predicts the likelihood of a given multiple choice answer candidate following the question text. We evaluate the question text and the text of each multiple choice answer candidate separately, where the answer candidate with the highest probablity is selected as the predicted answer for a given question. Performance is evaluated using Precision@1 (Manning et al., 2008) . Additional model details and hyperparameters are included in the Appendix.
We incorporate QC information into the QA process by implementing a variant of a query expansion model (Qiu and Frei, 1993) . Specifically, for a given {question, QC label} pair, we expand the question text by concatenating the definition text of the question classification label to the start of the question. We use of the top predicted question classification label for each question. Because QC labels are hierarchical, we append the label definition text for each level of the label L 1 ...L n . An exampe of this process is shown in Table 8 . Figure 2 shows QA peformance using predicted labels from the BERT-QC model, compared to a baseline model 6 When a model is trained using only question text (instead of both question and answer candidate text), the distribution of these highly-correlated errors changes to the following: 17% chose the correct label, 17% chose the same label, and 66% chose a different label not correlated with an incorrect answer candidate.
Original Question Text
What happens to water molecules during the boiling process?
Expanded Text (QC Label) Matter Changes of State Boiling What happens to water molecules during the boiling process? Figure 2 : Question answering performance (the proportion of questions answered correctly) for models that include question classification labels using query expansion, compared to a nolabel baseline model. While BERT-QC trained using question and answer text achieves higher QC performance, it leads to unstable QA performance due to it's errors being highly correlated with incorrect answers. Predicted labels using BERT-QC (question text only) show a significant increase of +1.7% P@1 at L6 (p < 0.01). Models with gold labels show the ceiling performance of this approach with perfect question classification performance. Each point represents the average of 10 runs. that does not contain question classification information. As predicted by the error analysis, while a model trained with question and answer candidate text performs better at QC than a model using question text alone, a large proportion of the incorrect predictions become associated with a negative answer candidate, reducing overall QA performance, and highlighting the importance of evaluating QC and QA models together. When using BERT-QC trained on question text alone, at the finest level of specificity (L6) where overall question classification accuracy is 57.8% P@1, ques- tion classification significantly improves QA performance by +1.7% P@1 (p < 0.01). Using gold labels shows ceiling QA performance can reach +10.0% P@1 over baseline, demonstrating that as question classification model performance improves, substantial future gains are possible. An analysis of expected gains for a given level of QC performance is included in the Appendix, showing approximately linear gains in QA performance above baseline for QC systems able to achieve over 40% classification accuracy. Below this level, the decreased performance from noise induced by incorrect labels surpasses gains from correct labels.
5.1. Automating Error Analyses With Qc
Detailed error analyses for question answering systems are typically labor intensive, often requiring hours or days to perform manually. As a result error analyses are typically completed infrequently, in spite of their utility to key decisions in the algortithm or knowledge construction process.
Here we show having access to detailed question classification labels specifying fine-grained problem domains provides a mechanism to automatically generate error analyses in seconds instead of days.
To illustrate the utility of this approach, Table 9 shows the performance of the BERT QA+QC model broken down by specific question classes. This allows automatically identifying a given model's strengths -for example, here questions about Human Health, Material Properties, and Earth's Inner Core are well addressed by the BERT-QA model, and achieve well above the average QA performance of 49%. Similarly, areas of deficit include Changes of State, Reproduction, and Food Chain Processes questions, which see below-average QA performance. The lowest performing class, Safety Procedures, demonstrates that while this model has strong performance in many areas of scientific reasoning, it is worse than chance at answering questions about safety, and would be inappropriate to deploy for safetycritical tasks.
While this analysis is shown at an intermediate (L2) level of specificity for space, more detailed analyses are possible. For example, overall QA performance on Scientific Inference questions is near average (47%), but increasing granularity to L3 we observe that questions addressing Experiment Design or Making Inferences -challenging questions even for humans -perform poorly (33% and 20%) when answered by the QA system. This allows a system designer to intelligently target problem-specific knowledge resources and inference methods to address deficits in specific areas.
6. Conclusion
Question classification can enable targetting question answering models, but is challenging to implement with high performance without using rule-based methods. In this work we generate the most fine-grained challenge dataset for question classification, using complex and syntactically diverse questions, and show gains of up to 12% are possible with our question classification model across datasets in open, science, and medical domains. This model is the first demonstration of a question classification model achieving state-ofthe-art results across benchmark datasets in open, science, and medical domains. We further demonstrate attending to question type can significantly improve question answering performance, with large gains possible as quesion classification performance improves. Our error analysis suggests that developing high-precision methods of question classification independent of their recall can offer the opportunity to incrementally make use of the benefits of question classification without suffering the consequences of classification errors on QA performance.
7. Resources
Our Appendix and supplementary material (available at http://www.cognitiveai.org/ explanationbank/) includes data, code, experiment details, and negative results.
8. Acknowledgements
The authors wish to thank Elizabeth Wainwright and Stephen Marmorstein for piloting an earlier version of the question classification annotation. We thank the Allen Insitute for Artificial Intelligence and National Science Founation (NSF 1815948 to PJ) for funding this work. 9. Appendix 9.1. Annotation Classification Taxonomy: The full classification taxonomy is included in separate files, both coupled with definitions, and as a graphical visualization.
Annotation Procedure: Primary annotation took place over approximately 8 weeks. Annotators were instructed to provide up to 2 labels from the full classification taxonomy (462 labels) that were appropriate for each question, and to provide the most specific label available in the taxonomy for a given question. Of the 462 labels in the classification taxonomy, the ARC questions had non-zero counts in 406 question types. Rarely, questions were encountered by annotators that did not clearly fit into a label at the end of the taxonomy, and in these cases the annotators were instructed to choose a more generic label higher up the taxonomy that was appropriate. This occurred when the production taxonomy failed to have specific categories for infrequent questions testing knowledge that is not a standard part of the science curriculum. For example, the question:
Which material is the best natural resource to use for making water-resistant shoes? (A) cotton (B) leather (C) plastic (D) wool tests a student's knowledge of the water resistance of different materials. Because this is not a standard part of the curriculum, and wasn't identified as a common topic in the training questions, the annotators tag this question as belonging to Matter → Properties of Materials, rather than a more specific category.
Questions from the training, development, and test sets were randomly shuffled to counterbalance any learning effects during the annotation procedure, but were presented in grade order (3 rd to 9 th grade) to reduce context switching (a given grade level tends to use a similar subset of the taxonomy -for example, 3 rd grade questions generally do not address Chemical Equations or Newtons 1 st Law of Motion).
Interannotator Agreement: To increase quality and consistency, each annotator annotated the entire dataset of 7,787 questions. Two annotators were used, with the lead annotator possessing previous professional domain expertise. Annotation proceeded in a two-stage process, where in stage 1 annotators completed their annotation independently, and in stage 2 each of the questions where the annotators did not have complete agreement were manually resolved by the annotators, resulting in high-quality classification annotation.
Because each question can have up to two labels, we treat each label for a given question as a separate evaluation of interannotator agreement. That is, for questions where both annotators labeled each question as having 1 or 2 labels, we treat this as 1 or 2 separate evaluations of interannotator agreement. For cases where one annotator labeled as question as having 1 label, and the other annotator labeled that same question as having 2 labels, we conservatively treat this as two separate interannotator agreements where one annotator failed to specify the second label and had zero agreement on that unspecified label. Though the classification procedure was fine-grained compared to other question classification taxonomies, containing an unusually large number of classes (406), overall raw interannotator agreement before resolution was high (Cohen's κ = 0.58). When labels are truncated to a maximum taxonomy depth of N, raw interannotator increases to κ = 0.85 at the coarsest (9 class) level (see Table 10 ). This is considered moderate to strong agreement (see McHugh (2012) for a discussion of the interpretation of the Kappa statistic). Based on the results of an error analysis on the question classification system (see Section 9.3.2.), we estimate that the overall accuracy of the question classification labels after resolution is approximately 96% .
Annotators disagreed on 3441 (44.2%) of questions. Primary sources of disagreement before resolution included each annotator choosing a single category for questions requiring multiple labels (e.g. annotator 1 assigning a label of X, and annotator 2 assigning a label of Y, when the gold label was multilabel X, Y), which was observed in 18% of disagreements. Similarly, we observed annotators choosing similar labels but at different levels of specificity in the taxonomy (e.g. annotator 1 assigning a label of Matter → Changes of State → Boiling, where annotator 2 assigned Matter → Changes of State), which occurred in 12% of disagreements before resolution. 9.2. Question Classification 9.2.1. Precision@1 Because of space limitations the question classification results are reported in Table 3 only using Mean Average Precision (MAP). We also include Precision@1 (P@1), the overall accuracy of the highest-ranked prediction for each question classification model, in Table 11 .
9.2.2. Negative Results Cnn:
We implemented the CNN sentence classifier of Kim (2014) , which demonstrated near state-of-the-art performance on a number of sentence classification tasks (including TREC question classification) by using pre-trained word embeddings (Mikolov et al., 2013) as feature extractors in a CNN model. We adapted the original CNN non-static model Table 11 : Performance of each question classification model, expressed in Precision@1 (P@1). * signifies a given model is significantly different from the baseline model (p < 0.01).
to multi-label classification by changing the fully connected softmax layer to sigmoid layer to produce a sigmoid output for each label simultaneously. We followed the same parameter settings reported by Kim et al. except the learning rate, which was tuned based on the development set. Pilot experiments did not show a performance improvement over the baseline model.
Label Definitions: Question terms can be mapped to categories using manual heuristics (e.g. Silva et al., 2011) . To mitigate sparsity and limit heuristic use, here we generated a feature comparing the cosine similarity of composite embedding vectors (e.g. Jansen et al., 2014) representing question text and category definition text, using pretrained GloVe embeddings (Pennington et al., 2014) . Pilot experiments showed that performance did not significantly improve.
Question Expansion with Hypernyms (Probase Version): One of the challenges of hypernym expansion (e.g. Huang et al., 2008; Silva et al., 2011; Roberts et al., 2014) is determining a heuristic for the termination depth of hypernym expansion, as in Van-tu et al. (2016) . Because science exam questions are often grounded in specific examples (e.g. a car rolling down a hill coming to a stop due to friction), we hypothesized that knowing certain categories of entities can be important for identifying specific question types -for example, observing that a question contains a kind of animal may be suggestive of a Life Science question, where similarly vehicles or materials present in questions may suggest questions about Forces or Matter, respectively. The challenge with WordNet is that key hypernyms can be at very different depths from query terms -for example, "cat" is distance 10 away from living thing, "car" is distance 4 away from vehicle, and "copper" is distance 2 away from material. Choosing a static threshold (or decaying threshold, as in Van-tu et al. (2016)) will inheriently reduce recall and limit the utility of this method of query expansion.
To address this, we piloted a hypernym expansion experiment using the Probase taxonomy (Wu et al., 2012) , a collection of 20.7M is-a pairs mined from the web, in place of WordNet. Because the taxonomic pairs in Probase come from use in naturalistic settings, links tend to jump levels in the WordNet taxonomy and be expressed in common forms. For example, cat → animal, car → vehicle, and copper → material, are each distance 1 in the Probase taxonomy, and high-frequency (i.e. high-confidence) taxonomic pairs.
Similar to query expansion using WordNet Hypernyms, our pilot experiments did not observe a benefit to using Probase hypernyms over the baseline model. An error analysis suggested that the large number of noisy and out-ofcontext links present in Probase may have reduced performance, and in response we constructed a filtered list of 710 key hypernym categories manually filtered from a list of hypernyms seeded using high-frequency words from an in-house corpus of 250 in-domain science textbooks. We also did not observe a benefit to question classification over the baseline model when expanding only to this manually curated list of key hypernyms.
9.2.3. Additional Positive Results
Topic words: We made use of the 77 TREC word lists of Li and Roth (2002) , containing a total of 3,257 terms, as well as an in-house set of 144 word lists on general and elementary science topics mined from the web, such as ANIMALS, VEGETABLES, and VEHICLES, containing a total of 29,059 words. To mitigate sparsity, features take the form of counts for a specific topic -detecting the words turtle and giraffe in a question would provide a count of 2 for the ANIMAL feature. This provides a light form of domain-specific entity and action (e.g. types of changes) recognition. Pilot experiments showed that this wordlist feature did add a modest performance benefit of approximately 2% to question classification accuracy. Taken together with our results on hypernym expansion, this suggests that manually curated wordlists can show modest benefits for question classification performance, but at the expense of substantial effort in authoring or collecting these extensive wordlists.
9.2.4. Additional Bert-Qc Model Details
Hyperparameters: For each layer of the class label hierarchy, we tune the hyperparameters based on the development set. We use the pretrained BERT-Base (uncased) checkpoint. We use the following hyperparameters: maximum sequence length = 256, batch size = 16, learning rates: 2e-5 (L1), 5e-5 (L2-L6), epochs: 5 (L1), 25 (L2-L6).
Statistics:
We use non-parametric bootstrap resampling to compare the baseline (Li and Roth (2002) model) to all experimental models to determine significance, using 10,000 bootstrap resamples.
9.3. Question Answering With Qc Labels
Hyperparameters: Pilot experiments on both pre-trained BERT-Base and BERT-Large checkpoints showed similar performance benefits at the finest levels of question classification granularity (L6), but the BERT-Large model demonstrated higher overall baseline performance, and larger incremental benefits at lower levels of QC granularity, so we evaluated using that model. We lightly tuned hyperparameters on the development set surrounding those reported by Devlin et al. (2018) , and ultimately settled on parameters similar to their original work, tempered by technical limitations in running the BERT-Large model on available hardware: maximum sequence length = 128, batch size = 16, learning rate: 1e-5. We report performance as the average of 10 runs for each datapoint. The number of epochs were tuned on each run on the development set (to a maximum of 8 epochs), where most models converged to maximum performance within 5 epochs.
Preference for uncorrelated errors in multiple choice question classification: We primarily report QA performance using BERT-QC trained using text from only the multiple choice questions and not their answer candidates. While this model achieved lower overall QC performance compared to the model trained with both question and multiple choice answer candidate text, it achieved slightly higher performance in the QA+QC setting. Our error analysis in Section 4.3. shows that though models trained on both question and answer text can achieve higher QC performance, when they make QC errors, the errors tend to be highly correlated with an incorrect answer candidate, which can substantially reduce QA performance. This is an important result for question classification in the context of multiple choice exams.In the context of multiple choice exams, correlated noise can substantially reduce QA performance, meaning the kinds of errors that a model makes are important, and evaluating QC performance in context with QA models that make use of those QC systems is critical.
Related to this result, we provide an analysis of the noise sensitivity of the QA+QC model for different levels of question classification prediction accuracy. Here, we perturb gold question labels by randomly selecting a proportion of questions (between 5% and 40%) and randomly assigning that question a different label. Figure 3 shows that this uncorrelated noise provides roughly linear decreases in performance, and still shows moderate gains at 60% accuracy (40% noise) with uncorrelated noise. This suggests that when making errors, making random errors (that are not correlated to incorrect multiple choice answers) is preferential.
Training with predicted labels: We observed small gains when training the BERT-QA model with predicted QC labels. We generate predicted labels for the training set using 5-fold crossvalidation over only the training questions.
Statistics: We use non-parametric bootstrap resampling to compare baseline (no label) and experimental (QC labeled) runs of the QA+QC experiment. Because the BERT-QA model produces different performance values across successive runs, we perform 10 runs of each condition. We then compute pairwise p-values for each of the 10 no label and Proportion of Noisy QC Labels QA Performance (Precision@1) Figure 3 : Analysis of noisy question classification labels on overall QA performance. Here, the X axis represents the proportion of gold QA labels that have been randomly switched to another of the 406 possible labels at the finest level of granularity in the classification taxonomy (L6). QA performance decreases approximately linearly as the proportion of noisy QC labels increases. Each point represents the average of 20 experimental runs, with different questions and random labels for each run. QA performance reported is on the development set. Note that due to the runtime associated with this analysis, the results reported are using the BERT-Base model. QC labeled runs (generating 100 comparisons), then use Fisher's method to combine these into a final statistic. 9.3.1. Interpretation of non-linear question answering gains between levels Question classification paired with question answering shows statistically significant gains of +1.7% P@1 at L6 using predicted labels, and a ceiling gain of up to +10% P@1 using gold labels. The QA performance graph in Figure 2 contains two deviations from the expectation of linear gains with increasing specificity, at L1 and L3. Region at L2 → L3 : On gold labels, L3 provides small gains over L2, where as L4 provides large gains over L3. We hypothesize that this is because approximately 57% of question labels belong to the Earth Science or Life Science categories which have much more depth than breadth in the standardized science curriculum, and as such these categories are primarily differentiated from broad topics into detailed problem types at levels L4 through L6. Most other curriculum categories have more breadth than depth, and show strong (but not necessarily full) differentiation at L2. Region at L1 : Predicted performance at L1 is higher than gold performance at L1. We hypothesize this is because we train using predicted rather than gold labels, which provides a boost in performance. Training on gold labels and testing on predicted labels substantially reduces the difference between gold and predicted performance.
9.3.2. Overall Annotation Accuracy
Though initial raw interannotator agreement was measured at kappa = 0.58, to maximize the quality of the annotation the annotators performed a second pass where all disagreements were manually resolved. Table 11 shows question
http://cogcomp.org/Data/QA/QC/ 4 Model performance is occasionally reported only on TREC-6 rather than the more challenging TREC-50, making direct comparisons between some algorithms difficult.5Xia et al. (2018) also report QC performance on MS Marco(Nguyen et al., 2016), a million-question dataset using 5 of the TREC-6 labels. We believe this to be in error as MS Marco QC labels are automatically generated. Still, for purposes of comparison, BERT-QC reaches 96.2% accuracy, an increase of +3% over Xia et al. (2018)'s best model.