Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Question Answering as Global Reasoning Over Semantic Abstractions



We propose a novel method for exploiting the semantic structure of text to answer multiple-choice questions. The approach is especially suitable for domains that require reasoning over a diverse set of linguistic constructs but have limited training data. To address these challenges, we present the first system, to the best of our knowledge, that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pre-trained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, we translate question answering (QA) into a search for an optimal subgraph that satisfies certain global and local properties. This formulation generalizes several prior structured QA systems. Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%.


Question answering (QA) is a fundamental challenge in AI, particularly in natural language processing (NLP) and reasoning. We consider the multiple-choice setting where Q is a question, A is a set of answer candidates, and the knowledge required for answering Q is available in the form of raw text P . This setting applies naturally to reading comprehension questions, as well as to open format multiple-choice questions for which information retrieval (IR) techniques are able to obtain relevant text from a corpus.

Several state-of-the-art QA systems excel at locating the correct answer in P based on proximity to question words, the distributional semantics of the text, answer type, etc (Clark et al. 2016) , however, often struggle with questions that require some form of reasoning or appeal to a more subtle understanding of the supporting text or the question. We demonstrate that we can use existing NLP modules, such as semantic role labeling (SRL) systems with respect to multiple predicate types (verbs, prepositions, nominals, etc.) , to derive multiple semantic views of the text and perform reasoning over these views to answer a variety of questions.

As an example, consider the following snippet of sports news text and an associated question: P : Teams are under pressure after PSG purchased Neymar this season. Chelsea purchased Morata. The Spaniard looked like he was set for a move to Old Trafford for the majority of the summer only for Manchester United to sign Romelu Lukaku instead, paving the way for Morata to finally move to Chelsea for an initial £56m. Q: Who did Chelsea purchase this season? A: { Alvaro Morata, Neymar, Romelu Lukaku } Given the bold-faced text P in P , simple word-matching suffices to correctly answer Q. However, P could have stated the same information in many different ways. As paraphrases become more complex, they begin to involve more linguistic constructs such as coreference, punctuation, prepositions, and nominals. This makes understanding the text, and thus the QA task, more challenging.

For instead, P could instead say Morata is the recent acquisition by Chelsea. This simple looking transformation can be surprisingly confusing for highly successful systems such as BIDAF (Seo et al. 2016) , which produces the partially correct phrase "Neymar this season. Morata". On the other hand, one can still answer the question confidently by abstracting relevant parts of Q and P , and connecting them appropriately. Specifically, a verb SRL frame for Q would indicate that we seek the object of the verb purchase, a nominal SRL frame for P would capture that the acquisition was of Morata and was done by Chelsea, and textual similarity would align purchase with acquisition.

Similarly, suppose P instead said Morata, the recent acquisition by Chelsea, will start for the team tomorrow. BIDAF now incorrectly chooses Neymar as the answer, presumably due to its proximity to the words purchased and this season. However, with the right abstractions, one could still arrive at the correct answer as depicted in Figure 1 for our proposed system, SEMANTICILP. This reasoning uses comma SRL to realize that the Morata is referring to the acquisition, and a preposition SRL frame to capture that the acquisition was done by Chelsea.

Figure 1: Depiction of SEMANTICILP reasoning for the example paragraph given in the text. Semantic abstractions of the question, answers, knowledge snippet are shown in different colored boxes (blue, green, and yellow, resp.). Red nodes and edges are the elements that are aligned (used) for supporting the correct answer. There are many other unaligned (unused) annotations associated with each piece of text that are omitted for clarity.

One can continue to make P more complex. For exam- Figure 1 : Depiction of SEMANTICILP reasoning for the example paragraph given in the text. Semantic abstractions of the question, answers, knowledge snippet are shown in different colored boxes (blue, green, and yellow, resp.). Red nodes and edges are the elements that are aligned (used) for supporting the correct answer. There are many other unaligned (unused) annotations associated with each piece of text that are omitted for clarity.

ple, P could introduce the need for coreference resolution by phrasing the information as: Chelsea is hoping to have a great start this season by actively hunting for new players in the transfer period. Morata, the recent acquisition by the team, will start for the team tomorrow. Nevertheless, with appropriate semantic abstractions of the text, the underlying reasoning remains relatively simple. Given sufficiently large QA training data, one could conceivably perform end-to-end training (e.g., using a deep learning method) to address these linguistic challenges. However, existing large scale QA datasets such as SQuAD (Rajpurkar et al. 2016) often either have a limited linguistic richness or do not necessarily need reasoning to arrive at the answer (Jia and Liang 2017) . Consequently, the resulting models do not transfer easily to other domains. For instance, the above mentioned BiDAF model trained on the SQuAD dataset performs substantially worse than a simple IR approach on our datasets. On the other hand, many of the QA collections in domains that require some form of reasoning, such as the science questions we use, are small (100s to 1000s of questions). This brings into question the viability of the aforementioned paradigm that attempts to learn everything from only the QA training data.

Towards the goal of effective structured reasoning in the presence of data sparsity, we propose to use a rich set of general-purpose, pre-trained NLP tools to create various se-mantic abstractions of the raw text 1 in a domain independent fashion, as illustrated for an example in Figure 1 . We represent these semantic abstractions as families of graphs, where the family (e.g., trees, clusters, labeled predicate-argument graphs, etc.) is chosen to match the nature of the abstraction (e.g., parse tree, coreference sets, SRL frames, etc., respectively). The collection of semantic graphs is then augmented with inter-and intra-graph edges capturing lexical similarity (e.g., word-overlap score or word2vec distance).

This semantic graph representation allows us to formulate QA as the search for an optimal support graph, a subgraph G of the above augmented graph connecting (the semantic graphs of) Q and A via P . The reasoning used to answer the question is captured by a variety of requirements or constraints that G must satisfy, as well as a number of desired properties, encapsulating the "correct" reasoning, that makes G preferable over other valid support graphs. For instance, a simple requirement is that G must be connected and it must touch both Q and A. Similarly, if G includes a verb from an SRL frame, it is preferable to also include the corresponding subject. Finally, the resulting constrained optimization problem is formulated as an Integer Linear Program (ILP), and optimized using an off-the-shelf ILP solver.

This formalism may be viewed as a generalization of systems that reason over tables of knowledge (Cohen 2000; Khashabi et al. 2016) : instead of operating over table rows (which are akin to labeled sequence graphs or predicateargument graphs), we operate over a much richer class of semantic graphs. It can also be viewed as a generalization of the recent TUPLEINF system (Khot, Sabharwal, and Clark 2017) , which converts P into a particular kind of semantic abstraction, namely Open IE tuples (Banko et al. 2007) .

This generalization to multiple semantic abstractions poses two key technical challenges: (a) unlike clean knowledge-bases (e.g., Dong et al. (2015) ) used in many QA systems, abstractions generated from NLP tools (e.g., SRL) are noisy; and (b) even if perfect, using their output for QA requires delineating what information in Q, A, and P is relevant for a given question, and what constitutes valid reasoning. The latter is especially challenging when combining information from diverse abstractions that, even though grounded in the same raw text, may not perfectly align. We address these challenges via our ILP formulation, by using our linguistic knowledge about the abstractions to design requirements and preferences for linking these abstractions.

We present a new QA system, SEMANTICILP, 2 based on these ideas, and evaluate it on multiple-choice questions from two domains involving rich linguistic structure and reasoning: elementary and middle-school level science exams, and early-college level biology reading comprehension. Their data sparsity, as we show, limits the performance of state-of-the-art neural methods such as BiDAF (Seo et al. 2016) . SEMANTICILP, on the other hand, is able to successfully capitalize on existing general-purpose NLP tools in order to outperform existing baselines by 2%-6% on the science exams, leading to a new state of the art. It also gen-eralizes well, as demonstrated by its strong performance on biology questions in the PROCESSBANK dataset (Berant et al. 2014) . Notably, while the best existing system for the latter relies on domain-specific structural annotation and question processing, SEMANTICILP needs neither.

Related Work

There are numerous QA systems operating over large knowledge-bases. Our work, however, is most closely related to systems that work either directly on the input text or on a semantic representation derived on-the-fly from it. In the former category are IR and statistical correlation based methods with surprisingly strong performance (Clark et al. 2016) , as well as a number of recent neural architectures such as BiDAF (Seo et al. 2016) , Decomposable Attention Model (Parikh et al. 2016) , etc. In the latter category are approaches that perform structured reasoning over some abstraction of text. For instance, Khashabi et al. (2016) perform reasoning on the knowledge tables constructed using semi-automatical methods, Khot, Sabharwal, and Clark (2017) use Open IE subject-verb-object relations (Etzioni et al. 2008) , Banarescu et al. (2013) use AMR annotators (Wang et al. 2015) , and Krishnamurthy, Tafjord, and Kembhavi (2016) use a semantic parser (Zettlemoyer and Collins 2005) to answer a given question. Our work differs in that we use of a wide variety of semantic abstractions simultaneously, and perform joint reasoning over them to identify which abstractions are relevant for a given question and how best to combine information from them.

Our formalism can be seen as an extension of some of the prior work on structured reasoning over semi-structured text. For instance, in our formalism, each table used by Khashabi et al. (2016) can be viewed as a semantic frame and represented as a predicate-argument graph. The tablechaining rules used there are equivalent to the reasoning we define when combining two annotation components. Similarly, Open IE tuples used by (Khot, Sabharwal, and Clark 2017) can also be viewed as a predicate-argument structure.

One key abstraction we use is the predicate-argument structure provided by Semantic Role Labeling (SRL). Many SRL systems have been designed (Gildea and Jurafsky 2002; Punyakanok, Roth, and Yih 2008) using linguistic resources such as FrameNet (Baker, Fillmore, and Lowe 1998) , Prop-Bank (Kingsbury and Palmer 2002) , and NomBank (Meyers et al. 2004) . These systems are meant to convey high-level information about predicates (which can be a verb, a noun, etc.) and related elements in the text. The meaning of each predicate is conveyed by a frame, the schematic representations of a situation. Phrases with similar semantics ideally map to the same frame and roles. Our system also uses other NLP modules, such as for coreference resolution (Lee et al. 2013) and dependency parsing (Chang et al. 2015) .

While it appears simple to use SRL systems for QA (Palmer, Gildea, and Kingsbury 2005) , this has found limited success (Kaisser and Webber 2007; Pizzato and Mollá 2008; Moreda et al. 2011) . The challenges earlier approaches faced were due to making use of VerbSRL only, while QA depends on richer information, not only verb predicates and their arguments, along with some level of brittle-ness of all NLP systems. Shen and Lapata (2007) have partly addressed the latter challenge with an inference framework that formulates the task as a bipartite matching problem over the assignment of semantic roles, and managed to slightly improve QA. In this work we address both these challenges and go beyond the limitations of using a single predicate SRL system; we make use of SRL abstractions that are based on verbs, nominals, prepositions, and comma predicates, as well as textual similarity. We then develop an inference framework capable of exploiting combinations of these multiple SRL (and other) views, thus operating over a more complete semantic representation of the text.

A key aspect of QA is handling textual variations, on which there has been prior work using dependency-parse transformations (Punyakanok, Roth, and Yih 2004) . These approaches often define inference rules, which can generate new trees starting from a base tree. Bar-Haim, Dagan, and Berant (2015) and Stern et al. (2012) search over a space of a pre-defined set of text transformations (e.g., coreference substitutions, passive to active). Our work differs in that we consider a much wider range of textual variations by combining multiple abstractions, and make use of a more expressive inference framework.

Another aspect of our approach is the graphical representation of knowledge and the reasoning over it. We model the QA problem as a search in the space of potential support graphs. Search-based reasoning systems have been successful in several NLP areas (Roth and Yih 2004; Chang et al. 2010; Berant, Dagan, and Goldberger 2010; Srikumar and Roth 2011; Goldwasser and Roth 2011; Schüller 2014; Fei et al. 2015) . Compared to previous works, we use a larger set of annotators to account for diverse linguistic phenomena in questions and to handle errors of individual annotators.

Knowledge Abstraction And Representation

We begin with our formalism for abstracting knowledge from text and representing it as a family of graphs, followed by specific instantiations of these abstractions using off-theshelf NLP modules.

Semantic Abstractions

The pivotal ingredient of the abstraction is raw text. This representation is used for question Q, each answer option A i and the knowledge snippet P , which potentially contains the answer to the question. The KB for a given raw text, consists of the text itself, embellished with various Seman-ticGraphs attached to it, as depicted in Figure 2 .

Figure 2: Knowledge Representation used in our formulation. Raw text is associated with a collection of SemanticGraphs, which convey certain information about the text. There are implicit similarity edges among the nodes of the connected components of the graphs, and from nodes to the corresponding raw-text spans.

Each SemanticGraph is representable from a family of graphs. In principle there need not be any constraints on the permitted graph families; however for ease of representation we choose the graphs to belong to one of the 5 following families: Sequence graphs represent labels for each token in the sentence. Span family represents labels for spans of the text. Tree, is a tree representation of text spans. Cluster family, contain spans of text in different groups. PredArg family represents predicates and their arguments; in this view edges represent the connections between each single predi-

Text SemanticGraph 1 (Sequence) SemanticGraph 2 (Tree) SemanticGraph 3 (Predicate- Argument)

SemanticGraph 4 (Cluster) Figure 2 : Knowledge Representation used in our formulation. Raw text is associated with a collection of Seman-ticGraphs, which convey certain information about the text. There are implicit similarity edges among the nodes of the connected components of the graphs, and from nodes to the corresponding raw-text spans.

cates and its arguments. Each SemanticGraph belongs to one of the graph families and its content is determined by the semantics of the information it represents and the text itself. We define the knowledge more formally here. For a given paragraph, T , its representation K(T ) consists of a set of semantic graphs K(T ) = {g 1 , g 2 , . . .}. We define v(g) = {c i } and e(g) = {(c i , c j )} to be the set of nodes and edges of a given graph, respectively.

Semantic Graph Generators

Having introduced a graph-based abstraction for knowledge and categorized it into a family of graphs, we now delineate the instantiations we used for each family. Many of the pre-trained extraction tools we use are available in COG-COMPNLP. 3

• Sequence or labels for sequence of tokens; for example

Lemma and POS (Roth and Zelenko 1998 (Arivazhagan, Christodoulopoulos, and Roth 2016) .

Given SemanticGraph generators we have the question, answers and paragraph represented as a collection of graphs.

3 Available at: http://github.com/CogComp/cogcomp-nlp Given the instance graphs, creating augmented graph will be done implicitly as an optimization problem in the next step.

Qa As Reasoning Over Semantic Graphs

We introduce our treatment of QA as an optimal subgraph selection problem over knowledge. We treat question answering as the task of finding the best support in the knowledge snippet, for a given question and answer pair, measured in terms of the strength of a "support graph" defined as follows.

The inputs to the QA system are, a question K(Q), the set of answers {K(A i )} and given a knowledge snippet K(P ). 4 Given such representations, we will form a reasoning problem, which is formulated as an optimization problem, searching for a "support graph" that connects the question nodes to a unique answer option through nodes and edges of a snippet.

Define the instance graph I = I(Q, {A i } , P ) as the union of knowledge graphs:

I K(Q) ∪ (K(A i )) ∪ K(P ).

Intuitively, we would like the support graph to be connected, and to include nodes from the question, the answer option, and the knowledge. Since the SemanticGraph is composed of many disjoint sub-graph, we define augmented graph I + to model a bigger structure over the instance graphs I. Essentially we augment the instance graph and weight the new edges. Define a scoring function f :

(v 1 , v

2 ) labels pair of nodes v 1 and v 2 with an score which represents their phraselevel entailment or similarity. Definition 1. An augmented graph I + , for a question Q, answers {A i } and knowledge P , is defined with the following properties:

1. Nodes: v(I + ) = v(I(Q, {A i } , P )) 2. Edges: 5 e(I + ) = e(I) ∪ K(Q) ⊗ K(P ) ∪ [∪ i K(P ) ⊗ K(A i )]

3. Edge weights: for any e ∈ I + :

• If e / ∈ I, the edge connects two nodes in different connected components:

∀e = (v 1 , v 2 ) / ∈ I : w(e) = f (v 1 , v 2 )

• If e ∈ I, the edge belongs to a connected component, and the edge weight information about the reliability of the SemanticGraph and semantics of the two nodes.

∀g ∈ I, ∀e ∈ g : w(e) = f (e, g)

Next, we have to define support graphs, the set of graphs that support the reasoning of a question. For this we will apply some structured constraints on the augmented graph. Definition 2. A support graph G = G(Q, {A i } , P ) for a question Q, answer-options {A i } and paragraph P , is a subgraph (V, E) of I + with the following properties: 4 For simplicity, from now on, we drop "knowledge"; e.g., instead of saying "question knowledge", we say "question".

5 Define K(T1) ⊗ K(T2)

(g 1 ,g 2 )∈ K(T 1 )×K(T 2 ) v(g1) × v(g2), where v(g1) × v(g2) = {(v, w); v ∈ v(g1), w ∈ v(g2)} .


Use at least (a) a predicate and its argument, or (b) two arguments Cluster Use at least two nodes Tree Use two nodes with distance less than k SpanLa-belView Use at least k nodes

G ∩ K(Q) = ∅, G ∩ K(P ) = ∅, ∃! i : G ∩ K(A i ) = ∅

3. G satisfies structural properties per each connected component, as summarized in Table 1 .

Table 1: Minimum requirements for using each family of graphs. Each graph connected component (e.g. a PredArg frame, or a Coreference chain) cannot be used unless the above-mentioned conditioned is satisfied.

Definition 2 characterizes what we call a potential solution to a question. A given question and paragraph give rise to a large number of possible support graphs. We define the space of feasible support graphs as G (i.e., all the graphs that satisfy Definition 2, for a given (Q, {A i } , P )). To rank various feasible support graphs in such a large space, we define a scoring function score(G) as:

v∈v(G) w(v) + e∈e(G) w(e) − c∈C w c 1{c is violated} (1)

for some set of preferences (or soft-constraints) C. When c is violated, denoted by the indicator function 1{c is violated} in Eq. (1), we penalize the objective value by some fixed amount w c . The second term is supposed to bring more sparsity to the desired solutions, just like how regularization terms act in machine learning models (Natarajan 1995) . The first term is the sum of weights we defined when constructing the augmented-graph, and is supposed to give more weight to the models that have better and more reliable alignments between its nodes. The role of the inference process will be to choose the "best" one under our notion of desirable support graphs:

EQUATION (2): Not extracted; please refer to original document.

ILP Formulation

Our QA system, SEMANTICILP, models the above support graph search of Eq. (2) as an ILP optimization problem, i.e., as maximizing a linear objective function over a finite set of variables, subject to a set of linear inequality constraints. A summary of the model is given below. The augmented graph is not explicitly created; instead, it is implicitly created. The nodes and edges of the augmented graph are encoded as a set of binary variables. The value of the binary variables reflects whether a node or an edge -Number of sentences used is more than k -Active edges connected to each chunk of the answer option, more than k -More than k chunks in the active answer-option -More than k edges to each question constituent -Number of active question-terms -If using PredArgof K(Q), at least an argument should be used -If using PredArg(Verb-SRL) of K(Q), at least one predicate should be used. is used in the optimal graph G * . The properties listed in Table 1 are implemented as weighted linear constraints using the variables defined for the nodes and edges.

As mentioned, edge weights in the augmented graph come from a function, f , which captures (soft) phrasal entailment between question and paragraph nodes, or paragraph and answer nodes, to account for lexical variability. In our evaluations, we use two types of f . (a) Similar to Khashabi et al. (2016) , we use a WordNet-based (Miller 1995) function to score word-to-word alignments, and use this as a building block to compute a phrase-level alignment score as the weighted sum of word-level alignment scores. Word-level scores are computed using WordNet's hypernym and synonym relations, and weighted using relevant wordsense frequency. f for similarity (as opposed to entailment) is taken to be the average of the entailment scores in both directions. (b) For longer phrasal alignments (e.g., when aligning phrasal verbs) we use the Paragram system of Wieting et al. (2015) .

The final optimization is done on Eq. (1). The first part of the objective is the sum of the weights of the sub-graph, which is what an ILP does, since the nodes and edges are modeled as variables in the ILP. The second part of Eq. (1) contains a set of preferences C, summarized in Table 2 , meant to apply soft structural properties that partly dependant on the knowledge instantiation. These preferences are soft in the sense that they are applied with a weight to the overall scoring function (as compare to a hard constraint). For each preference function c there is an associated binary or integer variable with weight w c , and we create appropriate constraints to simulate the corresponding behavior.

Table 2: The set of preferences functions in the objective.

We note that the ILP objective and constraints aren't tied to the particular domain of evaluation; they represent general properties that capture what constitutes a well supported answer for a given question.

Empirical Evaluation

We evaluate on two domains that differ in the nature of the supporting text (concatenated individual sentences vs. a coherent paragraph), the underlying reasoning, and the way questions are framed. We show that SEMANTICILP outperforms a variety of baselines, including retrieval-based methods, neural-networks, structured systems, and the current best system for each domain. These datasets and systems are described next, followed by results.

Question Sets

For the first domain, we have a collection of question sets containing elementary-level science questions from standardized tests (Clark et al. 2016; Khot, Sabharwal, and Clark 2017 For the second domain, we use the PROCESSBANK 8 dataset for the reading comprehension task proposed by Berant et al. (2014) . It contains paragraphs about biological processes and two-way multiple choice questions about them. We used a broad subset of this dataset that asks about events or about an argument that depends on another event or argument. 9 . The resulting dataset has 293 train and 109 test questions, based on 147 biology paragraphs.

Test scores are reported as percentages. For each question, a system gets a score of 1 if it chooses the correct answer, 1/k if it reports a k-way tie that includes the correct answer, and 0 otherwise.

Question Answering Systems

We consider a variety of baselines, including the best system for each domain.

IR (information retrieval baseline). We use the IR solver from Clark et al. (2016) , which selects the answer option that has the best matching sentence in a corpus. The sentence is forced to have a non-stopword overlap with both q and a.

SEMANTICILP (our approach). Given the input instance (question, answer options, and a paragraph), we invoke various NLP modules to extract semantic graphs. We then generate an ILP and solve it using the open source SCIP engine (Achterberg 2009) , returning the active answer option a m from the optimal solution found. To check for ties, we disable a m , re-solve the ILP, and compare the score of the second-best answer, if any, with that of the best score.

For the science question sets, where we don't have any paragraphs attached to each question, we create a passage by using the above IR solver to retrieve scored sentences for each answer option and then combining the top 8 unique sentences (across all answer options) to form a paragraph.

While the sub-graph optimization can be done over the entire augmented graph in one shot, our current implementation uses multiple simplified solvers, each performing reasoning over augmented graphs for a commonly occurring annotator combination, as listed in Table 3 . For all of these 7 AI2 Science Questions V1 at http://data.allenai.org/ai2science-questions 8 https://nlp.stanford.edu/software/bioprocess 9 These are referred to as "dependency questions" by Berant et al. (2014) , and cover around 70% of all questions.

Table 3: The semantic annotator combinations used in our implementation of SEMANTICILP.

Combination Representation

Comb-1 K(Q) = {Shallow-Parse, Tokens} K(P ) = {Shallow-Parse, Tokens, Dependency} annotator combinations, we let the representation of the answers be K(A) = {Shallow-Parse, Tokens}. Importantly, our choice of working with a few annotator combinations is mainly for simplicity of implementation and suffices to demonstrate that reasoning over even just two annotators at a time can be surprisingly powerful. There is no fundamental limitation in implementing SEMANTICILP using one single optimization problem as stated in Eq. 2. Each simplified solver associated with an annotator combination in Table 3 produces a confidence score for each answer option. We create an ensemble of these solvers as a linear combination of these scores, with weights trained using the union of training data from all questions sets.

Comb-2 K(Q) = {Verb-SRL, Shallow-Parse} K(P ) = {Verb-SRL} Comb-3 K(Q) = {Verb-SRL, Shallow-Parse} K(P ) = {Verb-SRL, Coreference} Comb-4 K(Q) = {Verb-SRL, Shallow-Parse} K(P ) = {Comma-SRL} Comb-5 K(Q) = {Verb-SRL, Shallow-Parse} K(P ) = {Prep-SRL}

BIDAF (neural network baseline). We use the recent deep learning reading comprehension model of Seo et al. 2016, which is one of the top performing systems on the SQuAD dataset and has been shown to generalize to another domain as well (Min, Seo, and Hajishirzi 2017) . Since BIDAF was designed for fill-in-the-blank style questions, we follow the variation used by Kembhavi et al. (2017) to apply it to our multiple-choice setting. Specifically, we compare the predicted answer span to each answer candidate and report the one with the highest similarity.

We use two variants: the original system, BIDAF, pretrained on 100,000+ SQuAD questions, as well as an extended version, BIDAF', obtained by performing continuous training to fine-tune the SQuAD-trained parameters using our (smaller) training sets. For the latter, we convert multiple-choice questions into reading comprehension questions by generating all possible text-spans within sentences, with token-length at most correct answer length + 2, and choose the ones with the highest similarity score with the correct answer. We use the ALLENNLP re-implementation of BIDAF 10 , train it on SQuAD, followed by training it on our dataset. We tried different variations (epochs and learning rates) and selected the model which gives the best average score across all the datasets. As we will see, the variant that was further trained on our data often gives better results.

TUPLEINF (semi-structured inference baseline). Recently proposed by Khot, Sabharwal, and Clark (2017) , this is a state-of-the-art system designed for science questions. It uses Open IE (Banko et al. 2007) tuples derived from the text as the knowledge representation, and performs reasoning over it via an ILP. It has access to a large knowledge base of Open IE tuples, and exploits redundancy to overcome challenges introduced by noise and linguistic variability.

PROREAD and SYNTPROX. PROREADis a specialized and best performing system on the PROCESSBANK question set. Berant et al. (2014) annotated the training data with events and event relations, and trained a system to extract the process structure. Given a question, PROREAD converts it into a query (using regular expression patterns and keywords) and executes it on the process structure as the knowledge base. Its reliance on a question-dependent query generator and on a process structure extractor makes it difficult to apply to other domains.

SYNTPROX is another solver suggested by (Berant et al. 2014) . It aligns content word lemmas in both the question and the answer against the paragraph, and select the answer tokens that are closer to the aligned tokens of the questions. The distance is measured using dependency tree edges. To support multiple sentences they connect roots of adjacent sentences with bidirectional edges.

Experimental Results

We evaluate various QA systems on datasets from the two domains. The results are summarized below, followed by some some insights into SEMANTICILP's behavior and an error analysis.

Science Exams. The results of experimenting on different grades' science exams are summarized in Table 4 , which shows the exam scores as a percentage. The table demonstrates that SEMANTICILP consistently outperforms the best baselines in each case by 2%-6%.

Table 4: Science test scores as a percentage. On elementary level science exams, SEMANTICILP consistently outperforms baselines. In each row, the best score is in bold and the best baseline is italicized.

Further, there is no absolute winner among the baselines; while IR is good on the 8th grade questions, TUPLEINF and BIDAF' are better on 4th grade questions. This highlights the differing nature of questions for different grades. Biology Exam. The results on the PROCESSBANK dataset are summarized in Table 5 . While SEMANTICILP's performance is substantially better than most baselines and close to that of PROREAD, it is important to note that this latter baseline enjoys additional supervision of domain-specific event annotations. This, unlike our other relatively general baselines, makes it limited to this dataset, which is also why we don't include it in Table 4 .

Table 5: Biology test scores as a percentage. SEMANTICILP outperforms various baselines on the PROCESSBANK dataset and roughly matches the specialized best method.

We evaluate IR on this reading comprehension dataset by creating an ElasticSearch index, containing the sentences of the knowledge paragraphs.

Error And Timing Analysis

For some insight into the results, we include a brief analysis of our system's output compared to that of other systems. We identify a few main reasons for SEMANTICILP's errors. Not surprisingly, some mistakes (see Appendix figure for an example) can be traced back to failures in generating proper annotation (SemanticGraph). Improvement in SRL modules or redundancy can help address this. Some mistakes are from the current ILP model not supporting the ideal reasoning, i.e., the requisite knowledge exists in the annotations, but the reasoning fails to exploit it. Another group of mistakes is due to the complexity of the sentences, and the system lacking a way to represent the underlying phenomena with our current annotators.

A weakness (that doesn't seem to be particular to our solver) is reliance on explicit mentions. If there is a meaning indirectly implied by the context and our annotators are not able to capture it, our solver will miss such questions. There will be more room for improvement on such questions with the development of discourse analysis systems.

When solving the questions that don't have an attached paragraph, relevant sentences need to be fetched from a corpus. A subset of mistakes on this dataset occurs because the extracted knowledge does not contain the correct answer.

ILP Solution Properties. Our system is implemented using many constraints, requires using many linear inequalities which get instantiated on each input instanced, hence there are a different number of variables and inequalities for each input instance. There is an overhead time for pre-processing an input instance, and convert it into an instance graph. Here in the timing analysis we provide we ignore the annotation time, as it is done by black-boxes outside our solver. Table 6 summarizes various ILP and support graph statistics for SEMANTICILP, averaged across PROCESSBANK questions. Next to SEMANTICILP we have included numbers from TABLEILP which has similar implementation machinery, but on a very different representation. While the size of the model is a function of the input instance, on average, SEMANTICILP tends to have a bigger model (number of constraints and variables). The model creation time is significantly time-consuming in SEMANTICILP as involves many graph traversal operations and jumps between nodes and edges. We also providing times statistics for TUPLE-INF which takes roughly half the time of TABLEILP, which means that it is faster than SEMANTICILP.

Table 6: SEMANTICILP statistics averaged across questions, as compared to TABLEILP and TUPLEINF statistics.

Ablation Study

In order to better understand the results, we ablate the contribution of different annotation combinations, where we drop different combination from the ensemble model. We retrain the ensemble, after dropping each combination.

The results are summarized in Table 7 . While Comb-1 seems to be important for science tests, it has limited contribution to the biology tests. On 8th grade exams, the Verb-SRL and Comma-SRL-based alignments provide high value. Structured combinations (e.g., Verb-SRL-based alignments) are generally more important for the biology domain.

Table 7: Ablation study of SEMANTICILP components on various datasets. The first row shows the overall test score of the full system, while other rows report the change in the score as a result of dropping an individual combination. The combinations are listed in Table 3.

Ai2Public 8Th Processbank Full Semanticilp

55.9 67.9 no Comb-1 -3.1 -1.8 no Comb-2 -2.0 -4.6 no Comb-3 -0.6 -1.8 no Comb-4 -3.1 -1.8 no Comb-5 -0.1 -5.1 Table 7 : Ablation study of SEMANTICILP components on various datasets. The first row shows the overall test score of the full system, while other rows report the change in the score as a result of dropping an individual combination. The combinations are listed in Table 3 .

Complementarity to IR. Given that in the science domain the input snippets fed to SEMANTICILP are retrieved through a process similar to the IR solver, one might naturally expect some similarity in the predictions. The pie-chart in Figure 3 shows the overlap between mistakes and correct predictions of SEMANTICILP and IR on 50 randomly chosen training questions from AI2PUBLIC 4TH. While there is substantial overlap in questions that both answer correctly (the yellow slice) and both miss (the red slice), there is also a significant number of questions solved by SEMANTICILP but not IR (the blue slice), almost twice as much as the questions solved by IRbut not SEMANTICILP (the green slice).

Figure 3: Overlap of the predictions of SEMANTICILP and IR on 50 randomly-chosen questions from AI2PUBLIC 4TH.

Cascade Solvers. In Tables 4 and 5 , we presented one single instance of SEMANTICILP with state-of-art results on multiple datasets, where the solver was an ensemble of semantic combinations (presented in Table 3 ). Here we show a simpler approach that achieves stronger results on individual datasets, at the cost of losing a little generalization across domains. Specifically, we create two "cascades" (i.e., decision lists) of combinations, where the ordering of combinations in the cascade is determined by the training set precision of the simplified solver representing an annotator combination (combinations with higher precision appear earlier). One cascade solver targets science exams and the other the biology dataset.

The results are reported in Table 8 . On the 8th grade data, the cascade solver created for science test achieves higher scores than the generic ensemble solver. Similarly, the cascade solver on the biology domain outperforms the ensemble solver on the PROCESSBANK dataset. We analyze the performance of the system as a function of the length of the paragraph fed into SEMANTICILP, for 50 randomly selected training questions from the REGENTS 4TH set. Figure 4 (left) shows the overall system, for two combinations introduced earlier, as a function of knowledge length, counted as the number of sentences in the paragraph. As expected, the solver improves with more sentences, until around 12-15 sentences, after which it starts to worsen with the addition of more irrelevant knowledge. While the cascade combinations did not show much generalization across domains, they have the advantage of a smaller drop when adding irrelevant knowledge compared to the ensemble solver. This can be explained by the simplicity of cascading and minimal training compared to the ensemble of annotation combinations. It is worth highlighting that while Comb-1 (blue) often achieves higher coverage and good scores in simple paragraphs (e.g., science exams), it is highly sensitive to knowledge length. On the other hand, highly-constrained combinations have a more consistent performance with increasing knowledge length, at the cost of lower coverage.

Figure 4: Performance change for varying knowledge length.
Table 8: Comparison of test scores of SEMANTICILP using a generic ensemble vs. domain-targeted cascades of annotation combinations.


Question answering is challenging in the presence of linguistic richness and diversity, especially when arriving at the correct answer requires some level of reasoning. Departing from the currently popular paradigm of generating a very large dataset and learning "everything" from it in an end-to-end fashion, we argue-and demonstrate via our QA system-that one can successfully leverage pre-trained NLP modules to extract a sufficiently complete linguistic abstraction of the text that allows answering interesting questions about it. This approach is particularly valuable in settings where there is a small amount of data. Instead of exploiting peculiarities of a large but homogeneous dataset, as many state-of-the-art QA systems end up doing, we focus on confidently performing certain kinds of reasoning, as captured by our semantic graphs and the ILP formulation of support graph search over them. Bringing these ideas to practice, our system, SEMANTICILP, achieves state-of-the-art performance on two domains with very different characteristics, outperforming both traditional and neural models.

Here we provide some details of our reasoning formulation and its implementation as an ILP.

The support graph search for QA is modeled as an ILP optimization problem, i.e., as maximizing a linear objective function over a finite set of variables, subject to a set of linear inequality constraints. We note that the ILP objective and constraints aren't tied to the particular domain of evaluation; they represent general properties that capture what constitutes a well-supported answer for a given question.

Table 9. Not extracted; please refer to original document.
Table 10: Comb-1 (Shallow Alignment)
Table 11. Not extracted; please refer to original document.
Table 12. Not extracted; please refer to original document.
Table 13. Not extracted; please refer to original document.

Our formulation consists of multiple kinds of reasoning, encapsulating our semantic understanding of the types of knowledge (e.g., verbs, corefs, etc.) extracted from the text, the family to graphs used to represent them, and how they interact in order to provide support for an answer candidate. Each kind of reasoning is added to a general body, defined in Table 14 , that is shared among different reasoning types. This general body encapsulates basic requirements such as at most one answer candidate being chosen in the resulting support graph.

Table 14. Not extracted; please refer to original document.

In what follows we delineate various forms of reasoning, capturing different semantic abstractions and valid interactions between them.