Abstract
State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However, does their strong performance on popular multi-hop datasets really justify this added design complexity? Our results suggest that the answer may be no, because even our simple pipeline based on BERT, named Quark, performs surprisingly well. Specifically, on HotpotQA, Quark outperforms these models on both question answering and support identification (and achieves performance very close to a RoBERTa model). Our pipeline has three steps: 1) use BERT to identify potentially relevant sentences independently of each other; 2) feed the set of selected sentences as context into a standard BERT span prediction model to choose an answer; and 3) use the sentence selection model, now with the chosen answer, to produce supporting sentences. The strong performance of Quark resurfaces the importance of carefully exploring simple model designs before using popular benchmarks to justify the value of complex techniques.
1 Introduction
Textual Multi-hop Question Answering (QA) is the task of answering questions by combining information from multiple sentences or documents. This is a challenging reasoning task that requires QA systems to identify relevant pieces of information in the given text and learn to compose them to answer a question. To enable progress in this area, many datasets (Welbl et al., 2018; Talmor and Berant, 2018; Yang et al., 2018; Khot et al., 2020) and models (Min et al., 2019b; Xiao et al., 2019; Tu et al., 2019) with varying complexities have been proposed over the past few years. Our work focuses on HotpotQA (Yang et al., 2018) , which contains 105,257 multi-hop questions derived from two Wikipedia paragraphs, where the correct answer is a span in these paragraphs or yes/no. Due to the multi-hop nature of this dataset, it is natural to assume that the relevance of a sentence for a question would depend on the other sentences considered to be relevant. E.g., the relevance of "Obama was born in Hawaii." to the question "Where was the 44 th President of USA born?" depends on the other relevant sentence: "Obama was the 44 th President of US." As a result, many approaches designed for this task focus on jointly identifying the relevant sentences (or paragraphs) via mechanisms such as cross-document attention, graph networks, and entity linking.
Our results question this basic assumption. We show that a simple model, QUARK (see Fig. 1 ), that first identifies relevant sentences from each arXiv:2004.06753v1 [cs.CL] 14 Apr 2020 paragraph independent of other paragraphs, is surprisingly powerful on this task: in 90% of the questions, QUARK's relevance module recovers all gold supporting sentences within the top-5 sentences. For QA, it uses a standard BERT (Devlin et al., 2019) span prediction model (similar to current published models) on the output of this module. Additionally, QUARK exploits the inherent similarity between the relevant sentence identification task and the task of generating an explanation given an answer produced by the QA module: it uses the same architecture for both tasks.
We show that this independent sentence scoring model results in a simple QA pipeline that outperforms all other BERT models in both 'distractor' and 'fullwiki' settings of HotpotQA. In the distractor setting (10 paragraphs, including two gold, provided as context), QUARK achieves joint scores (answer and support prediction) within 0.75% of the current state of the art. Even in the fullwiki setting (all 5M Wikipedia paragraphs as context), by combining our sentence selection approach with a commonly used paragraph selection approach (Nie et al., 2019) , we outperform all previously published BERT models. In both settings, the only models scoring higher use RoBERTa , a more robustly trained language model that is known to outperform BERT across various tasks.
While our design uses multiple transformer models (now considered a standard starting point in NLP), our contribution is a simple pipeline without any bells and whistles, such as NER, graph networks, entity linking, etc.
The closest effort to QUARK is by Min et al. (2019a) , who also propose a simple QA model for HotpotQA. Their approach selects answers independently from each paragraph to achieve competitive performance on the question-answering subtask of HotpotQA (they do not address the support identification subtask). We show that while relevant sentences can be selected independently, operating jointly over these sentences chosen from multiple paragraphs can lead to state-of-the-art questionanswering results, outperforming independent answer selection by several points.
Finally, our ablation study demonstrates that the sentence selection module benefits substantially from using context from the corresponding paragraph. It also shows that running this module a second time, with the chosen answer as input, results in more accurate support identification.
2 Related Work
Most approaches for HotpotQA attempt to capture the interactions between the paragraphs by either relying on cross-attention between documents or sequentially selecting paragraphs based on the previously selected paragraphs.
While Nishida et al. (2019) also use a standard Reading Comprehension (RC) model, they combine it with a special Query Focused Extractor (QFE) module to select relevant sentences for QA and explanation. The QFE module sequentially identifies relevant sentences by updating a RNN state representation in each step, allowing the model to capture the dependency between sentences across time-steps. Xiao et al. 2019propose a Dynamically Fused Graph Networks (DFGN) model that first extracts entities from paragraphs to create an entity graph, dynamically extract subgraphs and fuse them with the paragraph representation. The Select, Answer, Explain (SAE) model (Tu et al., 2019) is similar to our approach in that it also first selects relevant documents and uses them to produce answers and explanations. However, it relies on a self-attention over all document representations to capture potential interactions. Additionally, they rely on a Graph Neural Network (GNN) to answer the questions. Hierarchical Graph Network (HGN) model (Fang et al., 2019 ) builds a hierarchical graph with three levels: entities, sentences and paragraphs to allow for joint reasoning. DecompRC (Min et al., 2019b) takes a completely different approach of learning to decompose the question (using additional annotations) and then answer the decomposed questions using a standard single-hop RC system. Others such as Min et al. (2019a) have also noticed that many HotpotQA questions can be answered just based on a single paragraph. Our findings are both qualitatively and quantitatively different. They did not consider the support identification task, and showed strong (but not quite SoTA) QA performance by running a QA model independently on each paragraph. We, on the other hand, show that interaction is not essential for selecting relevant sentences but actually valuable for QA! Specifically, by using a context of relevant sentences spread across multiple paragraphs in steps 2 and 3, our simple BERT model outperforms previous models with complex entity-and graph-based interactions on top of BERT. We thus view QUARK as a different, stronger baseline for multi-hop QA.
In the fullwiki setting, each question has no associated context and models are expected to select paragraphs from Wikipedia. To be able to scale to such a large corpus, the proposed systems often select the paragraphs independent of each other. A recent retrieval method in this setting is Semantic Retrieval (Nie et al., 2019) where first the paragraphs are selected based on the question, followed by individual sentences from these paragraphs. However, unlike our approach, they do not use the paragraph context to select the sentences, missing key context needed to identify relevance.
3 Pipeline Model: Quark
Our model works in three steps. First, we score individual sentences from an input set of paragraphs D based on their relevance to the question. Second, we feed the highest-scoring sentences to a span prediction model to produce an answer to the question. Third, we score sentences from D a second time to identify the supporting sentences using the answer. These three steps are implemented using the two modules described next in Sections 3.1 and 3.2.
3.1 Sentence Scoring Module
In the distractor setting, HotpotQA provides 10 context paragraphs that have an average length of 41.4 sentences and 1106 tokens. This is too long for standard language-model based span-predictionmost models scale quadratically with the number of tokens, and some are limited to 512 tokens. This motivates selecting a few relevant sentences E to reduce the size of the input to the span-prediction model without losing important context. In a similar vein, the support identification subtask of Hot-potQA also involves selecting a few sentences that best explain the chosen answer. We solve both of these problems with the same transformer-based sentence scoring module, with slight variation in its input.
Our sentence scorer uses the BERT-Large-Cased model (Devlin et al., 2019) trained with wholeword masking, with an additional linear layer over the [CLS] token. Here, whole word masking refers to a BERT variant that masks entire words instead of word pieces during pre-training.
We score every sentence s from every paragraph p ∈ D independently by feeding the following sequence to the model: [SEP] . This sequence is the same for every sentence in the para-graph, but the sentence being classified is indicated using a segment IDs: It is set to 1 for tokens from the sentence and to 0 for the rest. If a paragraph has more than 512 tokens, we restrict the input to the first 512. Each annotated support sentence forms a positive example and all other sentences from D form the negative examples. Note that our classifier scores each sentence independently and never sees sentences from two paragraphs at the same time. (See Appendix A.1 for further detail.)
[CLS] question [SEP] p [SEP] answer
We train two variants of this model: (1) r na (s) is trained to score sentences given a question but no answer (answer is replaced with a [MASK] token); and (2) r a (s) is trained to score sentences given a question and its gold answer. We use r na (s) for relevant sentence selection and r a (s) for support identification (Sec. 3.3).
3.2 Question Answering Module
To find answers to questions, we use Wolf et al. (2019)'s implementation of Devlin et al. (2019) 's span prediction model. To achieve our best score, we use their BERT-Large-Cased model with whole-word masking and SQuAD (Rajpurkar et al., 2016) fine-tuning. 1 We fine-tune this model on the HotpotQA dataset with input QA context E from r na (s). Since BERT models have a hard limit of 512 word-pieces, we use r na (s) to select the most relevant sentences that can fit within this limit, as described next. (See Appendix A.2 for training details.)
To accomplish this, we compute the score r na (s) for each sentence in the input D. Then we add sentences in decreasing order of their scores to the QA context E, until we have filled no more than 508 word-pieces (incl. question word-pieces). For every new paragraph considered, we also add its first sentence, and the title of the article (enclosed in
3.3 Bringing It Together: Distractor Setting
Given a question along with 10 distractor paragraphs D, we use the r na (s) variant of our sentence scoring module to score each sentence s in D, again without looking at other paragraphs. In the second step, the selected sentences are fed as context E into the QA module (as described in Section 3.2) to choose an answer. In the final step, to find sentences supporting the chosen answer, we use r a (s) to score each sentence in D, this time with the chosen answer as part of the input. 2 We define the score n(S) of a set of sentences S ⊂ D to be the sum of the individual sentence scores; that is, n(S) = s∈S r a (s). 3 In HotpotQA, supporting sentences always come from exactly two paragraphs. We compute this score for all possible S satisfying this constraint and take the highest scoring set of sentences as our support.
3.4 Bringing It Together: Fullwiki Setting
Since there are too many paragraphs in the fullwiki setting, we use paragraphs from the SR-MRS system (Nie et al., 2019) as our context D for each question. On the Dev set, we found QUARK to perform best with a paragraph score threshold of −8.0 in MRS. Neither the sentence scorers r na (s), r a (s) nor the QA module were retrained in this setting.
4 Experiments
We evaluate on both the distractor and fullwiki settings of HotpotQA with the following goal: Can a simple pipeline model outperform previous, more complex, approaches? We present the EM (Exact Match) and F1 scores on the evaluation metrics proposed for HotpotQA:
(1) answer selection, (2) support selection, and (3) joint score. Table 1 shows that on the distractor setting, QUARK outperforms all previous models based on BERT, including HGN, which like us also uses whole word masking for contextual embeddings. Moreover, we are within 1 point of models that use RoBERTa embeddings-a much stronger language model that has shown improvements of 1.5 to 6 points in previous HotpotQA models.
QUARK also performs better than the recent single-paragraph approach for the QA subtask (Min et al., 2019a) by 14 points F1. While most of this gain comes from using a larger language model, QUARK scores 2 points higher even with a language model of the same size (BERT-Base).
We observe a similar trend in the fullwiki setting (Table 2) where QUARK again outperforms previous approaches (except HGN with RoBERTa). While we rely on retrieval from SR-MRS (Nie et al., 2019) for our initial paragraphs, we outperform the original work. We attribute this improvement to two factors: our sentence selection capitalizing on the sentence's paragraph context leading to better support selection, and a better span selection model Table 3 : Ablation study on sentence selection in the distractor setting. top-n indicates the number of sentences required to cover the annotated support sentences in 90% of the questions.
leading to improved QA.
4.1 Ablation
To evaluate the impact of context on our sentence selection model in isolation, we look at the number of sentences that score at least as high as the lowestscoring annotated support sentence. In other words, this is the number of sentences we must send to the QA model to ensure all annotated support is included. Table 3 shows that providing the model with the context from the paragraph gives a substantial boost on this metric, bringing it down from 10 to only 6 when using BERT-Base (an oracle would need 3 sentences). It further shows that this boost carries over to the downstream tasks of span selection and choosing support sentences (improving it by 9 points to 83%). Finally, the table shows the value of running the sentence selection model a second time: with BERT-Large, r a (s) outperforms r na (s) by 1.62% on the Support F1 metric. Looking deeper, we analyzed the accuracy of our third stage, r a (s), as a function of the correctness of the QA stage. When QA finds the correct gold answer, r a (s) obtains the right support in 65.9% of the cases. If the answer from QA is incorrect, the success rate of r a (s) is only 50.9%.
5 Conclusion
Our work shows that on the HotpotQA tasks, a simple pipeline model can do as well as or better than more complex solutions. Powerful pre-trained models allow us to score sentences one at a time, without looking at other paragraphs. By operating jointly over these sentences chosen from multiple paragraphs, we arrive at answers and supporting sentences on par with state-of-the-art approaches. This result shows that retrieval in HotpotQA is not itself a multi-hop problem, and suggests focusing on other multi-hop datasets to demonstrate the value of more complex techniques.
While we use the model fine-tuned on SQuAD, ablations show that this only adds 0.2% to the final score.
We simply append the answer string to the question even if it is "yes" or "no".3 Note that ra(s) is the logit score and can be negative, so adding a sentence may not always improve this score.