IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

James Ferguson
Matt Gardner
Tushar Khot
Pradeep Dasigi
EMNLP
2020
View in Semantic Scholar

Abstract

Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension (RC) tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system's performance at identifying a potential lack of sufficient information and locating sources for that information. To fill this gap, we present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents. The questions were written by crowd workers who did not have access to any of the linked documents, leading to questions that have little lexical overlap with the contexts where the answers appear. This process also gave many questions without answers, and those that require discrete reasoning, increasing the difficulty of the task. We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%. The dataset, code for the baseline system, and a leaderboard can be found at this https URL.

1 Introduction

Humans often read text with the goal of obtaining information. Given that a single document is unlikely to contain all the information a reader might need, the reading process frequently involves identifying the information present in the given document, and what is missing, followed by locating a different source that could potentially contain the missing information. Most recent read- Figure 1 : An example from IIRC. At the top is a context paragraph which provides only partial information required to answer the question. The bold spans in the context indicate links to other Wikipedia pages. The colored boxes below the question show snippets from four of these pages that provide the missing information for answering the question. The answer is the underlined span.

Figure 1. Not extracted; please refer to original document.

ing comprehension tasks, such as SQuAD 2.0 (Rajpurkar et al., 2018), DROP (Dua et al., 2019b) , or Quoref (Dasigi et al., 2019) , evaluate models using a relatively simpler setup where all the information required to answer the questions (including judging them as being unanswerable) is provided in the associated contexts. While this setup has led to significant advances in reading comprehension (Ran et al., 2019; Zhang et al., 2020) , the tasks are still limited since they do not evaluate the capability of models at identifying precisely what information, if any, is missing to answer a question, and where that information might be found.

On the other hand, open-domain question answering tasks (Chen et al., 2017; Joshi et al., 2017; Dhingra et al., 2017) present a model with a ques-tion by itself, requiring the model to retrieve relevant information from some corpus. However, this approach loses grounding in a particular passage of text, and it has so far been challenging to collect diverse, complex question in this setting.

Alternatively, complex questions grounded in context can be converted to open-domain or incomplete-information QA datasets such as Hot-potQA (Yang et al., 2018) . However, they do not capture the information-seeking questions that arise from reading a single document with partial information (Min et al., 2019b; Chen and Durrett, 2019) .

We present a new dataset of incomplete information reading comprehension questions, IIRC, to address both of these limitations. IIRC is a crowdsourced dataset of 13441 questions over 5698 paragraphs from English Wikipedia, with most of the questions requiring information from one or more documents hyperlinked to the associated paragraphs, in addition to the original paragraphs themselves. Our crowdsourcing process (Section 2) ensures the questions are naturally information-seeking by decoupling question and answer collection pipelines. Crowd workers are instructed to ask follow-up questions after reading a paragraph, giving links to pages where they would expect to find the answer. This process results in questions like the one shown in Figure 1 . As illustrated by the example, this setup results in questions requiring complex reasoning, with an estimated 39% of the questions in IIRC requiring discrete reasoning. Moreover, 30% of the questions in IIRC require more than one linked document in addition to the original paragraph and 30% of them are unanswerable even given the additional information. When present, the answers are either extracted spans, boolean, or values resulting from numerical operations.

To evaluate the quality of the data, we run experiments with a modified version of Num-Net+ (Ran et al., 2019), a state-of-the-art model from DROP (Dua et al., 2019b) , chosen because a significant portion of questions in IIRC require numerical reasoning similar to that found in DROP. Because DROP uses only a single paragraph of context, we add a two-stage pipeline to retrieve necessary context for the model from the linked articles. The pipeline first identifies which links are pertinent, and then selects the most relevant passage from each of those links, concatenating them to serve as input for the model (Section 3). This baseline achieves an F 1 score of 31.1% on IIRC, while the estimated human performance is 88.4% F 1 . Even giving the model oracle pipeline components results in a performance of only 70.3%. Taken together, these results show that substantial modeling is needed both to identify and retrieve missing information, and to combine the retrieved information to answer the question (Section 4). We additionally perform qualitative analysis of the data, and find that the errors of the baseline model are evenly split between retrieving incorrect information, identifying unanswerable questions, and successfully reasoning over the retrieved information.

By construction, all examples in IIRC require identifying missing information. Even though current model performance is quite low, a model trained on this data could theoretically leverage that fact to achieve artificially high performance on test data, because it does not have to first determine whether more information is needed. To account for this issue, we additionally sample questions from SQuAD 2.0 (Rajpurkar et al., 2018) and DROP (Dua et al., 2019b) , which have similar question language to what is in IIRC, putting forward this kind of combined evaluation as a challenging benchmark for the community. Predictably, our baseline model performs substantially worse in this setting, reaching only 28% F 1 on the IIRC portion of this combined evaluation (Section 5).

2 Building Iirc

We used Wikipedia to build IIRC and relied on the fact that entities in Wikipedia articles are linked to other articles about those entities, providing more information about them. Our goals were to build a dataset with naturally information-seeking questions anchored in paragraphs with incomplete information, such that identifying the location of missing information is non-trivial, and answering the questions would require complex cross-document reasoning.

We ensured that the questions are informationseeking by separating question and answer collection processes, and by not providing the question writers access to the contexts where the answers occur. This process also ensured that we get questions that have minimal lexical overlap with the answer contexts. We used Wikipedia paragraphs with many outgoing links to increase the difficulty of identifying the articles that provide the missing information. To ensure complex cross-document Metro-Goldwyn-Mayer MGM produced more than 100 feature films in its first two years. Table 1 for more details. The passages on the left are the original passage, with bold spans indicating links. The highlighted sections contain the necessary context found in linked articles. Purple highlights indicate either the answer, for the second question, or the information used to compute the answer.

Table 1: Frequency of different types of retrieval, reasoning, and answers that appear in IIRC.

reasoning, we asked the crowd workers to create questions that need information from the seed paragraph as well as one or more linked articles. This constraint resulted in questions that are answerable neither from the original paragraph alone, nor from one of the linked articles alone, often requiring over 3+ passages to answer. The remainder of this section describes our data collection process.

2.1 Seed Paragraphs

We started by collecting paragraphs from Wikipedia articles containing ten or more links to other Wikipedia articles. This resulted in roughly 130K passages. We then created two separate crowdsourcing tasks on Amazon Mechanical Turk 1 ; one for collecting questions, and one for collecting answers. Workers for each task were chosen based on a qualification task. Their submissions were manually inspected, and those that produced high quality questions and correct answers, respectively, continued to work on the main annotation tasks.

2.2 Collecting Questions

Given a paragraph with links to other articles highlighted, crowd workers were tasked with writing questions that require information from the paragraph, as well as from one or more of linked articles. Workers could see the links, and the titles of Question cannot be answered given the provided context 30% Table 1 : Frequency of different types of retrieval, reasoning, and answers that appear in IIRC.

the articles they pointed to, but not the contents of the linked articles. Since the linked articles were not provided, the workers were asked to frame questions based on the information they think would be contained in the those articles. For each human intelligence task (HIT), workers were presented with a collection of ten paragraphs, and were asked to write a total of ten questions using any of those paragraphs, with two questions requiring following two or more links. For example given a passage about an actor that mentions Rasulala had roles in Cool Breeze (1972), Blacula (1972), and Willie Dynamite (1973) , an example of a question requiring multiple links would be How many different directors did Rasulala work with in 1972?.

In order to minimize questions with shortcut reasoning, we provided workers extensive instructions along with examples of good and bad questions to ask. Examples of bad questions included questions that did not require any links -Who did the Arakanese kings compare themselves to? when the context included They compared themselves to Sultans; and questions that did not require information from the original passage -What was Syed Alaol's most famous work? when the context included Syed Alaol was a renowned poet.

In addition to writing questions, workers also provided the context from the original paragraph that they thought would be necessary to answer the question, as well as the links they expected to contain the remaining necessary information. Workers were paid $4.00 per set of ten questions, and reported taking 25 minutes on average, coming out to $9.60 per hour. 40 workers passed the qualification and worked on the main task.

2.3 Collecting Answers

For the answer task, workers were given a collection of ten questions, their respective original paragraphs, and the context/links selected by the question writer. For each paragraph, workers were able to see the links, and could follow them to view the text, not including tables or images, of the linked document.

They were then asked to select an answer from one of four types: a span of text from either the question or a document, a number and unit, yes/no, or no answer. For answerable questions, i.e. any of the first three types, they were additionally asked to provide the minimal context span(s), necessary to answer the question. For unanswerable questions, there is typically no indication that the answer is not given, so no such context can be provided. For example, the following question was written for a passage about a ship called the Italia: Who was the mayor of New York City when Italia was transferred to Genoa-NYC? Following the link to New York City mentions the current mayor, but not past mayors, making it unanswerable.

Annotators were also given the option of labeling a question as bad if it didn't make sense, and these bad questions were then filtered out. For example, if an annotator misinterpreted the passage when writing the question as in the case of the following question written about a horse, Crystal Ocean, and St Leger, which the annotator thought was a horse, but is actually a horse race: Is Crystal Ocean taller than St Leger?. Additionally, A small percentage of questions that can be answered from the original paragraph alone were also marked as being bad.

For the training set, comprising 80% of the data, each question was answered by a single annotator. For the development and test sets, comprising 10% each, three annotators answered each question, and only questions where at least two annotators agreed on the answer were kept. Workers were paid $3.00 per set of ten answers, and reported taking 20 minutes on average, coming out to $9.00 per hour. 33 workers passed the qualification and worked on the main task.

2.4 Dataset Analysis

In Figure 2 we show some examples from IIRC, labeled with different kinds of processing required to solve them. The types are described in detail in Table 1. These types and percentages were computed from a manual analysis of 100 examples.

Figure 2: Examples from IIRC, labeled with what kinds of processing are required to answer each question. See Table 1 for more details. The passages on the left are the original passage, with bold spans indicating links. The highlighted sections contain the necessary context found in linked articles. Purple highlights indicate either the answer, for the second question, or the information used to compute the answer.

In Table 2 we provide some global statistics of the dataset. In total, there are 13441 questions over 5698 passages. Each passage contains an average of 14.5 outgoing links. Using the context provided by the answer annotators, we are able to compute a distribution of the number of links required to answer questions in the dataset, included in Table 4 . While the majority of questions require information from only one linked document in addition to the original paragraph, 30% of questions require two or more, with some requiring reasoning over as many as 12 documents to reach the answer. This variability in the number of context documents adds an extra layer of complexity to the task.

Table 4: Exact match and F1 of the baseline model on the IIRC dev set broken down by number of links necessary to answer the question. The numbers in parentheses are the percentage of questions in the full dataset that require that number of context documents.

We also analyzed the initial trigrams of questions to quantify the diversity of questions in the dataset. We found that the most common type of questions, those related to time (eg "How old was", "How long did"), make up 15% of questions. There are 3.5k different initial trigrams across the 10.8k questions in the training set.

3 Modeling IIRC

3.1 Task Overview

Formally, a system tackling IIRC is provided with the following inputs: a question Q; a passage P ; a set of links contained in the passage, L = {l i } N i=1 ; and the set of articles those links lead to, A = {a i } N i=1 . The surface form of each link, l i is a sequence of tokens in P and is linked to an article a i . The target output is either a number, a sequence of tokens in one of P , Q, or a i , Yes, No, or NULL (for unanswerable questions).

3.2 Baseline Model

To evaluate the difficulty of IIRC, we construct a baseline model adapted from a state-of-the-art model built for DROP. We choose a DROP model due to the inclusion of numerical reasoning questions in our dataset. Because the model was not originally used for data requiring multiple paragraphs and retrieval, we first predict relevant context to serve as input to the QA model using a pipeline with three stages:

1. Identify relevant links 2. Select passages from linked articles 3. Pass the concatenated passages to a QA model

3.2.1 Identifying Links

To identify the set of relevant links, L , in a passage, P, for a question, Q, the model first encodes the concatenation of the question and original passage using BERT . It then concatenates the encoded representations of the first and last tokens of each link as input to a scoring function, following the span classification procedure used by Joshi et al. (2013) , selecting any links that score above a threshold g.

P = BERT([Q||P ]) Score(l) = f ([p i p j ]), l = (p i ...p j , a) L = {l : Score(l) > g}

where l is a link covering tokens p i ...p j linking to article a.

3.2.2 Selecting Context

Given the set, L from the previous step, the model then must select relevant context passages from the documents. For each document, it first splits the document into overlapping windows 2 , w 0 , w 1 ...w n . Each window is then concatenated with the question and prepended with a CLS token, and encoded with BERT. The encoded CLS tokens are then passed through a linear predictor to score each window, and the highest scoring sections from each document are concatenated as context for the final model, C.

c a i = max w j ∈Split(a i ) f (BERT([Q||w j ])) C = [c a i : a i ∈ L ]

3.2.3 Qa Model

As mentioned above, the final step in the pipeline is passing the concatenated context, along with the question and a selected window from the original passage, as input to a QA model. For our experiments, we use NumNet+, because it is the best performing model on the DROP leaderboard with publicly available code. At a high level, Num-Net+ encodes the input using RoBERTa , as well as a numerical reasoning component. It then passes these into a classifier to determine the type of answer expected by the question, which we modified by adding binary and unanswerable as additional answer types. This model is trained using the gold context for answerable questions, and predicted context for unanswerable questions. We do this because by definition, unanswerable questions do not have annotated answer context.

4.1 Evaluation Metrics

We use two evaluation metrics to compare model performance: Exact-Match (EM), and a numeracyfocused (macro-averaged) F1 score, which measures overlap between a bag-of-words representation of the gold and predicted answers. Due to the number of numeric answers in the data, we follow the evaluation methods used by DROP (Dua et al., 2019b) . Specifically, we employ the same implementation of Exact-Match accuracy as used by SQuAD (Rajpurkar et al., 2016) , which removes articles and does other simple normalization, and our F1 score is based on that used by SQuAD. We define F1 to be 0 when there is a number mismatch between the gold and predicted answers, regardless of other word overlap. When an answer has multiple spans, we first perform a one-to-one alignment greedily based on bag-of-word overlap on the set of spans and then compute average F1 over each span. For numeric answers, we ignore the units. Binary and unanswerable questions are both treated as span questions. In the unanswerable case, the answer is a special NONE token, and in the binary case, the answer is either yes or no.

4.2 Implementation Details

For the link selection model, we initialized the encoder with pretrained BERT-base, and fine-tuned it during training. For the scoring function, we used a single linear layer with a sigmoid activation function. The model was trained using Adam, and the score threshold to select links was set to 0.5. Additionally, we truncated any passages longer than 512 tokens to 512. This occurred in less than 1% of the data. This model is trained using a cross-entropy objective with the information provided in the gold context by annotators. Any links pointing to articles with an annotated context span are labeled 1, and all other links are labeled 0.

For the passage selection model, we again initialized the encoder with pretrained BERT-base, and fine-tuned it during training. We set the window size such that the concatenation of all selected contexts, along with the question and a selection from the original passage, has max length 512. More specifically, using the number of links, N l selected in the previous step, for a question with N Q tokens, we set the window size to be

512−(N Q ) N l +1

. We set the stride to be 1 4 the window size, i.e. if the first window contains tokens [0, 200] , the second window would contain [50, 250] . We used a single linear layer with a sigmoid activation as the scoring function. We train this model with a cross-entropy objective. We use the gold context provided by annotators, labeling sections that contain the entirety of the annotated context 1, and all other sections 0.

For NumNet+, we followed the hyperparameter and training settings specified in the original paper (Ran et al., 2019) . We trained the model on gold context provided by annotators when available, i.e. for answerable questions, and predicted context from the previous steps otherwise. Table 3 presents the performance of the baseline model. It additionally shows the results of using gold information at each stage of the pipeline, as well as human performance on computed on a subset of 200 examples from the test set. The model achieves 31.1% F 1 , which is well below the human performance of 88.4%. Even with the benefit of the gold input, there is still room for improvement on reasoning over multiple contexts, as performance is still 18% absolute below human levels. The model does a good job of predicting the relevant links, as evidenced by the fact that us- ing the gold links only improves performance by 1 point, but still struggles to identify the appropriate context within the linked documents. This is likely due to annotators not being able to see the linked context when the questions are written. This makes this step more difficult by not providing the model with surface-level lexical cues in the question that it could use to easily select the appropriate context. Table 4 shows the results of running the full pipeline broken down according to the number of linked documents required to answer the question. These performance differences are the result of a few factors. The first is the fact that the more links required to answer a question, the more chances there are for failure to retrieve the necessary information. This is exacerbated by the pipeline nature of our baseline model. However, the spike in performance for questions requiring four or more links is caused by the number of unanswerable questions. Nearly half of the questions in that category are unanswerable, and the model largely predicts No Answer on those questions. Finally, the distribution of question types is different conditioned on the number of links. Questions that require more links often also require some form of discrete reasoning, which is more difficult for the model to handle. Table 6 : Precision, recall, and F1 of identifying unanswerable questions in the dev set with various baselines that use different combinations of the question, original passage, and predicted context.

Table 3: Baseline and oracle results on IIRC. Human evaluation was obtained from a subset of 200 examples from the test set. We evaluate the model when given oracle links (L) and retrieved contexts (C). Retrieving the correct contexts is a significant challenge, but even given oracle contexts there is a substantial gap between model and human performance.

Table 6: Precision, recall, and F1 of identifying unanswerable questions in the dev set with various baselines that use different combinations of the question, original passage, and predicted context.

Analysis Of Number Of Linked Documents

Analyzing Different Answer Types Table 5 shows the performance broken down according to the type of answer each question has. The model performs worst on questions with numeric answers. This is due to the fact that these questions often require the model to do arithmetic to solve, which, as discussed above, the model struggles with relative to other types of questions. Table 6 shows how well a simple model can identify unanswerable questions with varying amounts of information. We set this up as a binary prediction, either answerable or not, and use a linear classifier that takes the BERT CLS token as input. We also include the result of always predicting unanswerable as a baseline. When the model can only see the question, it improves over the baseline by around 10 F1, meaning that there is some signal in the question alone, without any context. Some types of questions are more likely to be unanswerable, such as those asking for information with regards to a specific year, i.e. What was the population of New York in 1989?. This is caused by Wikipedia more generally including current statistics, but not including a specific information for all previous years. Additionally adding the original passage does not significantly improve performance. This is not surprising, as the original passage always contains information relevant to the question, and the question annotators could see that text when writing the question.

Table 5: Exact match and F1 of the baseline model on the IIRC dev set broken down by answer type. F1 equals EM for non-span types, so is not repeated.

4.4 Error Analysis

In order to better understand the challenges of the dataset, we manually analyzed 100 erroneous predictions by the model. Incorrect context (39%) These are the cases where the model identified the correct links but selected the wrong portion of the linked document. It often selects semantically similar context but misses the crucial information, e.g. selecting the duration instead of end date.

Modeling errors (32%) These are the cases in which the context passed to the final QA model contained all of the necessary information, but the model failed to predict the correct answer. This occurred most commonly for questions requiring math, with the model including unnecessary dates in the computation, resulting in predictions that were orders of magnitude off. For example, predicting -1984 when the question was asking for the age of a person.

Identifying unanswerable questions (24%) In these cases, the QA model was provided with related context that was missing crucial information, similar to the first class of errors. However, in this case, the full articles also did not contain the necessary information. In these cases the model often selected a related entity, ie for a question asking In which ocean is the island nation located?, the model predicted the island nation, Papua New Guinea as opposed to the ocean, which was not mentioned.

Insufficient Links (5%) These are cases where insufficient links were selected from the original passage, thus not providing enough information to answer the question. While the model can handle over-selection of links, we found that the vast majority of the time, the system correctly identified both the necessary and sufficient links, rarely over-predicting the required links.

5 Combined Evaluation

By construction, all the questions in IIRC require more than the original paragraph to answer. This means that a reading comprehension model built for IIRC does not actually have to detect whether more information is required than what is in the given paragraph, as it can always assume that this is true. In order to combat this bias, we recommend an additional, more stringent evaluation that Table 7 : Results for link identification and QA when training the baseline model on IIRC and sampled questions from SQuAD (S) and DROP (D).

Table 7: Results for link identification and QA when training the baseline model on IIRC and sampled questions from SQuAD (S) and DROP (D).

combines IIRC with other reading comprehension datasets that do not require retrieving additional information. This is in line with recentlyrecommended evaluation methodologies for reading comprehension models (Talmor and Berant, 2019; Dua et al., 2019a) .

In this section, we present the results of one such evaluation. Noting that IIRC has similar properties to both SQuAD 2.0 (Rajpurkar et al., 2018) and DROP (Dua et al., 2019b) , and even similar question language in places, we sample questions from these datasets to form a combined dataset for training and evaluating our baseline model. Sampling from SQuAD 2.0 and DROP To construct the data for the combined evaluation, we sample an additional 3360 questions from SQuAD 2.0 and DROP, so that they make up 20% of the questions in the new data. We sample from SQuAD 2.0 and DROP with a ratio of 3 : 1 in order to match the distribution of numeric questions in IIRC and used a Wikifier (Cheng and Roth, 2013) to identify the links to Wikipedia articles in them.

Results

We train the full baseline on IIRC augmented with sampled DROP and SQuAD data, and evaluate it on the IIRC dev set without any additional sampled data. We don't include any sampled data in the evaluation in order to make a direct comparison to IIRC to see how adding questions that don't require external context affects the model's ability to identify necessary context. We also include the results of running just the link identification model trained under each setting. We show the results in table 7. Adding the extra dimension of determining whether extra information is necessary causes the model to become less confident, significantly hurting recall on link selection. These missed predictions then propagate down the pipeline, resulting in a loss of almost 8% F 1 when compared to a model trained on just IIRC.

We also evaluated the combination model on a dev set with sampled SQuAD and DROP data to see how well the model learned to identify that no external information was necessary. Given that none of the SQuAD or DROP data requires external links, this evaluation could only negatively impact precision. We find that precision dropped by 8 points, compared to a drop of 28 points when the model trained only on IIRC was used, indicating that the model is able to learn to identify when no external information is required.

6 Related Work

Questions requiring multiple contexts Prior multi-context reading comprehension datasets were built by starting from discontiguous contexts, and forming compositional questions by stringing multiple facts either by relying on knowledge graphs as in QAngaroo (Welbl et al., 2018) , or by having crowdworkers do so, as in HotpotQA (Yang et al., 2018) . It has been shown that many of these questions can be answered by focusing on just one of the facts used for building the questions (Min et al., 2019b) . In contrast, each question in IIRC was written by a crowdworker who had access to just one paragraph, with the goal of obtaining information missing in it, thus minimizing lexical overlap between questions and the answer contexts. Additionally, IIRC provides a unique question type: questions requiring aggregating information from many related documents, such as the second question in Figure 2 .

Separation Of Questions From Answer Contexts

Many prior datasets (e.g.: WhoDidWhat (Onishi et al., 2016), NewsQA (Trischler et al., 2016) , DuoRC (Saha et al., 2018) , Natural Questions , TyDiQA (Clark et al., 2020) ) have tried to remove simple lexical heuristics from reading comprehension tasks by separating the contexts that questions are anchored in from those that are used to answer them. IIRC also separates the two contexts, but is unique given that the linked documents elaborate on the information present in the original contexts, naturally giving rise to follow-up questions, instead of openended ones.

Open-Domain Question Answering

In the opendomain QA setting, a system is given a question without any associated context, and must retrieve the necessary context to answer the question (Chen et al., 2017; Joshi et al., 2017; Dhingra et al., 2017; Yang et al., 2018; Seo et al., 2019; Karpukhin et al., 2020; Min et al., 2019a) . IIRC is similar in that it also requires the retrieval of missing information. However, the questions are grounded in a given paragraph, meaning that a system must examine more than just the question in order to know what to retrieve. Most questions in IIRC do not make sense in an open-domain setting, without their associated paragraphs.

Unanswerable questions Unlike SQuAD 2.0 (Rajpurkar et al., 2018) where the unanswerable questions were written to be close to answerable questions, IIRC contains naturally unanswerable questions that were not written with the goal of being unanswerable, a property that our dataset shares with NewsQA (Trischler et al., 2016) , Natural Questions , and TyDi QA (Clark et al., 2020) . Results shown in Section 4.3 indicate that these questions cannot be trivially distinguished from answerable questions.

Incomplete Information QA A few prior datasets have explored question answering given incomplete information, such as science facts (Mihaylov et al., 2018; Khot et al., 2019) . However, these datasets contain multiple choice questions, and the answer choices provide hints as to what information may be needed. Yuan et al. (2020) explore this as well using a POMDP in which the context in existing QA datasets is hidden from the model until it explicitly searches for it.

7 Conclusion

We introduced IIRC, a new dataset of incompleteinformation reading comprehension questions. These questions require identifying what information is missing from a paragraph in order to answer a question, predicting where to find it, then synthesizing the retrieved information in complex ways. Our baseline model, built on top of state-ofthe-art models for the most closely related existing datasets, performs quite poorly in this setting, even when given oracle retrieval results, and especially when combined with other reading comprehension datasets. IIRC both provides a promising new avenue for studying complex reading and retrieval problems and demonstrates that much more research is needed in this area. and gifts from the Sloan Foundation and Allen Institute for AI. Authors would also like to thank members of the Allen Institute for AI, UW-NLP, and the H2Lab at The University of Washington for their valuable feedback and comments.

www.mturk.com

See section 4.2 for more details.