Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
Authors
Abstract
Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.
1 Introduction
Paragraphs and other longer texts typically make multiple references to the same entities. Tracking these references and resolving coreference is essential for full machine comprehension of these texts. Significant progress has recently been made in reading comprehension research, due to large crowdsourced datasets (Rajpurkar et al., 2016; Bajaj et al., 2016; Joshi et al., 2017; Kwiatkowski et al., 2019, inter alia) . However, these datasets Byzantines were avid players of tavli (Byzantine Greek: τάβλη), a game known in English as backgammon, which is still popular in former Byzantine realms, and still known by the name tavli in Greece. Byzantine nobles were devoted to horsemanship, particularly tzykanion, now known as polo. The game came from Sassanid Persia in the early period and a Tzykanisterion (stadium for playing the game) was built by Theodosius inside the Great Palace of Constantinople. Emperor Basil I (r. 867-886) excelled at it; Emperor Alexander (r. 912-913) died from exhaustion while playing, Emperor Alexios I Komnenos (r. 1081-1118) was injured while playing with Tatikios, and John I of Trebizond (r. 1235-1238) died from a fatal injury during a game. Aside from Constantinople and Trebizond, other Byzantine cities also featured tzykanisteria, most notably Sparta, Ephesus, and Athens, an indication of a thriving urban aristocracy.
Q1
. What is the Byzantine name of the game that Emperor Basil I excelled at? it → tzykanion Q2. What are the names of the sport that is played in a Tzykanisterion? the game → tzykanion; polo Q3. What cities had tzykanisteria? cities → Constantinople; Trebizond; Sparta; Ephesus; Athens Figure 1 : Example paragraph and questions from the dataset. Highlighted text in paragraphs is where the questions with matching highlights are anchored. Next to the questions are the relevant coreferent mentions from the paragraph. They are bolded for the first question, italicized for the second, and underlined for the third in the paragraph.
focus largely on understanding local predicateargument structure, with very few questions requiring long-distance entity tracking. Obtaining such questions is hard for two reasons: (1) teaching crowdworkers about coreference is challenging, with even experts disagreeing on its nuances (Pradhan et al., 2007; Versley, 2008; Recasens et al., 2011; Poesio et al., 2018) , and (2) even if we can get crowdworkers to target coreference phenomena in their questions, these questions may contain giveaways that let models arrive at the correct answer without performing the desired reasoning (see §3 for examples).
We introduce a new dataset, QUOREF, that contains questions requiring coreferential reasoning (see examples in Figure 1 ). The questions are derived from paragraphs taken from a diverse set of English Wikipedia articles and are collected using an annotation process ( §2) that deals with the aforementioned issues in the following ways: First, we devise a set of instructions that gets workers to find anaphoric expressions and their referents, asking questions that connect two mentions in a paragraph. These questions mostly revolve around traditional notions of coreference (Figure 1 Q1 ), but they can also involve referential phenomena that are more nebulous (Figure 1 Q3) . Second, inspired by Dua et al. (2019) , we disallow questions that can be answered by an adversary model (uncased base BERT, , trained on SQuAD 1.1, Rajpurkar et al., 2016) running in the background as the workers write questions. This adversary is not particularly skilled at answering questions requiring coreference, but can follow obvious lexical cues-it thus helps workers avoid writing questions that shortcut coreferential reasoning. QUOREF contains more than 15K questions whose answers are spans or sets of spans in 3.5K paragraphs from English Wikipedia that can be arrived at by resolving coreference in those paragraphs. We manually analyze a sample of the dataset ( §3) and find that 78% of the questions cannot be answered without resolving coreference. We also show ( §4) that the best system performance is 49.1% F 1 , while the estimated human performance is 87.2%. These findings indicate that this dataset is an appropriate benchmark for coreference-aware reading comprehension.
2 Dataset Construction
Collecting paragraphs We scraped paragraphs from Wikipedia pages about English movies, art and architecture, geography, history, and music. For movies, we followed the list of English language films 1 , and extracted plot summaries that are at least 40 tokens, and for the remaining categories, we followed the lists of featured articles 2 . Since movie plot summaries usually mention many characters, it was easier to find hard QUOREF questions for them, and we sampled about 60% of the paragraphs from this category. Crowdsourcing setup We crowdsourced questions about these paragraphs on Mechanical Turk. We asked workers to find two or more co-referring spans in the paragraph, and to write questions such that answering them would require the knowledge that those spans are coreferential. We did not ask them to explicitly mark the co-referring spans.
Workers were asked to write questions for a random sample of paragraphs from our pool, and we showed them examples of good and bad questions in the instructions (see Appendix A). For each question, the workers were also required to select one or more spans in the corresponding paragraph as the answer, and these spans are not required to be same as the coreferential spans that triggered the questions 3 . We used an uncased base BERT QA model ) trained on SQuAD 1.1 (Rajpurkar et al., 2016) as an adversary running in the background that attempted to answer the questions written by workers in real time, and the workers were able to submit their questions only if their answer did not match the adversary's prediction. 4 Appendix A further details the logistics of the crowdsourcing tasks. Some basic statistics of the resulting dataset can be seen in Table 1 .
3 Semantic Phenomena In Quoref
To better understand the phenomena present in QUOREF, we manually analyzed a random sample of 100 paragraph-question pairs. The following are some empirical observations.
Requirement Of Coreference Resolution
We found that 78% of the manually analyzed questions cannot be answered without coreference resolution. that mentions only one city, "Bristol", and a sentence that says "the city was bombed". The associated question, Which city was bombed?, does not really require coreference resolution from a model that can identify city names, making the content in the question after Which city unnecessary.
Types of coreferential reasoning Questions in QUOREF require resolving pronominal and nominal mentions of entities. Table 2 shows percentages and examples of analyzed questions that fall into these two categories. These are not disjoint sets, since we found that 32% of the questions require both (row 3). We also found that 10% require some form of commonsense reasoning (row 4).
4 Baseline Model Performance On Quoref
We evaluated three types of initial baselines on QUOREF: state-of-the art reading comprehension models that predict a single span ( §4.1), state-ofthe-art single-span reading comprehension models extended to predict multiple spans ( §4.2), and heuristic baselines to look for annotation artifacts ( §4.3). We use two evaluation metrics to compare model performance: Exact Match (EM), and a (macro-averaged) F 1 score that measures overlap between a bag-of-words representation of the gold and predicted answers. We use the same implementation of EM as SQuAD, and we employ the F 1 metric used for DROP (Dua et al., 2019) . See Appendix B for model training hyperparameters and other details. 5
4.1 Single-Span Reading Comprehension
We test different single-span (SQuAD-style) reading comprehension models: (1) QANet (Yu et al., 2018) , currently the best-performing published model on SQuAD 1.1 without data augmentation or pretraining;
(2) QANet + BERT, which enhances the QANet model by concatenating frozen BERT representations to the original input embed-dings;
(3) BERT QA , the adversarial baseline used in data construction. When training our model, we use the summed likelihood objective function proposed by Clark and Gardner (2018) , which marginalizes the model's output over all occurrences of the answer text. We use the AllenNLP implementation of QANet, modified to use the marginal objective. All BERT experiments use the base uncased model. 6
4.2 Multi-Span Reading Comprehension
The single-span reading comprehension baselines are incapable of answering questions with multiple answer spans (e.g., the third question in Figure 1) ; these constitute ∼12% of QUOREF questions. Motivated by this shortcoming, we also evaluated a simple extension of the BERT QA model (the strongest single-span model evaluated), where the model is equipped with a variable number of prediction heads. Each prediction head contains two softmax classifiers for predicting the start and end indices of an answer span. 7 We set the number of prediction heads to equal the maximum number of answer spans associated with any training dataset question (8), enabling the model to correctly answer all questions in the training dataset. Each softmax classifier in each prediction head is independently trained, and prediction heads can opt out of predicting an answer.
4.3 Heuristic Baselines
In light of recent work exposing predictive artifacts in crowdsourced NLP datasets (Gururangan et al., 2018; Kaushik and Lipton, 2018, inter alia), we estimate the effect of predictive artifacts by training a BERT-based model to predict a single start and end index given only the passage as input (passage-only). Table 3 presents the performance of all baseline models on QUOREF. All models perform significantly worse than on other prominent reading comprehension datasets, and human performance remains high. 8 For instance, the BERT QA model yields the highest performance among our baselines, but its performance on QUOREF is nearly 40 absolute F 1 points lower than its performance on SQuAD 1.1.
4.4 Results
Our simple extensions to BERT QA for predicting multiple answer spans failed to improve upon the single-span counterpart. Qualitatively, the multi-span BERT QA model is prone to overpredicting answer spans. The passage-only baseline underperforms all other systems; examining its predictions reveals that it almost always predicts the most frequent entity in the passage. Its relatively low performance, despite the tendency for Wikipedia articles and passages to be written about a single entity, indicates that a large majority of questions likely require coreferential reasoning.
5 Related Work
Traditional coreference datasets Unlike traditional coreference annotations in datasets like those of Pradhan et al. 2007, Ghaddar and Langlais (2016) , Chen et al. (2018) and Poesio et al. (2018) , which aim to obtain complete coreference clusters, our questions require understanding coreference between only a few spans. While this means that the notion of coreference captured by our dataset is less comprehensive, it is also less conservative and allows questions about coreference relations that are not marked in OntoNotes annotations. Since the notion is not as strict, it does not require linguistic expertise from annotators, making it more amenable to crowdsourcing.
Reading comprehension datasets There are many reading comprehension datasets (Richardson et al., 2013; Rajpurkar et al., 2016; Kwiatkowski et al., 2019; Dua et al., 2019, inter alia) . Most of these datasets principally require understanding local predicate-argument structure in a paragraph of text. QUOREF also requires understanding local predicate-argument structure, but makes the reading task harder by explicitly querying anaphoric references, requiring a system to track entities throughout the discourse.
6 Conclusion
We present QUOREF, a focused reading comprehension benchmark that evaluates the ability of models to resolve coreference. We crowdsourced questions over paragraphs from Wikipedia, and manual analysis confirmed that most cannot be answered without coreference resolution. We show that current state-of-the-art reading comprehension models perform poorly on this benchmark, significantly lower than human performance. Both these findings provide evidence that QUOREF is an appropriate benchmark for coreference-aware reading comprehension.
A.1 Instructions
The crowdworkers were giving the following instructions:
"In this task, you will look at paragraphs that contain several phrases that are references to names of people, places, or things. For example, in the first sentence from sample paragraph below, the references Unas and the ninth and final king of Fifth Dynasty refer to the same person, and Pyramid of Unas, Unas's pyramid and the pyramid refer to the same construction. You will notice that multiple phrases often refer to the same person, place, or thing. Your job is to write questions that you would ask a person to see if they understood that the phrases refer to the same entity. To help you write such questions, we provided some examples of good questions you can ask about such phrases. We also want you to avoid questions that can be answered correctly by someone without actually understanding the paragraph. To help you do so, we provided an AI system running in the background that will try to answer the questions you write. You can consider any question it can answer to be too easy. However, please note that the AI system incorrectly answering a question does not necessarily mean that it is good. Please read the examples below carefully to understand what kinds of questions we are interested in."
A.2 Examples Of Good Questions
We illustrate examples of good questions for the following paragraph.
"The Pyramid of Unas is a smooth-sided pyramid built in the 24th century BC for the Egyptian pharaoh Unas, the ninth and final king of the Fifth Dynasty. It is the smallest Old Kingdom pyramid, but significant due to the discovery of Pyramid Texts, spells for the king's afterlife incised into the walls of its subterranean chambers. Inscribed for the first time in Unas's pyramid, the tradition of funerary texts carried on in the pyramids of subsequent rulers, through to the end of the Old Kingdom, and into the Middle Kingdom through the Coffin Texts which form the basis of the Book of the Dead. Unas built his pyramid between the complexes of Sekhemket and Djoser, in North Saqqara. Anchored to the valley temple via a nearby lake, a long causeway was constructed to provide access to the pyramid site. The causeway had elaborately decorated walls covered with a roof which had a slit in one section allowing light to enter illuminating the images. A long wadi was used as a pathway. The terrain was difficult to negotiate and contained old buildings and tomb superstructures. These were torn down and repurposed as underlay for the causeway. A significant stretch of Djoser's causeway was reused for embankments. Tombs that were on the path had their superstructures demolished and were paved over, preserving their decorations."
The following questions link pronouns: Q1: What is the name of the person whose pyramid was built in North Saqqara? A: Unas Q2: What is significant due to the discovery of Pyramid Texts? A: The Pyramid of Unas Q3: What were repurposed as underlay for the causeway? A: old buildings; tomb superstructures
The following questions link other references:
Q1: What is the name of the king for whose afterlife spells were incised into the walls of the pyramid? A: Unas Q2: Where did the final king of the Fifth dynasty build his pyramid? A: between the complexes of Sekhemket and Djoser, in North Saqqara
A.3 Examples Of Bad Questions
We illustrate examples of bad questions for the following paragraph.
"Decisions by Republican incumbent Peter Fitzgerald and his Democratic predecessor Carol Moseley Braun to not participate in the election resulted in wide-open Democratic and Republican primary contests involving fifteen candidates. In the March 2004 primary election, Barack Obama won in an unexpected landslidewhich overnight made him a rising star within the national Democratic Party, started speculation about a presidential future, and led to the reissue of his memoir, Dreams from My Father. In July 2004, Obama delivered the keynote address at the 2004 Democratic National Convention, seen by 9.1 million viewers. His speech was well received and elevated his status within the Democratic Party. Obama's expected opponent in the general election, Republican primary winner Jack Ryan, withdrew from the race in June 2004. Six weeks later, Alan Keyes accepted the Republican nomination to replace Ryan. In the November 2004 general election, Obama won with 70 percent of the vote. Obama cosponsored the Secure America and Orderly Immigration Act. He introduced two initiatives that bore his name: LugarObama, which expanded the NunnLugar cooperative threat reduction concept to conventional weapons; and the The following question has ambiguous answers: Q1: Whose memoir was called Dreams from My Father? A: Barack Obama; Obama; Senator Obama
A.4 Worker Pool Management
Beyond training workers with the detailed instructions shown above, we ensured that the questions are of high quality by selecting a good pool of 21 workers using a two-stage selection process, allowing only those workers who clearly understood the requirements of the task to produce the final set of questions. Both the qualification and final HITs had 4 paragraphs per HIT for paragraphs from movie plot summaries, and 5 per HIT for the other domains, from which the workers could choose. For each HIT, workers typically spent 20 minutes, were required to write 10 questions, and were paid US$7.
B Experimental Setup Details
Unless otherwise mentioned, we adopt the original published procedures and hyperparameters used for each baseline.
BERT QA We truncate paragraphs to 300 (word) tokens during training and 400 tokens during evaluation. Questions are always truncated to 30 tokens. We train our model with a batch size of 10 and a sequence length of 512 wordpieces. We use the Adam optimizer, with a learning rate of 3 −5 . We train for 10 epochs with an early stopping patience of 5, checkpointing the model after each epoch. We report the performance of the checkpoint with the highest development set F 1 score. The summed likelihood function of Gardner et al. (2018) requires a probability distribution over words in the paragraph, and we take each word's BERT representation to be the vector associated with its first wordpiece.
QANet Durining training, we truncate paragraphs to 400 (word) tokens during training and questions to 50 tokens. During evaluation, we truncate paragraphs to 1000 tokens and questions to 100 tokens.
Passage-only baseline We truncate paragraphs to 300 (word) tokens during training and 400 tokens during evaluation. To calculate BERT embeddings for each passage, we prepend the special classification token [CLS] and append the separator token [SEP] to the passage. As in the BERT QA model, the summed likelihood function of requires a probability distribution over words in the paragraph, so we take each word's BERT representation to be the vector associated with its first wordpiece.
https://en.wikipedia.org/wiki/ Category:English-language_films 2 https://en.wikipedia.org/wiki/ Wikipedia:Featured_articles
For example, the last question inTable 2is about the coreference of {she, Fania, his mother}, but none of these mentions is the answer 4 Among models with acceptably low latency, we qualitatively found uncased base BERT to be the most effective.
We will release code for reproducing results.
The large uncased model does not fit in GPU memory.7 The single-span BERT model is a special case of the multi-span BERT model, with only one prediction head.8 Human performance was estimated from the authors' answers of 400 questions from the test set scored with the same metric used for systems.