SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach

Michael Petrochuk
Luke Zettlemoyer
EMNLP
2018
View in Semantic Scholar

Abstract

The SimpleQuestions dataset is one of the most commonly used benchmarks for studying single-relation factoid questions. In this paper, we present new evidence that this benchmark can be nearly solved by standard methods. First, we show that ambiguity in the data bounds performance at 83.4%; many questions have more than one equally plausible interpretation. Second, we introduce a baseline that sets a new state-of-the-art performance level at 78.1% accuracy, despite using standard methods. Finally, we report an empirical analysis showing that the upperbound is loose; roughly a quarter of the remaining errors are also not resolvable from the linguistic signal. Together, these results suggest that the SimpleQuestions dataset is nearly solved.

1 Introduction

We present new evidence that the SimpleQuestions benchmark can be nearly solved by standard methods. First, we show that ambiguity in the data bounds performance; there are often multiple answers that cannot be disambiguated from the linguistic signal alone. Second, we introduce a baseline that sets a new state-of-the-art performance level, despite using standard methods.

Our first main contribution is to show that performance on the SimpleQuestions benchmark is bounded. This benchmark requires predicting a relation (e.g. /film/film/story by) and subject (e.g. 090s 0 [gulliver's travels]) given a question. Consider these examples from the SimpleQuestions dataset: These examples introduce a fundamental ambiguity. The linguistic signal provides equal evidence for the TV miniseries and the book in both cases, even though only one of the options is labeled as the correct answer. We introduce a method for automatically identifying many such ambiguities in the data, thereby producing a new 83.4% upperbound.

Our second main contribution is a baseline that sets a new state-of-the-art performance level, despite using standard methods. Our approach includes (1) a CRF tagger to determine the subject alias, and (2) a BiLSTM to classify the relation; yielding 78.1% accuracy for predicting correct subject-relation pairs.

Finally, we present an empirical error analysis of this model which shows the upperbound is loose and that there is likely not much more than 4% of performance to be gained with future work on the data. We will publicly release all code and models.

2 Background

Single-relation factoid questions (simple questions) are common in many settings (e.g. Microsoft's search query logs and WikiAnswers questions). The SimpleQuestions dataset is one of the most commonly used benchmarks for studying such questions but remains far from solved. This section reviews this benchmark.

The Freebase knowledge graph (KG) provides the facts for answering the questions in the SimpleQuestions dataset. It includes 3 billion triples of the form (subject, relation, object) (e.g.

[04b5zb , location/location/containedby, 0f80hy]). We denote such triples as (s, r, o).

The SimpleQuestions task is to rewrite questions into subject-relation pairs of the form (subject, relation), denoted in this paper as (s, r). Each pair defines a graph query that can be used to answer the corresponding natural language question. The subject is a Freebase object with a identifier called an MID (e.g. 04b5zb ). Freebase objects also typically include one or more string aliases (e.g. mid 04b5zb is named "fires creek"), which we will use later when computing our upper bounds. The relation is an object property (e.g. location/location/containedby) defined by the Freebase ontology. For example, the question "which forest is fires creek in" corresponds with the subject-relation pair (04b5zb [fires creek], location/location/containedby). Finally, the task is evaluated on subject-relation pair accuracy.

The SimpleQuestions dataset provides a set 108,442 simple questions; each question is accompanied by a ground truth triple (s, r, o) . This dataset also provides two subsets of Freebase: FB2M and FB5M. 1

3 Dataset Ambiguity And Upperbound

Our first main contribution is to show that performance on the SimpleQuestions benchmark is bounded. Consider the question "who wrote gulliver's travels?" in Table 1 , the linguistic signal provides equal evidence for six subject-relation pairs in the cross product of Table 2 and Table 3 , including:

Table 1: Unanswerable example from the SimpleQuestions dataset

Table 2: FB2M entities with the alias “gulliver’s travels”

Table 3: SimpleQuestions dataset template / predicate “who wrote e?” relation count

• (Gulliver's Travels [Book], book/written - work/author) • (Gulliver's Travels [TV miniseries], film/- film/written by) • (Gulliver's Travels [TV miniseries], film/film/story by)

The subject-relation pairs cannot be disambiguated from the linguistic signal alone; therefore, the question is unanswerable. We say a question is unanswerable if there exists multiple 1 The FB2M and FB5M subsets of Freebase KG can complete 7,188,636 and 7,688,234 graph queries respectively; therefore, the FB5M subset is 6.9% larger than the FB2M subset. More previous research has cited FB2M numbers than FB5M; therefore, we report our numbers on FB2M.

Question

Subject Relation who wrote 090s 0 film/filmgulliver's travels?

/story by The ambiguity perhaps comes from annotation process. Annotators were asked to write a natural language question for a corresponding triple (s, r, o). Given only this triple, it'd be difficult to anticipate possible ambiguities in Freebase.

3.1 Approach

Given an example question q with the ground truth (s, r, o), our goal is to determine the set of all subject-relation pairs that accurately interpret q.

We first determine a string alias a for the subject by matching a phrase in q with a Freebase alias for s, in our example yielding "gullivers travels". We then find all other Freebase entities that share this alias and add them to a set S, in our example S is the subject column of Table 2 .

We define an abstract predicate p (e.g. "who wrote e?") as q with alias a abstracted. We determine the set of potential relations R (e.g. See Table 3 ) as the relations p co-occurs with in the SimpleQuestions dataset.

Finally, if there exists a subject-relation pair (s, r) ∈ KG such that r ∈ R ∧ s ∈ S we define that as an accurate semantic interpretation of q. q is unanswerable if there exists multiple valid subject-relation pairs (s, r).

3.2 Results

We find that 33.9% of examples (3675 of 10845) in the SimpleQuestions dataset are unanswerable. Taking into account the frequency of relations for each subject in the KG, we can further improve accuracy by guessing according their empirical distribution, yielding an upperbound of 85.2%.

Finally, we also found that 1.8% of example questions (1587 of 86755) in the SimpleQuestions dataset set did not reference the subject. For example "Which book is written about?" does not reference the corresponding subject 01n7q (california). We consider these examples as unanswerable, yielding an upperbound of 83.4%.

4 Baseline Model

4.1 Approach

Given a question q (e.g. "who wrote gulliver's travels?") our model must predict the corresponding subject-relation pair (s, r). We predict with a pipeline that first does top-k subject recognition and then relation classification.

We make use of two learned distributions. The subject recognition model P (a|q) ranges over text spans A within the question q, in our example including the correct answer "gullivers travels". This distribution is modeled with a CRF, as defined in more detail below. The relation classification model P (r|q, a) will be used to select a Freebase relation r that matches q. The distribution ranges over all relations in Freebase that take objects that have an alias that matches a. It is modeled with an LSTM, that encodes q, again as defined in more detail below.

Given these distributions, we predict the final subject-relation pair (s, r) as follows. We first find the most likely subject prediction according to P (a|q) that also matches a subject alias in the KG. We then find all other Freebase entities that share this alias and add them to a set S, in our example S is the subject column of Table 2 . We define R such that ∀(s, r) ∈ KG{r ∈ R ∧ s ∈ S}. Using a relation classification model p(r|q, a), we predict the most likely relation r max ∈ R.

Now, the answer candidates are subject-relation pairs such that (s, r max ) ∈ KG{r ∈ R ∧ s ∈ S}. In our example question, if r max is film/film/story by then S is both subjects 06znpjr (Gullivers Travels, American film) and 02py9bj ( Gullivers Travels, French film). Because there is no explicit linguistic signal to disambiguate this choice, we pick the subject s max that has the most facts of type r max .

4.2 Model Details

Our approach requires two models, in this section we cover training and configuring these models. Note we train and configure the model on the Sim-pleQuestions 75,910 training examples and 10,845 validation examples respectively.

Top-K Subject Recognition

We model top-k subject recognition P (a|q) using a linear-chain conditional random field tagger (CRF) with a conditional log likelihood loss objective. k candidates are inferred with the top-k viterbi algorithm.

Our model is trained on a dataset of question tokens and their corresponding subject alias spans using IO tagging. The subject alias spans are determined by matching a phrase in the question with a Freebase alias for the subject.

As for hyperparameters, our model word embeddings are initialized with GloVe and frozen. Adam, initialized with an learning rate of 0.0001, is employed to optimize the model weights. Finally, we halve the learning rate if the validation accuracy has not improved in 3 epochs.

All hyperparameters are hand tuned and then a limited set are further tuned with grid search to increase validation accuracy. In total we evaluated at most 100 hyperparameter configurations. P (r|q, a) is modeled with a one layer BiLSTM batchnorm softmax classifier that encodes the predicate p and uses a negative log likelihood loss objective. We define an abstract predicate p (e.g. "who wrote e?") as q with alias a abstracted.

Relation Classification Relation Classification

The model is trained on a dataset of predicate p and relation set R to ground truth relation r. These values are attained by following our approach in Section 4.1 until the values are declared.

As for hyperparameters, the model word embeddings are initialized with Fast-Text (Bojanowski et al., 2016) and frozen. The

Previous Work

Acc. Random guess (Bordes et al., 2015) 4.9 Memory NN (Bordes et al., 2015) 61.6 Attn. LSTM (He and Golub, 2016) 70.9 GRU (Lukovnikov et al., 2017) 71.2 BiGRU-CRF & BiGRU 73.7 (Mohammed et al., 2017) BiLSTM & BiGRU 74.9 (Mohammed et al., 2017) BiGRU & BiGRU (Dai et al., 2016) Table 4 : Summary of past results on the Simple-Questions benchmark along with the neural models employed. Note that an "&" indicates multiple neural models. AMSGrad variant of Adam (Reddi et al., 2018) , initialized with an learning rate of 0.0001, is employed to optimize the model weights. Finally, we double the batch size (Smith et al., 2017) if the validation accuracy has not improved in 3 epochs.

Table 4: Summary of past results on the SimpleQuestions benchmark along with the neural models employed. Note that an “&” indicates multiple neural models.

All hyperparameters are hand tuned and then a limited set are further tuned with Hyperband (Li et al., 2017) to increase validation accuracy. Hyperband is allowed at most 30 epochs per model and a total of 1000 epochs. In total we evaluated at most 500 hyperparameter configurations.

4.3 Results

Following running our model on the SimpleQuestions 21,687 test set examples, we present our results on the SimpleQuestions task. Note we run on the test set only once to measure generalization.

SimpleQuestions Task Our baseline model achieves 78.1% accuracy, a new state-of-the-art without ensembling or data augmentation (Table 4 ). These results suggest that relatively standard architectures work well when carefully tuned, and approach the level set by our upper bound earlier in the paper.

Further Qualitative Analysis We also analyze the remaining errors, to point torward directions 1 for future work.

In Section 3, we showed that questions can provide equal evidence for multiple subject-relation pairs. To remove this ambiguity, we count any of these options as correct, and our performance jumps to 91.5%.

The remaining 8.5% error comes from a number of sources. First, we find that 1.9% of examples were incorrect due to noise mentioned in Section 3. Finally, we are left with a 6.5% gap. To understand the gap, we do an empirical error analysis on a sample of 50 negative examples.

First we found that for 14 of 50 cases the question provided equal linguistic evidence for both the ground truth and false answer, similar to the dataset ambiguity found in Section 3, suggesting that our upper bound is loose. We note that Section 3 did not cover all possible question-subjectrelation pair ambiguities. The approach relied on exact string matching to discover ambiguity; therefore, missing similar paraphrases. For example, the abstract predicate "what classification is e" had more examples than "what classification is the e" allowing our approach to programmatically define more subject-relation pair ambiguities for the former predicate than the later.

The remaining 36 of 50 cases were linguistic mistakes by our model. Among the 36 cases, we identified these error cases:

• Low Shot (16 of 36) The relation was seen in the training data less than 10 times.

• Subject Span (14 of 36) The subject span was incorrect.

• Noise (2 of 36) The question did not make grammatical sense.

Finally, the error analysis of this model shows that the upperbound is loose. There is likely not much more than 4% of performance to be gained with future work on the data.

5 Conclusions And Future Work

The SimpleQuestions dataset is one of the most commonly used benchmarks for studying singlerelation factoid questions. In this paper, we presented new evidence to suggest that this benchmark can be nearly solved by standard methods. These results suggest there is likely not much more than 4% to be gained with future work on the data.