Query-Reduction Networks for Question Answering

Minjoon Seo
Sewon Min
Ali Farhadi
Hannaneh Hajishirzi
ICLR
2017
View in Semantic Scholar

Abstract

In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.

1 Introduction

In this paper, we address the problem of question answering (QA) when reasoning over multiple facts is required. For example, consider we know that Frogs eat insects and Flies are insects. Then answering Do frogs eat flies? requires reasoning over both of the above facts. Question answering, more specifically context-based QA, has been extensively studied in machine comprehension tasks (Richardson et al., 2013; Hermann et al., 2015; Hill et al., 2016; Rajpurkar et al., 2016) . However, most of the datasets are primarily focused on lexical and syntactic understanding, and hardly concentrate on inference over multiple facts. Recently, several datasets aimed for testing multi-hop reasoning have emerged; among them are story-based QA and the dialog task .

Recurrent Neural Network (RNN) and its variants, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) , are popular choices for modeling natural language. However, when used for multi-hop reasoning in question answering, purely RNN-based models have shown to perform poorly . This is largely due to the fact that RNN's internal memory is inherently unstable over a long term. For this reason, most recent approaches in the literature have mainly relied on global attention mechanism and shared external memory (Sukhbaatar et al., 2015; Peng et al., 2015; Xiong et al., 2016; Graves et al., 2016) . The attention mechanism allows these models to focus on a single sentence in each layer. They can sequentially read multiple relevant sentences from the memory with multiple layers to perform multi-hop reasoning. However, one major drawback of these standard attention mechanisms is that they are insensitive to the time step (memory address) of the sentences when accessing them.

Our proposed model, Query-Reduction Network 1 (QRN), is a single recurrent unit that addresses the long-term dependency problem of most RNN-based models by simplifying the recurrent update, while taking the advantage of RNN's capability to model sequential data ( Figure 1 ). QRN considers the context sentences as a sequence of state-changing triggers, and transforms (reduces) the original query to a more informed query as it observes each trigger through time. For instance in Figure 1b , the original question, Where is the apple?, cannot be directly answered by any single sentence from the story. After observing the first sentence, Sandra got the apple there, QRN transforms the original question to a reduced query Where is Sandra?, which is presumably Where is the apple? x, q,ŷ are the story, question and predicted answer in natural language, respectively. x = x1, . . . , xT , q,ŷ are their corresponding vector representations (upright font). α and ρ are update gate and reduce functions, respectively.ŷ is assigned to be h 2 5 , the local query at the last time step in the last layer. Also, red-colored text is the inferred meanings of the vectors (see 'Interpretations' of Section 5.3). easier to answer than the original question given the context provided by the first sentence. 2 Unlike RNN-based models, QRN's candidate state (h t in Figure 1a ) does not depend on the previous hidden state (h t−1 ). Compared to memory-based approaches Sukhbaatar et al., 2015; Peng et al., 2015; Kumar et al., 2016; Xiong et al., 2016) , QRN can better encodes locality information because it does not use a global memory access controller (circle nodes in Figure 2 ), and the query updates are performed locally.

Figure 1: (1a) QRN unit, (1b) 2-layer QRN on 5-sentence story, and (1c) entire QA system (QRN and input / output modules). x, q, ŷ are the story, question and predicted answer in natural language, respectively. x = 〈x1, . . . ,xT 〉,q, ŷ are their corresponding vector representations (upright font). α and ρ are update gate and reduce functions, respectively. ŷ is assigned to be h25, the local query at the last time step in the last layer. Also, red-colored text is the inferred meanings of the vectors (see ‘Interpretations’ of Section 5.3).

Figure 2: The schematics of QRN and the two state-of-the-art models, End-to-End Memory Networks (N2N) and Improved Dynamic Memory Networks (DMN+), simplified to emphasize the differences among the models. AGRU is a variant of GRU where the update gate is replaced with soft attention, proposed by Kumar et al. (2016). For QRN and DMN+, only forward direction arrows are shown.

In short, the main contribution of QRN is threefold. First, QRN is a simple variant of RNN that reduces the query given the context sentences in a differentiable manner. Second, QRN is situated between the attention mechanism and RNN, effectively handling time dependency and long-term dependency problems of each technique, respectively. Hence it is well-suited for sequential data with both local and global interactions (note that QRN is not the replacement of RNN, which is arguably better for modeling complex local interactions). Third, unlike most RNN-based models, QRN can be parallelized over time by computing candidate reduced queries (h t ) directly from local input queries (q t ) and context sentence vectors (x t ). In fact, the parallelizability of QRN implies that QRN does not suffer from the vanishing gradient problem of RNN, hence effectively addressing the long-term dependency. We experimentally demonstrate these contributions by achieving the state-of-the-art results on story-based QA and interactive dialog datasets.

2 Model

In story-based QA (or dialog dataset), the input is the context as a sequence of sentences (story or past conversations) and a question in natural language (equivalent to the user's last utterance in the dialog). The output is the predicted answer to the question in natural language (the system's next utterance in the dialog). The only supervision provided during training is the answer to the question.

In this paper we particularly focus on end-to-end solutions, i.e., the only supervision comes from questions and answers, and we restrain from using manually defined rules or external language resources, such as lexicon or dependency parser. Let x 1 , . . . , x T denote the sequence of sentences, where T is the number of sentences in the story, and let q denote the question. Letŷ denote the predicted answer, and y denote the true answer. Our proposed system for end-to-end QA task is divided into three modules ( Figure 1c ): input module, QRN layers, and output module.

Input module. Input module maps each sentence x t and the question q to d-dimensional vector space, x t ∈ R d and q t ∈ R d . We adopt a previous solution for the input module (details in Section 5).

QRN layers. QRN layers use the sentence vectors and the question vector from the input module to obtain the predicted answer in vector space,ŷ ∈ R d . A QRN layer refers to the recurrent application of a QRN unit, which can be considered as a variant of RNN with two inputs, two outputs, and a hidden state (reduced query), all of which operate in vector space. The details of the QRN module is explained throughout this section (2.1, 2.2).

Output module. Output module mapsŷ obtained from QRN to a natural language answerŷ. Similar to the input module, we adopt a standard solution for the output module (details in Section 5).

We first formally define the base model of a QRN unit, and then we explain how we connect the input and output modules to it (Section 2.1). We also present a few extensions to the network that can improve QRN's performance (Section 2.2). Finally, we show that QRN can be parallelized over time, giving computational advantage over most RNN-based models by one order of magnitude (Section 3).

2.1 Qrn Unit

As an RNN-based model, QRN is a single recurrent unit that updates its hidden state (reduced query) through time and layers. Figure 1a depicts the schematic structure of a QRN unit, and Figure 1b demonstrates how layers are stacked. A QRN unit accepts two inputs (local query vector q t ∈ R d and sentence vector x t ∈ R d ), and two outputs (reduced query vector h t ∈ R d , which is similar to the hidden state in RNN, and the sentence vector x t from the input without modification). The local query vector is not necessarily identical to the original query (question) vector q. In order to compute the outputs, we use update gate function α :

R d × R d → [0, 1] and reduce function ρ : R d × R d → R d .

Intuitively, the update gate function measures the relevance between the sentence and the local query and is used to update the hidden state. The reduce function transforms the local query input to a candidate state which is a new reduced (easier) query given the sentence. The outputs are calculated with the following equations:

EQUATION (2): Not extracted; please refer to original document.

h t = z tht + (1 − z t )h t−1 (3)

where z t is the scalar update gate,h t is the candidate reduced query, and h t is the final reduced query at time step t, σ(•) is sigmoid activation, tanh(•) is hyperboolic tangent activation (applied element-wise),

W (z) ∈ R 1×d , W (h) ∈ R d×2d are weight matrices, b (z) ∈ R, b (h) ∈ R d are bias terms,

• is element-wise vector multiplication, and [; ] is vector concatenation along the row. As a base case, h 0 = 0. Here we have explicitly defined α and ρ, but they can be any reasonable differentiable functions.

The update gate is similar to the global attention mechanism (Sukhbaatar et al., 2015; Xiong et al., 2016) in that it measures the similarity between the sentence (a memory slot) and the query. However, a significant difference is that the update gate is computed using sigmoid (σ) function on the current memory slot only (hence internally embedded within the unit), whereas the global attention is computed using softmax function over the entire memory (hence globally defined). The update gate can be rather considered as local sigmoid attention.

Stacking Layers

We just showed the single-layer case of QRN, but QRN with multiple layers is able to perform reasoning over multiple facts more effectively, as shown in the example of Figure 1b . In order to stack several layers of QRN, the outputs of the current layer are used as the inputs to the next layer. That is, using superscript k to denote the current layer's index (assuming 1-based indexing), we let q k+1 t = h k t . Note that x t is passed to the next layer without any modification, so we do not put a layer index on it.

Bi-direction. So far we have assumed that QRN only needs to look at past sentences, whereas often times, query answers can depend on future sentences. For instance, consider a sentence "John dropped the football." at time t. Then, even if there is no mention about the "football" in the past (at time i < t), it can be implied that "John" has the "football" at the current time t. In order to incorporate the future dependency, we obtain − → h t and ← − h t in both forward and backward directions, respectively, using Equation 3. We then add them together to get q t for the next layer. That is,

q k+1 t = − → h k t + ← − h k t (4)

for layer indices 1 ≤ k ≤ K − 1. Note that the variables (h) are shared between the two directions.

W (z) , b (z) , W (h) , b

Connecting input and output modules. Figure 1c depicts how QRN is connected with the input and output modules. In the first layer of QRN, q 1 t = q for all t, where q is obtained from the input module by processing the natural language question input q. x t is also obtained from x t by the same input module. The output at the last time step in the last layer is passed to the output module. That is, y = h K t where K represent the number of layers in the network. Then the output module gives the predicted answerŷ in natural language.

2.2 Extensions

Here we introduce a few extensions of QRN, and later in our experiments, we test QRN's performance with and without each of these extensions.

Reset gate. Inspired by GRU (Cho et al., 2014) , we found that it is useful to allow the QRN unit to reset (nullify) the candidate reduced query (i.e.,h t ) when necessary. For this we use a reset gate function β :

R d × R d → [0, 1]

, which can be defined similarly to the update gate function:

EQUATION (5): Not extracted; please refer to original document.

where W (r) ∈ R 1×d is a weight matrix, and b (r) ∈ R is a bias term. Equation 3 is rewritten as

EQUATION (6): Not extracted; please refer to original document.

Note that we do not use the reset gate in the last layer.

Vector gates. As in LSTM and GRU, update and reset gates can be vectors instead of scalar values for fine-controlled gating. For vector gates, we modify the row dimension of weights and biases in Equation 1 and 5 from 1 to d. Then we obtain z t , r t ∈ R d (instead of z t , r t ∈ R), and these can be element-wise multiplied (•) instead of being broadcasted in Equation 3 and 6.

3 Parallelization

An important advantage of QRN is that the recurrent updates in Equation 3 and 5 can be computed in parallel across time. This is in contrast with most RNN-based models that cannot be parallelized, where computing the candidate hidden state at time t explicitly requires the previous hidden state. In QRN, the final reduced queries (h t ) can be decomposed into computing over candidate reduced queries (h t ), without looking at the previous reduced query. Here we primarily show that the query update in Equation 3 can be parallelized by rewriting the equation with matrix operations. The extension to Equation 5 is straightforward. The proof for QRN with vector gates is shown in Appendix B. The recursive definition of Equation 3 can be explicitly written as

EQUATION (7): Not extracted; please refer to original document.

Let b i = log(1 − z i ) for brevity. Then we can rewrite Equation 7 as the following equation: (Sukhbaatar et al., 2015 ) (Xiong et al., 2016) Figure 2: The schematics of QRN and the two state-of-the-art models, End-to-End Memory Networks N2Nand Improved Dynamic Memory Networks (DMN+), simplified to emphasize the differences among the models. AGRU is a variant of GRU where the update gate is replaced with soft attention, proposed by Kumar et al. (2016) . For QRN and DMN+, only forward direction arrows are shown.

       h 1 h 2 h 3 . . . h T        =        exp                     0 −∞ −∞ . . . −∞ b 2 0 −∞ . . . −∞ b 2 + b 3 b 3 0 . . . −∞ . . . . . . . . . . . . . . . T j=2 b j T j=3 b j T j=4 b j . . . 0                                   z 1h 1 z 2h 2 z 3h 3 . . . z Th T        (8) " # $ q ⋯ " # $ QRN ⋯ $ # = ) " " # " $ " " # " " # " # # QRN QRN QRN QRN QRN (a) QRN " # $ = ( " # $ ⋯ " + = # Σ Σ ⋯ (b) N2N

" # $ = ( AGRU AGRU AGRU GRU ⋯ " # $ AGRU AGRU AGRU GRU ⋯ " + = # (c) DMN+

Let H = [h 1 ; . . . ; h T ] be a T -by-d matrix where the transposes ( ) of the column vectors h t are concatenated across row. We similarly defineH fromh t . Also, let z = [z 1 ; . . . ; z T ] and b = [0; b 2 ; . . . ; b T ] be column vectors (note that we use 0 instead of b 1 ). Then Equation 8 is:

H = [L • exp (L [B • L ])] Z •H (9)

where L, L ∈ R T ×T are lower and strictly lower triangular matrices of 1's, respectively, • is elementwise multiplication, and B is a matrix where

4 Related Work

QRN is inspired by RNN-based models with gating mechanism, such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) . While GRU and LSTM use the previous hidden state and the current input to obtain the candidate hidden state, QRN only uses the current two inputs to obtain the candidate reduced query (equivalent to candidate hidden state). We conjecture that this not only gives computational advantage via parallelization, but also makes training easier, i.e., avoiding vanishing gradient (which is critical for long-term dependency), overfitting (by simplifying the model), and converging to local minima.

The idea of structurally simplifying (constraining) RNNs for learning longer-term patterns has been explored in recent previous work, such as Structurally Constrained Recurrent Network (Mikolov et al., 2015) and Strongly-Typed Recurrent Neural Network (STRNN) (Balduzzi and Ghifary, 2016) . QRN is similar to STRNN in that both architectures use gating mechanism, and the gates and the candidate hidden states do not depend on the previous hidden states, which simplifies the recurrent relation. However, QRN can be distinguished from STRNN in three ways. First, QRN's update gate simulates attention mechanism, measuring the relevance between the input sentence and query. On the other hand, the gates in STRNN can be considered as the simplification of LSTM/GRU by removing their dependency on previous hidden state. Second, QRN is an RNN that is natively compatible with context-based QA tasks, where the QRN unit accepts two inputs, i.e. each context sentence and query. This is distinct from STRNN which has only one input. Third, we show that QRN is timewise-parallelizable on GPUs. Our parallelization algorithm is also applicable to STRNN.

End-to-end Memory Network (N2N) (Sukhbaatar et al., 2015) uses external memory with multi-layer attention mechanism to focus on sentences that are relevant to the question. There are two key differences between N2N and our QRN. First, N2N summarizes the entire memory in each layer to control the attention in the next layer (circle nodes in Figure 2b ). Instead, QRN does not have any controller node (Figure 2a ) and is able to focus on relevant sentences through the update gate that is internally embodied within its unit. Second, N2N adds time-dependent trainable weights to the sentence representations to model the time dependency of the sentences (as discussed in Section 1). QRN does not need such additional weights as its inherent RNN architecture allows QRN to effectively model the time dependency. Neural Reasoner (Peng et al., 2015) and Gated End-toend Memory Network (Perez and Liu, 2016) ) are variants of MemN2N that share its fundamental characteristics.

Improved Dynamic Memory Network (DMN+) (Xiong et al., 2016) uses the hybrid of the attention mechanism and the RNN architecture to model the sequence of sentences. It consists of two distinct GRUs, one for the time axis (rectangle nodes in Figure 2c ) and one for the layer axis (circle nodes in Figure 2c ). Note that the update gate of the GRU for the time axis is replaced with external softmax attention weights. DMN+ uses the time-axis GRU to summarizes the entire memory in each layer, and then the layer-axis GRU controls the attention weights in each layer. In contrast, QRN is simply a single recurrent unit without any controller node.

5 Experiments

5.1 DATA bAbI story-based QA dataset bAbI story-based QA dataset is composed of 20 different tasks (Appendix A), each of which has 1,000 (1k) synthetically-generated story-question pair. A story can be as short as two sentences and as long as 200+ sentences. A system is evaluated on the accuracy of getting the correct answers to the questions. The answers are single words or lists (e.g. "football, apple"). Answering questions in each task requires selecting a set of relevant sentences and applying different kinds of logical reasoning over them. The dataset also includes 10k training data (for each task), which allows training more complex models. Note that DMN+ (Xiong et al., 2016) only reports on the 10k dataset.

Table 1: (top) bAbI QA dataset (Weston et al., 2016): number of failed tasks and average error rates (%). † is obtained from github.com/therne/dmn-tensorflow. (bottom) bAbI dialog and DSTC2 dialog dataset (Bordes and Weston, 2016) average error rates (%) of QRN and previous work (LSTM, N2N, DMN+, GMemN2N, and DNC). For QRN, the first number (1, 2, 3) indicates the number of layers, ‘r’ means the reset gate is used, and the last number (100, 200), if exists, indicates the dimension of the hidden state, where the default value is 50. ‘+’ indicates that ‘match’ (See Appendix for details) is used. The task-wise results are shown in Appendices: Table 2 (bAbI QA) and Table 3 (dialog datasets). See Section 5.3 for details.

Table 2: bAbI QA dataset (Weston et al., 2016) error rates (%) of QRN and previous work: LSTM (Weston et al., 2016), End-to-end Memory Networks (N2N) (Sukhbaatar et al., 2015), Dynamic Memory Networks (DMN+) (Xiong et al., 2016), Gated End-to-end Memory Networks(GMemN2N) (Perez and Liu, 2016). Results within each task of Differentiable Neural Computer(DNC) were not provided in its paper Graves et al. (2016)). For QRN, a number in the front (1, 2, 3, 6) indicates the number of layers. A number in the back (200) indicates the dimension of hidden vector, while the default value is 50. ‘r’ indicates that the reset gate is used, and ‘v’ indicates that the gates were vectorized. ‘*’ indicates joint training.

bAbI dialog dataset bAbI dialog dataset consists of 5 different tasks (Table 3) , each of which has 1k synthetically-generated goal-oriented dialogs between a user and the system in the domain of restaurant reservation. Each dialog is as long as 96 utterances and comes with external knowledge base (KB) providing information of each restaurant. The authors also provide Out-Of-Vocabulary (OOV) version of the dataset, where many of the words and KB keywords in test data are not seen during training. A system is evaluated on the accuracy of its response to each utterance of the user, choosing from up to 2500 possible candidate responses. A system is required not only to understand the user's request but also refer to previous conversations in order to obtain the context information of the current conversation. transformed the Second Dialog State Tracking Challenge (DSTC2) dataset (Henderson et al., 2014) into the same format as the bAbI dialog dataset, for the measurement of performance on a real dataset. Each dialog can be as long as 800+ utterances, and a system needs to choose from 2407 possible candidate responses for each utterance of the user. Note that the evaluation metric of the original DSTC2 is different from that of the transformed DSTC2, so previous work on the original DSTC2 should not be directly compared to our work. We will refer to this transformed DSTC2 dataset by "Task 6" of dialog dataset.

Table 3: bAbI dialog and DSTC2 dialog dataset (Bordes and Weston, 2016) average error rates (%) of QRN and previous work: End-to-end Memory Networks(N2N (Bordes and Weston, 2016)) and Gated End-to-end Memory Networks(GMemN2N (Perez and Liu, 2016)). For QRN, a number in the front (1, 2, 3, 6) indicates the number of layers and a number in the back (100) indicates the dimension of hidden vector, while the default value is 50. ‘r’ indicates that the reset gate is used, ‘v’ indicates that the gates were vectorized, and ‘+’ indicates that ‘match’ was used.

5.2 Model Details

Input Module. In the input module, we are given sentences (previous conversations in dialog) x t and a question (most recent user utterance) q, and we want to obtain their vector representations, x t , q ∈ R d . We use a trainable embedding matrix A ∈ R d×V to encode the one-hot vector of each word x tj in each sentence x t into a d-dimensional vector x tj ∈ R d . Then the sentence representation x t is obtained by Position Encoder . The same encoder with the same embedding matrix is also used to obtain the question vector q from q.

Output Module for story-based QA. In the output module, we are given the vector representation of the predicted answerŷ and we want to obtain the natural language form of the answer,ŷ. We use a V -way single-layer softmax classifier to mapŷ to a V -dimensional sparse vector,v = softmax W (y)ŷ ∈ R V , where W (y) ∈ R V ×d is a weight matrix. Then the final answerŷ is simply the argmax word inv. To handle questions with multiple-word answers, we consider each of them as a single word that contains punctuations such as space and comma, and put it in the vocabulary.

Output Module for dialog. We use a fixed number single-layer softmax classifiers, each of which is similar to that of the sotry-based QA model, to sequentially output each word of the system's response. While it is similar in spirit to the RNN decoder (Cho et al., 2014) , our output module does not have a recurrent hidden state or gating mechanism. Instead, it solely uses the final ouptut of the QRN,ŷ, and the current word output to influence the prediction of the next word among possible candidates.

Training. We withhold 10% of the training for development. We use the hidden state size of 50 by deafult. Batch sizes of 32 for bAbI story-based QA 1k, bAbI dialog and DSTC2 dialog, and 128 for bAbI QA 10k are used. The weights in the input and output modules are initialized with zero mean and the standard deviation of 1/ √ d. Weights in the QRN unit are initialized using techniques by Glorot and Bengio (2010) , and are tied across the layers. Forget bias of 2.5 is used for update gates (no bias for reset gates). L2 weight decay of 0.001 (0.0005 for QA 10k) is used for all weights. The loss function is the cross entropy betweenv and the one-hot vector of the true answer. The loss is minimized by stochastic gradient descent for maximally 500 epochs, but training is early stopped if the loss on the development data does not decrease for 50 epochs. The learning rate is controlled by AdaGrad (Duchi et al., 2011) with the initial learning rate of 0.5 (0.1 for QA 10k). Since the model is sensitive to the weight initialization, we repeat each training procedure 10 times (50 times for 10k) with the new random initialization of the weights and report the result on the test data with the lowest loss on the development data.

5.3 Results.

We compare our model with baselines and previous state-of-the-art models on story-based and dialog tasks (Table 1) . These include LSTM (Hochreiter and Schmidhuber, 1997) , End-to-end Memory Networks (N2N) (Sukhbaatar et al., 2015) , Dynamic Memory Networks (DMN+) (Xiong et al., 2016) , Gated End-to-end Memory Networks (GMemN2N) (Perez and Liu, 2016) , and Differentiable Neural Computer (DNC) (Graves et al., 2016) .

Story-based QA. Table 1(top) reports the summary of results of our model (QRN) and previous work on bAbI QA (task-wise results are shown in Table 2 in Appendix). In 1k data, QRN's '2r' (2 layers + reset gate + d = 50) outperforms all other models by a large margin (2.8+%). In 10k dataset, the average accuracy of QRN's '6r200' (6 layers + reset gate + d = 200) model outperforms all previous models by a large margin (2.5+%), achieving a nearly perfect score of 99.7%.

Dialog. Table 1(bottom) reports the summary of the results of our model (QRN) and previous work on bAbI dialog and Task 6 dialog (task-wise results are shown in Table 3 in Appendix). As done in previous work Perez and Liu, 2016) , we also report results when we use 'Match' for dialogs. 'Match' is the extension to the model which additionally takes as input whether each answer candidate matches with context (more details on Appendix). QRN outperforms previous work by a large margin (2.0+%) in every comparison.

Ablations. We test four types of ablations (also discussed in Section 2.2): number of layers (1, 2, 3, or 6), reset gate (r), and gate vectorization (v) and the dimension of the hidden vector (50, 100). We show a subset of combinations of the ablations for bAbI QA in Table 1 and Table 2 ; other combinations performed poorly and/or did not give interesting observations. According to the ablation results, we infer that: (a) When the number of layers is only one, the model lacks reasoning capability. In the case of 1k dataset, when there are too many layers (6), it seems correctly training the model becomes increasingly difficult. In the case of 10k dataset, many layers (6) and hidden dimensions (200) helps reasoning, most notably in difficult task such as task 16. (b) Adding the reset gate helps. (c) Including vector gates hurts in 1k datasets, as the model either overfits to the training data or converges to local minima. On the other hand, vector gates in bAbI story-based QA 10k dataset sometimes help. (d) Increasing the dimension of the hidden state to 100 in the dialog's Task 6 (DSTC2) helps, while there is not much improvement in the dialog's Task 1-5. It can be hypothesized that a larger hidden state is required for real data. : number of failed tasks and average error rates (%). † is obtained from github.com/therne/dmn-tensorflow. (bottom) bAbI dialog and DSTC2 dialog dataset average error rates (%) of QRN and previous work (LSTM, N2N, DMN+, GMemN2N, and DNC) . For QRN, the first number (1, 2, 3) indicates the number of layers, 'r' means the reset gate is used, and the last number (100, 200), if exists, indicates the dimension of the hidden state, where the default value is 50. '+' indicates that 'match' (See Appendix for details) is used. The task-wise results are shown in Appendices:

− → r 1 ← − r 1 z 2 resto-paris-expen-frech-8stars?

0.00 1.00 0.96 0.91 Do you have something else? 0.41 0.99 0.00 0.00 Sure let me find another option.

1.00 0.00 0.00 0.12 resto-paris-expen-frech-5stars? 0.00 1.00 0.96 0.91 No this does not work for me. 0.00 0.00 0.14 0.00 Sure let me find an other option. 1.00 0.00 0.00 0.12 What do you think of this? resto-paris-expen-french-4stars Parallelization. We implement QRN with and without parallelization in TensorFlow (Abadi et al., 2016 ) on a single Titan X GPU to qunaitify the computational gain of the parallelization. For QRN without parallelization, we use the RNN library provided by TensorFlow. QRN with parallelization gives 6.2 times faster training and inference than QRN without parallelization on average. We expect that the speedup can be even higher for datasets with larger context.

Layer 1 Layer 2 Task 6: DSTC2 dialog z 1 − → r 1 ← − r 1 z 2 Spanish

Interpretations. An advantage of QRN is that the intermediate query updates are interpretable. Figure 1 shows intermediate local queries (q k t ) interpreted in natural language, such as "Where is Sandra?". In order to obtain these, we place a decoder on the input question embedding q and add its loss for recovering the question to the classification loss (similarly to Peng et al. (2015) ). We then use the same decoder to decode the intermediate queries. This helps us understand the flow of information in the networks. In Figure 1 , the question Where is apple? is transformed into Where is Sandra? at t = 1. At t = 2, as Sandra dropped the apple, the apple is no more relevant to Sandra. We obtain Where is Daniel? at time t = 3, and it is propagated until t = 5, where we observe a sentence (fact) that can be used to answer the query.

Visualization. Figure 3 shows vizualization of the (scalar) magnitudes of update and reset gates on story sentences and dialog utterances. More visualizations are shown in Appendices: Figure 4 and Figure 5 . In Figure 3 , we observe high values on facts that provide information to answer question (the system's next utterance for dialog). In QA Task 2 example (top left), we observe high update gate values in the first layer on facts that state who has the apple, and in the second layer, the high update gate values are on those that inform where that person went to. We also observe that the forward reset gate at t = 2 in the first layer ( − → r 1 2 ) is low, which is signifying that apple no more belongs to Sandra. In dialog Task 3 (bottom left), the model is able to infer that three restaurants are already recommended so that it can recommend another one. In dialog Task 6 (bottom), the model focuses on the sentences containing Spanish, and does not concentrate much on other facts such as I don't care.

Figure 3: (top) bAbI QA dataset (Weston et al., 2016) visualization of update and reset gates in QRN ‘2r’ model (bottom two) bAbI dialog and DSTC2 dialog dataset (Bordes and Weston, 2016) visualization of update and reset gates in QRN ‘2r’ model. Note that the stories can have as many as 800+ sentences; we only show part of them here. More visualizations are shown in Figure 4 (bAbI QA) and Figure 5 (dialog datasets).

Figure 4: Visualization of update and reset gates in QRN ‘2r’ model for on several tasks of bAbI QA (Table 2). We do not put reset gate in the last layer. Note that we only show some of recent sentences here, though the stories can have as many as 200+ sentences.

Figure 5: Visualization of update and reset gates in QRN ‘2r’ model for on several tasks of bAbI dialog and DSTC2 dialog (Table 3). We do not put reset gate in the last layer. Note that we only show some of recent sentences here, even the dialog has more sentences.

6 Conclusion

In this paper, we introduce Query-Reduction Network (QRN) to answer context-based questions and carry out conversations with users that require multi-hop reasoning. We show the state-of-theart results in the three datasets of story-based QA and dialog. We model a story or a dialog as a sequence of state-changing triggers and compute the final answer to the question or the system's next utterance by recurrently updating (or reducing) the query. QRN is situated between the attention mechanism and RNN, effectively handling time dependency and long-term dependency problems of each technique, respectively. It addresses the long-term dependency problem of most RNNs by simplifying the recurrent update, in which the candidate hidden state (reduced query) does not depend on the previous state. Moreover, QRN can be parallelized and can address the well-known problem of RNN's vanishing gradients.

A Task-Wise Results

Here we provide detailed per-task breakdown of our results in QA (Table 2 ) and dialog datasets (Table 3) . error rates (%) of QRN and previous work: LSTM , End-to-end Memory Networks (N2N) (Sukhbaatar et al., 2015) , Dynamic Memory Networks (DMN+) (Xiong et al., 2016) , Gated End-to-end Memory Networks(GMemN2N) (Perez and Liu, 2016) . Results within each task of Differentiable Neural Computer(DNC) were not provided in its paper Graves et al. (2016) Table 3 : bAbI dialog and DSTC2 dialog dataset average error rates (%) of QRN and previous work: End-to-end Memory Networks(N2N ) and Gated End-to-end Memory Networks(GMemN2N (Perez and Liu, 2016) ). For QRN, a number in the front (1, 2, 3, 6) indicates the number of layers and a number in the back (100) indicates the dimension of hidden vector, while the default value is 50. 'r' indicates that the reset gate is used, 'v' indicates that the gates were vectorized, and '+' indicates that 'match' was used.

B Vector Gate Parallelization

For vector gates, we have z t ∈ R d instead of z t ∈ R. Therefore the following equation replaces Equation 7:

EQUATION (10): Not extracted; please refer to original document.

where z j k is the k-th column vector of z j . Let b ij = log(1 − z i j ) for brevity. Then, we can rewrite Equation 8 as following:

EQUATION (11): Not extracted; please refer to original document.

Let H = [h 1 ; . . . ; h T ] be a T -by-d matrix where the transposes ( ) of the column vectors h t are concatenated across row. We similarly defineH fromh t . Also, let z = [z 1 ; . . . ; z T ], and B d be a T -by-T matrix where T [0; b 2d ; . . . ; b T d ]'s are tiled across the column.

Then Equation 11 is:

EQUATION (12): Not extracted; please refer to original document.

where L, L ∈ R T ×T are lower and strictly lower triangular matrices of 1's are tiled across the column. Z = [z 1 , . . . , z d ] ∈ R T ×d .

C Model Details

Match. While similar in spirit, our 'Match' model is slightly different from previous work (Bordes and Perez and Liu, 2016) . We use answer candidate embedding matrix, and add 2 dimension of 0-1 matrix which expresses whether the answer candidate matches with any word in the paragraph and the question. In other words, the softmax is computed bŷ v = softmax W[W (y) ; M (y) ]ŷ ∈ R V , where W ∈ R d×d and W (y) ∈ R V ×(d−2) are trainable weight matrices, and M (y) ∈ R V ×2 is the 0-1 match matrix.

D Visualizations

Visualization of Story-based QA. Figure 4 shows visualization of models for story-based QA tasks.

In the task 3 (left), the model focuses on the facts that contain 'football' in the first layer, and found out where Mary journeyed to before the bathroom in the second layer. In task 7 (right), the model focuses on the facts that provide information about the location of Sandra.

& & & " &

This mechanism is akin to logic regression in situation calculus(Reiter, 2001).