Natural Language to Structured Query Generation via Meta-Learning

Po-Sen Huang
Chenglong Wang
Rishabh Singh
Wen-tau Yih
X. He
NAACL
2018
View in Semantic Scholar

Abstract

In conventional supervised training, a model is trained to fit all the training examples. However, having a monolithic model may not always be the best strategy, as examples could vary widely. In this work, we explore a different learning protocol that treats each example as a unique pseudo-task, by reducing the original learning problem to a few-shot meta-learning scenario with the help of a domain-dependent relevance function. When evaluated on the WikiSQL dataset, our approach leads to faster convergence and achieves 1.1%–5.4% absolute accuracy gains over the non-meta-learning counterparts.

1 Introduction

Conventional supervised training is a pervasive paradigm for NLP problems. In this setting, a model is trained to fit all the training examples and their corresponding targets. However, while sharing the same surface form of the prediction task, examples of the same problem may vary widely. For instance, recognizing textual entailment is a binary classification problem on whether the hypothesis follows a given textual statement, but the challenge datasets consist of a huge variety of inference categories and genres (Dagan et al., 2013; Williams et al., 2017) . Similarly, for a semantic parsing problem that maps natural language questions to SQL statements, the number of conditions in a SQL query or the length of a question can vary substantially (Zhong et al., 2017) .

The inherently high variety of the examples suggests an alternative training protocol: instead of learning a monolithic, one-size-fits-all model, it could be more effective to learn multiple models, where each one is designed for a specific "task" that covers a group of similar examples. However, this strategy is faced with at least two difficulties. As the number of tasks increases, each task will have much fewer training examples for learning a robust model. In addition, the notion of "task", namely the group of examples, is typically not available in the dataset.

In this work, we explore this alternative learning setting and address the two difficulties by adapting the meta-learning framework. Motivated by the few-shot learning scenario (Andrychowicz et al., 2016; Ravi and Larochelle, 2016; Vinyals et al., 2016) , meta-learning aims to learn a general model that can quickly adapt to a new task given very few examples without retraining the model from scratch (Finn et al., 2017) . We extend this framework by effectively creating pseudo-tasks with the help of a relevance function. During training, each example is viewed as the test example of an individual "task", where its top-K relevant instances are used as training examples for this specific task. A general model is trained for all tasks in aggregation. Similarly during testing, instead of applying the general model directly, the top-K relevant instances (in the training set) to the given test example are first selected to update the general model, which then makes the final prediction.

When empirically evaluated on a recently proposed, large semantic parsing dataset, WikiSQL (Zhong et al., 2017) , our approach leads to faster convergence and achieves 1.1%-5.4% absolute accuracy gain over the non-meta-learning counterparts, establishing a new state-of-the-art result. More importantly, we demonstrate how to design a relevance function to successfully reduce a regular supervised learning problem to a meta-learning problem. To the best of our knowledge, this is the first successful attempt in adapting meta-learning to a semantic task.

2 Background: Meta-Learning

Our work is built on the recently proposed Model-Agnostic Meta-Learning (MAML) frame-work (Finn et al., 2017) , which we describe briefly here. MAML aims to learn the learners (for the tasks) and the meta-learner in the few-shot metalearning setup (Vinyals et al., 2016; Andrychowicz et al., 2016; Ravi and Larochelle, 2016) . Formally, it considers a model that is represented by a function f θ with parameters θ. When the model adapts to a new task T i , the model changes parameters θ to θ i , where a task contains K training examples and one or more test examples (K-shot learning). MAML updates the parameters θ i by one or a few rounds of gradient descent based on the training examples of task T i . For example, with one gradient update,

θ i = θ − α∇ θ L T i (f θ ),

where the step size α is a hyper-parameter;

L T i (f θ )

is a loss function that evaluates the error between the prediction f θ (x (j) ) and target y (j) , where x (j) , y (j) are an input/output pair sampled from the training examples of task T i . Model parameters θ are trained to optimize the performance of f θ i on the unseen test examples from T i across tasks. The meta-objective is:

min θ T i ∼p(T ) L T i (f θ i ) = T i ∼p(T ) L T i (f θ−α∇ θ L T i (f θ ) )

The goal of MAML is to optimize the model parameters θ such that the model can learn to adapt new tasks with parameters θ i via a few gradient step on the training examples of new tasks. The model is improved by considering how the test error on unseen test data from T i changes with respect to the parameters. The meta-objective across tasks is optimized using stochastic gradient descent (SGD). The model parameters θ are updated as follows:

θ ← θ − β∇ θ T i ∼p(T ) L T i (f θ i ),

where β is the meta step size.

3 Approach

As discussed in Section 1, to reduce traditional supervised learning to a few-shot meta-learning problem, we introduce a relevance function, which effectively help group examples to form pseudotasks. Because the relevance function is problemdependent, we first describe the semantic parsing problem below, followed by the design of our relevance function and the complete algorithm.

3.1 The Semantic Parsing Task

The specific semantic parsing problem we study in this work is to map a natural language question to a SQL query, which can be executed against a given table to find the answer to the original question. In particular, we use the currently largest natural language questions to SQL dataset, WikiSQL (Zhong et al., 2017) , to develop our model and to conduct the experiments.

3.2 Relevance Function

The intuition behind the design of a relevance function is that examples of the same type should have higher scores. For the questions to SQL problem, we design a simple relevance function that depends on (1) the predicted type of the corresponding SQL query and (2) the question length.

There are five SQL types in the WikiSQL dataset: {Count, Min, Max, Sum, Avg, Select}. We train a SQL type classifier f sql using SVMs with bag-of-words features of the input question, which achieves 93.5% training accuracy and 88% test accuracy in SQL type prediction. Another soft indication on whether two questions can be viewed as belonging to the same "task" is their lengths, as they correlate to the lengths of the mapped SQL queries. The length of a question is the number of tokens in it after normalizing entity mentions to single tokens. 1 Our relevance function only considers examples of the same predicted SQL types. If examples x (i) and x (j) have the same SQL type, then their relevance score is 1 − |q len (x (i) ) − q len (x (j) )|, where q len calculates the question length. Notice that the relevance function does not need to be highly accurate as there is no formal definition on which examples should be grouped in the same pseudo-task. A heuristic-based function that encodes some domain knowledge typically works well based on our preliminary study. In principle, the relevance function can also be jointly learned with the metalearning model, which we leave for future work.

3.3 Algorithm

Given a relevance function, the adaptation of the meta-learning using the MAML framework can be summarized in Algorithm 1. For each training example x (j) , we create a pseudo-task T j using the top-K relevant examples as the support set Algorithm 1 Proposed Algorithm Require: Training Datapoints D = {x (j) , y (j) } Require: α, β: step size hyperparameters Require: K: support set size hyperparameter 1: Construct a task T j with training examples using a support set S

(j)

K and a test example D j = (x (j) , y (j) ). 2: Denote p(T ) as distribution over tasks 3: Randomly initialize θ 4: while not done do 5:

Sample batch of tasks T i ∼ p(T ) 6: for all T i do 7: Evaluate ∇ θ L T i (f θ ) using S (j) K 8:

Compute adapted parameters with gradient descent:

θ i = θ − α∇ θ L T i (f θ ) 9:

Use the test example D i from T i for the meta-update 10:

end for 11:

Update θ ← θ − β∇ θ T i ∼p(T ) L T i (f θ i ) using each D i and L T i 12: end while S (j) K (

Step 1). The remaining steps of the algorithm mimics the original MAML design, update task-level models (Step 8) and the meta-level, general model (Step 11) using gradient descent.

4 Experiments

We apply our approach to the WikiSQL dataset. Details of the dataset and preprocessing steps can be found in Appendix A. Here we focus our discussion on the basic learner model and the results.

4.1 Learner Model

We use the model of Wang et al. (2018) as the learner in our meta-learning setup. The model is a grammar-aware Seq2Seq encoder-decoder model with attention . The encoder is a bidirectional LSTM, which takes the concatenation of the table header (column names) of the queried table and the question as input to learn a joint representation. The decoder is another LSTM with attention mechanism. There are three output layers corresponding to three decoding types, which restricts the vocabulary it can sample from at each decoding step. The three decoding types are defined as follows:

• τ V (SQL operator): The output has to be a SQL operator, i.e., a terminal from V = {Select, From, Where, Id, Max, Min, Count, Sum, Avg, And, =, >, ≥, <, ≤, , }. • τ C (column name): The output has to be a column name, which will be copied from either the table header or the query section of the input sequence. Note that the column required for the correct SQL output may or may not be mentioned explicitly in the question. • τ Q (constant value): The output is a constant that would be copied from the question section of the input sequence. The grammar of SQL expressions in the the WikiSQL dataset can be described in regular expression as "Select f c From t Where (c op v) * " (f refers to an aggregation function, c refers to a column name, t refers to the table name, op refers an comparator and v refers to a value). The form can be represented by a decoding-type sequence

τ V τ V τ C τ V τ C τ V (τ C τ V τ Q )

* , which will ensure only decoding-type corrected tokens can be sampled at each decoding step.

The authors of Wang et al. (2018) propose three cross-entropy based loss functions: "Pointer loss", which is the cross-entropy between target index and the chosen index, "Max loss", which computes the probability of copying a token v in the input as the maximum probability of pointers that point to token v, and "Sum loss", which computes the probability of copying a token v in the input as the sum of probabilities of pointers that point to token v. We refer readers to Wang et al. (2018) for more details. The detailed hyper-parameter setup is described in Appendix B. Table 1 shows the experimental results of our model on the WikiSQL dataset. We select the model based on the best logical form accuracy on the development set, and compare our results to augmented pointer network and the Seq2SQL model (with RL) in (Zhong et al., 2017) . Both logical form accuracy (denoted by Acc lf ) that compares the exact SQL syntax match, and the SQL execution results (denoted by Acc ex ) are reported. We compare our approach with its nonmeta-learning counterpart using "Pointer loss", "Max loss", and "Sum loss" losses from Wang et al. (2018) . Our model achieves 1.1%-5.3% and 1.2%-5.4% gains on the test set logical form and execution accuracy, respectively. Detailed error analysis between the "Sum loss" and "Meta + Sum loss" models can be found in Appendix C. where Acc lf represents the logical form accuracy and Accex represents the SQL execution accuracy. "Pointer loss", "Max loss", and "Sum loss" are the non-meta-learning counterpart from Wang et al. (2018) . "Meta + X" denotes the metalearning model with learner "X". "meta train" and "meta dev" are the train and development set accuracy using the "Meta + Sum loss" model, "train" and "dev" are the train and development set accuracy using the "Sum loss" model (Wang et al., 2018) .

Table 1: Experimental Results on the WikiSQL dataset, where Acclf represents the logical form accuracy and Accex represents the SQL execution accuracy. “Pointer loss”, “Max loss”, and “Sum loss” are the non-meta-learning counterpart from Wang et al. (2018). “Meta + X” denotes the metalearning model with learner “X”.

4.2 Results

We also investigate the training and development set logical form accuracy over different epochs by "Meta + Sum loss" and "Sum loss" models. The results are shown in Figure 1 . One interesting observation is that the "Meta + Sum loss" model converges much faster than the "Sum loss" model especially in the first 10 epochs. We attribute this improvement to the ability to adapt to new tasks even with a small number of training examples.

Figure 1: Logical form accuracy comparison, where “meta train” and “meta dev” are the train and development set accuracy using the “Meta + Sum loss” model, “train” and “dev” are the train and development set accuracy using the “Sum loss” model (Wang et al., 2018).

5 Related Work

Meta Learning One popular direction of metalearning (Thrun and Pratt, 1998; Schmidhuber, 1987; Naik and Mammone, 1992) is to train a meta-learner that learns how to update the parameters of the learners model (Bengio et al., 1992; Schmidhuber, 1992) . This direction has been applied to learning to optimize deep neural networks (Hochreiter et al., 2001; Andrychowicz et al., 2016; Li and Malik, 2017; Ha et al., 2017) . Fewshot learning methods have also adapted metalearning approaches for image recognition (Koch, 2015; Ravi and Larochelle, 2016; Vinyals et al., 2016 ) and reinforcement learning (Finn et al., 2017) . Given that the few-shot learning setup cannot directly work in standard supervised learning problems, we explore reducing a regular supervised learning problem to the few-shot metalearning scenario by creating pseudo-tasks with a relevance function.

Semantic Parsing Mapping natural language to logic forms has been actively studied in natural language processing research (Zettlemoyer and Collins, 2005; Artzi and Zettlemoyer, 2011; Berant et al., 2013; Yih et al., 2014 Yih et al., , 2015 Wang et al., 2015; Golub and He, 2016; Iyer et al., 2017; . However, unlike conventional approaches, which fit one model for all training examples, the proposed approach learns to adapt to new tasks. By using the support set based on the relevance function, the proposed model can adapt to a unique model for each example.

Program Induction / Synthesis Program induction (Reed and De Freitas, 2016; Neelakantan et al., 2015; Graves et al., 2014; Yin et al., 2015) aims to infer latent programs given input/output examples, while program synthesis models (Zhong et al., 2017; Parisotto et al., 2017) aim to generate explicit programs and then execute them to get output. The learner model we used in this work follows the line of program synthesis models and trains on pairs of natural language (question) and program (SQL) directly.

6 Conclusion

In this paper, we propose a new learning protocol that reduces a regular supervised learning problem to the few-shot meta-learning scenario. This is done by effectively creating pseudo-tasks with the help of a relevance function. When evaluated on the newly released, large semantic parsing dataset, WikiSQL, our approach leads to faster convergence and enjoys 1.1%-5.4% absolute accuracy gains over the non-meta-learning counterparts, achieving a new state-of-the-art result.

While the initial finding is encouraging, we believe the potential of this meta-learning framework has not yet been fully realized. In the future, we plan to explore more variations of the meta-learning setup, such as using different relevance functions, including the ones that are jointly learned. We also would like to understand this approach better by testing it on more natural language processing tasks.

A Dataset Details

We evaluate our model on the WikiSQL dataset (Zhong et al., 2017) . We follow the data preprocessing and model in Wang et al. (2018) as the learner in the meta learning setup. Specifically, we first preprocess the dataset by running both tables and question-query pairs through Stanford Stanza (Manning et al., 2014) using the script included with the WikiSQL dataset, which normalizes punctuation and cases of the dataset. We further normalize each question based on its corresponding table: for table entries and columns occurring in questions or queries, we normalize their format to be consistent with the table. After preprocessing, we filter the training set by removing pairs whose ground truth solution contains constants not mentioned in the question, as our model requires the constants to be copied from the question. We train and tune our model only on the filtered training and filtered development set, but we report our evaluation on the full development and test sets. We obtain 59,845 (originally 61,297) training pairs, 8,928 (originally 9,145) development set pairs and 17,283 test pairs (the test set is not filtered).

B Model Hyperparameters

We use the pre-trained n-gram embeddings by Hashimoto et al. (2017) (100 dimension) and the GloVe word embedding (100 dimension) by Pennington et al. (2014) ; each token is embedded into a 200 dimensional vector. The encoder is a 3layer bidirectional LSTM with hidden states of size 100, and the decoder is a 3-layer unidirectional LSTM with hidden states of size 100. The model is trained with question-query pairs with a batch size of 200 for 100 epochs. During training, we clip gradients at 5 and add gradient noise with η = 0.3, γ = 0.55 to stabilize training (Neelakantan et al., 2015) . We found the meta-learning model is trained stably without back-propagating to second order gradients. We select the support set size K to be 2 in the "Meta + Sum loss" based on the development set. Empirically, the performance does not improve when we use larger K. We set the learning rates α = 0.001 and β = 0.1 based on the development set. The model is implemented in Tensorflow and trained using the Adagrad optimizer (Duchi et al., 2011) .

C Error Analysis

We compare the logical form error on the test set between the "Sum loss" model (Wang et al., 2018) and the proposed "Meta + Sum loss" model. Among the 17,283 test examples, there are 6,661 and 6,428 errors by the "Sum loss" and the "Meta + Sum loss", respectively. There are 5,190 common errors by both models. We examine the test examples where "Sum loss" is correct while "Meta + Sum loss" is not and vice versa, shown in Figure 2 . We observe that the differences are mainly in ground truth SQL length = 7 and 10, where the "Meta + Sum loss" model outperforms "Sum loss" model by a large margin. We show some examples for the two cases below. Figure 2 : Logical form accuracy comparison, where "Meta + Sum loss (o), Sum loss (x)" indicates the generated SQL is incorrect by the "Sum loss" model and is correct by the "Meta + Sum loss" model. Similarly "Meta + Sum loss (x), Sum loss (o)" indicates the generated SQL is incorrect by the "Meta + Sum loss" model and is correct by the "Sum loss" model.

Figure 2: Logical form accuracy comparison, where “Meta + Sum loss (o), Sum loss (x)” indicates the generated SQL is incorrect by the “Sum loss” model and is correct by the “Meta + Sum loss” model. Similarly “Meta + Sum loss (x), Sum loss (o)” indicates the generated SQL is incorrect by the “Meta + Sum loss” model and is correct by the “Sum loss” model.

Example 1: Meta + Sum loss is correct and Sum loss is incorrect • Table: 2 -17982145-1, Header: [benalla dfl, wins, losses, draws, byes, against] • Question: when benall dfl is benalla dfl goorambat with less than 13 wins , what is the least amount of losses ?

• Ground Truth: min losses benalla dfl = goorambat wins < 13

• Prediction (Sum loss): min losses wins = goorambat wins < 13

• Support 1 Example 4: Meta + Sum loss is incorrect and Sum loss is correct • Table: 1-22546460-4 , Header: [best male mc, best female mc, best male artist, best female artist, best male lyricist, best female lyricist, best male record] • Question: who won the best female artist where best r&b contributor won the best male lyricist ?

• Ground Truth: select best female artist best male lyricist = best r&b contributor • Prediction (Sum loss): select best female artist best male lyricist = best r&b contributor

• Support 1 stage winner = josé-manuel fuente

• Support 1 Table: 2-1676921-5 , Header: [date, tournament, surface, partner, opponents in final, score in final] • Support 1 Question: is the opponents in final in the match with a score in final of 4-6, 1-6 played on clay surface ?

• Support 1 Ground Truth: select opponents in final score in final = 4-6, 1-6 surface = clay

• Support 2 Table: 1-170958-2, Header: [official name, status, area km 2, population, census ranking] • Support 2 Question: what is the land area of official name hopewell parish in km2 ?

• Support 2 Ground Truth: select area km 2 official name = hopewell

• Prediction (Meta + Sum loss): select distance (km) stage winner = josé-manuel fuente stage < 16

Phrases in questions that can match some table cells are treated as entities.