Global Reasoning over Database Structures for Text-to-SQL Parsing

Ben Bogin
Matt Gardner
Jonathan Berant
EMNLP
2019
View in Semantic Scholar

Abstract

State-of-the-art semantic parsers rely on auto-regressive decoding, emitting one symbol at a time. When tested against complex databases that are unobserved at training time (zero-shot), the parser often struggles to select the correct set of database constants in the new database, due to the local nature of decoding. %since their decisions are based on weak, local information only. In this work, we propose a semantic parser that globally reasons about the structure of the output query to make a more contextually-informed selection of database constants. We use message-passing through a graph neural network to softly select a subset of database constants for the output query, conditioned on the question. Moreover, we train a model to rank queries based on the global alignment of database constants to question words. We apply our techniques to the current state-of-the-art model for Spider, a zero-shot semantic parsing dataset with complex databases, increasing accuracy from 39.4% to 47.4%.

1 Introduction

The goal of zero-shot semantic parsing (Krishnamurthy et al., 2017; Xu et al., 2017; Yu et al., 2018b,a; Herzig and Berant, 2018) is to map language utterances into executable programs in a new environment, or database (DB). The key difficulty in this setup is that the parser must map new lexical items to DB constants that weren't observed at training time.

Existing semantic parsers handle this mostly through a local similarity function between words and DB constants, which considers each word and DB constant in isolation. This function is combined with an auto-regressive decoder, where the decoder chooses the DB constant that is most similar to the words it is currently attending to. Thus, selecting DB constants is done one at a time rather than as a set, and informative global considerations are ignored. Consider the example in Figure 1 , where a question is mapped to a SQL query over a complex DB. After decoding SELECT, the decoder must now choose a DB constant. Assuming its attention is focused on the word 'name' (highlighted), and given local similarities only, the choice between the lexically-related DB constants (singer.name and song.name) is ambiguous. However, if we globally reason over the DB constants and question, we can combine additional cues. First, a subsequent word 'nation' is similar to the DB column country which belongs to the table singer, thus selecting the column singer.name from the same table is more likely. Second, the next appearance of the word 'name' is next to the phrase 'Hey', which appears as the value in one of the cells of the column song.name. Assuming a one-toone mapping between words and DB constants, again singer.name is preferred.

Figure 1: An example where choosing a DB constant based on local similarities is difficult, but the ambiguity can be resolved through global reasoning (see text).

In this paper, we propose a semantic parser that reasons over the DB structure and question to make a global decision about which DB constants should be used in a query. We extend the parser of , which learns a representation for the DB schema at parsing time. First, we perform message-passing through a graph neural network representation of the DB schema, to softly select the set of DB constants that are likely to appear in the output query. Second, we train a model that takes the top-K queries output by the autoregressive model and re-ranks them based on a global match between the DB and the question. Both of these technical contributions can be applied to any zero-shot semantic parser.

We test our parser on SPIDER, a zero-shot semantic parsing dataset with complex DBs. We show that both our contributions improve performance, leading to an accuracy of 47.4%, well beyond the current state-of-the-art of 39.4%.

Our code is available at https: //github.com/benbogin/ spider-schema-gnn-global.

2 Schema-Augmented Semantic Parser

Problem Setup We are given a training set

{(x (k) , y (k) , S (k) )} N k=1

, where x (k) is a question, y (k) is its translation to a SQL query, and S (k) is the schema of the corresponding DB. We train a model to map question-schema pairs (x, S) to the correct SQL query. Importantly, the schema S was not seen at training time.

A DB schema S includes : (a) A set of DB tables, (b) a set of columns for each table, and (c) a set of foreign key-primary key column pairs where each pair is a relation from a foreign-key in one table to a primary-key in another. Schema tables and columns are termed DB constants, denoted by V.

We now describe a recent semantic parser from , focusing on the components relevant for selecting DB constants. Base Model The base parser is a standard topdown semantic parser with grammar-based decoding (Xiao et al., 2016; Yin and Neubig, 2017; Krishnamurthy et al., 2017; Rabinovich et al., 2017; Lin et al., 2019) . The input question (x 1 , . . . , x |x| ) is encoded with a BiLSTM, where the hidden states e i of the BiLSTM are used as contextualized representations for the word x i . The output query y is decoded top-down with another LSTM using a SQL grammar, where at each time step a grammar rule is decoded. Our main focus is decoding of DB constants, and we will elaborate on this part.

The parser decodes a DB constant whenever the previous step decoded the non-terminals Table or Column. To select the DB constant, it first com-putes an attention distribution over the question words

{α i } |x| i=1

in the standard manner (Bahdanau et al., 2015) . Then the score for a DB constant v is

s v = i α i s link (v, x i ),

where s link is a local similarity score, computed from learned embeddings of the word and DB constant, and a few manuallycrafted features, such as the edit distance between the two inputs and the fraction of string overlap between them. The output distribution of the decoder is simply softmax({s v } v∈V ). Importantly, the dependence between decoding decisions for DB constants is weak -the similarity function is independent for each constant and question word, and decisions are far apart in the decoding sequence, especially in a top-down parser.

DB schema encoding In the zero-shot setting, the schema structure of a new DB can affect the output query. To capture DB structure, Bogin et al. (2019) learned a representation h v for every DB constant, which the parser later used at decoding time. This was done by converting the DB schema into a graph, where nodes are DB constants, and edges connect tables and their columns, as well as primary and foreign keys (Figure 2 To focus the GCN's capacity on important nodes, a relevance probability ρ v was computed for every node, and used to "gate" the input to the GCN, conditioned on the question. Specifically, given a learned embedding r v for every database constant, the GCN input is h

Figure 2: High-level overview, where our contributions are in thick orange boxes. First, a relevance score is predicted for each of the DB constants using the gating GCN. Then, a learned representation is computed for each DB constant using the encoder GCN, which is then used by the decoder to predict K candidates queries. Finally, the re-ranking GCN scores each one of these candidates, basing its score only on the selected DB constants. The dashed line and arrow indicate no gradients are propagated from the re-ranking GCN to the decoder, as the decoder outputs SQL queries. Names of loss terms are written below models that are trained with a loss on their output.

(0) v = ρ v • r v .

Then, the GCN recurrence is applied for L steps. At each step, nodes re-compute their representation based on the representation of their neighbors, where different edge types are associated with different learned parameters (Li et al., 2016) . The final representation of each DB constant is

h v = h (L) v .

Importantly, the relevance probability ρ v , which can be viewed as a soft selection for whether the DB constant should appear in the output, was computed based on local information only:

First, a distribution p link (v | x i ) ∝ exp(s link (v, x i ))

was defined, and then

ρ v = max i p link (v | x i ) was com- puted deterministically. Thus, ρ v

doesn't consider the full question or DB structure. We address this next. Figure 2 : High-level overview, where our contributions are in thick orange boxes. First, a relevance score is predicted for each of the DB constants using the gating GCN. Then, a learned representation is computed for each DB constant using the encoder GCN, which is then used by the decoder to predict K candidates queries. Finally, the re-ranking GCN scores each one of these candidates, basing its score only on the selected DB constants. The dashed line and arrow indicate no gradients are propagated from the re-ranking GCN to the decoder, as the decoder outputs SQL queries. Names of loss terms are written below models that are trained with a loss on their output. 3 Global Reasoning over DB Structures Figure 2 gives a high-level view of our model, where the contributions of this paper are marked by thick orange boxes. First, the aforementioned relevance probabilities are estimated with a learned gating GCN, allowing global structure to be taken into account. Second, the model discriminatively re-ranks the top-K queries output by the generative decoder. Global gating showed that an oracle relevance probability can increase model performance, but computed ρ v from local information only.

We propose to train a GCN to directly predict ρ v from the global context of the question and DB.

The input to the gating GCN is the same graph described in §2, except we add a new node v global , connected to all other nodes with a special edge type. To predict the question-conditioned rele-vance of a node, we need a representation for both the DB constant and the question. Thus, we define the input to the GCN at node v to be g

(0) v = F F ([r v ;h v ; ρ v ]), where ';' is con- catenation, F F (•)

is a feed-forward network, and

h v = i p link (x i | v)

• e i is a weighted average of contextual representations of question tokens. The initial embedding of v global is randomly initialized. A relevance probability is computed per DB constant based on the final graph representation: ρ

global v = σ(F F (g (L) v ))

. This probability replaces ρ v at the input to the encoder GCN (Figure 2 ).

Because we have the gold query y for each question, we can extract the gold subset of DB constants U y , i.e., all DB constants that appear in y. We can now add a relevance loss term

− v∈Uy log ρ global v − v / ∈Uy log(1−ρ global v

) to the objective. Thus, the parameters of the gating GCN are trained from the relevance loss and the usual decoding loss, a ML objective over the gold sequence of decisions that output the query y. Discriminative re-ranking Global gating provides a more accurate model for softly predicting the correct subset of DB constants. However, parsing is still auto-regressive and performed with a local similarity function. To overcome this, we separately train a discriminative model (Collins and Koo, 2005; Ge and Mooney, 2006; Lu et al., 2008; Fried et al., 2017) to re-rank the top-K queries in the decoder's output beam. The re-ranker scores each candidate tuple (x, S,ŷ), and thus can globally reason over the entire candidate queryŷ.

We focus the re-ranker capacity on the main pain point of zero-shot parsing -the set of DB constants Uŷ that appear inŷ. At a high-level (Figure 3) , for each candidate we compute a logit sŷ = w F F (f Uŷ , e align ), where w is a learned parameter vector, f Uŷ is a representation for the set Uŷ, and e align is a representation for the global alignment between question words and DB constants. The re-ranker is trained to minimize the re-ranker loss, the negative log probability of the correct query y. We now describe the computation of f Uŷ and e align , based on a re-ranking GCN.

Figure 3: The re-ranking GCN architecture (see text).

Unlike the gating GCN, the re-ranking GCN takes as input only the sub-graph induced by the selected DB constants Uŷ, and the global node v global . The input is represented by f

(0) v = F F (r v ;h v )

, and after L propagation steps we obtain f Uŷ = f (L)v global . Note that the global node representation is used to describe and score the questionconditioned sub-graph, unlike the gating GCN where the global node mostly created shorter paths between other graph nodes. The representation f Uŷ captures global properties of selected nodes but ignores nodes that were not selected and are possibly relevant. Thus, we compute a representation e align , which captures whether question words are aligned to selected DB constants. We define a representation for every node v ∈ V:

ϕ v = f (L) v if v ∈ Uŷ r v

otherwise Now, we compute for every question word x i a representation of the DB constants it aligns to:

l ϕ i = v∈V p link (v | x i ) • ϕ v .

We concatenate this representation to every word e align i = [e i ; l ϕ i ], and compute the vector e align using attention over the question words, where the attention score for every word is e align i w att for a learned vector w att . The goal of this term is to allow the model to recognize whether there are any attended words that are aligned with DB constants, but these DB constants were not selected in Uŷ.

In sum, our model adds a gating GCN trained to softly select relevant nodes for the encoder, and a reranking GCN that globally reasons over the subset of selected DB constants, and captures whether the query properly covers question words.

4 Experiments And Results

Experimental setup We train and evaluate on SPIDER (Yu et al., 2018b) , which contains 7,000/1,034/2,147 train/development/test examples, using the same pre-processing as . To train the re-ranker, we take K = 40 candidates from the beam output of the decoder. At each training step, if the gold query is in the beam, we calculate the loss on the gold query and 10 randomly selected negative candidates. At test time, we re-rank the best K = 10 candidates in the beam, and break re-ranking ties using the autoregressive decoder scores (ties happen since the re-ranker considers the DB constants only and not the entire query). We use the official SPIDER script for evaluation, which tests for loose exact match of queries.

Results As shown in Table 1 , the accuracy of our proposed model (GLOBAL-GNN) on the hidden test set is 47.4%, 8% higher than current state-ofthe art of 39.4%. Table 2 shows accuracy results on the development set for different experiments. We perform minor modifications to the implementation of , improving the accuracy from 40.7% to 44.1% (details in appendix A). We follow , measuring accuracy on easier examples where queries use a single table (SINGLE) and those using more than one table (MULTI). GLOBAL-GNN obtains 52.1% accuracy on the development set, substantially higher than all previous scores. Importantly, the performance increase comes mostly from queries that require more than one table, which are usually more complex.

Table 1: Test set accuracy of GLOBAL-GNN compared to prior work on SPIDER.

Table 2: Development set accuracy for various experiments. The column ‘Beam’ indicates the fraction of examples where the gold query is in the beam (K = 10).

Removing any of our two main contributions (NO GLOBAL GATING, NO RE-RANKING) leads to a 4% drop in performance. Training without the relevance loss (NO RELEVANCE LOSS) results in a 2% accuracy degrade. Omitting the representation e align from the re-ranker (NO ALIGN REP.) reduces performance, showing the importance of identifying unaligned question words.

We also consider a model that ranks the entire query and not only the set of DB constants. We re-define sŷ = w F F (f Uŷ , h align , h query ), where h query is a concatenation of the last and first hidden states of a BiLSTM run over the output SQL query (QUERY RE-RANKER). We see performance is lower, and most introduced errors are minor mistakes such as min instead of max. This shows that our re-ranker excels at choosing DB constants, while the decoder is better at determining the SQL query structure and the SQL logical constants.

Finally, we compute two oracle scores to estimate future headroom. Assuming a perfect global gating, which gives probability 1.0 iff the DB constant is in the gold query, increases accuracy to 63.2%. Adding to that a perfect re-ranker leads to an accuracy of 73.5%. Qualitative analysis Analyzing the development set, we find two main re-occurring patterns, where the baseline model is wrong, but our parser is correct. (a) coverage: when relevant question words are not covered by the query, which results in a missing joining of tables or selection of columns (b) precision: when unrelated tables are joined to the query due to high lexical similarity. Selected examples are in Appendix B. Error analysis In 44.4% of errors where the correct query was in the beam, the selection of U was correct but the query was wrong. Most of these errors are caused by minor local errors, e.g., min/max errors, while the rest are due to larger structural mistakes, indicating that a global model that jointly selects both DB constants and SQL tokens might further improve performance. Other types of errors include missing or extra columns and tables, especially in complex queries.

5 Conclusion

In this paper, we demonstrate the importance of global decision-making for zero-shot semantic parsing, where selecting the relevant set of DB constants is challenging. We present two main technical contributions. First, we use a gating GCN that globally attends the input question and the entire DB schema to softly-select the relevant DB constants. Second, we re-rank the output of a generative semantic parser by globally scoring the set of selected DB-constants. Importantly, these contributions can be applied to any zero-shot semantic parser with minimal modifications. Empirically, we observe a substantial improvement over the state-of-the-art on the SPIDER dataset, showing the effectiveness of both contributions.

A Re-Implementation

We perform a simple modification to . We add cell values to the graph, in a similar fashion to Krishnamurthy et al. (2017) . Specifically, we extract the cells of the first 5000 rows of all tables in the schema, during the pre-processing phase. We then consider every cell q of a column c, which has a partial match with any of the question words (x 1 , . . . , x |x| ). We then add nodes representing these cells to all of the model's graphs, with extra edges (c, q) and (q, c).

B Selected Examples

Selected examples are given in Table 3 .

Table 3: Selected correct examples where the baseline model is wrong, but our parser is correct.