Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing

Ben Bogin
Matt Gardner
Jonathan Berant
ACL
2019
View in Semantic Scholar

Abstract

Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In spider, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.

1 Introduction

Semantic parsing (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005) has recently taken increased interest in parsing questions into SQL queries, due to the popularity of SQL as a query language for relational databases (DBs).

Work on parsing to SQL (Zhong et al., 2017; Iyer et al., 2017; Finegan-Dollak et al., 2018; Yu et al., 2018a) has either involved simple DBs that contain just one table, or had a single DB that is observed at both training and test time. Consequently, modeling the schema structure received little attention. Recently, Yu et al. (2018b) presented SPIDER, a text-to-SQL dataset, where at test time questions are executed against unseen and complex DBs. In this zero-shot setup, an informative representation of the schema structure is important. Consider the questions in Figure 1 : while their language structure is similar, in the first query a 'join' operation is necessary because the information is distributed across three tables, while in the other query no 'join' is needed.

Figure 1: Examples from SPIDER showing how similar questions can have different SQL queries, conditioned on the schema. Table names are underlined.

x : Find the age of students who do not have a cat pet. y : SELECT age FROM student WHERE student NOT IN (SELECT ... FROM student JOIN has pet ... JOIN pets ... WHERE ...) x : What are the names of teams that do not have match season record? y : SELECT name FROM team WHERE team id NOT IN (SELECT team FROM match season) In this work, we propose a semantic parser that strongly uses the schema structure. We represent the structure of the schema as a graph, and use graph neural networks (GNNs) to provide a global representation for each node (Li et al., 2016; De Cao et al., 2019; Sorokin and Gurevych, 2018) . We incorporate our schema representation into the encoder-decoder parser of , which was designed to parse questions into queries against unseen semi-structured tables. At encoding time we enrich each question word with a representation of the subgraph it is related to, and at decoding time we emit symbols from the schema that are related through the graph to previously decoded symbols.

We evaluate our parser on SPIDER, and show that encoding the schema structure improves accuracy from 33.8% to 39.4% (and from 14.6% to 26.8% on questions that involve multiple tables), well beyond 19.7%, the current stateof-the-art. We make our code publicly available at https://github.com/benbogin/ spider-schema-gnn.

2 Problem Setup

We are given a training set {(x (k) , y (k) , S (k)

)} N k=1

, where x (k) is a natural language question, y (k) is its translation to a SQL query, and S (k) is the schema of the DB where y (k) is executed. Our Figure 2 : The decoder we base our work on . The input to the LSTM (g j ) at step j is a learned embedding of the last decoded grammar rule, except when the last rule is schema-specific (g 3 ), where the input is a learned embedding of the schema item type. A grammar rule is selected based on the LSTM output (o j ) and the attended hidden state of the input LSTM (c j ).

Figure 2: The decoder we base our work on (Krishnamurthy et al., 2017). The input to the LSTM (gj) at step j is a learned embedding of the last decoded grammar rule, except when the last rule is schema-specific (g3), where the input is a learned embedding of the schema item type. A grammar rule is selected based on the LSTM output (oj) and the attended hidden state of the input LSTM (cj).

goal is to learn a function that maps an unseen question-schema pair (x, S) to its correct SQL query. Importantly, the schema S was not seen at training time, that is, S = S (k) for all k.

A DB schema S includes: (a) The set of DB tables T (e.g., singer), (b) a set of columns C t for each t ∈ T (e.g., singer name), and (c) a set of foreign key-primary key column pairs F, where each (c f , c p ) ∈ F is a relation from a foreignkey c f in one table to a primary-key c p in another. We term all schema tables and columns as schema items and denote them by V = T ∪ {C t } t∈T .

3 A Neural Semantic Parser For Sql

We base our model on the parser of , along with a grammar for SQL provided by AllenNLP (Gardner et al., 2018; Lin et al., 2019) , which covers 98.3% of the examples in SPIDER. This parser uses a linking mechanism for handling unobserved DB constants at test time. We review this model in the context of text-to-SQL parsing, focusing on components we expand upon in §4.

Linking schema items To handle unseen schema items, learn a similarity score s link (v, x i ) between a word x i and a schema item v that has type τ . 1 The score is based on learned word embeddings and a few manually-crafted features.

1 Types are tables, string columns, number columns, etc.

The linking score is used to compute

p link (v | x i ) = exp(s link (v, x i )) v ∈Vτ ∪{∅} exp(s link (v , x i )) ,

where V τ are all schema items of type τ and s link (∅, •) = 0 for words that do not link to any schema item. The functions p link (•) and s link (•) will be used to decode unseen schema items.

Encoder A Bidirectional LSTM (Hochreiter and Schmidhuber, 1997) provides a contextualized representation h i for each question word x i . Importantly, the encoder input at time step i is [w x i ; l i ]: the concatenation of the word embedding for

x i and l i = τ v∈Vτ p link (v | x i ) • r v ,

where r v is a learned embedding for the schema item v, based on the type of v and its schema neighbors. Thus, p link (v | x i ) augments every word x i with information on the schema items it should link to.

Decoder We use a grammar-based (Xiao et al., 2016; Cheng et al., 2017; Yin and Neubig, 2017; Rabinovich et al., 2017) LSTM decoder with attention on the input question ( Figure 2 ). At each decoding step, a non-terminal of type τ is expanded using one of the grammar rules. Rules are either schema-independent and generate nonterminals or SQL keywords, or schema-specific and generate schema items. At each decoding step j, the decoding LSTM takes a vector g j as input, which is an embedding of the grammar rule decoded in the previous step, and outputs a vector o j . If this rule is schemaindependent, g j is a learned global embedding. If it is schema-specific, i.e., a schema item v was generated, g j is a learned embedding τ (v) of its type. An attention distribution a j over the input words is computed in a standard manner (Bahdanau et al., 2015) , where the attention score for every word is h i o j . It is then used to compute the weighted average of the input c j = i a j h j . Now a distribution over grammar rules is computed by:

s glob j = FF( o j ; c j ) ∈ R G legal , s loc j = S link a j ∈ R V legal , p j = softmax([s glob j ; s loc j ]),

where G legal , V legal are the number of legal rules (according to the grammar) that can be chosen at time step j for schema-independent and schemaspecific rules respectively. The score s glob j is computed with a feed-forward network, and the score s loc j is computed for all legal schema items by multiplying the matrix S link ∈ R V legal ×|x| , which contains the relevant linking scores s link (v, x i ), with the attention vector a j . Thus, decoding unseen schema items is done by first attending to the question words, which are linked to the schema items.

4 Modeling Schemas With Gnns

Schema structure is informative for predicting the SQL query. Consider a table with two columns, where each is a foreign key to two other tables (student semester table in Figure 3) . Such a table is commonly used for describing a manyto-many relation between two other tables, which affects the output query. We now show how we represent this information in a neural parser and use it to improve predictions.

Figure 3: Left: DB schema and question. Middle: A graph representation of the schema. Bold nodes are tables, other nodes are columns. Dashed red (blue) edges are foreign (primary) keys edges, green edges are table-column edges. Right: Use of the schema by the decoder. For clarity, the decoder outputs tokens rather than grammar rules.

At a high-level our model has the following parts (Figure 3) . (a) The schema is converted to a graph. (b) The graph is softly pruned conditioned on the input question. (c) A Graph neural network generates a representation for nodes that is aware of the global schema structure. (d) The encoder and decoder use the schema representation. We will now elaborate on each part.

Schema-To-Graph

To convert the schema S to a graph (Figure 3, left) , we define the graph nodes as the schema items V. We add three types of edges: for each column c t in a table t, we add edges (c t , t) and (t, c t ) to the edge set E ↔ (green edges). For each foreign-primary key column pair (c t 1 , c t 2 ) ∈ F, we add edges (c t 1 , c t 2 ) and (t 1 , t 2 ) to the edge set E → and edges (c t 2 , c t 1 ) and (t 2 , t 1 ) to E ← (dashed edges). Edge types are used by the graph neural network to capture different ways in which columns and tables relate to one another.

Question-conditioned relevance Each question refers to different parts of the schema, and thus, our representation should change conditioned on the question. For example, in Figure 3 , the relation between the tables student semester and program is irrelevant. To model that, we re-use the distribution p link (•) from §3, and define a relevance score for a schema item v:

ρ v = max i p link (v | x i )

-the maximum probability of v for any word x i . We use this score next to create a question-conditioned graph representation. Neural graph representation To learn a node representation that considers its relevance score and the global schema structure, we use gated GNNs (Li et al., 2016) . Each node v is given an initial embedding conditioned on the relevance score: h

(0) v = r v • ρ v .

We then apply the GNN recurrence for L steps. At each step, each node re-computes its representation based on the representation of its neighbors in the previous step:

a (l) v = type∈{→,↔} (u,v)∈Etype W type h l−1 u + b type ,

and then h v is computed as following, using a standard GRU (Cho et al., 2014) update:

h (l) v = GRU(h (l−1) v , a (l) v )

(see Li et al. (2016) for further details). We denote the final representation of each schema item after L steps by

ϕ v = h (L) v

. We now show how this representation is used by the parser.

Encoder In §3, a weighted average over schema items l i was concatenated to every word x i . To enjoy the schema-aware representations, we compute

l ϕ i = τ v∈Vτ ϕ v p link (v | x i )

, which is identical to l i , except ϕ v is used instead of r v . We concatenate l ϕ i to the output of the encoder h i , so that each word is augmented with the graph structure around the schema items it is linked to.

Decoder As mentioned ( §3), when a schema item v is decoded, the input in the next time step is its type τ (v). A first change is to replace τ (v) by ϕ v , which has knowledge of the structure around v. A second change is a self-attention mechanism that links to the schema, which we describe next.

When scoring a schema item, its score should depend on its relation to previously decoded schema items. E.g., in Figure 3 , once the table semester has been decoded, it is likely to be joined to a related table. We capture this intuition with a self-attention mechanism.

For each decoding step j, we denote by u j the hidden state of the decoder, and byĴ = (i 1 , . . . , i |Ĵ| ) the list of time steps before j where a schema item has been decoded. We define the ma-trixÛ ∈

R d×|Ĵ| = [u i 1 , . . . , u i |Ĵ| ]

, which concatenates the hidden states from all these time steps. We now compute a self-attention distribution over these time steps, and score schema items based on this distribution (Figure 3, right) :

a j = softmax(Û T u j ) ∈ R |Ĵ| , s att j =â j S att , p j = softmax([s glob j ; s loc j + s att j ]),

where the matrix S att ∈ R |Ĵ|×V legal computes a similarity between schema items that were previously decoded, and schema items that are legal according to the grammar:

S att v 1 ,v 2 = F (ϕ v 1 ) F (ϕ v 2 ), where F (•)

is a feed-forward network. Thus, the score of a schema item increases, if substantial attention is placed on schema items to which it bears high similarity.

Training We maximize the log-likelihood of the gold sequence during training, and use beamsearch (of size 10) at test time, similar to Krishnamurthy et al. 2017 and prior work. We run the GNN for L = 2 steps.

5 Experiments And Results

Experimental setup We evaluate on SPI-DER (Yu et al., 2018b) , which contains 7,000/1,034/2,147 train/development/test examples.

We pre-process examples to remove table aliases (AS T1/T2/...) from the queries and use the explicit table name instead (i.e. we replace T1.col with table1 name.col), as in the majority of the cases (> 99% in SPIDER) these aliases are redundant. In addition, we add a table reference to all columns that do not have one (i.e. we replace col with table name.col).

Table 1: Development set accuracy for all models.

We use the official evaluation script from SPI-DER to compute accuracy, i.e., whether the predicted query is equivalent to the gold query.

Results Our full model (GNN) obtains 39.4% accuracy on the test set, substantially higher than prior state-of-the-art (SYNTAXSQLNET), which is at 19.7%. Removing the GNN from the parser (NO GNN), which results in the parser of , augmented with a grammar for SQL, obtains an accuracy of 33.8%, showing the importance of encoding the schema structure. Table 1 shows results on the development set for baselines and ablations. The first column describes accuracy on the entire dataset, and the next two columns show accuracy when partitioning examples to queries involving only one table (SINGLE) vs. more than one table (MULTI).

GNN dramatically outperforms previously published baselines SQLNET and SYNTAXSQLNET, and improves the performance of NO GNN from 34.9% to 40.7%. Importantly, using schema structure specifically improves performance on questions with multiple tables from 14.6% to 26.8%.

We ablate the major novel components of our model to assess their impact. First, we remove the self-attention component (NO SELF ATTEND). We observe that performance drops by 2 points, where SINGLE slightly improves, and MULTI drops by 6.5 points. Second, to verify that improvement is not only due to self-attention, we ablate all other uses of the GNN. Namely, We use a model identical to NO GNN, except it can access the GNN representations through the self-attention (ONLY SELF ATTEND). We observe a large drop in performance to 35.9%, showing that all components are important. Last, we ablate the relevance score by setting ρ v = 1 for all schema items (NO REL.). Indeed, accuracy drops to 37.0%.

To assess the ceiling performance possible with a perfect relevance score, we run an oracle experiment, where we set ρ v = 1 for all schema items that are in the gold query, and ρ v = 0 for all other schema items (GNN ORACLE REL.). We see that a perfect relevance score substantially improves performance to 54.3%, indicating substantial headroom for future research.

join analysis For any model, we can examine the proportion of predicted queries with a join, where the structure of the join is "bad": (a) when the join condition clause uses the same table twice (ON t1.column1 = t1.column2), and (b) when the joined table are not connected through a primary-foreign key relation.

We find that NO GNN predicts such joins in 83.4% of the cases, while GNN does so in only 15.6% of cases. When automatically omitting from the beam candidates where condition (a) occurs, NO GNN predicts a "bad" join in 14.2% of the cases vs. 4.3% for GNN (total accuracy increases by 0.3% for both models). As an example, in Figure 3 , s loc j scores the table student the highest, although it is not related to the previously decoded table semester. Adding the selfattention score s att j corrects this and leads to the correct student semester, probably because the model learns to prefer connected tables.

6 Conclusion

We present a semantic parser that encodes the structure of the DB schema with a graph neural network, and uses this representation to make schema-aware decisions both at encoding and decoding time. We demonstrate the effectivness of this method on SPIDER, a dataset that contains complex schemas which are not seen at training time, and show substantial improvement over current state-of-the-art.