IKE - An Interactive Tool for Knowledge Extraction

Bhavana Dalvi
Sumithra Bhakthavatsalam
Christopher Clark
P. Clark
Oren Etzioni
Anthony Fader
Dirk Groeneveld
AKBC@NAACL-HLT
2016
View in Semantic Scholar

Abstract

Recent work on information extraction has suggested that fast, interactive tools can be highly effective; however, creating a usable system is challenging, and few publically available tools exist. In this paper we present IKE, a new extraction tool that performs fast, interactive bootstrapping to develop high-quality extraction patterns for targeted relations. Central to IKE is the notion that an extraction pattern can be treated as a search query over a corpus. To operationalize this, IKE uses a novel query language that is expressive, easy to understand, and fast to execute essential requirements for a practical system. It is also the first interactive extraction tool to seamlessly integrate symbolic (boolean) and distributional (similarity-based) methods for search. An initial evaluation suggests that relation tables can be populated substantially faster than by manual pattern authoring while retaining accuracy, and more reliably than fully automated tools, an important step towards practical KB construction. We are making IKE publically available (http://allenai.org/ software/interactive-knowledge-extraction).

1 Introduction

Knowledge extraction from text remains a fundamental challenge for any system that works with structured data. Automatic extraction algorithms, e.g., (Angeli et al., 2015; Carlson et al., 2009; Nakashole et al., 2011; Hoffmann et al., 2011) , have proved efficient and scalable, especially when leveraging existing search engine technologies, e.g., (Et-zioni et al., 2004) , but typically produce noisy results, e.g., the best F1 score for the KBP slot filling task was 0.28, as reported in (Angeli et al., 2015) . Weakly supervised automatic bootstrapping methods (Carlson et al., 2010; Gupta and Manning, 2014) are more precise in the initial bootstrapping iterations, but digress in later iterations, a problem generally referred to as semantic drift.

More recently there has been work on more interactive methods, which can be seen as a "machine teaching" approach to KB construction (Amershi et al., 2014; Amershi et al., 2015; Li et al., 2012) . For example, (Soderland et al., 2013) showed that users can be surprisingly effective at authoring and refining extraction rules for a slot filling task, and (Freedman et al., 2011) demonstrated that a combination of machine learning and user authoring produced high quality results. However, none of these approaches have evolved into publically available tools.

In this paper we present IKE, a usable, generalpurpose tool for interactive extraction. Central to IKE is the notion that an extraction pattern can be treated as a search query over a corpus, building on earlier work by (Cafarella et al., 2005) . It addresses the resulting requirements of expressiveness, comprehensibility, and speed with a novel query language based on chunking rather than parsing, and is the first tool to seamlessly integrate symbolic (boolean) and distributional (similarity-based) methods for search. It also includes a machine learning component for suggesting new queries to the user. A preliminary evaluation suggests that relation tables can be populated substantially faster with IKE than by manual pattern authoring (and more reliably than fully automated tools), while retaining accu-racy, suggesting IKE has utility for KB construction.

Query

Interpretation the dog matches "the" followed by "dog" NP grows an NP followed by "grows" (NP) grows

Capture the NP and place in column 1 (1 column table) (NP) conducts (NP) Capture the two NPs into columns 1 and 2 (2 column table) (? NP) is conducted by (? NP) Capture the two NPs and place in columns named Energy and Material the {cat,dog} "the" followed by "cat" or "dog" cats and {NN,NNS} "cats and" followed by NN or NNS JJ* dog Zero or more JJ then "dog" JJ+ dog One or more JJ then "dog" JJ[2-4] dog 2 to 4 JJ then "dog" dog .[0-4] tail "dog" followed by any 0 to 4 words followed by 'tail" dog∼50

Matches "dog" and the 50 words most distributionally similar to "dog" . dog Any word then "dog" $colors

Any entry in the single-column "colors" table $colors ∼100 same plus 100 most similar words $flower.color Any in the "color" column of "flower" table

2 Interactive Knowledge Extraction (Ike)

We first overview IKE and a sample workflow using it. IKE allows the user to create relation tables, and populate them by issuing pattern-based queries over a corpus. It also has a machine learning component that suggests high-quality broadenings or narrowings of the user's queries. Together, these allow the user to perform fast, interactive bootstrapping.

2.1 Ike'S Query Language

A key part of IKE is treating an extraction pattern as a search query. To do this, the query language must be both comprehensible and fast to execute. To meet these requirements, IKE indexes and searches the corpus using a chunk-based rather than dependency-based representation of the text corpus. IKE's query language is presented by example in Table 1. The query language supports wildcards, window sizes, POS tags, chunk tags, and general regular expression queries similar to TokenRegex (Chang and Manning, 2014) and Lucene, ElasticSearch's (Gormley and Tong, 2015) query language. Additionally, IKE supports distributional similarity based search (e.g. dog∼50 would find 50 words similar to "dog"). "Capture groups", indicated by parentheses, instruct IKE to catch the matching element(s) as candidate entries in the table being populated. The user can also reference data in other alreadyconstructed tables using the $ prefix. The use of a chunk-based representation has several advantages over a dependency-based one (e.g., (Freedman et al., 2011; Gamallo et al., 2012; Hoffmann et al., 2015; Akbik et al., 2013) ). First, both indexing and search are very fast (e.g., <1 sec to execute a query over 1.5M sentences), essential for an interactive system. Second, authoring queries does not require detailed knowledge of dependency structure, making the language more accessible. Finally, the system avoids parse errors, a considerable challenge for dependency-based systems. Corresponding challenges with chunk-based representations, e.g., defining constituent boundaries in terms of POS chunks, are partially alleviated by providing predefined, higher-level POS-based patterns, e.g., for verb phrases.

Table 1: IKE’s Query Language, described by example.

2.2 Machine Learning

IKE also has a ML-based Query Suggestor to propose improved queries to the user. This module performs a depth-limited beam search to explore the space of query variants, evaluated on the userannotated examples collected so far. Variants are generated by broadening/narrowing a query.

Narrowing a query involves searching the space of restrictions on the current query, e.g., replacing a POS tag with a specific word, adding prefixes or suffixes to the query, adjusting distributional similarity based queries etc. Similarly, the broaden feature generalizes the given user query e.g. replacing a word by its POS tag. In both cases the candidate queries are ranked by the weighted sum of the number of positive n p , negative n n , and unlabeled n u instances it matches, the weights being user-configurable (default 2, -1, -0.05 respectively). For example, for the query ($conducts.Material) VBZ ($conducts.Energy) the top three suggested narrowings are:

($conducts.Material) conducts ($conducts.Energy) ($conducts.Material) absorbs ($conducts.Energy) ($conducts.Material) produces ($conducts.Energy) all patterns that distinguish positive examples from negatives well.

2.3 Example Workflow

We now describe these features in more detail by way of an example. Consider the task of acquiring instances of the binary predicate conducts(material,energy), e.g., conducts("steel","electricity").

In IKE, relations are visualized as tables, so we treat this task as one of table population. A typical workflow is illustrated in Figure 1 , which we now describe.

Figure 1: IKE interactive bootstrapping workflow

2.3.1 Define The Types Material And Energy

First, the user defines the argument types material and energy. To define a type, IKE lets a user build a single column table, e.g., for type material, the user:

1. Creates a single column table called Material. 2. Manually adds several representative examples in the table, e.g., "iron", "wood", "steel". 3. Expands this set by searching for cosinevector-similar phrases in the corpus, and marking valid and invalid members, e.g., the query $Material ∼20

searches for the 20 phrases most similar to any existing member in the Material table, where similar is defined as the cosine between the phrase embeddings. Here we use 300 di-

ARG1 conducts ARG2 ARG1 melts ARG2 TO ARG2 flows through ARG1 ARG1 VB NN ARG2 Material, Energy ------------------------------------------------ iron,

Material, Energy -----------------------------------------------Copper, Electricity

metal, sound human body, electricity plastic, electricity …..

Candidate Instances

Add positive instances to the set

2.3.3 Bootstrapping To Expand The Table

Figure 2: Search for examples of “X absorbs Y”, where X is distributionally similar (∼500) to existing entries in the Material column of the conducts table. The user then annotates examples for inclusion in the table.

2.4 Execution Speed

IKE uses BlackLab (Institute Dutch Lexicology, 2016) for indexing the corpus. This, combined with the chunk-based representation, results in fast query execution times (e.g., <1 second for a query over 1.5M sentences), an essential requirement for an interactive system (Table 2) . 3 Preliminary Evaluation

Table 2: Avg. query-times with different sized corpora.

3.1 Experiments

Although IKE is still under development, we have conducted a preliminary evaluation, comparing it with two other methods for populating relation tables. Our interest is in how these different methods compare in terms of precision, yield and time:

• Manual: The user manually authors and refines patterns (without any automatic assistance) to populate a table.

• Automatic: The user provides an initial table with a few entries, and then lets the system bootstrap on its own, without any further interaction. • Interactive (IKE): Interactive bootstrapping, as described earlier. The manual system was implemented in IKE by dis-abling the embedding-based set expansion and MLbased query suggestion features. The automatic approach was simulated in IKE by removing both user annotation steps in Figure 1 , and instead adding all machine-learned patterns suggested by the Query Suggestor (Section 2.2) and instances that occur at least k times in the corpus (using k = 2). This is a simple baseline method of bootstrapping, compared with more sophisticated methods such as co-training (Collins and Singer, 1999; Neelakantan and Collins, 2014) . For a fair comparison, we compared results after 3 bootstrapping iterations (for Automatic, IKE) and a similar amount of user time (∼30 mins, for Manual and IKE). The number of iterations were limited to 3 to keep the annotation time within reasonable limits.

3.2 Tasks And Datasets

We compared these methods to define and populate two target relations: conducts(material,energy), and has-part(animal,bodypart). All methods extract knowledge from the same corpora of science text, consisting of ∼1.5M sentences (largely) about elementary science drawn from science textbooks, Simple Wikipedia, and the Web. For each relation, two (different) users familiar with IKE were asked to construct these tables. The numbers presented in Table 3 and Table 4 are averaged over these two users. Although this study is small, it provides helpful indicators about IKE's utility. Table 3 shows the results for building the conducts(material,energy) table. Most importantly, with IKE the user was able to discover substantially more patterns (31 vs. 7) with higher accuracy (52.7% vs. 25.5%) than the manual approach, resulting in a larger table (59 vs. 27 rows) in less time (20 vs. 30 mins). It also shows that fully automatic bootstrapping produced a large number of low quality (34.4% precision) rules, with an overall lower yield (63 rows).

Table 3: conducts(material,energy) table after 3 iterations or ∼30 minutes user time. IKE helps the user discover substantially more patterns than the manual method (31 vs. 7), with better precision and in less time, resulting in the overall yield of 59 relation instances. Fully automatic bootstrapping produced a large number of lower precision (34.4%) patterns compared to IKE (52.7%) patterns.

Table 4: has-part(organism,bodypart) table after 3 iterations or ∼30 minutes user time. Again, IKE produces the highest yield by helping the user discover 21 patterns with precision (22.5%) comparable to manual patterns (25.2%). Note that further use of IKE continues to expand the yield (e.g., after 3 more iterations of IKE the yield rises to 262 while maintaining average precision).

3.3 Results

Note that for both Manual and IKE, users have to decide how to spend their time budget, in particular between work on creating high-quality patterns vs. work on annotating examples found by those patterns. Thus the precision scores in the Tables reflect how the users chose to make this tradeoff, while the yield reflects their success at the overall goal, namely building a good table. Table 4 shows similar results for constructing the has-part(organism,bodypart) table, IKE having the highest overall yield. Although this is a small case study, it suggests that IKE has value for rapid knowledge base construction.

4 Conclusion

We have presented IKE, a usable, general-purpose tool for interactive extraction.

It has an expressive, easily comprehensible query language that integrates symbolic (boolean) and distributional (similarity-based) methods for search, and has a fast execution time. A preliminary evaluation suggests that IKE is effective for the task of knowledge-base construction compared to manual pattern authoring or using fully automated extraction tools. We are currently using this tool to expand the KB used by the Aristo system (Clark et al., 2016) , and are making IKE publically available on our Web site at http://allenai.org/software/ interactive-knowledge-extraction.