Extracting meronyms for a biology knowledge base using distant supervision

Xiao Ling
P. Clark
Daniel S. Weld
AKBC '13
2013
View in Semantic Scholar

Abstract

Knowledge of objects and their parts, meronym relations, are at the heart of many question-answering systems, but manually encoding these facts is impractical. Past researchers have tried hand-written patterns, supervised learning, and bootstrapped methods, but achieving both high precision and recall has proven elusive. This paper reports on a thorough exploration of distant supervision to learn a meronym extractor for the domain of college biology. We introduce a novel algorithm, generalizing the ``at least one'' assumption of multi-instance learning to handle the case where a fixed (but unknown) percentage of bag members are positive examples. Detailed experiments compare strategies for mention detection, negative example generation, leveraging out-of-domain meronyms, and evaluate the benefit of our multi-instance percentage model.

1. Introduction

We are motivated by the vision of Digital Aristotle, specifically an interactive knowledge-based system that can answer a wide range of questions about scientific topics like college biology [5] . Knowledge of object part-whole relations, so called meronyms, is central to many questions. For example, in order to answer the question "What part of a cell allows for selective permiability?" it's important to know that the plasma membrane is part of all cells. However it is almost impractical to manually encode all meronym relations 1 , so we seek to extract them automatically from text.

The problem of meronym extraction has been studied for many years. For example, Berland and Charniak [2] applied two Hearst patterns [6] to find candidate meronyms and then used corpus statistics to order them. Girju et al. [4] added a verbal pattern and leveraged decision trees to learn semantic constraints over the part/whole's WordNet classes. While these efforts yielded promising precision, Berland and Charniak's experiments were limited to finding the parts of five specific objects. Furthermore, Girju et al.'s method required thousands of manually-labeled sentences in order to train a reliable classifier, and this effort would likely need to be repeated for different technical domains such as biology. Minimally supervised approaches have also been explored [1, 14, 8] . Espresso used bootstrap learning and Web statistics to learn meronyms [14] . Ittoo and Bouma extended the method by independently bootstrapping different subtypes of the meronym relation (e.g., located-in vs member-of) [8] . Precision was high (0.67-0.82) on the most confident 500 predictions, but recall was not evaluated.

Distant supervision is a promising alternative, which has seen considerable success for relation extraction [10, 22, 13] . The basic idea is to use a database table, R, to automatically create a training set by heuristically labeling a sentence as a positive example if it contains mentions of two entities that match a tuple in R. While the resulting training set is usually quite noisy, it requires no human labeling effort. To cope with incorrectly labeled examples, Bunescu and Mooney suggest using multiple instance learning [3] and several authors have proposed graphical models that encode the assumption that at least one of the sentences matching each tuple is a true positive example [16, 7, 20] . In this paper, we generalize this model to one which assumes a certain percentage of matches (apriori unknown and potentially different for each relation) are true positives. We extend Hoffmann et al.'s graphical model [7] to enforce percentage constraints, train using perceptron updates, and use grid search via cross-validation to find the best percentage for a relation. In summary, this paper makes the following contributions:

• We introduce a novel generalization of multi-instance learning from the "at least one" assumption to handle the case where a fixed (but unknown) percentage of bag members are positive examples. • We present a detailed evaluation of the efficacy of distant supervision for the problem of extracting meronyms knowledge, only 185 meronym facts are included, far less than the number of the facts in our experiment.

from biological text, specifically comparing strategies for 1) mention detection, 2) negative example generation, 3) multi-instance percentage, and 4) leveraging out-ofdomain meronyms.

2. Our Distant Supervision Model

As input distantly supervised extration learning requires three inputs: 1) a set of target relations R (and a null relation NA); 2) a knowledge base ∆ = {(a1, R, a2)} where R ∈ R (i.e. a triple store); (3) an unlabeled textual corpus Σ. Extractors are typically learned in two stages.

Data Preparation: Given the knowledge base, ∆, and the unlabeled corpus, first identify all mentions of entities in each sentence 2 . Next one enumerates all pairs of entities (i.e. (e1, e2)) in each sentence s ∈ Σ and then groups these examples so each entity pair is associated with a set of sentences that mention both entities. Finally, for each entity pair and target relation R ∈ R, check to see if the ground fact (e1, R, e2) is in ∆; if so call the quadruple (e1, R, e2, s) a fact mention. Clearly there is no guarantee that every fact mention for (e1, R, e2) is actually expressing R (in fact, none of the sentences may state R). Bunescu and Mooney [3] suggested modeling this in terms of multi-instance learning(MIL), where the training set consists of positive and negative bags of instances. One may assume that at least one of the instances in each positive bag is a true positive instance and that every instance in a negative bag is a true negative instance. For distantly supervised extractor learning, we form a positive bag from the union of all mentions of each mentioned has-part fact: {. . . , (e1, R, e2, si), . . .}. Negative bags are formed by taking (some of) the (e1, e2, s) matches where ∆ contained no corresponding fact and are denoted {. . . , (e1, N A, e2, si), . . .}. All instances are represented as a vector of features, such as those used by Mintz et al [13] .

Training the Extractor: From the set of positive and negative bags, one must learn a classifier that will take a new example (e1, e2, s) and predict whether N A or a relation R ∈ R holds. For example, MultiR [7] uses structural perceptron-style additive updates to train a log-linear model using a simple graphical model that aggregates sentencelevel predictions with a deterministic OR of the predicted relation types. For meronym extraction, there is only one target relation: R = {has-part}.

Handling Percentage Constraints

So far we have made the common assumption that at least one instance is positive in each positive bag. One consequence of this assumption is that during MultR's perceptron learning, only a few instances cause updates even if the true positive instances are abundant in some bags. We propose an extension to MultiR that allows more updates by tuning a parameter p 3 .

For each bag, let y be its bag label (has-part or NA), z = {zi} be the instance labels and x = {xi} be the instances. In MultiR, p(z, y|x) p(z|x)p(y|z). The first term is defined as p(z|x) = i p(zi|xi) where p(zi|xi) is a log-linear model. Predicting y given z(the second term) is deterministic. In other words, y =has-part is predicted if and only if ∃i, zi =has-part; y =NA otherwise.

In order to handle the percentage constraint that strictly more than p% of a bag's instances be positive, we make the following modification. Instead of predicting y =has-part using deterministic OR (i.e. seeing only one zi =has-part), we predict y =has-part when strictly more than p% of all the instances have a predicted label has-part.

During learning, when a bag's label prediction,ŷ, disagrees with the true bag label, the conditional probability p(z|ŷ, x) is computed to determine the most probable values of zis. The exact solution is obtained via maximum weighted bipartite matching. Pursuing efficiency, MultiR uses an effective approximation: when the bag label is haspart, assigning that label to the instance with the highest score under the current linear model. However, it is likely that the other instances with high scores are also truly positive. Thus, we propose to instead assign the label has-part to all of the instances with the highest scores in a positive bag, until strictly more than p% denote has-part.

Note that MultiR is a special case of our algorithm when p = 0. In both situations, at least one zi =has-part is equivalent to more than 0% of the instances.

The discussion so far has assumed that one knows the actual percentage of positive instances in a postive bag, but p varies from relation to relation and also depends on the textual corpus in question. To find the best value of p, therefore, our algorithm performs a grid search considering different values and chooses the best using cross validation.

3. Experimental Framework

Knowledge Base ∆: We start with the ontology created in Project Halo [5] , which includes a seed database of meronyms from the biology domain. This includes 640 context-free, universally quantified has-part facts, labeled by human annotators and another 3179 has-part facts having some sort of context. For instance, one KB expression might state that a plant-cell has-part chloroplast with some qualifications such as the plant cells being photosynthetic, etc. Entity Identifier: In order to identify mentions of biological entities in sentences, we use a simple procedure that automatically marks substrings by taking the longest possible match from a compiled dictionary of biological terms. In the dictionary, each concept is associated with a few different lexical phrases (e.g. "cell of plant" is recognized as plant-cell). The entity types are then determined by a curated ontology (e.g., plant-cell has the type living-entity; in total there are nine types). Unlabeled Corpus Σ: We use a digitized version of a popular biology textbook [15] comprised of 41, 892 sentences in 56 chapters. For syntactic analysis, we ran the Stanford CoreNLP pipeline [9] , substituting the parser by Charniak-Johnson using a biomedical model [11] . In the whole data set used in the following experiments, around 20 thousand sentences are used as fact mentions for roughly 4 thousand facts.

Features: We use the commonly adopted "Mintz features" [13] , which include features such as the word sequence between two entity arguments as well as the dependency path connecting them. Since overfitting was a concern, we experimented with filtering features that occurred fewer than k times in the training set, but to our surprise this reduced F1 scores for all attempted values of k. Evaluation: A held-out test set was set up and validated by two human annotators at the start of the project, consisting of 172 has-part facts and 206 NA facts. In most experiments, we report precision and recall for k-fold crossvalidation (k = 5) over the training set as well as performance on the test set.

In the following sections, we evaluate various versions of the distant supervision process. In each case, when evaluating one aspect, we run the experiment using the best combination of other aspects. For example, when considering the use of coreference in mention detection, we use the best setting for creating negative examples.

4. Expanding Mentions With Coref

As mentioned in the previous section, in the baseline condition entities are identified using simple string match against a precompiled dictionary. When two mentions of the same entities appear in the same sentence, only one mention is used depending on the token distance from the mention of the other argument (the shorter one is chosen). However, sometimes the closest mention of an entity with respect to the other is a pronoun or nominal rather than its full name. For instance, consider the task of extracting (X-Chromosome, has-part, Gene) from the sentence "One of the sex-determining chromosomes is X chromosome and it has around 1,100 genes.", the pronoun "it" that refers to "X chromosome" has a more direct syntactic connection with "genes". The best mentions of a concept are chosen based on the length of their dependency path. Besides pronouns, we also found that partitive nouns are often used in the corpus. Only "nucleitide bases" will be identified as a concept in the phrase "a sequence of nucleitide bases". We manually selected 29 partitive nouns (e.g. collection, pair, etc) and expand mentions like the previous example to their whole phrases including the partitive nouns. The top two rows of Table 1 show the 5-fold cross-validation (CV) results on the training set. The following two rows display the results on the held-out test set (TEST). Using coreference to improve mention selection yields a small, but definite imrpovement.

Table 1: An accurate mention detection lifts the performance.

5. Generating Negative Examples

While the distant supervision framework is clear about how to generate positive examples, negative examples are not so straightforward. If one knew that the underlying knowledge base were complete, then one could add every sentence where an entity pair failed to match a fact as a negative example (aka the closed-world assumption). Of course, if one had a complete knowledge base, one would not need to build an extractor in the first place. There is also the issue of skew -increasing the number of negative examples increases precision and reduces recall. Depending on one's objective, varying points along this tradeoff may be desired.

In our dataset, 887 out of 3819 has-part facts have at least one corresponding sentence that mentions both arguments. These form the positive bags. Another 105 argument pairs are manually labeled as definitively not having a has-part relation; of these 105 pairs, 77 have matching sentences, which are taken as negative examples. We call this setup "BASE".

To increase the number of negative examples for training, we exploit the asymmetric nature of has-part and create a negative instance by swapping the arguments of each positive has-part fact: if (e1, has-part, e2) holds, so will (e2, NA, e1). This heuristic(REV) yields an additional 887 negative instances. Furthermore, a transitive closure(TRANS) of existing negative relations lifts the number of negative facts to Table 2 : Performance on different negative data. "#+" indicates the number of has-part facts in the training set; "#-" indicates the number of NA facts. Table 2 shows the results. Since additional negative examples improve precision at the expense of recall, it is unsurprising that BASE has the highest recall in both evaluations, while TRANS has the highest precision. However, it is notable that TRANS also has the highest F1 scores by a sizable amount. The actual percentage of instances in a positive bag that truly denote the corresponding fact is a function of both the relation and the corpus. Without knowing the value of p a priori, our learner picks the best value based on the results of automated cross validation experiments. Figure 1 shows the average F1 values discovered by a grid search from 0.0 to 0.9 with a 0.1 increment on p. Because the grid search finds a peak in F1 at p = 0.2, that value is used when evaluating on the held-out test set (Table 3) . Our approach for finding and exploiting percentage constraints results in a small, but distinct improvement in F1.

Table 2: Performance on different negative data. “#+” indicates the number of has-part facts in the training set; “#-” indicates the number of NA facts.

Table 3: Performance comparing MultiR and MultiR with percentage.

Recall Precision

F1 CV(p = 0) 0.674 0.821 0.740 CV(p = 0.2) 0.730 0.776 0.753 TEST(p = 0) 0.744 0.795 0.769 TEST(p = 0.2) 0.791 0.786 0.788

7. Supervision From Out-Of-Domain

The manner in which meronym relations are expressed varies from domain to domain. In the case of the biology domain, we were able to use distant supervision to exploit the nascent, manually-created knowledge-base. Given that WordNet [12] encodes a huge number of part-whole relations over synset pairs, it is natural to wonder if these general facts could be exploited to increase the performance in the biology domain.

After some experimentations, we chose only substance and part meronyms (not member meronyms). Furthermore, we exclude meronym pairs involving a location or a named entity. Initial efforts of matching these facts against a broad corpus led to dramatic performance drops, so we restrict the matching corpus to Wikipedia articles. Unfortunately, WordNet synsets use semantic types that are incompatible with those in the Halo biology ontology. Therefore, we make the features type-free. Take one feature for example:

Living-Entity − [nsubj] → have ← [dobj] − Organ is reduced to ARG1 − [nsubj] → have ← [dobj] − ARG2

We compare the results on the test set when training the model with(TEST+WN) or without(TEST) additional WordNet meronyms (around 700 ground facts). We also show the results when different percentage values are set (p = 0 or p = 0.2). While the experiment (Table 4) shows small improvement due to the additional training instances, the gains are small. Inspection of the data makes it clear that the absence of type features exerts a significant hit on performance. However, we believe that the results suggest that with further effort on ontological alignment there would be bigger benefits from the additional examples. Table 4 : Adding additional out-of domain meronym facts during training yields small improvements despite the lack of type features.

Table 4: Adding additional out-of domain meronym facts during training yields small improvements despite the lack of type features.

8. Compared Systems

We also evaluate the results in the context of other available systems, which are listed below:

• Pattern-based (PAT): Our implementation of 2 Hearst patterns (whole 's part and part of whole ) and 2 verbal patterns (e.g. part form whole, whole consist of part).

• Always Predict has-part (AP): For any entity pair, this baseline always predicts has-part. • Entity type based (TYPE): This baseline predicts solely based on the semantic types of the arguments of the fact mention. If most facts under the same (ordered) type pair in the training data are positive, the test mention will be considered a positive fact. Similarly for the negative predictions.

• Table 5 : MultiR(p) is the system when the percentage value is 0.2. Table 5 shows that our distantly supervised relation extractor is able to achieve reasonably high numbers in both precision and recall among all other systems. Moreover, our proposed extension beyond the "at least one" assumption further lifts the performance.

9. Related Work And Conclusion

In addition to the work described in the introduction, we note that Wang et al.'s p-posterior mixture-model kernels [21] are similar to our density model for positive bags. Our methods differ in two respects, however. First, they seek to predict a label for a complete bag not for instances within the bag; this is akin to the distinction between aggregate and sentential extraction. Secondly, our approaches use dramatically different mechanisms -SVMs vs perceptron updates. Our work is also related to that of Snow et al., who present a probabilistic joint model for taxonomy induction [17] .

Given the importance of meronyms in question answering, extraction of these relations from text deserves more attention. This paper has presented a thorough exploration of distant supervision to this task, comparing various strategies for mention detection, negative example generation, and example transfer. In addition, we presented a novel generalization of multi-instance learning that replaces the "at least one" assumption to handle the case where an unknown but fixed percentage of bag members are positive examples. Our results show that the approach for finding and exploiting percentage constraints results in a small, but distinct improvement in F1.

A common approach on news articles uses named-entity recognition (NER) to find the entity mentions and simple string match to disambiguate the entities; a more sophisticated approach might use named-entity linking (NEL).3 For simplicity, we describe the extension for one single relation.

Surdeanu et al.[19] and Sun et al.[18] propose only taking (a, R, c) as a negative example if (a, R, b) is a positive fact for b = c, but this heuristic depends on the relation being functional, and an entity can have many parts.