A Two-Stage Masked LM Method for Term Set Expansion

Guy Kushilevitz
Shaul Markovitch
Yoav Goldberg
ACL
2020
View in Semantic Scholar

Abstract

We tackle the task of Term Set Expansion (TSE): given a small seed set of example terms from a semantic class, finding more members of that class. The task is of great practical utility, and also of theoretical utility as it requires generalization from few examples. Previous approaches to the TSE task can be characterized as either distributional or pattern-based. We harness the power of neural masked language models (MLM) and propose a novel TSE algorithm, which combines the pattern-based and distributional approaches. Due to the small size of the seed set, fine-tuning methods are not effective, calling for more creative use of the MLM. The gist of the idea is to use the MLM to first mine for informative patterns with respect to the seed set, and then to obtain more members of the seed class by generalizing these patterns. Our method outperforms state-of-the-art TSE algorithms. Implementation is available at: https://github.com/ guykush/TermSetExpansion-MPB/

1 Introduction

Term Set expansion (TSE) is the task of expanding a small seed set of terms into a larger (ideally complete) set of terms that belong to the same semantic category. For example, the seed set {"orange", "apple"} should expand into a set of fruits, while {"orange", "blue"} into a set of colors, and {"apple","google"} into a set of tech companies. Beyond being of great practical utility, the TSE task is a challenging instance of a generalization from few examples problem. Solving TSE requires the algorithm to: (1) identify the desired concept class based on few examples; and (2) identify additional members of the class.

We present an effective TSE method which is based on querying large, pre-trained masked language models (MLMs). Pre-trained language models (LMs) have been shown to contain semantic (Tenney et al., 2019) , syntactic (Goldberg, 2019; Hewitt and Manning, 2019; Linzen et al., 2016) and factual knowledge (Petroni et al., 2019) , and to be great starting points for transfer-learning to new tasks via fine-tuning on few examples. However, the TSE seed sets are too small for fine-tuning, calling for a different approach. Our method uses the MLMs directly for the task they were trained forlanguage-modeling-by issuing word-completion queries and operating on the returned word distributions. 1 Previous solutions to the TSE problem (also called semantic class induction) can be roughly categorized into distributional and pattern-based approaches (Shi et al., 2010) . Our method can be seen as a combination of the two.

The distributional approach to TSE (Hindle, 1990; Pantel and Lin, 2002; Pantel et al., 2009; Mamou et al., 2018; Mahabal et al., 2018) operates under the hypothesis that similar words appear in similar contexts (Harris, 1968) . These methods represent each term in the vocabulary as an embedding vector that summarizes all the contexts the term appears in in a large corpus, and then look for terms with vectors that are similar to those of the seed term. The methods differ in their context definitions and in their way of computing similarities. A shortcoming of these methods is that they consider all occurrences of a term in the corpus when calculating its representation, including many contexts that are irrelevant to the concept at hand due to polysemy, noise in the corpus or non-informative contexts. 2 In contrast, the pattern-based approach considers specific indicative patterns that signal the desired concept, looking for them in a large corpus, and extracting the terms that appear in them. Patterns can be binary (Hearst, 1992; Ohshima et al., 2006; Zhang et al., 2009) ("such as X or Y"), indicating that both X and Y belong to the same class, or unary (Gupta and Manning, 2014; Wang and Cohen, 2007) ("fruits such as X", "First I painted the wall red, but then I repainted it X"), suggesting that X belongs to a certain category (fruit, color). The patterns can be determined manually (Hearst, 1992) or automatically (Wang and Cohen, 2007; Gupta and Manning, 2014) . While well tailored patterns can be precise and interpretable, a notable shortcoming of pattern-based methods is their lack of coverage, due to the challenge of finding patterns that are specific enough to be accurate yet common enough in a large corpus to be useful. Wang and Cohen (2007) use patterns from non-natural language (HTML) while Gupta and Manning (2014) restrict themselves to short patterns of 2-4 words to each side of the masked term.

Our method. By using MLMs, we combine the power of the pattern-based and the distributional approaches: like the patterns-based approaches, we consider only specific, indicative corpus locations (retaining specificity and transparency). We then use the distributional nature of the neural LM to generalize across patterns and corpus locations.

We use sentences with a single masked location as indicative patterns. For example, ''We took Rexy, our pet , to the vet." is an indicative pattern for the house animals semantic class. Given an initial set of seed terms, we first search the corpus for indicative patterns for members of the set (2.1). Intuitively, an indicative pattern is a corpus location which is considered by an LM to be a good fit for all seed members. Once we identified indicative patterns, we extend the set to terms that can appear in similar patterns. We propose two methods for doing this. The first method (2.2) queries an MLM for completions. While effective, this method restricts the expanded set to the LM vocabulary. The second method (2.3) uses the MLM to define a similarity metric over patterns, and searches the corpus for terms that appear in patterns that are similar to the indicative ones. To summarize, we embrace the pattern-based approach, while using distributional similarity for identifying good patterns as well as for generaliz-ing across patterns.

2 Method

Task formulation we are given a seed set S of k 3 terms S = t 1 , ..., t k , that come from a larger (and unknown) gold set S g . Our goal is to return S g . Practically, our (and other) algorithms return a ranked list of terms rather than a fixed set. The evaluation is then performed over the ranking: ideally, all terms in S g will be ranked above all terms not in S g .

We operate in stages. First, we search the corpus for indicative masked patterns m 1 , ..., m , that are likely to signal the concept class in S g with high probability. Then, we use the patterns to extend the set.

2.1 Finding Indicative Masked-Patterns

A masked pattern m is a sequence of words with a single masked location (marked as " "), where the mask indicates one or more words. We look for patterns such that, with high probability, instances of the desired semantic class will make good mask replacements, while instances of other classes will make bad replacements. For example, "The capital of " is a good pattern for the "countries" class.

We collect L pattern candidates for each seed term t j by querying a corpus for sentences that contain the term, and replacing the term position with a mask. We then score each of the kL resulting pattern candidate m i , and take the -best ones.

Intuitively, we seek a diverse set of patterns in which all seed terms are ranked high (ie, have low rank index) in the MLM's prediction: we look for patterns whose worst-fitting seed term is still high on the list of replacement terms. Formally, let LM (m) be the word completions (mask replacements) predicted by the LM for pattern m, ranked by their probability, and let R LM (t, m) be the rank (index) of term t in LM (m).

The score of the a pattern is then the maximal rank of any of the seed terms: 4

s(m i ) = maxRank(m i ) = max t j ∈S R LM (t j , m i ) (1)

We then sort the patterns by s(m i ) and take the patterns with minimal values. This min-over-max formulation ensures that the patterns are a good fit for all seed terms. 5 To achieve the diversity objective, we use the following heuristic: after sorting all candidate patterns m i by s(m i ), rather than taking the first items we go over the sorted list in order, and keep a pattern only if it differs by at least 50% of it's tokens from an already kept pattern. We do this until collecting patterns.

2.2 Seed Set Extension Via Mlm Query

Having identified indicative patterns, we now turn to suggest terms for expanding the seed set. Each indicative pattern m i naturally provides a ranked list of candidate terms LM (m i ) = t 1 , ..., t |V | , where V is the LM's vocabulary and each term t j is scored by its pattern-conditional probability. We combine the term scores from all chosen indicative patterns using a product of experts approach, scoring each term by the product of probabilities (sum of log probabilities) assigned to it by each context. Let p LM (t|m i ) be the probability assigned to vocabulary term t in pattern m i . The term score is:

EQUATION (2): Not extracted; please refer to original document.

where

c i = (maxRank(m i ) −1 j=1 (maxRank(m j ) −1

is a weighing factor for indicative pattern m i , giving more weight to "tighter" indicative patterns.

This method is fast and effective, requiring only queries to the LM. However, it assumes that all the desired terms from S g appear as vocabulary items in the LM. This assumption often does not hold in practice: first, for efficiency reasons, pretrained LM vocabularies are often small (∼ 50k items), precluding rare words. Second, many terms of interest are multi-word units, that do not appear as single items in the LM vocabulary.

2.3 Extended Coverage Via Pattern Similarity

We seek a term expansion method that will utilize the power of the pre-trained LM, without being restricted by its vocabulary: we would like to identify rare words, out-of-domain words, and multi-word units.

Our solution is to generalize the indicative patterns. Rather than looking for terms that match the patterns, we instead search a large corpus for patterns which are similar to the indicative ones, and collect the terms that appear within them. Following the distributional hypothesis, these terms should be of the desired concept class.

By looking at patterns that surround corpus locations, we are no longer restricted by the LM vocabulary to single-token terms.

However, considering all corpus locations as candidate patterns is prohibitively expensive. Instead, we take a ranking approach and restrict ourselves only to corpus locations that correspond to occurrences of candidate terms returned by a high-recall algorithm. 6 We use the LM to define a similarity measure between two masked patterns that aims to capture our desired notion of similarity: masked patterns are similar if they are likely to be filled by the same terms. Let top q (LM (m i )) be the q highest scoring terms for pattern m i . We define the similarity between two patterns as the fraction of shared terms in their top q predictions (q being a hyperparameter):

sim(m i , m j ) = |top q (LM (m i )) ∩ top q (LM (m j ))|/q

For a candidate term t, let pats(t) = m t 1 , ..., m t n be the set of patterns derived from it: sentences that contain t, where t is replaced with a mask. Note that t can be an arbitrary word or word sequence. We wish to find terms for which the similarity between pats(t) and the indicative patterns is high. However, since words have different senses, it is sufficient for only some patterns in pats(t) to be similar to patterns in m 1 , ..., m . We score a term t as:

EQUATION (3): Not extracted; please refer to original document.

where c i is the pattern weighing factor from equation (2). As i=1 c i = 1, the term score score(t) for every term t is ∈ [0, 1].

Table 1: Number of indicative patterns used (#patt), and number of candidate seed-term containing sentences (#sent) used for selecting these indicative patterns. Set is the NFL team set, method is MPB1. Every value is an avg MAP on 5 seeds (chosen randomly, fixed for all values of #sent and #patt) of size 3. NA: #patt can not be bigger than #sent.

Table 2: Effect of similarity measure’s k on performance, using MPB2 on a single random seed from each set.

3 Experiments And Results

We refer to the method in Section (2.2) as MPB1 and the method in section (2.3) as MPB2. Setup. In our experiments we use BERT (Devlin et al., 2019) as the MLM, and English Wikipedia as the corpus. Following previous TSE work (e.g. (Mahabal et al., 2018 )), we measure performance using MAP (using MAP 70 for the open set). For each method we report the average MAP over several runs (exact number mentioned under each table), each with a different random seed set of size 3. Based on preliminary experiments, for MPB1 we use = 160 and L = 2000/k and for MPB2 we use = 20 and L = 2000/k. 7 When comparing different systems (i.e, in Table 3) , each system sees the same random seed sets as the others. For smaller sets we expand to a set of size 200, while for the Countries and Capitals sets, which have expected sizes of > 100, we expand to 350 items. Dataset. Automatic TSE evaluation is challenging. A good TSE evaluation set should be complete (contain all terms in the semantic class), clean (not contain other terms) and comprehensive (contain all different synonyms for all terms). These are hard to come by. Indeed, previous work either used a small number of sets, or used some automatic set acquiring method which commonly are not complete. We curated a dataset with 7 closed, well defined sets, which we make publicly available. The sets are National football league teams (NFL, size:32), Major league baseball teams (MLB, 30), US states (States, 50), Countries (Cntrs, 195) , European countries (Euro, 44) Capital cities (Caps, 195) and Presidents of the USA (Pres, 44). We also provide on one open class set: Music Genres (Genre). This set created by manually verifying the items in the union of the output of all the different algorithms. This set contains around 600 unique items.

Table 3: Main results. Average MAP scores over 3 random seeds of size 3. */**: excluding 2 or 3 OOV terms. NA: Not applicable, because sets contain many OOV terms. NA’: Not applicable for oracle setting, because gold standard candidates not available for open sets. †: Average value over applicable sets only.

Compared Methods. We compare our methods, MPB1 (MLM-pattern-based) (Section 2.2) and MPB2 8 (Section 2.3), to two state-of-the-art systems: setExpander 9 (SE) (Mamou et al., 2018) , and category builder (CB) (Mahabal et al., 2018) . We also compare to two baselines: The first, BB (basic-BERT), is a baseline for MPB1. This is a BERT-based baseline that uses the MPB1 method on patterns derived from sentences that include seed terms, without the selection method described in Section 2.1. The second, S2V, is a baseline for MPB2. This is a basic distributional method that uses sense2vec (Trask et al., 2015) representations, 10 which is also our candidate acquisition method for MPB2 (5). As MPB2 relies on external candidate generation, we also report on the oracle case MPB2+O where we expand the S2V-generated candidate list to include all the members of the class.

Main Results. Our main results are reported in Table 3 . Our first method, MPB1, achieves the best scores on two of the three sets suitable for its limitations (where all or most of the set's terms are in the LM's vocabulary), and second-best results on the third. 11 MPB2 outperforms all other methods on 5 out of 7 closed sets when assuming gold-standard candidates (MPB2+O), and even when considering the missing candidates it outperforms other expanders on 4 out of 7 closed sets, averaging the best MAP score on all sets. While other methods 8 We follow (Mahabal et al., 2018) and limit MPB2 to 200,000 most frequent terms. MPB2 can work with any number of terms and is limited only by the candidate supplying method (in this implementation-sence2vec which has ∼3,400,000 terms). 9 We use the non-grouping release version because it reaches better results on our dataset than the grouping one.

10 https://explosion.ai/demos/sense2vec 11 MPB1's relatively poor performance on the president's set can be a result of the basic terms MPB1 considers. MPB1 ranks only terms which are in the LM's vocabulary, which means that while other expanders can rank terms like "President George W. Bush", MPB1 will consider terms like "bush", which are harder to ascribe to the presidents set. While this is true for all sets, it seams to be more significant for a set containing person names. Additional experiments. How many sentences should we query when searching for indicative patterns, and how many patterns should we retain? Table 1 shows a grid of these parameters. We use the NFL set for this experiment, as terms in this set all have more than one meaning, and for most the common usage is not the one that belongs to the NFL set (e.g "jets", "dolphins"). Therefore, this set should give a pessimistic estimation for the the number of sentences we need to extract to find quality indicative patterns. Results imply that ∼2000 appearances of seed terms are sufficient, and that good results can be obtained also with fewer instances. This shows that-beyond the data used to train the initial MLM-we do not require a large corpus to achieve good results, suggesting applicability also in new domains. 13 How sensitive is the algorithm to the choice of k when computing the pattern similarity? Table 2 shows that the similarity measure is effective for various k values, with max performance at ∼50.

Finally, how do the different methods behave in a case where the seed terms are a part of a 12 SE does not rank the seed terms, as opposed to other methods. For fairness, we add them in the beginning of the returned list before computing the MAP score.

13 While for MPB1 there are no prominent downsides in using a large number of indicative patterns, for MPB2 doing so will force us to use a large number of occurrences of the candidate terms also. This will (1) be costly run-time wise and (2) many occurrences of rare terms might not always be available. Therefore, we choose different parameters for MPB1 and MPB2. While in both we will use 2000 sentences to search for these indicative patterns (L = 2000/k), for MPB1 we will use 160 indicative patterns ( = 160) and for MPB2 we will use only 20 of them ( = 20). Table 4 : Performance on a subset. Avg MAP over 3 random seeds of size 3.

Table 4: Performance on a subset. Avg MAP over 3 random seeds of size 3.

Set

subset? Table 4 shows a case where seed terms are European countries. Ideally, we would like top results to be European countries, later results to be non-European countries, and then unrelated terms. MPB2+O achieves the best MAP scores on both the set and the subset. In the subset case, even when not provided with all oracle terms, MPB2 is better then all other expanders. While other expanders tend to reach stronger results on either the set or the subset, MPB2+O achieves similar scores on both.

4 Conclusions

We introduce an LM-based TSE method, reaching state-of-the-art results. The method uses the power of LM predictions to locate indicative patterns for the concept class indicated by the seed terms, and then to generalize these patterns to other corpus locations. Beyond strong TSE results, our method demonstrates a novel use of pre-trained MLMs, using their predictions directly rather than relying on their states for fine-tuning.

See(Amrami and Goldberg, 2018) for a method that uses MLM word completions for word-sense induction.2 The work ofMahabal et al. (2018) is unique in this regard by considering only a subset of the contexts that are relevant for the expansion, as determined from the seed set.

In this work we focus on small values of k. Our experiments use k = 3 seed terms.4 We assume the seed terms are a part of the LM's vocabulary.

Contrast this to a min-over-average formulation, which may score very well on some seed terms but badly on others.

For example, one that is based simple distributional similarity to the seed terms. In this work we use the nearest neighbours returned by the sense2vec model(Trask et al., 2015), as implemented in https://spacy.io/ universe/project/sense2vec.

see Additional experiments for justification of these parameter choices

https://explosion.ai/demos/sense2vec