On the Limits of Learning to Actively Learn Semantic Representations

Omri Koshorek
Gabriel Stanovsky
Yichu Zhou
Vivek Srikumar
Jonathan Berant
CoNLL
2019
View in Semantic Scholar

Abstract

One of the goals of natural language understanding is to develop models that map sentences into meaning representations. However, training such models requires expensive annotation of complex structures, which hinders their adoption. Learning to actively-learn (LTAL) is a recent paradigm for reducing the amount of labeled data by learning a policy that selects which samples should be labeled. In this work, we examine LTAL for learning semantic representations, such as QA-SRL. We show that even an oracle policy that is allowed to pick examples that maximize performance on the test set (and constitutes an upper bound on the potential of LTAL), does not substantially improve performance compared to a random policy. We investigate factors that could explain this finding and show that a distinguishing characteristic of successful applications of LTAL is the interaction between optimization and the oracle policy selection process. In successful applications of LTAL, the examples selected by the oracle policy do not substantially depend on the optimization procedure, while in our setup the stochastic nature of optimization strongly affects the examples selected by the oracle. We conclude that the current applicability of LTAL for improving data efficiency in learning semantic meaning representations is limited.

1 Introduction

The task of mapping a natural language sentence into a semantic representation, that is, a structure that represents its meaning, is one of the core goals of natural language processing. This goal has led to the creation of many general-purpose formalisms for representing the structure of language, such as semantic role labeling (SRL; Palmer et al., 2005) , semantic dependencies (SDP; Oepen et al., 2014) , abstract meaning representation (AMR; Banarescu et al., 2013) , universal conceptual cognitive annotation (UCCA; Abend and Rappoport, 2013) , question-answer driven SRL (QA-SRL; He et al., 2015) , and universal dependencies (Nivre et al., 2016) , as well as domain-specific semantic representations for particular users in fields such as biology (Kim et al., 2009; Nédellec et al., 2013; Berant et al., 2014) and material science (Mysore et al., 2017; Kim et al., 2019) .

Currently, the dominant paradigm for building models that predict such representations is supervised learning, which requires annotating thousands of sentences with their correct structured representation, usually by experts. This arduous data collection is the main bottleneck for building parsers for different users in new domains.

Past work has proposed directions for accelerating data collection and improving data efficiency through multi-task learning across different representations (Stanovsky and Dagan, 2018; Hershcovich et al., 2018) , or having non-experts annotate sentences in natural language (He et al., 2015 (He et al., , 2016 . One of the classic and natural solutions for reducing annotation costs is to use active learning, an iterative procedure for selecting unlabeled examples which are most likely to improve the performance of a model, and annotating them (Settles, 2009) .

Recently, learning to actively-learn (LTAL) has been proposed (Fang et al., 2017; Bachman et al., 2017; , where the procedure for selecting unlabeled examples is trained using methods from reinforcement and imitation learning. In recent work by , given a labeled dataset from some domain, active learning is simulated on this dataset, and a policy is trained to iteratively select the subset of examples that maximizes performance on a development set. Then, this policy is used on a target domain to select unlabeled examples for annotation. If the learned policy generalizes well, we can reduce the cost of learning semantic representations. and Vu et al. (2019) have shown that such learned policies significantly reduce annotation costs on both text classification and named entity recognition (NER).

In this paper, we examine the potential of LTAL for learning a semantic representation such as QA-SRL. We propose an oracle setup that can be considered as an upper bound to what can be achieved with a learned policy. Specifically, we use an oracle policy that is allowed to always pick a subset of examples that maximizes its target metric on a development set, which has the same distribution as the test set. Surprisingly, we find that even this powerful oracle policy does not substantially improve performance compared to a policy that randomly selects unlabeled examples on two semantic tasks: QA-SRL span (argument) detection and QA-SRL question (role) generation.

To elucidate this surprising finding, we perform a thorough analysis, investigating various factors that could negatively affect the oracle policy selection process. We examine possible explanatory factors including: (a) the search strategy in the unlabeled data space (b) the procedure for training the QA-SRL model (c) the architecture of the model and (d) the greedy nature of the selection procedure. We find that for all factors, it is challenging to get consistent gains with an oracle policy over a random policy.

To further our understanding, we replicate the experiments of on NER, and compare the properties of a successful oracle policy in NER to the less successful case of QA-SRL. We find that optimization stochasticity negatively affects the process of sample selection in QA-SRL; different random seeds for the optimizer result in different selected samples. We propose a measure for quantifying this effect, which can be used to assess the potential of LTAL in new setups.

To conclude, in this work, we conduct a thorough empirical investigation of LTAL for learning a semantic representation, and find that it is difficult to substantially improve data efficiency compared to standard supervised learning. Thus, other approaches should be explored for the important goal of reducing annotation costs in building such models. Code for reproducing our experiments is available at https://github.com/ koomri/LTAL_SR/.

2 Learning To Actively Learn

Classic pool-based active learning (Settles, 2009) assumes access to a small labeled dataset S lab and a large pool of unlabeled examples S unlab for a target task. In each iteration, a heuristic is used to select L unlabeled examples, which are sent to annotation and added to S lab . An example heuristic is uncertainty sampling (Lewis and Gale, 1994), which at each iteration chooses examples that the current model is the least confident about.

LTAL proposes to replace the heuristic with a learned policy π θ , parameterized by θ. At training time, the policy is trained by simulating active learning on a labeled dataset and generating training data from the simulation. At test time, the policy is applied to select examples in a new domain. Figure 1 and Algorithm 1 describe this data collection procedure, on which we build our oracle policy ( §3).

Figure 1: A single iteration of LTAL, where examples are sampled from Sunlab, trained with examples in Slab, and performance on Seval is used to select examples to annotate. See §2 for details.

In LTAL, we assume a labeled dataset D which is partitioned into three disjoint sets: a small labeled set S lab , a large set S unlab that will be treated as unlabeled, and an evaluation set S eval that will be used to estimate the quality of models. Then, active learning is simulated for B iterations. In each iteration i, a model m i φ , parameterized by φ, is first trained on the labeled dataset. Then, K subsets {C j } K j=1 are randomly sampled from S unlab , and the model m i φ is finetuned on each candidate set, producing K models

{m i φ j } K j=1 .

The performance of each model is evaluated on S eval , yielding the scores {s(C j )} K j=1 . Let the candidate set with highest accuracy be C i t . We can create training examples for π θ , where

(S lab , S unlab , m i φ , {s(C j )} K j=1

) are the inputs and C i t is the label. Then C i t is moved from S unlab to S lab .

Simulating active learning is a computationally expensive procedure. In each iteration we need to train K models over S lab ∪ C j . However, a trained network can potentially lead to a policy that is better than standard active learning heuristics.

3 An Oracle Active Learning Policy

Our goal is to examine the potential of LTAL for learning a semantic representation such as QA-SRL. Towards this goal, we investigate an oracle policy that should be an upper bound for what can be achieved with a learned policy π θ .

The oracle policy is allowed to use Algorithm 1 Algorithm 1: Simulating active learning

Input: S lab , S unlab , S eval 1 for i ∈ {1 . . . B} do 2 m i φ ← Train(S lab ) 3 C i 1 , . . . , C i K = SampleCandidates(S unlab ) 4 for j ∈ {1, . . . , K} do 5 m i φ j ← FineTune(m i φ , S lab ⋃ C i j ) 6 s i j ← Accuracy(m i φ j , S eval ) 7 t ← argmax j∈{1,...,K} s i j 8 CreateTrainEx((S lab , S unlab , m i φ , {C i j } K j=1 ), C i t ) 9 S lab ← S lab ⋃ C i t , S unlab ← S unlab ∖ C i t 10 return S lab

at test time (it does not create training examples for π θ , thus Line 8 is skipped). Put differently, the oracle policy selects the set of unlabeled examples that maximizes the target metric of our model on a set sampled from the same distribution as the test set. Therefore, the oracle policy enjoys extremely favorable conditions compared to a trained policy, and we expect it to provide an upper bound on the performance of π θ . Despite these clear advantages, we will show that an oracle policy struggles to substantially improve performance compared to a random policy. While the oracle policy effectively "peeks" at the label to make a decision, there are various factors that could explain the low performance of a model trained under the oracle policy. We now list several hypotheses, and in §5.4 and §6 methodologically examine whether they explain the empirical results of LTAL.

• Training: The models m i φ j are affected by the training procedure in Lines 2 and 5 of Alg. 1. Different training procedures affect the performance of models trained with the oracle policy.

• Search space coverage: Training over all unlabeled examples in each iteration is intractable, so the oracle policy randomly samples K subsets, each with L examples. Because K ⋅L << S unlab , it is possible that randomly sampling these sets will miss the more beneficial unlabeled examples. Moreover, the parameter L controls the diversity of candidate subsets, since as L increases the similarity between the K different subsets grows. Thus, the hyper-parameters K and L might affect the outcome of the oracle policy. • Model architecture: The model architecture (e.g., number of parameters) can affect the efficacy of learning under the oracle policy. • Stochasticity: The oracle policy chooses an unlabeled set based on performance after training with stochastic gradient descent. Differences in performance between candidate sets might be related to this stochasticity, rather than to the actual value of the examples (especially when S lab is small). • Myopicity: The oracle policy chooses the set C i j that maximizes its performance. However, the success of LTAL depends on the sequence of choices that are made. It is possible that the greedy nature of this procedure results in suboptimal performance. Unfortunately, improving search through beam search or similar measures is intractable in this already computationallyexpensive procedure. We now describe QA-SRL (He et al., 2015) , which is the focus of our investigation, and then describe the experiments with the oracle policy.

4 Qa-Srl Schema

QA-SRL was introduced by He et al. 2015as an open variant of the predefined role schema in traditional SRL. QA-SRL replaces the predefined set of roles with the notion of argument questions. These are natural language questions centered around a target predicate, where the answers to the given question are its corresponding arguments. For example, for the sentence "Elizabeth Warren decided to run for president", traditional SRL will label "Elizabeth Warren" as ARG0 of the run predicate (the agent of the predicate, or the entity running in this case), while QA-SRL Elizabeth Warren announced her candidacy at a rally in Massachusetts.

Qa-Srl Role Propbank Role Elizabeth Warren

Who announced something? ARG0 her candidacy What did someone announce? ARG1 at a rally in Massachusetts Where did someone announce something? ARGM-LOC Table 1 : Example of QA-SRL versus traditional SRL annotation for a given input sentence (top). Each line shows a single argument, and its role in QA-SRL (in question form) followed by its traditional SRL role, using PropBank notation. Roles in QA-SRL have a structured open representation, while SRL assigns discrete roles from a predefined set.

Table 1: Example of QA-SRL versus traditional SRL annotation for a given input sentence (top). Each line shows a single argument, and its role in QA-SRL (in question form) followed by its traditional SRL role, using PropBank notation. Roles in QA-SRL have a structured open representation, while SRL assigns discrete roles from a predefined set.

will assign the more subtle question "who might run?", indicating the uncertainty of this future event. Questions are generated by assigning values to 7 pre-defined slots (where some of the slots are potentially empty). See Table 1 for an example QA-SRL annotation of a full sentence.

Recently, FitzGerald et al. (2018) demonstrated the scalability of QA-SRL by crowdsourcing the annotation of a large QA-SRL dataset, dubbed QA-SRL bank 2.0. It consists of 250K QA pairs over 64K sentences on three different domains (Wikipedia, news, and science). Following, this large dataset has enabled the development a neural model which breaks QA-SRL into a pipeline of two tasks, given a target predicate in an input sentence. First, a span detection algorithm identifies arguments of the predicate as continuous spans in the sentence (e.g., "Elizabeth Warren" in the previous example), then a question generation model predicts an appropriate role question (e.g., "who might run?").

We find that QA-SRL is a good test-bed for active learning of semantic representations, for several key reasons: (1) it requires semantic understanding of the sentence, beyond syntactic or surface-level features (e.g., identifying the factuality of a given predicate), (2) adopting the formulation of FitzGerald et al. 2018, it consists of two semantic tasks, allowing us to test active learning on both of them, (3) we can leverage the large QA-SRL dataset to simulate active learning scenarios, and lastly (4) QA-SRL's scalability is attractive for the application of active learning policies, as they may further reduce costs for researchers working on developing specialized semantic representations in low-resource domains (e.g., medical, biological, or educational domains).

5 Experimental Evaluation

We now perform a series of experiments comparing the performance of an oracle policy to a ran-dom policy. We describe the experimental settings ( §5.1), tasks and models ( §5.2), present the main results ( §5.3), and conclude by investigating factors that might affect our empirical findings ( §5.4).

5.1 Experimental Settings

We evaluate the potential of the oracle policy on QA-SRL Bank 2.0 (FitzGerald et al., 2018). We use the training set of the science domain as D, randomly split it into S lab , S unlab , and S eval . We evaluate the success of a model m i φ trained with the oracle policy by periodically measuring performance on the development set of the science domain. Unless mentioned, all results are an average of 3 experiments, where a different split of D was performed. Each experiment used K threads of a 40-core 2.2GHz Xeon Silver 4114 machine.

We compare the results of a base oracle policy (BASEORACLE) corresponding to the best policy we were able to obtain using the architecture from FitzGerald et al. (2018) to the following baselines:

• RANDOM: One of the candidate sets C i j is chosen at random and added to S lab .

• LONGEST: The set C i j with the maximal average number of tokens per sentence is added to S lab .

• UNCERTAINTY: For each candidate set, we use m i φ to perform predictions over all of the sentences in the set, and choose the set C i j that has the maximal average entropy over the set of predictions.

5.2 Tasks And Models

We now describe the three tasks and corresponding models in our analysis:

Span Detection: Here we detect spans that are arguments of a predicate in a sentence (see Table 1 ). We start with a labeled set of size S lab = 50, and select examples with the oracle policy for B = 460 iterations. We set the number of candidate sets to K = 5, and the size of each set to L = 1, thus the size of the final labeled set is 510 examples. We train the publicly available span detection model released by FitzGerald et al. 2018, which consumes as input a sentence x 1 , . . . , x n , where x i is the concatenation of the embedding of the ith word in the sentence and a learned embedding of a binary indicator for whether this word is the target predicate. This input is fed into a multi-layer encoder, producing a representation h i for every token. Each span x i∶j is represented by concatenating the respective hidden states:

s ij = [h i ; h j ].

A fully connected network consumes the span representation s ij , and predicts a probability whether the span is an argument or not.

To accelerate training, we reduce the number of parameters to 488K by freezing the token embeddings, reducing the number of layers in the encoder, and by shrinking the dimension of both the hidden representations and the binary predicate indicator embedding. Following FitzGerald et al. 2018, we use GLoVe embeddings (Pennington et al., 2014) .

Question Generation: We generate the question (role) for a given predicate and corresponding argument. We start with a labeled set of size S lab = 500 and perform B = 250 iterations, where in each iteration we sample K = 5 candidate sets each of size L = 10 (lower values were intractable). Thus, the final size of S lab is 3,000 samples. We train the publicly available local question generation model from FitzGerald et al. 2018, where the learned argument representation s ij is used to independently predict each of the 7 question slots. We reduce the number of parameters to 360K with the same modifications as in the span detector model. As a metric for the quality of question generation models, we use its official metric exact match (EM), which reflects the percentage of predicted questions that are identical to the ground truth questions.

Named Entity Recognition: To reproduce the experiments of we run the oracle policy on the CoNLL-2003 NER English dataset (Sang and De Meulder, 2003) , replicating the experimental settings described in (as their code is not publicly available). We run the oracle policy for B = 200 iterations, starting from an empty S lab , and adding one example (L = 1) from K = 5 candidate sets in each iteration. We use a CRF sequence tagger from AllenNLP (Gardner et al., 2018) , and experiment with two variants: (1) NER-MULTILANG: A Bi-LSTM CRF model (20K parameters) with 40 dimensional multi-lingual word embeddings (Ammar et al., 2016), and (2) NER-LINEAR: A linear CRF model which was originally used by .

5.3 Results

Span Detection: Table 2 shows F 1 score (the official metric) of the QA-SRL span detector models for different sizes of S lab for BASEORACLE and the other baselines. Figure 2 (left) shows the relative improvement of the baselines over RAN-DOM. We observe that the maximal improvement of BASEORACLE over RANDOM is 9% given 200 examples, but with larger S lab the improvement drops to less than 5%. This is substantially less than the improvements obtained by on text classification and NER. Moreover, LONGEST outperforms BASEORACLE in most of the observed results. This shows that there exists a selection strategy that is better than BASEOR-ACLE, but it is not the one chosen by the oracle policy.

Figure 2: Relative improvement (in %) of different models compared to RANDOM on the development set. Note that the range of the y-axis in NER is different from QA-SRL.

Table 2: Span detection F1 on the development set for all models across different numbers of labeled examples.

Question Generation: To check whether the previous result is specific to span detection, we conduct the same experiment for question generation. However, training question generation models is slower compared to span detection and thus we explore a smaller space of hyper-parameters. Table 3 reports the EM scores achieved by BASE-ORACLE and the other baselines, and Figure 2 (center) shows the relative improvement. Here, the performance of BASEORACLE is even worse compared to span detection, as its maximal relative improvement over RANDOM is at most 5%.

Table 3: Question generation scores (exact match) on the development set across different numbers of labeled examples.

Named Entity Recognition: Figure 2 (right) shows the relative improvement of NER-LINEAR and NER-MULTILANG compared to RANDOM. We observe that in NER-LINEAR, which is a replication of , the oracle policy indeed obtains a large improvement over RAN-DOM for various sizes of S lab , with at least 9.5% relative improvement in performance. However, in NER-MULTILANG the relative gains are smaller, especially when the size of S lab is small.

5.4 Extended Experiments

Surprisingly, we observed in §5.3 that even an oracle policy, which is allowed to pick the examples that maximize performance on samples from the same distribution as the test set, does not substantially improve performance over a random policy. One possibility is that no active learning policy is better than random. However, LONGEST outperformed BASEORACLE showing that the problem is at least partially related to BASEORACLE itself. We now examine the possible factors described in §3 and investigate their interaction with the performance of models trained with BASEORACLE. All modifications were tested on span detection, using the experimental settings described in §5.1.

Search Space Coverage

We begin by examining the effect of the parameters K and L on the oracle policy. As K increases, we cover more of the unlabeled data, but training time increases linearly. As L increases, the subsets {C j } K j=1 become more similar to one another due to the fact that we are randomly mixing more examples from the unlabeled data. On the other hand, when L is small, the fine-tuning process is less affected by the candidate sets and more by S lab . In such case, it is likely that the difference in scores is also affected by stochasticity. BASEORACLE uses K = 5, L = 1. We examine the performance of the oracle policy as these values are increased in Table 4 . We observe that performance does not improve, and perhaps even decreases for larger values of K. We hypothesize that a large K increases the greediness of the procedure, and may result in selecting an example that seems promising in the current iteration but is suboptimal in the long run, similar to large beam sizes reducing performance in neural machine translation (Yang et al., 2018) . A moderate K results in a more random and possibly beneficial selection.

Table 4: Span detection F1 scores on the development set for different size of Slab. We highlight the best performing policy for the standard span detector architecture. (*) indicates that the results are for a single run.

Increasing the size of each candidate set to L = 5 or 20 results in roughly similar performance to L = 1. We hypothesize that there is a trade-off where as L increases the similarity between the different sets increases but training becomes more stable and vice versa, and thus performance for different L values does not vary substantially.

Training In Lines 2 and 5 of Alg. 1 we train on S lab and then fine-tune on the union S lab ∪ C i j until s i j does not significantly improve for 5 epochs. It is possible that fine-tuning from a fixed model reduces the efficacy of training, and training on S lab ∪ C i j from random weights will improve performance. Of course, training from scratch will substantially increase training time. We run an experiment, termed INDEP., where Line 2 is skipped, and in Line 5 we independently train each of the candidate models from random weights. We find that this modification does not achieve better results than BASEORACLE, possibly because training a model from scratch for each of the candidates increases the stochasticity in the optimization. Table 4 : Span detection F 1 scores on the development set for different size of S lab . We highlight the best performing policy for the standard span detector architecture. (*) indicates that the results are for a single run.

In addition, we also experiment with fine-tuning on C j only, rather than S lab ∪ C j . As we expect, results are quite poor since the model uses only a few examples for fine-tuning and forgets the examples in the labeled set.

Lastly, we hypothesize that selecting a candidate set based on the target metric (F 1 for span detection) might not be sensitive enough and thus we run an experiment, termed LOSS-SCORE, where we select the set C j that minimizes the loss on the development set. We find that this modification achieves lower results than RANDOM, especially when S lab is small, reflecting the fact that the loss is not perfectly correlated with our target metric.

Model Architecture In §5.3 we observed that results on NER vary with the model architecture. To see whether this phenomenon occurs also for span detection we perform a modification to the model -we reduce the number of parameters from 488K to 26K by reducing the hidden state size and replacing GLoVe embeddings with multi-lingual embeddings (Ammar et al., 2016) . We then compare an oracle policy (ORACLESMALLMODEL) with a random policy (RANDOMSMALLMODEL). Table 4 shows that while absolute F 1 actually improves in this setup, the oracle policy improves performance compared to a random policy by no more than 4%. Thus, contrary to NER, here architecture modifications do not expose an advantage of the oracle policy compared to the random one. We did not examine a simpler linear model for span detection, in light of recent findings (Lowell et al., 2019) that it is important to test LTAL with state-of-the-art models, as performance is tied to the specific model being trained.

Myopicity We hypothesized that greedily selecting an example that maximizes performance in a specific iteration might be suboptimal in the long run. Because non-greedy selection strategies are computationaly intractable, we perform the following two experiments.

First, we examine EPSILON-GREEDY-P, where in each iteration the oracle policy selects the set C j that maximizes target performance with probability 1 − p and randomly chooses a set with probability p. This is meant to check whether adding random exploration to the oracle policy might prevent it from getting stuck in local optima. We find that when p = 0.3 its performance is comparable to BASEORACLE while reducing the computational costs.

Second, we observe that most of the gain of BASEORACLE compared to RANDOM is in the beginning of the procedure. Thus, we propose to use BASEORACLE in the first b iterations, and then transition to a random policy (termed ORACLE-B). We run this variation with b = 100 and find that it leads to similar performance.

To summarize, we have found that an oracle policy only slightly improves performance for QA-SRL span detection and question generation compared to a random policy, and that improvements in NER are also conditioned on the underlying model. Our results echo recent findings by Lowell et al. (2019) , who have shown that gains achieved by active learning are small and inconsistent when modifying the model architecture.

We have examined multiple factors that might affect the performance of models trained with an oracle policy including the training procedure, model architecture, and search procedure, and have shown that in all of them the oracle policy struggles to improve over the random one. Thus, a learned policy is even less likely to obtain meaningful gains using LTAL.

In the next section we analyze the differences between NER-LINEAR, where LTAL works well, and BASEORACLE, in order to better understand the underlying causes for this phenomenon.

6 When Does Ltal Work?

A basic underlying assumption of active learning (with or without a learned policy), is that some samples in S unlab are more informative for the learning process than others. In LTAL, the informativeness of a candidate example set is defined by the accuracy of a trained model, as evaluated on S eval (Line 6 in Alg. 1). Thus, for active learning to work, the candidate set that is selected should not be affected by the stochasticity of the training process. Put differently, the ranking of the candidate sets by the oracle policy should be consistent and not be dramatically affected by the optimization.

To operationalize this intuition, we use Alg. 1, but run the for-loop in Line 4 twice, using two different random seeds. Let C i t be the chosen or reference candidate set according to the first run of the for-loop in iteration i. We can measure the consistency of the optimization process by looking at the ranking of the candidate sets C i 1 , . . . C i K according to the second fine-tuning, and computing the mean reciprocal rank (MRR) with respect to the reference candidate set C i t across all iterations:

EQUATION (1): Not extracted; please refer to original document.

where rank(C i t ) is the rank of C i t in the second fine tuning step. The only difference between the two fine-tuning procedures is the random seed. Therefore, an MRR value that is close to 1 means that the ranking of the candidates is mostly affected by the quality of the samples, while a small MRR hints that optimization plays a large role. We prefer MRR to other correlation-based measures (such as Spearman's rank-order correlation), because the oracle is only affected by the candidate set that is ranked first. We can now examine whether the MRR score correlates with whether LTAL works or not.

We measure the MRR in 3 settings: (1) NER-LINEAR, a linear CRF model for NER which replicates the experimental settings in , where LTAL works, (2) NER-MULTILANG, a BiLSTM-CRF sequence tagger from AllenNLP (Gardner et al., 2018) with 40 dimensional multi-lingual word embeddings of Ammar et al. (2016) , and (3) BASEORACLE, the baseline model for span detection task. In all experiments the initial S lab was empty and B = 200, following the experimental settings in which LTAL has shown good performance Fang et al., 2017; Vu et al., 2019) . Since the MRR might change as the size of S lab is increasing, we compute and report MRR every 10 iterations. Figure 3 (left) presents the MRR in the three experiments. We observe that in NER-LINEAR the MRR has a stable value of 1, while in NER-MULTILANG and BASEORACLE the MRR value is substantially lower, and closer to an MRR value of a random selection (∼.46). The right side of Figure 3 shows that NER-LINEAR oracle policy outperforms a random policy by a much larger margin, compared to the other 2 experiments.

Figure 3: MRR (on the left) and relative improvement (in %) of different models compared to RANDOM on the development set.

These results show that the ranking in NER-LINEAR is not affected by the stochasticity of optimization, which is expected given its underlying convex loss function. On the other hand, the optimization process in the other experiments is over a non-convex loss function and a small S lab , and thus optimization is more brittle. Interestingly, we observe in Figure 3 that the gains of the oracle policy in NER-LINEAR are higher than NER-MULTILANG, although the task and the dataset are exactly same in the two experiments. This shows that the potential of LTAL is affected by the model, where a more complex model leads to smaller gains by LTAL.

We view our findings as a guideline for future work: by tracking the MRR one can assess the potential of LTAL at development time -when the MRR is small, the potential is limited.

7 Related Work

Active learning has shown promising results on various tasks. The commonly used uncertainty criteria (Lewis and Catlett, 1994; Culotta and Mc-Callum, 2005 ) is focused on selecting the samples on which the confidence of the model is low. Among other notable approaches, in query by committee (Seung et al., 1992) selecting what samples to label.

In a large empirical study, Lowell et al. (2019) have recently shown other limitations in active learning. They investigate the performance of active learning across NLP tasks and model architectures, and demonstrate that it does not achieve consistent gains over supervised learning, mostly because the collected samples are beneficial to a specific model architecture, and does not yield better results than random selection when switching to a new architecture.

There has been little research regarding active learning of semantic representations. Among the relevant work, Siddhant and Lipton (2018) have shown that uncertainty estimation using dropout and Bayes-By-Backprop (Blundell et al., 2015) achieves good results on the SRL formulation. The improvements in performance due to LTAL approaches on various tasks (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; has raised the question whether learned policies can be applied also to the field of learning semantic representations.

8 Conclusions

We presented the first experimentation with LTAL techniques in learning parsers for semantic representations. Surprisingly, we find that LTAL, a learned method which was shown to be effective for NER and document classification, does not do significantly better than a random selection on two semantic representation tasks within the QA-SRL framework, even when given extremely favourable conditions. We thoroughly analyze the factors leading to this poor performance, and find that the stochasticity in the model optimization negatively affects the performance of LTAL. Finally, we propose a metric which can serve as an indicator for whether LTAL will fare well for a given dataset and model. Our results suggest that different approaches should be explored for the important task of building semantic representation models.