Transfer Learning Between Related Tasks Using Expected Label Proportions

Matan Ben Noach
Y. Goldberg
EMNLP/IJCNLP
2019
View in Semantic Scholar

Abstract

Deep learning systems thrive on abundance of labeled training data but such data is not always available, calling for alternative methods of supervision. One such method is expectation regularization (XR) (Mann and McCallum, 2007), where models are trained based on expected label proportions. We propose a novel application of the XR framework for transfer learning between related tasks, where knowing the labels of task A provides an estimation of the label proportion of task B. We then use a model trained for A to label a large corpus, and use this corpus with an XR loss to train a model for task B. To make the XR framework applicable to large-scale deep-learning setups, we propose a stochastic batched approximation procedure. We demonstrate the approach on the task of Aspect-based Sentiment classification, where we effectively use a sentence-level sentiment predictor to train accurate aspect-based predictor. The method improves upon fully supervised neural system trained on aspect-level data, and is also cumulative with LM-based pretraining, as we demonstrate by improving a BERT-based Aspect-based Sentiment model.

1 Introduction

Data annotation is a key bottleneck in many data driven algorithms. Specifically, deep learning models, which became a prominent tool in many data driven tasks in recent years, require large datasets to work well. However, many tasks require manual annotations which are relatively hard to obtain at scale. An attractive alternative is lightly supervised learning (Schapire et al., 2002; Jin and Liu, 2005; Chang et al., 2007; Graça et al., 2007; Quadrianto et al., 2009a; Mann and Mc-Callum, 2010a; Ganchev et al., 2010; Hope and Shahaf, 2016) , in which the objective function is supplemented by a set of domain-specific soft-constraints over the model's predictions on unlabeled data. For example, in label regularization (Mann and McCallum, 2007) the model is trained to fit the true label proportions of an unlabeled dataset. Label regularization is special case of expectation regularization (XR) (Mann and McCallum, 2007) , in which the model is trained to fit the conditional probabilities of labels given features.

In this work we consider the case of correlated tasks, in the sense that knowing the labels for task A provides information on the expected label composition of task B. We demonstrate the approach using sentence-level and aspect-level sentiment analysis, which we use as a running example: knowing that a sentence has positive sentiment label (task A), we can expect that most aspects within this sentence (task B) will also have positive label. While this expectation may be noisy on the individual example level, it holds well in aggregate: given a set of positively-labeled sentences, we can robustly estimate the proportion of positively-labeled aspects within this set. For example, in a random set of positive sentences, we expect to find 90% positive aspects, while in a set of negative sentences, we expect to find 70% negative aspects. These proportions can be easily either guessed or estimated from a small set.

We propose a novel application of the XR framework for transfer learning in this setup. We present an algorithm (Sec 3.1) that, given a corpus labeled for task A (sentence-level sentiment), learns a classifier for performing task B (aspectlevel sentiment) instead, without a direct supervision signal for task B. We note that the label information for task A is only used at training time. Furthermore, due to the stochastic nature of the estimation, the task A labels need not be fully accurate, allowing us to make use of noisy predictions which are assigned by an automatic classifier (Sections 3.1 and 4). In other words, given a medium-sized sentiment corpus with sentencelevel labels, and a large collection of un-annotated text from the same distribution, we can train an accurate aspect-level sentiment classifier.

The XR loss allows us to use task A labels for training task B predictors. This ability seamlessly integrates into other semi-supervised schemes: we can use the XR loss on top of a pre-trained model to fine-tune the pre-trained representation to the target task, and we can also take the model trained using XR loss and plentiful data and fine-tune it to the target task using the available small-scale annotated data. In Section 5.3 we explore these options and show that our XR framework improves the results also when applied on top of a pretrained BERT-based model (Devlin et al., 2018) .

Finally, to make the XR framework applicable to large-scale deep-learning setups, we propose a stochastic batched approximation procedure (Section 3.2). Source code is available at https: //github.com/MatanBN/XRTransfer.

2 Background And Related Work 2.1 Lightly Supervised Learning

An effective way to supplement small annotated datasets is to use lightly supervised learning, in which the objective function is supplemented by a set of domain-specific soft-constraints over the model's predictions on unlabeled data. Previous work in lightly-supervised learning focused on training classifiers by using prior knowledge of label proportions (Jin and Liu, 2005; Chang et al., 2007; Musicant et al., 2007; Mann and Mc-Callum, 2007; Quadrianto et al., 2009b; Liang et al., 2009; Ganchev et al., 2010; Mann and Mc-Callum, 2010b; Chang et al., 2012; Wang et al., 2012; Zhu et al., 2014; Hope and Shahaf, 2016) or prior knowledge of features label associations (Schapire et al., 2002; Haghighi and Klein, 2006; Druck et al., 2008; Melville et al., 2009; Mohammady and Culotta, 2015) . In the context of NLP, Haghighi and Klein (2006) suggested to use distributional similarities of words to train sequence models for part-of-speech tagging and a classified ads information extraction task. Melville et al. (2009) used background lexical information in terms of word-class associations to train a sentiment classifier. Ganchev and Das (2013) ; Wang and Manning (2014) suggested to exploit the bilingual correlations between a resource rich language and a resource poor language to train a classifier for the resource poor language in a lightly supervised manner.

2.2 Expectation Regularization (Xr)

Expectation Regularization (XR) (Mann and Mc-Callum, 2007 ) is a lightly supervised learning method, in which the model is trained to fit the conditional probabilities of labels given features. In the context of NLP, XR was used by Mohammady and Culotta (2015) to train twitter-user attribute prediction using hundreds of noisy distributional expectations based on census demographics. Here, we suggest using XR to train a target task (aspect-level sentiment) based on the output of a related source-task classifier (sentence-level sentiment).

Learning Setup The main idea of XR is moving from a fully supervised situation in which each data-point x i has an associated label y i , to a setup in which sets of data points U j are associated with corresponding label proportionsp j over that set.

Formally, let X = {x 1 , x 2 , . . . , x n } ⊆ X be a set of data points, Y be a set of |Y| class labels, U = {U 1 , U 2 , . . . , U m } be a set of sets where U j ⊆ X for every j ∈ {1, 2, . . . , m}, and let p j ∈ R |Y| be the label distribution of set U j . For example,p j = {.7, .2, .1} would indicate that 70% of data points in U j are expected to have class 0, 20% are expected to have class 1 and 10% are expected to have class 2. Let p θ (x) be a parameterized function with parameters θ from X to a vector of conditional probabilities over labels in Y. We write p θ (y|x) to denote the probability assigned to the yth event (the conditional probability of y given x).

A typically objective when training on fully labeled data of (x i , y i ) pairs is to maximize likelihood of labeled data using the cross entropy loss,

L cross (θ) = − n i log p θ (y i |x i )

Instead, in XR our data comes in the form of pairs (U j ,p j ) of sets and their corresponding expected label proportions, and we aim to optimize θ to fit the label distributionp j over U j , for all j.

XR Loss As counting the number of predicted class labels over a set U leads to a nondifferentiable objective, Mann and McCallum (2007) suggest to relax it and use instead the model's posterior distributionp j over the set:

EQUATION (2): Not extracted; please refer to original document.

where q(y) indicates the yth entry in q. Then, we would like to set θ such thatp j andp j are close. Mann and McCallum (2007) suggest to use KLdivergence for this. KL-divergence is composed of two parts:

D KL (p j ||p j ) = −p j • logp j +p j • logp j = H(p j ,p j ) − H(p j )

Since H(p j ) is constant, we only need to minimize H(p j ,p j ), therefore the loss function becomes: 1

EQUATION (3): Not extracted; please refer to original document.

Notice that computingq j requires summation over p θ (x) for the entire set U j , which can be prohibitive. We present batched approximation (Section 3.2) to overcome this.

Temperature Parameter Mann and McCallum (2007) find that XR might find a degenerate solution. For example, in a three class classification task, wherep j = {.5, .35, .15}, it might find a solution such thatp θ (y) = {.5, .35, .15} for every instance, as a result, every instance will be classified the same. To avoid this, Mann and McCallum (2007) suggest to penalize flat distributions by using a temperature coefficient T likewise:

EQUATION (4): Not extracted; please refer to original document.

Where z is a feature vector and W and b are the linear classifier parameters.

2.3 Aspect-Based Sentiment Classification

In the aspect-based sentiment classification (ABSC) task, we are given a sentence and an aspect, and need to determine the sentiment that is expressed towards the aspect. For example the sentence "Excellent food, although the interior could use some help." has two aspects:

Algorithm 1 Stochastic Batched XR Inputs: A dataset (U 1 , ..., U m ,p 1 , ...,p m ), batch size k, differentiable classifier p θ (y|x) while not converged do j ← random(1, ..., m) U ← random-choice(U j ,k) q u ← x∈U p θ (x) p u ← normalize(q u ) ← −p j logp u

Compute loss (eq (4)) Compute gradients and update θ end while return θ food and interior, a positive sentiment is expressed about the food, but a negative sentiment is expressed about the interior. A sentence α = (w 1 , w 2 , . . . , w n ), may contain 0 or more aspects a i , where each aspect corresponds to a sub-sequence of the original sentence, and has an associated sentiment label (NEG, POS, or NEU). Concretely, we follow the task definition in the SemEval-2015 and SemEval-2016 shared tasks (Pontiki et al., 2015 (Pontiki et al., , 2016 , in which the relevant aspects are given and the task focuses on finding the sentiment label of the aspects.

While sentence-level sentiment labels are relatively easy to obtain, aspect-level annotation are much more scarce, as demonstrated in the small datasets of the SemEval shared tasks.

3.1 Transfer-Training Between Related Tasks With Xr

Consider two classification tasks over a shared input space, a source task s from X to Y s and a target task t from X to Y t , which are related through a conditional distribution P (y t = i|y s = j). In other words, a labeling decision for task s induces an expected label distribution over the task t. For a set of datapoints x 1 , ..., x n that share a source label y s , we expect to see a target label distribution of P (y t |y s ) =p y s . Given a large unlabeled dataset

D u = (x u 1 , ..., x u |D u | ), a small labeled dataset for the tar- get task D t = ((x t 1 , y t 1 ), ..., (x t |D t | , y t |D t | ))

, classifier C s : X → Y s (or sufficient training data to train one) for the source task, 2 we wish to use C s and D u to train a good classifier C t : X → Y t for the target task. This can be achieved using the following procedure.

• Apply C s to D t , resulting in a noisy sourceside labelsỹ s i = C s (x t i ) for the target task.

• Estimate the conditional probability

P (y t |ỹ s ) table using MLE estimates over D t p j (y t = i|ỹ s = j) = #(y t = i,ỹ s = j) #(ỹ s = j)

where # is a counting function over D t . 3

• Apply C s to the unlabeled data D u resulting in labels C s (x u i ). Split D u into |Y s | sets U j according to the labeling induced by C s :

U j = {x u i | x u i ∈ D u ∧ C s (x u i ) = j}

• Use Algorithm 1 to train a classifier for the target task using input pairs (U j ,p j ) and the XR loss.

In words, by using XR training, we use the expected label proportions over the target task given predicted labels of the source task, to train a targetclass classifier. Mann and McCallum (2007) and following work take the base classifier p θ (y|x) to be a logistic regression classifier, for which they manually derive gradients for the XR loss and train with LBFGs (Byrd et al., 1995) . However, nothing precludes us from using an arbitrary neural network instead, as long as it culminates in a softmax layer.

3.2 Stochastic Batched Training For Deep Xr

One complicating factor is that the computation ofq j in equation 1requires a summation over p θ (x) for the entire set U j , which in our setup may contain hundreds of thousands of examples, making gradient computation and optimization impractical. We instead proposed a stochastic batched approximation in which, instead of requiring that the full constraint set U j will match the expected label posterior distribution, we require that sufficiently large random subsets of it will match work, etc. In this work, we use a neural classification model. 3 In theory, we could estimate-or even "guess"-these |Y s | × |Y t | values without using D t at all. In practice, and in particular because we care about the target label proportions given noisy source labelsỹ s assigned by C s , we use MLE estimates over the tagged D t . the distribution. At each training step we compute the loss and update the gradient with respect to a different random subset. Specifically, in each training step we sample a random pair (U j ,p j ), sample a random subset U of U j of size k, and compute the local XR loss of set U :

L XR (θ; j, U ) = −p j • logp u (5)

wherep u is computed by summing over the elements of U rather than of U j in equations (1-2). The stochastic batched XR training algorithm is given in Algorithm 1. For large enough k, the expected label distribution of the subset is the same as that of the complete set.

4 Application To Aspect-Based Sentiment

We demonstrate the procedure given above by training Aspect-based Sentiment Classifier (ABSC) using sentence-level 4 sentiment signals.

4.1 Relating The Classification Tasks

We observe that while the sentence-level sentiment does not determine the sentiment of individual aspects (a positive sentence may contain negative remarks about some aspects), it is very predictive of the proportion of sentiment labels of the fragments within a sentence. Positively labeled sentences are likely to have more positive aspects and fewer negative ones, and vice-versa for negatively-labeled sentences. While these proportions may vary on the individual sentence level, we expect them to be stable when aggregating fragments from several sentences: when considering a large enough sample of fragments that all come from positively labeled sentences, we expect the different samples to have roughly similar label proportions to each other. This situation is idealy suited for performing XR training, as described in section 3.1. The application to ABSC is almost straightforward, but is complicated a bit by the decomposition of sentences into fragments: each sentence level decision now corresponds to multiple fragment-level decisions. Thus, we apply the sentence-level (task A) classifier C s on the aspectlevel corpus D t by applying it on the sentence level and then associating the predicted sentence labels with each of the fragments, resulting in Figure 1 : Illustration of the algorithm. C s is applied to D u resulting inỹ for each sentence, U j is built according with the fragments of the same labelled sentences, the probabilities for each fragment in U j are summed and normalized, the XR loss in equation 4is calculated and the network is updated. Figure 2 : Illustration of the decomposition procedure, when given a 1 ="duck confit" and a 2 = "foie gras terrine with figs" as the pivot phrases. fragment-level labeling. Similarly, when we apply C s to the unlabeled data D u we again do it at the sentence level, but the sets U j are composed of fragments, not sentences:

Figure 1: Illustration of the algorithm. Cs is applied to Du resulting in ỹ for each sentence, Uj is built according with the fragments of the same labelled sentences, the probabilities for each fragment in Uj are summed and normalized, the XR loss in equation (4) is calculated and the network is updated.

Figure 2: Illustration of the decomposition procedure, when given a1=“duck confit“ and a2= “foie gras terrine with figs“ as the pivot phrases.

U j = {f α i | α ∈ D u ∧f α i ∈ frags(α)∧C s (α) = j}

We then apply algorithm 1 as is: at each step of training we sample a source label j ∈ {POS,NEG,NEU}, sample k fragments from U j , and use the XR loss to fit the expected fragmentlabel proportions over these k fragments top j . Figure 1 illustrates the procedure.

4.2 Classification Architecture

We model the ABSC problem by associating each (sentence,aspect) pair with a sentence-fragment, and constructing a neural classifier from fragments to sentiment labels. We heuristically decompose a sentence into fragments. We use the same BiL-STM based neural architecture for both sentence classification and fragment classification.

Fragment-Decomposition

We now describe the procedure we use to associate a sentence fragment with each (sentence,aspect) pairs. The shared tasks data associates each aspect with a pivotphrase a, where pivot phrase (w 1 , w 2 , ...w l ) is defined as a pre-determined sequence of words that is contained within the sentence. For a sentence α, a set of pivot phrases A = (a 1 , ..., a m ) and a specific pivot phrase a i , we consult the constituency parse tree of α and look for tree nodes that satisfy the following conditions: 5 1. The node governs the desired pivot phrase a i .

2. The Node Governs Either A Verb (Vb, Vbd,

VBN, VBG, VBP, VBZ) or an adjective (JJ, JJR, JJS), which is different than any a j ∈ A.

3. The node governs a minimal number of pivot phrases from (a 1 , ..., a m ), ideally only a i .

We then select the highest node in the tree that satisfies all conditions. The span governed by this node is taken as the fragment associated with as-pect a i . 6 The decomposition procedure is demonstrated in Figure 2 .

When aspect-level information is given, we take the pivot-phrases to be the requested aspects. When aspect-level information is not available, we take each noun in the sentence to be a pivotphrase.

Neural Classifier Our classification model is a simple 1-layer BiLSTM encoder (a concatenation of the last states of a forward and a backward running LSTMs) followed by a linear-predictor. The encoder is fed either a complete sentence or a sentence fragment.

5 Experiments

Data Our target task is aspect-based fragmentclassification, with small labeled datasets from the SemEval 2015 and 2016 shared tasks, each dataset containing aspect-level predictions for about 2000 sentences in the restaurants reviews domain. Our source classifier is based on training on up to 10,000 sentences from the same domain and 2000 sentences for validation, labeled for only for sentence-level sentiment. We additionally have an unlabeled dataset of up to 670,000 sentences from the same domain 7 . We tokenized all datasets using the Tweet Tokenizer from NLTK package 8 and parsed the tokenized sentences with AllenNLP parser. 9

Training Details Both the sentence level classification models and the models trained with XR have a hidden state vector dimension of size 300, they use dropout (Hinton et al., 2012) on the sentence representation or fragment representation vector (rate=0.5) and optimized using Adam (Kingma and Ba, 2014). The sentence classification is trained with a batch size of 30 and XR models are trained with batch sizes k that each contain 450 fragments 10 . We used a temperature param- 6 On the rare occasions where we cannot find such a node, we take the root node of the tree (the entire sentence) as the fragment for the given aspect.

7 All of the sentence-level sentiment data is obtained from the Yelp dataset challenge: https://www.yelp.com/ dataset/challenge 8 https://www.nltk.org/ 9 https://allennlp.org/ 10 We also increased the batch sizes of the baselines to match those of the XR setups. This decreased the performance of the baselines, which is consistent with the folk knowledge in the community according to which smaller batch sizes are more effective overall. eter of 1 11 . We use pre-trained 300-dimensional GloVe embeddings 12 (Pennington et al., 2014) , and fine-tune them during training. The XR training was validated with a validation set of 20% of SemEval-2015 training set, the sentence level BiL-STM classifiers were validated with a validation of 2000 sentences. 13 When fine-tuning to the aspect based task we used 20% of train in each dataset as validation and evaluated on this set. On each training method the models were evaluated on the validation set, after each epoch and the best model was chosen. The data is highly imbalanced, with only very few sentences receiving a NEU label. We do not deal with this imbalance directly and train both the sentence level and the XR aspect-based training on the imbalanced data. However, when training C s , we trained five models and chose the best model that predicts correctly at least 20% of the neutral sentences. The models are implemented using DyNet 14 (Neubig et al., 2017) .

Baseline models In recent years many neural network architectures with increasing sophistication were applied to the ABSC task (Nguyen and Shirai, 2015; Vo and Zhang, 2015; Tang et al., 2016a,b; Wang et al., 2016; Zhang et al., 2016; Ruder et al., 2016; Ma et al., 2017; Liu and Zhang, 2017; Chen et al., 2017; Wang et al., 2018b,a; Fan et al., 2018a,b; Li et al., 2018; Ouyang and Su, 2018) . We compare to a series of state-of-theart ABSC neural classifiers that participated in the shared tasks. TDLSTM-ATT (Tang et al., 2016a) encodes the information around an aspect using forward and backward LSTMs, followed by an attention mechanism. ATAE-LSTM (Wang et al., 2016) is an attention based LSTM variant. MM (Tang et al., 2016b ) is a deep memory network with multiple-hops of attention layers. RAM (Chen et al., 2017) uses multiple attention mechanisms combined with a recurrent neural networks and a weighted memory mechanism. LSTM+SynATT+TarRep (He et al., 2018a) Table 1 : Average accuracies and Macro-F1 scores over five runs with random initialization along with their standard deviations. Bold: best results or within std of them. * indicates that the method's result is significantly better than all baseline methods, † indicates that the method's result is significantly better than all baselines methods that use the aspect-based data only, with p < 0.05 according to a one-tailed unpaired t-test. The data annotations S, N and A indicate training with Sentence-level, Noisy sentence-level and Aspect-level data respectively. Numbers for TDLSTM+Att,ATAE-LSTM,MM,RAM and LSTM+SynATT+TarRep are from (He et al., 2018a) . Numbers for Semisupervised are from (He et al., 2018b) .

Table 1: Average accuracies and Macro-F1 scores over five runs with random initialization along with their standard deviations. Bold: best results or within std of them. ∗ indicates that the method’s result is significantly better than all baseline methods, † indicates that the method’s result is significantly better than all baselines methods that use the aspect-based data only, with p < 0.05 according to a one-tailed unpaired t-test. The data annotations S, N and A indicate training with Sentence-level, Noisy sentence-level and Aspect-level data respectively. Numbers for TDLSTM+Att,ATAE-LSTM,MM,RAM and LSTM+SynATT+TarRep are from (He et al., 2018a). Numbers for Semisupervised are from (He et al., 2018b).

tactic information into the attention mechanism and uses an auto-encoder structure to produce an aspect representations. All of these models are trained only on the small, fully-supervised ABSC datasets. "Semisupervised" is the semi-supervised setup of (He et al., 2018b) , it trains an attentionbased LSTM model on 30,000 documents additional to an aspect-based train set, 10,000 documents to each class. We consider additional two simple but strong semi-supervised baselines. Sentence-BiLSTM is our BiLSTM model trained on the 10 4 sentence-level annotations, and applied as-is to the individual fragments. Sentence-BiLSTM+Finetuning is the same model, but finetuned on the aspect-based data as explained above. Finetuning is performed using our own implementation of the attention-based model of He et al. (2018b) . 15 Both these models are on par with the fully-supervised ABSC models.

Empirical Proportions The proportion constraint setsp j based on the SemEval-2015 aspect-based train data are: p POS = {POS : 0.93, NEG : 0.06, NEU : 0.01} p NEG = {POS : 0.27, NEG : 0.7, NEU : 0.03} p NEU = {POS : 0.45, NEG : 0.41, NEU : 0.14} Table 1 compares these baselines to three XR conditions. 16 The first condition, BiLSTM-XR-Dev, performs XR training on the automatically-labeled sentence-level dataset. The only access it has to aspect-level annotation is for estimating the proportions of labels for each sentence-level label, which is done based on the validation set of SemEval-2015 (i.e., 20% of the train set). The XR setting is very effective: without using any in-task data, this model already surpasses all other models, both supervised and semi-supervised, except for the (He et al., 2018b ,a) models which achieve higher F1 scores. We note that in contrast to XR, the competing models have complete access to the supervised aspect-based labels. The second condition, BiLSTM-XR, is similar but now the model is allowed to estimate the conditional label proportions based on the entire aspect-based training set (the classifier still does not have direct access to the labels beyond the aggregate proportion information). This improves results further, showing the importance of accurately estimating the proportions. Finally, in BiLSTM-XR+Finetuning, we follow the XR training with fully supervised fine-tuning on the small labeled dataset, using the attention-based model of He et al. (2018b) . This achieves the best results, and surpasses also the semi-supervised He et al. (2018b) baseline on accuracy, and matching it on F1. 17 We report significance tests for the robustness of the method under random parameter initialization. Our reported numbers are averaged over five random initialization. Since the datasets are unbalanced w.r.t the label distribution, we report both accuracy and macro-F1. The XR training is also more stable than the other semi-supervised baselines, achieving substantially lower standard deviations across different runs.

5.2 Further Experiments

In each experiment in this section we estimate the proportions using the SemEval-2015 train set.

Effect of unlabeled data size How does the XR training scale with the amount of unlabeled data? Figure 3a shows the macro-F1 scores on the entire SemEval-2016 dataset, with different unlabeled corpus sizes (measured in number of sentences). An unannotated corpus of 5×10 4 sentences is sufficient to surpass the results of the 10 4 sentencelevel trained classifier, and more unannotated data further improves the results.

Figure 3: Macro-F1 scores for the entire SemEval-2016 dataset of the different analyses. (a) the contribution of unlabeled data. (b) the effect of sentence classifier quality. (c) the effect of k. (d) the effect of sentence-level pretraining vs. corpus size.

Effect of Base-classifier Quality Our method requires a sentence level classifier C s to label both the target-task corpus and the unlabeled corpus. How does the quality of this classifier affect the overall XR training? We vary the amount of supervision used to train C s from 0 sentences (assigning the same label to all sentences), to 100, 1000, 5000 and 10000 sentences. We again measure macro-F1 on the entire SemEval 2016 corpus.

The results in Figure 3b show that when using the prior distributions of aspects (0), the model struggles to learn from this signal, it learns mostly to predict the majority class, and hence reaches very low F1 scores of 35.28. The more data given to the sentence level classifier, the better the potential results will be when training with our method using the classifier labels, with a classifiers trained on 100,1000,5000 and 10000 labeled sentences, we get a F1 scores of 53.81, 58.84, 61.81, 65.58 respectively. Improvements in the source task classifier's quality clearly contribute to the target task accuracy.

Effect of k The Stochastic Batched XR algorithm (Algorithm 1) samples a batch of k examples at each step to estimate the posterior label distribution used in the loss computation. How does the size of k affect the results? We use k = 450 fragments in our main experiments, but smaller values of k reduce GPU memory load and may train better in practice. We tested our method with varying values of k on a sample of 5 × 10 4 , using batches that are composed of fragments of 5, 25, 100, 450, 1000 and 4500 sentences. The results are shown in Figure 3c . Setting k = 5 result in low scores. Setting k = 25 yields better F1 score but with high variance across runs. For k = 100 fragments the results begin to stabilize, we also see a slight decrease in F1-scores with larger batch sizes. We attribute this drop despite having better estimation of the gradients to the general trend of larger batch sizes being harder to train with stochastic gradient methods.

5.3 Pre-Training, Bert

The XR training can be performed also over pretrained representations. We experiment with two pre-training methods: (1) pre-training by training the BiLSTM model to predict the noisy sentencelevel predictions. (2) Using the pre-trained BERT representation (Devlin et al., 2018) . For (1), we compare the effect of pre-train on unlabeled corpora of sizes of 5 × 10 4 , 10 5 and 6.7 × 10 5 sentences. Results in Figure 3d show that this form of pre-training is effective for smaller unlabeled corpora but evens out for larger ones.

BERT For the BERT experiments, we experiment with the BERT-base model 18 with k = 450 sets, 30 epochs for XR training or sentence level fine-tuning 19 and 15 epochs for aspect based finetuning, on each training method we evaluated the model on the dev set after each epoch and the best model was chosen 20 . We compare the following setups: -BERT→Aspect Based Finetuning: pretrained BERT model finetuned to the aspect based task. -BERT→ 10 4 : A pretrained BERT model finetuned to the sentence level task on the 10 4 sentences, and tested by predicting fragment-level sentiment. -BERT→10 4 →Aspect Based Finetuning: pretrained BERT model finetuned to the sentence level task, and finetuned again to the aspect based one.

-BERT→XR: pretrained BERT model followed by Table 2 : BERT pre-training: average accuracies and Macro-F1 scores from five runs and their stdev. * indicates that the method's result is significantly better than all baseline methods, † indicates that the method's result is significantly better than all non XR baseline methods, with p < 0.05 according to a one-tailed unpaired t-test. The data annotations S, N and A indicate training with Sentence-level, Noisy sentence-level and Aspect-level data respectively.

Table 2: BERT pre-training: average accuracies and Macro-F1 scores from five runs and their stdev. ∗ indicates that the method’s result is significantly better than all baseline methods, † indicates that the method’s result is significantly better than all non XR baseline methods, with p < 0.05 according to a one-tailed unpaired t-test. The data annotations S, N and A indicate training with Sentence-level, Noisy sentence-level and Aspect-level data respectively.

XR training using our method.

-BERT→ XR → Aspect Based Finetuning: pretrained BERT followed by XR training and then fine-tuned to the aspect level task. The results are presented in Table 2 . As before, aspect-based fine-tuning is beneficial for both SemEval-16 and SemEval-15. Training a BiL-STM with XR surpasses pre-trained BERT models and using XR training on top of the pre-trained BERT models substantially increases the results even further.

6 Discussion

We presented a transfer learning method based on expectation regularization (XR), and demonstrated its effectiveness for training aspect-based sentiment classifiers using sentence-level supervision. The method achieves state-of-the-art results for the task, and is also effective for improving on top of a strong pre-trained BERT model. The proposed method provides an additional data-efficient tool in the modeling arsenal, which can be applied on its own or together with another training method, in situations where there is a conditional relations between the labels of a source task for which we have supervision, and a target task for which we don't.

While we demonstrated the approach on the sentiment domain, the required conditional dependence between task labels is present in many situations. Other possible application of the method includes training language identification of tweets given geo-location supervision (knowing the geographical region gives a prior on languages spoken), training predictors for renal failure from textual medical records given classifier for diabetes (there is a strong correlation between the two conditions), training a political affiliation classifier from social media tweets based on age-group classifiers, zip-code information, or social-status classifiers (there are known correlations between all of these to political affiliation), training hate-speech detection based on emotion detection, and so on.

Note also that ∀j|Uj| = 1 ⇐⇒ LXR(θ) = Lcross(θ)

Note that the classifier does not need to be trainable or differentiable. It can be a human, a rule based system, a nonparametric model, a probabilistic model, a deep learning net-

In practice, our "sentences" are in fact short documents, some of which are composed of two or more sentences.

Condition (2) coupled with selecting the highest node pushes towards complete phrases that contain opinions (which are usually expressed with adjectives or verbs), while the other conditions focus the attention on the desired pivot phrase.

Despite(Mann and McCallum, 2007) claim regarding the temperature parameter, we observed lower performance when using it in our setup. However, in other setups this parameter might be found to be beneficial.12 https://nlp.stanford.edu/projects/ glove/13 We also tested the sentence BiLSTM baselines with a SemEval validation set, and received slightly lower results without a significant statistical difference.14 https://github.com/clab/dynet

We changed the LSTM component to a BiLSTM.16 To be consistent with existing research(He et al., 2018b), aspects with conflicted polarity are removed.

We note that their setup uses clean and more balanced annotations, i.e. they use 10,000 samples for each label, which helps predicting the infrequent neutral sentiment. We however, use noisy sentence sentiment labels which are automatically obtained from a trained classifier, which trains on 10,000 samples in their natural imbalanced distribution.

We could not fit k = 450 sets of BERT-large on our GPU.19 When fine-tuning to the sentence level task, we provide the sentence as input. When fine-tuning to the aspect-level task, we provide the sentence, a seperator and then the aspect.20 The other configuration parameters were the default ones in https://github.com/huggingface/ pytorch-pretrained-BERT