Variational Pretraining for Semi-supervised Text Classification

Suchin Gururangan
Tam Dang
D. Card
Noah A. Smith
ACL
2019
View in Semantic Scholar

Abstract

We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find that fine-tuning to in-domain data is crucial to achieving decent performance from contextual embeddings when working with limited supervision. We accompany this paper with code to pretrain and use VAMPIRE embeddings in downstream tasks.

1 Introduction

An effective approach to semi-supervised learning has long been a goal for the NLP community, as unlabeled data tends to be plentiful compared to labeled data. Early work emphasized using unlabeled data drawn from the same distribution as the labeled data (Nigam et al., 2000) , but larger and more reliable gains have been obtained by using contextual embeddings trained with a language modeling (LM) objective on massive amounts of text from domains such as Wikipedia or news (Peters et al., 2018a; Devlin et al., 2019; Howard and Ruder, 2018) . The latter approaches play to the strengths of high-resource settings (e.g., access to web-scale corpora and powerful machines), but their computational and data requirements can make them less useful in resource-limited environments. In this paper, we instead focus on the low-resource setting ( §2.1), 1 VAriational Methods for Pretraining In Resource-limited Environments and develop a lightweight approach to pretraining for semi-supervised text classification.

Our model, which we call VAMPIRE, combines a variational autoencoder (VAE) approach to document modeling (Kingma and Welling, 2013; Miao et al., 2016; Srivastava and Sutton, 2017) with insights from LM pretraining (Peters et al., 2018a) . By operating on a bag-of-words representation, we avoid the time complexity and difficulty of training a sequence-to-sequence VAE (Bowman et al., 2016; Xu et al., 2017; Yang et al., 2017) while retaining the freedom to use a multi-layer encoder that can learn useful representations for downstream tasks. Because VAMPIRE ignores sequential information, it leads to models that are much cheaper to train, and offers strong performance when the amount of labeled data is small. Finally, because VAMPIRE is a descendant of topic models, we are able to explore model selection by topic coherence, rather than validation-set perplexity, which results in better downstream classification performance ( §6.1).

In order to evaluate the effectiveness of our method, we experiment with four text classification datasets. We compare our approach to a traditional semi-supervised baseline (self-training), alternative representation learning techniques that have access to the in-domain data, and the fullscale alternative of using large language models trained on out-of-domain data, optionally finetuned to the task domain.

Our results demonstrate that effective semisupervised learning is achievable for limitedresource settings, without the need for computationally demanding sequence-based models. While we observe that fine-tuning a pretrained BERT model to the domain provides the best results, this depends on the existence of such a model in the relevant language, as well as GPUs to fine-tune it. When this is not an option, our arXiv:1906.02242v1 [cs.CL] 5 Jun 2019 model offers equivalent or superior performance to the alternatives with minimal computational requirements, especially when working with limited amounts of labeled data.

The major contributions of this paper are:

• We adapt variational document models to modern pretraining methods for semisupervised text classification ( §3), and highlight the importance of appropriate criteria for model selection ( §3.2).

• We demonstrate experimentally that our method is an efficient and effective approach to semi-supervised text classification when data and computation are limited ( §5).

• We confirm that fine-tuning is essential when using contextual embeddings for document classification, and provide a summary of practical advice for researchers wishing to use unlabeled data in semi-supervised text classification ( §8).

• We release code to pretrain variational models on unlabeled data and use learned representations in downstream tasks. 2

2.1 Resource-Limited Environments

In this paper, we are interested in the low-resource setting, which entails limited access to computation, labels, and out-of-domain data. Labeled data can be obtained cheaply for some tasks, but for others, labels may require expensive and timeconsuming human annotations, possibly from domain experts, which will limit their availability. While there is a huge amount of unlabeled text available for some languages, such as English, this scale of data is not available for all languages. Indomain data availability, of course, varies by domain. For many researchers, especially outside of STEM fields, computation may also be a scarce resource, such that training contextual embeddings from scratch, or even incorporating them into a model could be prohibitively expensive.

Moreover, even when such pretrained models are available, they inevitably come with potentially undesirable biases baked in, based on the data on which they were trained (Recasens et al., 2013; Bolukbasi et al., 2016; Zhao et al., 2019) .

Particularly for social science applications, it may be preferable to exclude such confounders by only working with in-domain or curated data.

Given these constraints and limitations, we seek an approach to semi-supervised learning that can leverage in-domain unlabeled data, achieve high accuracy with only a handful of labeled instances, and can run efficiently on a CPU.

2.2 Semi-Supervised Learning

Many approaches to semi-supervised learning have been developed for NLP, including variants of bootstrapping (Charniak, 1997; Blum and Mitchell, 1998; Zhou and Li, 2005; McClosky et al., 2006) , and representation learning using generative models or word vectors (Mikolov et al., 2013; Pennington et al., 2014) . Contextualized embeddings have recently emerged as a powerful way to use out-of-domain data (Peters et al., 2018a; Radford, 2018) , but training these large models requires a massive amount of appropriate data (typically on the order of hundreds of millions of words), and industry-scale computational resources (hundreds of hours on multiple GPUs). 3 There have also been attempts to leverage VAEs for semi-supervised learning in NLP, mostly in the form of sequence-to-sequence models (Xu et al., 2017; Yang et al., 2017) , which use sequencebased encoders and decoders (see §3). These papers report strong performance, but there are many open questions which necessitate further investigation. First, given the reported difficulty of training sequence-to-sequence VAEs (Bowman et al., 2016) , it is questionable whether such an approach is useful in practice. Moreover, it is unclear if such complex models (which are expensive to train) are actually required for good performance on tasks such as text classification.

Here, we instead base our framework on neural document models (Miao et al., 2016; Srivastava and Sutton, 2017; Card et al., 2018) , which offer both faster training and an explicit interpretation in the form of topics, and explore their utility in the semi-supervised setting.

3 Model

In this work, we assume that we have

L docu- ments, D L = {(x i , y i )} L i=1

, with observed cat-egorical labels y ∈ Y. We also assume access to a larger set of U documents drawn from the same distribution, but for which the labels are unobserved, i.e, D U = {x i } U +L i=L+1 . Our primary goal is to learn a probabilistic classifier, p(y | x).

Our approach heavily borrows from past work on VAEs (Kingma and Welling, 2013; Miao et al., 2016; Srivastava and Sutton, 2017) , which we adapt to semi-supervised text classification (see Figure 1 ). We do so by pretraining the document model on unlabeled data ( §3.1), and then using learned representations in a downstream classifier ( §3.3). The downstream classifier makes use of multiple internal states of the pretrained document model, as in Peters et al. (2018b) . We also explore how to best do model selection in a way that benefits the downstream task ( §3.2).

Figure 1: VAMPIRE involves pretraining a deep variational autoencoder (VAE; displayed on left) on unlabeled text. The VAE, which consists entirely of feedforward networks, learns to reconstruct a word frequency representation of the unlabeled text with a logistic normal prior, parameterized by µ and σ. Downstream, the pretrained VAE’s internal states are frozen and concatenated to task-specific word vectors to improve classification in the low-resource setting.

3.1 Unsupervised Pretraining

In order to learn useful representations, we initially ignore labels, and assume each document is generated from a latent variable, z. The functions learned in estimating this model then provide representations which are used as features in supervised learning.

Using a variational autoencoder for approximate Bayesian inference, we simultaneously learn an encoder, which maps from the observed text to an approximate posterior q(z | x), and a decoder, which reconstructs the text from the latent representation. In practice, we instantiate both the encoder and decoder as neural networks and assume that the encoder maps to a normally distributed posterior, i.e., for document i,

EQUATION (2): Not extracted; please refer to original document.

Using standard principles of variational inference, we derive a variational bound on the marginal log-likelihood of the observed data,

EQUATION (3): Not extracted; please refer to original document.

Intuitively, the first term in the bound can be thought of as a reconstruction loss, ensuring that generated words are similar to the original document. The second term, the KL divergence, encourages the variational approximation to be close to the assumed prior, p(z), which we take to be a spherical normal distribution. Figure 1 : VAMPIRE involves pretraining a deep variational autoencoder (VAE; displayed on left) on unlabeled text. The VAE, which consists entirely of feedforward networks, learns to reconstruct a word frequency representation of the unlabeled text with a logistic normal prior, parameterized by µ and σ. Downstream, the pretrained VAE's internal states are frozen and concatenated to task-specific word vectors to improve classification in the low-resource setting.

Using the reparameterization trick (Kingma and Welling, 2013; Rezende et al., 2014), we replace the expectation with a single-sample approximation, 4 i.e.,

EQUATION (5): Not extracted; please refer to original document.

where ε (s) ∼ N (0, I) is sampled from an independent normal. All parameters can then be optimized simultaneously by performing stochastic gradient ascent on the variational bound. A powerful way of encoding and decoding text is to use sequence models. That is, f µ (x) and f σ (x) would map from a sequence of tokens to a pair of vectors, µ and σ, and f d (z) would similarly decode from z to a sequence of tokens, using recurrent, convolutional, or attention-based networks. Some authors have adopted this approach (Bowman et al., 2016; Xu et al., 2017; Yang et al., 2017) , but as discussed above ( §2.2), it has a number of disadvantages.

In this paper, we adopt a more lightweight and directly interpretable approach, and work with word frequencies instead of word sequences. Using the same basic structure as Miao et al. (2016) but employing a softmax in the decoder, we encode f µ (x) and f σ (x) with multi-layer feed forward neural networks operating on an input vector of word counts, c i :

EQUATION (10): Not extracted; please refer to original document.

For a decoder, we use the following form, which reconstructs the input in terms of topics (coherent distributions over the vocabulary):

EQUATION (13): Not extracted; please refer to original document.

where j ranges over the vocabulary.

By placing a softmax on z, we can interpret θ as a distribution over latent topics, as in a topic model (Blei et al., 2003) , and B as representing positive and negative topical deviations from a background b. This form (essentially a unigram LM) allows for much more efficient inference on z, compared to sequence-based encoders and decoders.

3.2 Model Selection Via Topic Coherence

Because our pretraining ignores document labels, it is not obvious that optimizing it to convergence will produce the best representations for downstream classification. When pretraining using a LM objective, models are typically trained until model fit stops improving (i.e., perplexity on validation data). In our case, however, θ i has a natural interpretation as the distribution (for document i) over the latent "topics" learned by the model (B). As such, an alternative is to use the quality of the topics as a criterion for early stopping.

It has repeatedly been observed that different types of topic models offer a trade-off between perplexity and topic quality (Chang et al., 2009; Srivastava and Sutton, 2017) . Several methods for automatically evaluating topic coherence have been proposed (Newman et al., 2010; Mimno et al., 2011) , such as normalized pointwise mutual information (NPMI), which Lau et al. 2014found to be among the most strongly correlated with human judgement. As such, we consider using either log likelihood or NPMI as a stopping criteria for VAMPIRE pretraining ( §6.1), and evaluate them in terms of which leads to the better downstream classifier.

NPMI measures the probability that two words collocate in an external corpus (in our case, the validation data). For each topic t in B, we collect the top ten most probable words and compute NPMI between all pairs:

EQUATION (14): Not extracted; please refer to original document.

We then arrive at a global NPMI for B by averaging the NPMIs across all topics. We evaluate NPMI at the end of each epoch during pretraining, and stop training when NPMI has stopped increasing for a pre-defined number of epochs.

3.3 Using A Pretrained Vae For Text Classification

Kingma et al. 2014proposed using the latent variable of an unsupervised VAE as features in a downstream model for classifying images. However, work on pretraining for NLP, such as Peters et al. (2018a) , found that LMs encode different information in different layers, each of which may be more or less useful for certain tasks. Here, for an n-layer MLP encoder on word counts c i , we build on that idea, and use as representations a weighted sum over θ i and the internal states of the MLP, h

i , with weights to be learned by the downstream classifier. 5 That is, for any sequence-to-vector encoder, f s2v (x), we propose to augment the vector representations for each document by concatenating them with a weighted combination of the internal states of our variational encoder (Peters et al., 2018a) . We can then train a supervised classifier on the weighted combination,

EQUATION (16): Not extracted; please refer to original document.

where f c is a neural classifier and {λ 0 , . . . , λ n } are softmax-normalized trainable parameters.

3.4 Optimization

In all cases, we optimize models using Adam (Kingma and Ba, 2014). In order to prevent divergence during pretraining, we make use of a batchnorm layer on the reconstruction of x (Ioffe and Szegedy, 2015) . We also use KL-annealing (Bowman et al., 2016) , placing a scalar weight on the KL divergence term in Eq. 3, which we gradually increase from zero to one. Because our model consists entirely of feedforward neural networks, it is easily parallelized, and can run efficiently on either CPUs or GPUs.

4 Experimental Setup

We evaluate the performance of our approach on four text classification tasks, as we vary the amount of labeled data, from 200 to 10,000 instances. In all cases, we assume the existence of about 75,000 to 125,000 unlabeled in-domain examples, which come from the union of the unused training data and any additional unlabeled data provided by the corpus. Because we are working with a small amount of labeled data, we run each experiment with five random seeds, each with a different sample of labeled training instances, and report the mean performance on test data.

4.1 Datasets And Preprocessing

We experiment with text classification datasets that span a variety of label types. The datasets we use are the familiar AG News (Zhang et al., 2015) , IMDB (Maas et al., 2011) , and YAHOO! Answers datasets (Chang et al., 2008) , as well as a dataset of tweets labeled in terms of four HATESPEECH categories (Founta et al., 2018) . Summary statistics are presented in Table 1 . In all cases, we either use the official test set, or take a random stratified sample of 25,000 documents as a test set. We also sample 5,000 instances as a validation set. We tokenize documents with spaCy, and use up to 400 tokens for sequence encoding (f s2v (x)). For VAMPIRE pretraining, we restrict the vocabulary to the 30,000 most common words in the dataset, after excluding tokens shorter than three characters, those with digits or punctuation, and stopwords. 6 We leave the vocabulary for downstream classification unrestricted.

Table 1: Datasets used in our experiments.

Table 2: Test accuracies in the low-resource setting on four text classification datasets under varying levels of labeled training data (200, 500, 2500, and 10000 documents). Each score is reported as an average over five seeds, with standard deviation in parentheses, and the highest mean result in each setting shown in bold.

Table 3: Example topics learned by VAMPIRE in IMDB and YAHOO! datasets. See Appendix D for more examples.

Table 4: VAMPIRE is substantially more compact than Transformer-based ELMO but is still competitive under low-resource settings. Here, we display the computational requirements for pretraining VAMPIRE and ELMO on in-domain unlabeled text from the IMDB dataset. We report results on training VAMPIRE (with hyperparameters listed in Appendix A.1) and ELMO (with its default configuration) on a GeForce GTX 1080 Ti GPU, and VAMPIRE on a 2.60GHz Intel Xeon CPU. VAMPIRE uses about 750MB of memory on a GPU, while ELMO requires about 8.5GB.

4.2 Vampire Architecture

In order to find reasonable hyperparameters for VAMPIRE, we utilize a random search strategy for pretraining. For each dataset, we take the model with the best NPMI for use in the downstream classifiers. We detail sampling bounds and final assignments for each hyperparameter in Table 5 in Appendix A.1.

Table 5: VAMPIRE search space, best assignments, and associated performance on the four datasets we consider in this work.

4.3 Downstream Classifiers

For all experiments we make use of the Deep Averaging Network (DAN) architecture (Iyyer et al., 2015) as our baseline sequence-to-vector encoder, f s2v (x). That is, embeddings corresponding to each token are summed and passed through a multi-layer perceptron.

p(y i | x i ) = MLP 1 |x i | |x i | j=1 E(x i ) j , (17)

where E(x) converts a sequence of tokens to a sequence of vectors, using randomly initialized vectors, off-the-shelf GLOVE embeddings (Pennington et al., 2014), or contextual embeddings.

To incorporate the document representations learned by VAMPIRE in a downstream classifier, we concatenate them with the average of randomly initialized trainable embeddings, i.e.,

EQUATION (18): Not extracted; please refer to original document.

Preliminary experiments found that DANs with one-layer MLPs and moderate dropout provide more reliable performance on validation data than more expressive models, such as CNNs or LSTMs, with less hyperparameter tuning, especially when working with few labeled instances (details in Appendix A.2).

4.4 Resources And Baselines

In these experiments, we consider baselines for both low-resource and high-resource settings, where the high-resource baselines have access to 80 AG News Table 2. greater computational resources and a either massive amount of unlabeled data or a pretrained model, such as ELMO or BERT. 7

Low resource In the low-resource setting we assume that computational resources are at a premium, so we are limited to lightweight approaches such as VAMPIRE, which can run efficiently on a CPU. As baselines, we consider a) a purely supervised model, with randomly initialized 50dimensional embeddings and no access to unlabeled data; b) the same model initialized with 300dimensional GLOVE vectors, pretrained on 840 billion words; 8 c) 300-dimensional GLOVE vectors trained on only in-domain data; and d) selftraining, which has access to the in-domain unlabeled data. For self-training, we iterate over training a model, predicting labels on all unlabeled instances, and adding to the training set all unlabeled instances whose label is predicted with high confidence, repeating this up to five times and using the model with highest validation accuracy. On each iteration, the threshold for a given label is equal to the 90th percentile of predicted probabilities for validation instances with the corresponding label. 7 As discussed above, we consider these models to be representative of the high-resource setting, both because they were computationally intensive to train, and because they were made possible by the huge amount of English text that is available online.

8 http://nlp.stanford.edu/projects/ glove/ High resource In the high-resource setting, we assume access to plentiful computational resources and massive amounts of out-of-domain data, which may be indirectly accessed through pretrained models. Specifically, we evaluate the performance of a Transformer-based ELMO (Peters et al., 2018b) and BERT, both (a) off-theshelf with frozen embeddings and (b) after semisupervised fine-tuning to both unlabeled and labeled in-domain data. To perform semi-supervised fine-tuning, we first use ELMO and BERT's original objectives to fine-tune to the unlabeled data. To fine-tune ELMO to the labeled data, we average over the LM states and add a softmax classification layer. We obtain the best results applying slanted triangular learning rates and gradual unfreezing (Howard and Ruder, 2018) to this fine-tuning step. To fine-tune BERT to labeled data, we feed the hidden state corresponding to the [CLS] token of each instance to a softmax classification layer. We use AllenNLP 9 to fine-tune ELMO, and Pytorch-pretrained-BERT 10 to finetune BERT.

We also experiment with ELMO trained only on in-domain data as an example of high-resource LM pretraining methods, such as Dai and Le (2015), when there is no out-of-domain data available. Specifically, we generate contextual word representations with a Transformer-based ELMO. During downstream classification, the resulting vectors are frozen and concatenated to randomly initialized word vectors prior to the summation in Eq. (17).

5 Results

In the low-resource setting, we find that VAM-PIRE achieves the highest accuracy of all lowresource methods we consider, especially when the amount of labeled data is small. Table 2 shows the performance of all low-resource models on all datasets as we vary the amount of labeled data, and a subset of these are also shown in Figure 2 for easy comparison.

In the high-resource setting, we find, not surprisingly, that fine-tuning the pretrained BERT model to in-domain data provides the best performance. For both BERT and ELMO, we find that using frozen off-the-shelf vectors results

Dataset

Model 200 500 2500 10000 IMDB

Baseline 68.5 (7.8) 79.0 (0.4) 84.4 (0.1) 87.1 (0.3) Self-training 73.8 (3.3) 80.0 (0.7) 84.6 (0.2) 87.0 (0.4) GLOVE (ID) 74.5 (0.8) 79.5 (0.4) 84.7 (0.2) 87.1 (0.4) GLOVE (OD) 74.1 (1.2) 80.0 (0.2) 84.6 (0.3) 87.0 (0.6) VAMPIRE 82.2 (2.0) 84.5 (0.4) 85.4 (0.4) 87.1 (0.4)

Ag

Baseline 68.8 (2.0) 77.3 (1.0) 84.4 (0.1) 87.5 (0.2) Self-training 77.3 (1.7) 81.3 (0.8) 84.8 (0.2) 87.7 (0.1) GLOVE (ID) 70.4 (1.2) 78.0 (1.0) 84.1 (0.3) 87.1 (0.2) GLOVE (OD) 68.8 (5.7) 78.8 (1.1) 85.3 (0.3) 88.0 (0.3) Table 2 : Test accuracies in the low-resource setting on four text classification datasets under varying levels of labeled training data (200, 500, 2500, and 10000 documents). Each score is reported as an average over five seeds, with standard deviation in parentheses, and the highest mean result in each setting shown in bold. in surprisingly poor performance, compared to fine-tuning to the task domain, especially for HATESPEECH and IMDB. 11 For these two datasets, an ELMO model trained only on indomain data offers far superior performance to frozen off-the-shelf ELMO (see Figure 3) . This difference is smaller, however, for YAHOO! and 11 See also Howard and Ruder (2018) .

Figure 3: High-resource methods (plus VAMPIRE) on four datasets; ELMO performance benefits greatly from training on (ID), or fine-tuning (FT) to, the indomain data (as does BERT; full results in Appendix B). Key: FT (fine-tuned), FR (frozen), ID (in-domain).

Ag. (Please See Appendix B For Full Results).

These results taken together demonstrate that although pretraining on massive amounts of web text offers large improvements over purely supervised models, access to unlabeled in-domain data is critical, either for fine-tuning a pretrained language model in the high-resource setting, or for training VAMPIRE in the low-resource setting. Similar findings have been reported by Yogatama et al. (2019) for tasks such as natural language inference and question answering.

6.1 Npmi Versus Nll As Stopping Criteria

To analyze the effectiveness of different stopping criterion in VAMPIRE, we pretrain 200 VAMPIRE models on IMDB: 100 selected via NPMI, and 100 selected via negative log likelihood (NLL) on validation data. Interestingly, we observe that VAM-PIRE NPMI and NLL values are negatively correlated (ρ = -0.72; Figure 4A ), suggesting that upon convergence, trained models that better fit the data also tend to have more coherent topics. We then train 200 downstream classifiers with the same hyperparameters, on a fixed 200 document random subset of the IMDB dataset, uniformly sampling over the NPMI-and NLL-selected VAMPIRE models as additional features. In Figure 4B and ure 4C, we observe that better pretrained VAMPIRE models (according to either criterion) tend to produce better downstream performance. (ρ = 0.55 and ρ = -0.53, for NPMI and NLL respectively). However, we also observe higher variance in accuracy among the VAMPIRE models obtained using NLL as a stopping criterion ( Figure 4D ). Such models selected via NLL have poor topic coherence and downstream performance. As such, doing model selection using NPMI is the preferred alternative, and all VAMPIRE results in Table 2 are based on pretrained models selected using this criterion.

Figure 4: Comparing NPMI and NLL as early stopping criteria for VAMPIRE model selection. NPMI and NLL are correlated measures of model fit, but NPMI-selected VAMPIRE models have lower variance on downtream classification performance with 200 labeled documents of IMDB. Accuracy is reported on the validation data. See §6.1 for more details.

The experiments in Ding et al. (2018) provide some insight into this behaviour. They find that when training neural topic models, model fit and NPMI initially tend to improve on each epoch. At some point, however, perplexity continues to improve, while NPMI starts to drop, sometimes dramatically. We also observe this phenomenon when training VAMPIRE (see Appendix C). Using NPMI as a stopping criterion, as we propose to do, helps to avoid degenerate models that result from training too long.

In some preliminary experiments, we also observe cases where NPMI is artificially high because of redundancy in topics. Applying batchnorm to the reconstruction markedly improves diversity of collocating words across topics, which has also been noted by Srivastava (2017). Future work may explore assigning a word diversity regularizer to the NPMI metric, so as to encourage models that have both stronger coherence and word diversity across topics.

6.2 Learned Latent Topics

In addition to being lightweight, one advantage of VAMPIRE is that it produces document representations that can be explicitly interpreted in terms of topics. Although the input we feed into the downstream classifier combines this representation with internal states of the encoder, the topical interpretation helps to summarize what the pretraining has learned. Examples of topics learned by VAMPIRE are provided in Table 3 and Appendix D.

6.3 Learned Scalar Layer Weights

Since the scalar weight parameters in r i are trainable, we are able to investigate which layers of the pretrained VAE the classifier tends to prefer. We consistently find that the model tends to upweight the first layer of the VAE encoder, h (1) , and θ, and downweight the other layers of the encoder. To improve learning, especially under low resource settings, we initialize the scalar weights applied to the first encoder layer and θ with high values and downweighted the intermediate layers, which increases validation performance. However, we also have observed that using a multi-layer encoder in VAMPIRE leads to larger gains downstream.

6.4 Computational Requirements

An appealing aspect of VAMPIRE is its compactness. Table 4 shows the computational requirements involved in training VAMPIRE on a single GPU or CPU, compared to training an ELMO model from scratch on the same data on

7 Related Work

In addition to references given throughout, many others have explored ways of enhancing performance when working with limited amounts of labeled data. Early work on speech recognition demonstrated the importance of pretraining and fine-tuning deep models in the semi-supervised setting (Yu et al., 2010) . Chang et al. (2008) considered "dataless" classification, where the names of the categories provide the only supervision. Miyato et al. (2016) showed that adversarial pretraining can offer large gains, effectively augmenting the amount of data available. A long line of work in active learning similarly tries to maximize performance when obtaining labels is costly (Settles, 2012). Xie et al. (2019) describe novel data augmentation techniques leveraging back translation and tf-idf word replacement. All of these approaches could be productively combined with the methods proposed in this paper.

8 Recommendations

Based on our findings in this paper, we offer the following practical advice to those who wish to do effective semi-supervised text classification.

• When resources are unlimited, the best results can currently be obtained by using a pretrained model such as BERT, but fine-tuning to in-domain data is critically important (see also Howard and Ruder, 2018) .

• When computational resources and annotations are limited, but there is plentiful unlabeled data, VAMPIRE offers large gains over other low-resource approaches.

• Training a language model such as ELMO on only in-domain data offers comparable or somewhat better performance to VAMPIRE, but may be prohibitively expensive, unless working with GPUs.

• Alternatively, resources can be invested in getting more annotations; with sufficient labeled data (tens of thousands of instances), the advantages offered by additional unlabeled data become negligible. Of course, other NLP tasks may involve different tradeoffs between data, speed, and accuracy.

9 Conclusions

The emergence of models like ELMO and BERT has revived semi-supervised NLP, demonstrating that pretraining large models on massive amounts of data can provide representations that are beneficial for a wide range of NLP tasks. In this paper, we confirm that these models are useful for text classification when the number of labeled instances is small, but demonstrate that fine-tuning to in-domain data is also of critical importance. In settings where BERT cannot easily be used, either due to computational limitations, or because an appropriate pretrained model in the relevant language does not exist, VAMPIRE offers a competitive lightweight alternative for pretraining from unlabeled data in the low-resource setting. When working with limited amounts of labeled data, we achieve superior performance to baselines such as self-training, or using word vectors pretrained on out-of-domain data, and approach the performance of ELMO trained only on in-domain data at a fraction of the computational cost.

A Hyperparameter Search

In this section, we describe the hyperparameter search we used to choose model configurations, and include plots illustrating the range of validation performance observed in each setting.

A.1 Vampire Search

For the results presented in the paper, we varied the hyperparameters of VAMPIRE across a number of different dimensions, outlined in Table 5 .

Figure 5: An example learning curve when training VAMPIRE on the IMDB dataset. If trained for too long, we observe many cases in which NPMI (higher is better) degrades while NLL (lower is better) continues to decrease. To avoid selecting a model that has poor topic coherence, we recommend performing model selection with NPMI rather than NLL.

A.2 Classifier Search

To choose a baseline classifier for which we experiment with all pretrained models, we performed a mix of manual tuning and random search over four basic classifiers: CNN, LSTM, Bag-of-Embeddings (i.e., Deep Averaging Networks), and Logistic Regression. Figure 6 shows the distribution of validation accuracies using 200 and 10,000 labeled instances, respectively, for different classifiers on the IMDB and AG datasets. Under the lowresource setting, we observe that logistic regression and DAN based classifiers tend to lead to more reliable validation accuracies. With enough compute, CNN-based classifiers tend to produce marginally higher validation accuracies, but the probability is mostly centered below those of the logistic regression and DAN classifiers. LSTMbased classifiers tend to have extremely high variance under the low-resource setting. For this work, we choose to experiment with the DAN classifier, which comes with the richness of vectorbased representations, along with the reliability that comes with having very few hyperparameters to tune. Table 6 shows the results of all high-resource methods (along with VAMPIRE) on all datasets, as we vary the amount of labeled data. As can be seen, training ELMO only on in-domain data results in similar or better performance to using an off-the-shelf ELMO or BERT model, without fine-tuning it to in-domain data.

Figure 6: Probability densities of supervised classification accuracy in low-resource (200 labeled instances; left) and high-resource (10K labeled instances; right) settings for IMDB and AG datasets using randomly initialized trainable embeddings. Each search consists of 300 trials over 5 seeds and varying hyperparameters. We experiment with four different classifiers: Logistic Regression, LSTM-based classifier, Deep Averaging Network, and a CNNbased Classifier. We choose to use the Deep Averaging Network for all classifier baselines, due to its reliability, expressiveness, and low-maintenance.

Table 6: Results in the high-resources setting.

B Results In The High Resource Setting

Except for one case in which it fails badly (YAHOO! with 200 labeled instances), fine-tuning BERT to the target domain achieves the best performance in every setting. Though we performed a substantial hyperparameter search under this regime, we attribute the failure of fine-tuning 0 50

Step : An example learning curve when training VAMPIRE on the IMDB dataset. If trained for too long, we observe many cases in which NPMI (higher is better) degrades while NLL (lower is better) continues to decrease. To avoid selecting a model that has poor topic coherence, we recommend performing model selection with NPMI rather than NLL.

BERT under this setting to potential hyperparameter decisions which could be improved with further tuning. Other work has suggest that random initializations have a significant effect on the failure cases of BERT, pointing to the brittleness of fine-tuning (Phang et al., 2018) . The performance gap between fine-tuned ELMO and frozen ELMO in AG News corpus is much smaller than that of the other datasets, perhaps because the ELMO model we used was pretrained on the Billion Words Corpus, which is a news crawl. This dataset is also an example where frozen ELMO tends to out-perform using VAM-PIRE. We attribute the strength of frozen, pretrained ELMO under this setting as further evidence of the importance of in-domain data for effective semi-supervised text classification.

C Further Details On Npmi Vs. Nll As Stopping Criteria

In the main paper, we note that we have observed cases in which training VAMPIRE for too long results in NPMI degradation, while NLL continues to improve. In Figure 5 , we display example learning curves that point to this phenomenon.

D Additional Learned Topics

In Table 7 we display some additional topics learned by VAMPIRE on the YAHOO! dataset. Table 5 : VAMPIRE search space, best assignments, and associated performance on the four datasets we consider in this work. and high-resource (10K labeled instances; right) settings for IMDB and AG datasets using randomly initialized trainable embeddings. Each search consists of 300 trials over 5 seeds and varying hyperparameters. We experiment with four different classifiers: Logistic Regression, LSTM-based classifier, Deep Averaging Network, and a CNNbased Classifier. We choose to use the Deep Averaging Network for all classifier baselines, due to its reliability, expressiveness, and low-maintenance.

Table 7: Example topics learned by VAMPIRE in the YAHOO! dataset.

http://github.com/allenai/vampire

For example, ULMFIT was trained on 100 million words, and BERT used 3.3 billion. While many pretrained models have been made available, they are unlikely to cover every application, especially for rare languages.

We leave experimentation with multi-sample approximation (e.g., importance sampling) to future work.

We also experimented with the joint training and combined approaches discussed in, but found that neither of these reliably improved performance over our pretraining approach.

https://allennlp.org/elmo 10 https://github.com/huggingface/ pytorch-pretrained-BERT