Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Unsupervised Domain Clusters in Pretrained Language Models



The notion of “in-domain data” in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision – suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and precision and recall with respect to an oracle selection.

1 Introduction

It is common knowledge in modern NLP that using large amounts of high-quality training data is a key aspect in building successful machine-learning based systems. For this reason, a major challenge when building such systems is obtaining data in the domain of interest. But what defines a domain? Natural language varies greatly across topics, styles, levels of formality, genres and many other linguistic nuances (van der Wees et al., 2015; van der Wees, 2017; Niu et al., 2017) . This overwhelming diversity of language makes it hard to find the right data for the task, as it is nearly impossible to well-define the exact requirements from such data with respect to all the aforementioned aspects. On top of that, domain labels are usually unavailable -e.g. in large-scale web-crawled data like Common Crawl 1 which was recently used to train state-of-the-art pretrained language models for various tasks (Raffel et al., 2019) .

Domain data selection is the task of selecting the most appropriate data for a domain from a large corpus given a smaller set of in-domain data (Moore and Lewis, 2010; Axelrod et al., 2011; Duh et al., 2013; Silva et al., 2018) . In this work, we propose to use the recent, highly successful selfsupervised pre-trained language models, e.g. Devlin et al. (2019) ; for domain data selection. As pretrained LMs demonstrate state-ofthe-art performance across many NLP tasks after being trained on massive amounts of data, we hypothesize that the robust representations they learn can be useful for mapping sentences to domains in an unsupervised, data-driven approach. We show that these models indeed learn to cluster sentence representations to domains without further supervision (e.g. Figure 1 ), and quantify this phenomenon by fitting Gaussian Mixture Models (GMMs) to the learned representations and measuring the purity of the resulting unsupervised clustering. We then propose methods to leverage these emergent arXiv:2004.02105v2 [cs.CL] 1 May 2020 domain clusters for domain data selection in two ways:

Figure 1: A 2D visualization of average-pooled BERT hidden-state sentence representations using PCA. The colors represent the domain for each sentence.

• Via distance-based retrieval in the sentence embedding space induced by the pretrained language model.

• By fine-tuning the pretrained language model for binary classification, where positive examples are from the domain of interest.

Our methods enable to select relevant data for the task while requiring only a small set of monolingual in-domain data. As they are based solely on the representations learned by self-supervised LMs, they do not require additional domain labels which are usually vague and over-simplify the notion of domain in textual data. We evaluate our method on data selection for neural machine translation (NMT) using the multi-domain German-English parallel corpus composed by Koehn and Knowles (2017) . Our data selection methods enable to train NMT models that outperform those trained using the well-established cross-entropy difference method of Moore and Lewis (2010) across five diverse domains, achieving a recall of more than 95% in all cases with respect to an oracle that selects the "true" in-domain data.

Our contributions in this work are as follows. First, we show that pre-trained language models are highly capable of clustering textual data to domains with high accuracy in a purely unsupervised manner. Second, we propose methods to select in-domain data based on this property using vectorspace retrieval and positive-unlabeled fine-tuning of pretrained language models for binary classification. Third, we show the applicability of our proposed data selection methods on a popular benchmark for domain adaptation in machine translation. An additional contribution is a new, improved data split we create for this benchmark, as we point on issues with previous splits used in the literature. The code and data for this work is publicly available. 2 We hope this work will encourage more research on understanding the data landscape in NLP, enabling to "find the right data for the task" in the age of massive models and diverse data sources.

2 Emerging Domain Clusters In

Pretrained Language Models

2.1 Motivation

The proliferation of massive pretrained neural language models such as ELMo (Peters et al., 2018) , BERT (Devlin et al., 2019) or RoBERTa has enabled great progress on many NLP benchmarks (Wang et al., , 2019a . Larger and larger models trained on billions of tokens of raw text are released in an ever-increasing pace (Raffel et al., 2019) , enabling the NLP community to fine-tune them for the task of interest. While many works tried to "probe" those models for the morphological, syntactic and semantic information they capture (Tenney et al., 2019; Goldberg, 2019; Clark et al., 2019) , an important aspect of language remained overlooked in this context -the domain the data comes from, often referred to as the "data distribution". The definition of domain is many times vague and over-simplistic (e.g. "medical text" may be used for biomedical research papers and for clinical conversations between doctors and patients, although the two vary greatly in topic, formality etc.). A common definition treats a domain as a data source: "a domain is defined by a corpus from a specific source, and may differ from other domains in topic, genre, style, level of formality, etc." (Koehn and Knowles, 2017) . We claim that a more data-driven definition should take place, as different data sources may have sentences with similar traits and vice versa -a single massive webcrawled corpus contains texts in numerous styles, topics and registers. Our analysis in Section 2 shows examples for such cases, e.g. a sentence discussing "Viruses and virus-like organisms" in a legal corpus.

We hypothesize that massive pretrained LMs can learn representations that cluster to domains, as texts from similar domains will appear in similar contexts. We test this hypothesis across several large, publicly-available pretrained LMs; we explore both masked-language-models (MLMs) and auto-regressive LMs.

2.2 Method

We encode multi-domain data at the sentence level into vector representations. We then cluster these vector representations for each model using a Gaussian Mixture Model (GMM) with k pre-defined clusters. We chose GMM as our clustering ap- proach as it allows soft assignments (vs. hard assignments as in e.g. K-means) which we think fits the task better (as a sentence can be seen as drawn from a mixture of several domain). 3 In all cases, to create a sentence representation we perform average pooling of the last hidden state (before the softmax layer) for each token in the sentence. 4 To accelerate the clustering process and enable visualization we also experiment with performing dimensionality reduction with PCA over the sentence vectors before clustering them. We experiment with k in 5, 10 and 15 to test how adding flexibility would improve the domain clustering accuracy.

2.3 Models And Baselines

For MLM-based models we use BERT (Devlin et al., 2019) , DistilBERT and RoBERTa ) (in both the base and large versions). For autoregressive models we use GPT-2 (Radford et al., 2018) and XLNet (Yang et al., 2019) . In all cases we use the implementations from the HuggingFace Transformers toolkit . We also evaluated three additional, simpler baselines. The first is using representations from word2vec (Mikolov et al., 2013) , where we average-pooled the word vectors for the tokens that were present in the model vocabulary. The second is using Latent Dirichlet Allocation (LDA, Blei et al., 2003) , which is a classic ap-3 See further discussion comparing GMMs and K-means in Daume (2009) . 4 Using the penultimate layer or others may result in better performance; we leave this for future work. proach to unsupervised clustering of text. 5 We also report results for a baseline which assigns sentences by sampling randomly from a uniform distribution over the clusters.

2.4 Evaluation

To evaluate the unsupervised domain clustering we used the multi-domain corpus proposed by Koehn and Knowles (2017) which includes textual data in five diverse domains: subtitles 6 , medical text (PDF documents from the European Medicines Agency), legal text (legislative text of the European Union), translations of the Koran, and IT-related text (manuals and localization files of open-source software). This dataset includes parallel sentences in English and German; for this experiment we used the English portion of the data. See more details on the dataset in Section 3.1. We used 2000 distinct sentences from each domain. To evaluate whether the resulting clusters indeed capture the domains the data was drawn from we measure the clustering purity, which is a well-known metric for evaluating clustering (Manning et al., 2008) . To measure the clustering purity, we assign each unsupervised cluster with the most common "true" domain in the sentences assigned to that cluster, and then compute the accuracy according to this majority-based cluster-domain assignment (note that in this case several unsupervised clusters can be assigned to the same domain). In cases where randomness is involved we run each experiment five times with different initializations and report the mean and variance of the purity metric for each model.

2.5 Results And Discussion

As can be seen in Table 1 , pre-trained language models are indeed highly capable of generating sentence representations that cluster by domains, resulting in up to 87.66%, 89.04% and 89.94% accuracy when using k=5, k=10 and k=15 clusters, respectively, across 10,000 sentences in 5 domains. We find these scores remarkably high given our straight-forward average-pooling strategy and that no domain-supervision was involved in the process of learning the pre-trained representations. Figure 2 also demonstrates the quality of the obtained clusters in 2D using the BERT-base model, where the ellipses describe the mean and variance parameters learned for each cluster by the GMM with k = 5. 7 We note that some classes of models did better than others: while all vector-based models did far better than the random and LDA baselines 8 , the MLM-based models dominated in all cases over word2vec and the auto-regressive models. This may be explained by the fact that the MLM-based models use the entire sentence context when generating the representations for each token, while the auto-regressive models only use the past context, and word2vec uses a limited window context. Using PCA improved performance in most cases it koran subtitles medical law bert-base-uncased and especially for the auto-regressive models, although the results for the MLMs remain high in both cases -suggesting that these models encode the information very differently.

Table 1: Unsupervised domain clustering as measured by purity for the different models. Best results are marked in bold for each setting.

2.6 Analysis

As can be seen in Figure 2 , in some areas the domains are somewhat overlapping in the embedding space, which may lead to outlier cases where examples from one domain are assigned to a cluster of a another domain. We plot a confusion matrix (Figure 3 ) to analyze this further based on the clustering with BERT-base and k=5. We first note that the outlier sentences are much shorter than the average sentence length in the corpus (11.62 tokens on average for outliers vs. 20.5 tokens on average in general). This makes sense as shorter sentences contain less information, making it harder to assign them to an appropriate cluster. Table 2 shows examples of outlier sentences, assigned to clusters of domains different from their originating domain.

Figure 2: A 2D visualization of the unsupervised GMM clustering for the same sentences as in Figure 1.
Figure 3: A confusion matrix for clustering with k=5 using BERT-base.
Table 2: Sentences from one domain which were assigned to a cluster of another domain by the BERT-based clustering, k=5.

We can see that in many cases the assignments are sensible -for example for sentences originating from the subtitles corpus, a sentence that mentions "great priest" is assigned to the Koran cluster, a sentence that mentions "The International Criminal Court in The Hague" is assigned to the Law cluster, a sentence that mentions "the virus" is assigned to the Medical cluster and so on. This strengthens our claim that defining domains based on the corpus they originated from may be over-simplistic, and using a more data-driven approach may enable to find better domain assignments across different corpora.

The domain that attracted the largest number Subtitles assigned to Koran Subtitles assigned to Medical I am Spa'am, high priest of the boars.

Oxygen supply at 50%. Joseph, go in peace, and the Lord be with you.

Or it can help her walk again if the virus is kept in check with this. Subtitles assigned to IT Subtitles assigned to Law Push it up to the front of the screen.

Statutes, transcripts, redacted immunity agreements. Polyalloy requires programming to take permanent

The Security Council therefore must press for his immediate form.

referral to the International Criminal Court in The Hague.

Law Assigned To Medical

Law assigned to IT -Viruses and virus-like organisms "INFORMATION SOCIETY STATISTICS where the glucose content is equal to or less than This document must be attached to the certificate and field the fructose content.

with it, except where there is a computerised checking system.

Medical Assigned To Law

Medical assigned to IT This will be introduced by a Regulation adopted by the An updated and improved version of the CD-ROM was issued European Commission.

to all subscribers during the first half of the year. The marketing authorisation was renewed on 22 May -All tables will be based on generic and not product-specific of outliers is the IT domain cluster, with 597 sentences assigned to it from other domains. Looking more closely we find that more than half of these sentences (340 out of 597) included numbers (e.g. "34% 25% 34%" (from medical), "(b) reference number 20 is deleted;" (from law), "(Command of Prostration # 1)" (from Koran) or "The message, R2." (from subtitles)). As numbers appear in many different contexts, they may be harder to assign to a specific domain by the context-aware language models in such short sentences. The second largest attractor of outliers is the Subtitles cluster, with 372 sentences assigned to it from other domains. We find that most of these sentences contain personal pronouns or question marks (228 out of 372, 61.2%) while the ratio of such sentences in the entire corpus is only 40%. Examples include "Why did you choose the name & amarok;?" (from IT), or "What is Avonex?" (from Medical). This may be expected as the subtitles corpus mainly includes transcriptions of spoken, conversational language, and "conversation tends to have more verbs, more personal pronouns, and more questions" (Conrad and Biber, 2005) . Another possible reason for the subtitles domain to attract outliers is the fact that this is the least-topical cluster: movies and TV series may discuss diverse topics, unlike medical, religious, legal and technical texts that may have a more cohesive topic.

3 Neural Machine Translation In A Multi-Domain Scenario

As we showed that pre-trained language models are indeed very useful in clustering sentence repre-sentations by domains in an unsupervised manner, we now seek to harness this property for a downstream task -domain data selection for machine translation. Domain data selection is the task of selecting examples from a large corpus which are as close as possible to the domain of interest, given a smaller set of in-domain examples. The selected examples can be used to either (1) train a domainspecific model from scratch (Axelrod et al., 2011) ,

(2) fine-tune a pre-trained general-domain model (Sajjad et al., 2017; Silva et al., 2018) , or (3) prioritize data for annotation as in an Active-Learning framework, if only monolingual data is available (Haffari et al., 2009) . To demonstrate the need for domain data selection and set the stage for our data selection experiments, we perform preliminary experiments with NMT in a multi-domain scenario.

3.1 Multi-Domain Dataset

To simulate a diverse multi-domain setting we use the dataset proposed in Koehn and Knowles (2017) , as it was recently adopted for domain adaptation research in NMT (Hu et al., 2019; Dou et al., 2019a,b) . The dataset includes parallel text in German and English from five diverse domains (Medical, Law, Koran, IT, Subtitles; as discussed in Section 2), available via OPUS (Tiedemann, 2012; Aulamo and Tiedemann, 2019) . In a preliminary analysis of the data we found that in both the original train/dev/test split by Koehn and Knowles (2017) and in the more recent split by there was overlap between the training data and the dev/test data. 9


New Split Medical 1,104,752 248,099 Law 715,372 467,309 IT 378,477 222,927 Koran 533,128 17,982 Subtitles 22,508,639 14,458,058 Table 3 : Number of training examples for each domain in the original split and in our split.

Table 3: Number of training examples for each domain in the original split (Müller et al., 2019) and in our split.

Fixing these issues is important, as it may affect the conclusions one draws from experiments with this dataset. For example, as overlapping development sets favor memorization of the training set, one may choose checkpoints and report results on over-fitting models. This is especially relevant with neural sequence-to-sequence models, as they are highly susceptible to memorization (Aharoni and Goldberg, 2018) and hallucination , as confirmed by .

To create a better experimental setting to test generalization within and across domains, we create a new data split where we ensure that no such overlap between the training, development and test sets occur. We started from the split of Müller et al. 2019as it included newer versions of some of the datasets. 10 Furthermore, we did not allow more than one translation of a given source or target sentence, as such cases were very frequent in the dataset and usually stand for duplicate sentence pairs (See Table 3 ). For example, applying this filtering reduced the size of the Koran corpus from 533,128 sentence pairs to only 17,982. Finally, following Müller et al. (2019) we cap the subtitles corpus to 500,000 sentence pairs as it is much larger than the rest. We make the new split publicly available and hope it will enable better future experimentation on this important subject. 11

3.2 Cross-Domain Experiments

Experimental Setup We follow Hu et al. (2019) and train domain-specific models for all domains. We then evaluate each model across the different domain test sets, enabling us to understand the effect of different domains on the downstream MT performance and to set up strong baselines for data selection experiments. We also train a generaldomain model using the available data from all domains, as it is also a common approach in multidomain scenarios . In all experiments we use a similar Transformer (Vaswani Medical Table 4 . In most cases, the best results for each domain are obtained by training on the in-domain data. Training on all the available data helped mostly for the Koran test set. This is expected as the training data for this domain is considerably smaller than the training data for rest of the domains (Table 3) . We can also see that more data is not necessarily better (Gascó et al., 2012) : while the subtitles corpus is the largest of all 5 and includes 500,000 sentence pairs, it is second to last in performance as measured by the average BLEU across all test sets.

Table 4: SacreBLEU (Post, 2018) scores of our baseline systems on the test sets of the new data split. Each row represents the results from one model on each test set. The best result in each column is marked in bold.

Cross-Domain BLEU vs. Cluster Proximity An interesting observation can be made with respect to the visual analysis of the domain clusters as depicted in Figure 2 : as the Medical cluster (in Yellow), Law cluster (in Purple) and IT cluster (in Red) are close to each other in the embedding space, their cross-domain BLEU scores are also higher. For example, note how in the results for the Medical domain-specific model (first row in Table 4 ), the BLEU scores on the Law and IT test sets are much higher in comparison to those on the Koran and Subtitles test sets, which clusters are farther away in the visualized embedding space. Similarly, as the Subtitles cluster (Blue) is closer to the Koran cluster (Green), the highest cross-domain BLEU score on the Koran test set is from the Subtitles model. To further quantify this phenomenon, we plot and measure Pearson's correlation between the cosine similarity of the centroids for the English BERT-based dev sentence representations for each domain pair, and the cross-domain BLEU score for this domain pair. This is shown in Figure 4 . We can see the general trend where the closer the domain centroids are (with a similarity of 1 for training and evaluating on the same domain), the higher the cross-domain BLEU is between those domains, resulting in a Pearson's correlation of 0.81 (strong correlation). This suggests that such preliminary visual analysis can be a useful tool for understanding the relationship between diverse datasets, and motivates the use of pre-trained language model representations for domain data selection in MT.

Figure 4: The cosine similarity between the centroids of the BERT representations for each domain pair vs. the corresponding cross-domain BLEU.

4 Domain Data Selection With Pretrained Language Models

As shown in the previous section, using the right data is critical for achieving good performance on an in-domain test set, and more data is not necessarily better. However, in real-world scenarios, the availability of data labeled by domain is limited, e.g. when working with large scale, webcrawled data. In this section we focus on a dataselection scenario where only a very small number of in-domain sentences are used to select data from a larger unlabeled parallel corpus. An established method for data selection was proposed by Moore and Lewis (2010), which was also used in training the winning systems in WMT 2019 Barrault et al., 2019) . This method compares the cross-entropy, according to domainspecific and non-domain-specific language models, for each candidate sentence for selection. The sentences are then ranked by the cross-entropy difference, and only the top sentences are selected for training. While the method by Moore and Lewis (2010) is tried-and-true, it is based on simple n-gram language models which cannot generalize beyond the n-grams that are seen in the in-domain set. In addition, it is restricted to the in-domain and generaldomain datasets it is trained on, which are usually small. On the contrary, pre-trained language models are trained on massive amounts of text, and, as we showed through unsupervised clustering, learn representations with domain-relevant information. In the following sections, we investigate whether this property of pretrained language models makes them useful for domain data selection.

4.1 Methods

We propose two methods for domain data selection with pretrained language models.

Domain-Cosine In this method we first compute a query vector, which is the element-wise average over the vector representations of the sentences in the small in-domain set. We use the same sentence-level average-pooling approach as described in Section 2 to obtain sentence representations. We then retrieve the most relevant sentences in the training set by computing the cosine similarity of each sentence with this query vector and ranking the sentences accordingly.

Domain-Finetune It is now common knowledge that pretrained language models are especially useful when fine-tuned for the task of interest in an end-to-end manner (Ruder et al., 2019) . In this method we fine-tune the pretrained LM for binary classification, where we use the in-domain sentences as positive examples, and randomly sampled general-domain sentences as negative examples. We then apply this classifier on the general-domain data and pick the sentences that are classified as positive as in-domain, or choose the top-k sentences as ranked by the classifier output distribution. This can be seen as an instance of positiveunlabeled learning for document-set expansion; see Jacovi et al. (2019) for a recent discussion and methodology for this task.

Negative Sampling with Pre-ranking One problem that may rise when randomly sampling negative examples is that unlabeled in-domain sentences from the general-domain data may be sampled as negative examples -deteriorating the classifier performance. To alleviate this issue, we perform a biased sampling of negative examples. We first rank the general-domain data using the Domain-Cosine method, and then sample negative examples under a certain threshold in the ranking (in our experiments we sampled from the bottom two-thirds). Table 5 shows an ablation for such pre-ranking, measuring precision, recall and F1 for binary classification on a held-out set for each domain. When not using pre-ranking, as the training data for the domain is larger, the precision is lower -since more in-domain examples are drawn as negative samples. Using pre-ranking indeed alleviates this issue, achieving higher F1 scores in all cases. Given the results in Table 5 we always use pre-ranking in the following experiments.

Table 5: Ablation analysis showing precision (p) recall (r) and F1 for the binary classification accuracy on a held-out set, with and without pre-ranking.

4.2 Experimental Setup

We perform data selection experiments for each domain in the multi-domain dataset. As the small set of monolingual in-domain data we take the 2000 development sentences from each domain. For the general-domain corpus we concatenate the training data from all domains, resulting in 1,456,317 sentences. To enable faster experimentation we used DistilBERT for the Domain-Cosine and Domain-Finetune methods. More technical details are available in the supplementary material. We compare our methods to four approches:

(1) The established method by Moore and Lewis (2010), (2) a random selection baseline, (3) an oracle which is trained on all the available in-domain data, and (4) the model we train on all the domains concatenated. We select the top 500k examples to cover the size of every specific in-domain dataset. We train Transformer NMT models on the selected data with a similar configuration to the ones trained in the cross-domain evaluation.

4.3 Results

The results are available in Table 6 . We can see that all selection methods performed much better in terms of BLEU than random selection. It is also nice to see that all selection methods performed better than using all the available data or the oracleselected data when averaged across all domains, showing again that more data is not necessarily better in multi-domain scenarios and that data selection is a useful approach. Regarding a comparison of the data selection methods, Moore-Lewis performed better than Domain-Cosine, while Domain-Finetune performed best, showing the benefit of fine-tuning large pretrained models for the data selection task. Using the positively-labeled examples alone (Domain-Finetune-Positive) performed worse than using the top 500k examples but better than Domain-Cosine, while not requiring to determine the number of selected sentences.

Table 6: SacreBLEU scores for the data selection experiments. Highest scores per column are marked in bold.

4.4 Analysis

We perform an analysis on the selected datasets, where we measure the precision and recall of sentence selection with respect to the oracle selection. The results are available in Table 7 . As also reflected in the BLEU scores, the Domain-Finetune method resulted in the highest domain recall with a minimum of 97.5, while Moore-Lewis and Domain-Cosine scored 89.4 and 78.8 respectively. We find these results very appealing given that only 2000 in-domain sentences were used for selection for each domain out of 1.45 million sentences. Also note that we used DistilBERT in these experiments: we believe that using larger, nondistilled models may result in even better selection performance (although at the price of larger computational requirements).

Table 7: Precision (p) and recall (r) for data selection of 500k sentences with respect to the oracle selection.

5 Related Work

Previous works used n-gram LMs for data selection (Moore and Lewis, 2010; Axelrod et al., 2011) or other count-based methods (Axelrod, 2017; Parcheta et al., 2018; Santamaría and Axelrod, 2019) . While such methods work well in practice, they cannot generalize beyond the N-grams observed in the in-domain datasets, which are usually small. Duh et al. (2013) proposed to replace n-gram models with RNN-based LMs with notable improvements. However, such methods do not capture the rich sentence-level global context as in the recent self-attention-based MLMs; as we showed in the clustering experiments, autoregressive neural LMs were inferior to masked LMs in clustering the data by domain. In addition, training very large neural LMs may be prohibitive without relying on pre-training.

Regarding domain clustering for MT, Hasler et al. (2014) discovered topics using LDA instead of using domain labels. Cuong et al. (2016) induced latent subdomains from the training data using a dedicated probabilistic model.

Many works used vector-based retrieval for data selection; Ruder and Plank (2017) learn to select data using Bayesian optimization, and explored word2vec for that purpose. Duma and Menzel (2016) create paragraph vectors for data selection in the context of SMT. Wang et al. (2017) use internal representations from the NMT model to perform data selection. Bapna and Firat (2019) propose a mechanism for incorporating retrieved sentences for each instance for domain adaptation in NMT, using representations extracted from a pre-trained NMT model. Farajian et al. (2017) explored instance-based data selection in a multidomain scenario using information retrieval methods.

Other related works on domain adaptation include Dou et al. (2019a) that adapts multi-domain NMT models with domain-aware feature embeddings, which are learned via an auxiliary language modeling task. Peris et al. (2017) proposed neuralnetwork based classifiers for data selection in SMT. For more related work on data selection and domain adaptation in the context of MT, see the surveys by Eetemadi et al. (2015) for SMT and more recently Chu and Wang (2018) for NMT.

Unrelated to MT, Ma et al. (2019) used BERT to select data for tasks from the GLUE benchmark . However, they assumed supervision for all the different tasks/domains, while we propose an unsupervised method requiring only a small set of in-domain data. Also in the context of pretrained language models, Gururangan et al. (2020) show the importance of additional pretraining with in-domain data to improve the down-stream task-specific performance.

While previous work made important contributions to domain data selection, our work is the first to explore massive pretrained language models for both unsupervised domain clustering and for data selection in NMT.

6 Conclusions And Future Work

We showed that massive pre-trained language models are highly effective in mapping data to domains in a fully-unsupervised manner using averagepooled sentence representations and GMM-based clustering. We suggest that such clusters are a more appropriate, data driven approach to domains in natural language than simplistic labels (e.g. "medical text"), and that it will improve over time as better and larger pretrained LMs will become available. We proposed new methods to harness this property for domain data selection using distancebased ranking in vector space and pretrained LM fine-tuning, requiring only a small set of in-domain data. We demonstrated the effectiveness of our methods on a new, improved data split we created for a previously studied multi-domain machine translation benchmark. Our methods perform similarly or better than an established data selection method and oracle in-domain training across all five domains in the benchmark.

This work just scratches the surface with what can be done on the subject; possible avenues for future work include extending this with multilingual data selection and multilingual LMs (Conneau and Lample, 2019; Wu et al., 2019; Hu et al., 2020) , using such selection methods with domain-curriculum training (Zhang et al., 2019; , applying them on noisy, web-crawled data (Junczys-Dowmunt, 2018) or for additional tasks (Gururangan et al., 2020) . Another interesting avenue is applying this to unsupervised NMT, which is highly sensitive to domain mismatch (Marchisio et al., 2020; Kim et al., 2020) . We hope this work will encourage more research on finding the right data for the task, towards more efficient and robust NLP.

A Appendix

A.1 NMT Training Figure 5 details the hyperparameter configuration we used to train the NMT models. We use Transformer models (Vaswani et al., 2017) in the Base configuration using the implementation provided in Fairseq . For all models we use a joint BPE vocabulary (Sennrich et al., 2016) learned with 32k merge operations over the concatenated corpus in both languages, enabling to tie all the embedding layers (Press and Wolf, 2017) . 12 We perform early stopping if the BLEU score on the domain-specific development set did not improve in 10 consequent checkpoints. We use the ADAM (Kingma and Ba, 2014) optimizer with an initial learning rate of 5 • 10 − 4 and a maximum of 4096 tokens per batch. We trained all models on a single NVIDIA GPU. We decode using beam search with a beam size of 5. For pre-processing we used the Moses (Koehn et al., 2007) pipeline including tokenization, normalize-punctuation, nonprinting character removal, truecasing and cleaning. We removed examples with sequences longer than 100 tokens from the training data (before subword segmentation). Table 8 shows details about the overlap between the training, development and test sets for the different data splits of the multi-domain dataset. The overlap was computed using the English part of the corpus.