Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training
Authors
Abstract
We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.
1 Introduction
This work deals with neural language modeling of code-switched language, motivated by an application to speech recognition. Code-switching (CS) is a linguistic phenomenon defined as "the alternation of two languages within a single discourse, sentence or constituent." (Poplack, 1980) . Since CS is widely used in spoken platforms, dealing with code-switched language becomes an important challenge for automatic speech recognition (ASR) systems. To get a feeling how an ASR system trained on monolingual data performs on code-switched language, we fed the IBM English and Spanish systems 1 with audio files of code-switched conversations from the Bangor Miami Corpus (see Section 7). The results (examples available in Table 1 ) exhibit two failure modes:
(1) Words and sentences in the opposite language are not recognized correctly; and (2) Code-switch points also hurt recognition of words from the main language. This demonstrates the need for designated speech recognition systems for CS. A crucial component in such a CS ASR system is a strong CS language model, which is used to rank the bilingual candidates produced by the ASR system for a given acoustic signal.
Language models are traditionally evaluated with perplexity. However, this measure suffers from several shortcomings, in particular strong dependence on vocabulary size and lack of ability to directly evaluate scores assigned to malformed sentences. We address these deficiencies by presenting a new evaluation scheme for LM that simulates an ASR setup. Rather than requiring the LM to produce a probability for a gold sentence, we instead present the LM with a set of alternative sentences, including the gold one, and require it to rank the gold one higher. This evaluation is more realistic since it simulates the role a language model plays in the process of converting audio into text (Section 3). We create such an evaluation dataset for English-Spanish CS -Spangli-shEval (Section 4).
Additionally, LM for CS poses a unique challenge: while data for training monolingual LMs is easy to obtain in large quantities, CS occurs primarily in spoken language. This severely limits data availability, thus, small amounts of training data are an intrinsic property of CS LM. A natural approach to overcome this problem is to train monolingual language models-for which we have huge amounts of data-for each language separately, and combine them somehow into a code-switched LM. While this is rela-Original sentence (audio) English model output Spanish model output no pero vino porque he came to check on my machine No but we no longer he came to check on my machine Cual provino de que en dicha cámara chino yo le invité pa Easter porque me daba pena el pobre aquí solo sin familia ni nada feeling betrayed by eastern lap and I bought a new solo seem funny any now y el envite país tampoco nada pero por aquí solo sin familia ni nada Table 1: Examples of the output of IBM's English and Spanish speech recognition systems on code-switched audio. tively straightforward to do in an n-gram language model, it is not obvious how to perform such an LM combination in a non-markovian, RNN-based language model. We propose an effective protocol for LSTM-based CS LM training which can take advantage of monolingual data (Section 5).
Based on the new evaluation scheme we present, we further propose to learn a model for this ranking task using discriminative training. This model, as opposed to LMs, no longer depends on estimating next-word probabilities for the entire vocabulary. Instead, during training the model is introduced with positive and negative examples and is encouraged to prefer the positive examples over the negative ones. This model gives significantly better results (Section 6).
Our contributions in this work are four-fold: (a) We propose a new, vocabulary-size independent evaluation scheme for LM in general, motivated by ASR. This evaluation scheme is ranking-based and also suites CS LM; (b) We describe a process for automatically creating such datasets, and provide a concrete evaluation dataset for English-Spanish CS (SpanglishEval); (c) We present an effective protocol for training CS LM for this ranking task using pretrained monolingual models and show significant improvement over various baselines; (d) We present a model for this new ranking task that uses discriminative training and is decoupled of probability estimations, this model surpasses the standard LMs.
The CS LM evaluation dataset and the code for the model are available at the github repository: https://github.com/gonenhila/ codeswithing-lm.
2 Background
Code-Switching Code-switching (CS) is defined as the use of two languages at the same discourse (Poplack, 1980) . The mixing of different languages in various levels has been widely studied from social and linguistic point of view (Auer, 1999; Muysken, 2000; Bullock and Toribio, 2009) , and started getting attention also in the NLP community in the past few years (Solorio and Liu, 2008; Adel et al., 2013a; Cotterell et al., 2014) .
Below is an example of code-switching between Spanish and English (taken from the Bangor Miami corpus described in Section 7). Translation to English follows:
• "that es su tío that has lived with him like I don't know how like ya several years..." that his uncle who has lived with him like, I don't know how, like several years already...
Code-switching is becoming increasingly popular, mainly among bilingual communities. Yet, one of the main challenges when dealing with CS is the limited data and its unique nature: it is usually found in non standard platforms such as spoken data and social media and accessing it is not trivial (Ç etinoglu et al., 2016) .
Shortcomings of Perplexity-based Evaluation for LM The most common evaluation measure of language models is perplexity. Given a language model M and a test sequence of words w 1 , ..., w N , the perplexity of M over the sequence is defined as:
2 − 1 N N i=1 log 2 M (w i )
where M (w i ) is the probability the model assigns to w i .
A better model is expected to give higher probability to sentences in the test set, that is, lower perplexity. However, this measure is not always well aligned with the quality of a language model as it should be. For example, Tran et al. (2018) show that RNNs capture long-distance syntactic dependencies better than attention-only LMs, despite having higher (worse) perplexity. Simiarly, better perplexities often do not translate to better word-error-rate (WER) scores in an ASR system (Huang et al., 2018) .
This highlights a shortcoming of perplexitybased evaluation: the method is rewarded for assigning high probability to gold sentences, but is not directly penalized for assigning high probability to highly implausible sentences. When used in a speech recognition setup, the LM is expected to do just that: score correct sentences above incorrect hypotheses.
Another shortcoming of perplexity-based evaluation is that it requires the compared models to have the same support (in other words, the same vocabulary). Simply adding words to the vocabulary, even if no additional change is done to the model, will likely result in higher perplexity for the same dataset. It is also sketchy to compare perplexities of word-based LMs and character-based LMs for the same reason.
Problem with WER-based evaluation Evaluating LM models based on the final WER of an ASR system side-steps these two issues: it evaluates the LM on incorrect sentences, and seamlessly compares LMs with different support. However, this makes the evaluation procedure highly dependent on a particular ASR system. This is highly undesirable, as ASR systems are hard to set up and tune and are not standardized. This both conflates the LM performance with performance of other aspects of the ASR system, and makes it hard to replicate the evaluation procedure and fairly compare results across different publications. Indeed, as discussed in Section 9, most previous works on CS LM use an ASR system as part of their evaluation setup, and none of them compares to any other work. Moreover, no standard benchmark or evaluation setup exists for CS LM.
3 An Alternative Evaluation Method
We propose an evaluation technique for language models which is ASR-system-independent, that does take into account incorrect sentences and allows to compare LMs with different support or OOV handling techniques.
We seek a method that meets the following requirements: (1) Prefers language models that prioritize correct sentences over incorrect ones; (2) Does not depend on the support (vocabulary) of the language model; (3) Is independent of and not coupled with a speech recognition system.
To meet these criteria, we propose to assemble a dataset comprising of gold sentences, where each gold sentence is associated with a set of alternative sentences, and require the LM to identify the gold sentence in each set. The alternative sentences in a set should be related to the gold sentence. Following the ASR motivation, we take the alternative set to contain sentences that are phonetically related (similar-sounding) to the gold sentence. This setup simulates the task an LM is intended to perform as a part of an ASR system: given several candidates, all originating from the same acoustic signal, the LM should choose the correct sentence over the others.
A New Evaluation metric Given this setup, we propose to use the accuracy metric: the percentage of sets in which the LM (or other method) successfully identified the gold sentence among the alternatives. 2 The natural way of using an LM for identifying the gold sentences is assigning a probability to each sentence in the set, and choosing the one with highest probability. Yet, the scoring mechanism is independent of perplexity, and addresses the two deficiencies of perplexity based evaluation discussed above.
Our proposed evaluation method is similar in concept to the NMT evaluation proposed by Sennrich (2016). There, the core idea is to measure whether a reference translation is more probable under an NMT model than a contrastive translation which introduces a specific type of error.
4 Evaluation Dataset
We now turn to construct such an evaluation dataset. One method of obtaining sentence-sets is feeding acoustic waves into an ASR system and tracking the resulting lattices. However, this requires access to the audio signals, as well as a trained system for the relevant languages. We propose instead a generation method that does not require access to an ASR system.
The dataset we create is designed for evaluating English-Spanish CS LMs, but the creation process can be applied to other languages and language pairs. 3 The process of alternative sentence creation is as follows: (1) Convert a sequence of languagetagged words (either CS or monolingual) into a Example:
Gold: vamos a ser juntos twenty four seven . Alt-CS: vamos a ser juntos when de for saben . Alt-EN: follows a santos twenty for seven . Alt-SP: vamos hacer junto sentí for saben . Spanish evaluation dataset. The first sentence is the gold sentence, followed by a generated code-switched alternative, a generated English alternative, and a generated Spanish one. Blue (normal) marks English, while red (italic) marks Spanish.
sequence of the matching phonemes (using pronunciation dictionaries); (2) Decode the sequence of phonemes into new sentences, which include words from either language (possibly both); (3) When decoding, allow minor changes in the sequence of phonemes to facilitate the differences between the languages. These steps can be easily implemented using composition of finite-state transducers (FSTs). 4 For each gold sentence (which can be either code-switched or monolingual) we create alternatives of all three types: (a) code-switched sentences, (b) sentences in L1 only, (c) sentences in L2 only.
We created such a dataset for English-Spanish based on the Bangor Miami corpus. Figure 1 shows an example from the dataset, in which the gold sentence is followed by a single codeswitched alternative, a single English alternative, and a single Spanish one (a subset of the full set).
4.1 Technical Details
The gold sentences are taken from the Bangor Miami Corpus (Section 7).
When creating code-switched alternatives, we want to encourage the creation of sentences that include both languages, and that differ from each other. This is done with scores determined by some heuristics (such as preferring sentences that include more words from the language that was less dominant in the original one). We create 1000-best alternatives from the FST, re-score them according to the heuristic and keep the top 10.
We discard sets in which the gold sentence has less than three words (excluding punctuation), and also sets with less than 5 alternatives in one of the three types. We randomly choose 250 sets in which the gold sentence is code-switched, and 750 sets in which the gold sentence is monolingual, both for the development set and for the test set. This percentage of CS sentences is higher than in the underlying corpus in order to aid a meaningful comparison between the models regarding their ability to prefer gold CS sentences over alternatives. The statistics of the dataset are detailed in Table 2 .
Further details regarding the implementation can be found in Appendix A.1.
5 Using Monolingual Data For Cs Lm
Data for code-switching is relatively scarce, while monolingual data is easy to obtain. The question then is how do we efficiently incorporate monolingual data when training a CS LM?
We present an effective training protocol (FINETUNED) for incorporating monolingual data into the language model. We first pre-train a model with monolingual sentences from both English and Spanish. This essentially trains two monolingual models, one for English and one for Spanish, but with full sharing of parameters. Note that in this pre-training phase, the model is not exposed to any code-switched sentence.
We then use the little amount of available codewitched data to further train the model, making it familiar with code-switching examples that mix the two languages. This fine-tunning procedure enables the model to learn to correctly combine the two languages in a single sentence.
We show in Section 8 that adding the CS data only at the end, in the described manner, works substantially better than several alternatives.
6 Discriminative Training
Our new evaluation method gives rise to training models that are designated to the ranking task. As our main goal is to choose a single best sentence out of a set of candidate sentences, we can focus on training models that score whole sentences with discriminative training, rather than using the standard probability setting of LMs. Discriminative approach unbinds us from the burden of estimating probability distributions over all the words in the vocabulary and allows us to simply create representations of sentences and match them with scores.
Using negative examples is not straight forward in LM training, but is essential for speech recognition, where the language model needs to distinguish between "good" and "bad" sentences. Our discriminative training is based on the idea of using both positive and negative examples during training. The training data we use is similar in nature to that of our test data: sets of sentences in which only a single one is a genuine example collected from a real CS corpus, while all the others are synthetically created.
During training, we aim to assign the gold sentences with a higher score than the scores of the others. For each sentence, we require the difference between its score and the score of the gold sentence to be as large as its WER with respect to the gold sentence. This way, the farther a sentence is from the gold one, the lower its score is.
Formally, let s 1 be the gold sentence, and s 2 , ..., s m be the other sentences in that set. The loss of the set is the sum of losses over all sentences, except for the gold one:
m i=2 max(0, WER(s 1 , s i ) − [score(s 1 ) − score(s i )])
where score(s i ) is computed by the multiplication of the representation of s i and a learned vector w:
score(s i ) = w • repr(s i )
A sentence is represented with its BiLSTM representation -the concatenation the final states of forward and backword LSTMs. Formally, a sentence s = w 1 , ..., w n is represented as follows:
repr(s) = LST M (w 1 , ..., w n )•LST M (w n , ..., w 1 )
where • is the concatenation operation.
Incorporating monolingual data In order to use monolingual data in the case of discriminative training, we follow the same protocol: as a first step, we create alternative sentences for each monolingual sentence from the monolingual corpora. We train a model using this data, and as a next step, we fine-tune this model with the sets of sentences that are created from the CS corpus.
7 Empirical Experiments
Models and Baselines We report results on two models that use discriminative training: CS-ONLY-DISCRIMINATIVE only trains on data that is created based on the small code-switched corpus, while FINE-TUNED-DISCRIMINATIVE first trains on data created based on the monolingual corpora and is then fine-tuned using the data created from the code-switched corpus.
We compare our models to several baselines, all of which use standard LM training: ENGLISHONLY-LM and SPANISHONLY-LM train on the monolingual data only. Two additional models train on a combination of the codeswitched corpus and the two monolingual corpora: the first (ALL:SHUFFLED-LM) trains on all sentences (monolingual and code-switched) presented in a random order. The second (ALL:CS-LAST-LM) trains each epoch on the monolingual datasets followed by a pass on the small code-switched corpus. The models CS-ONLY-LM and FINE-TUNED-LM are the equivalents of CS-ONLY-DISCRIMINATIVE and FINE-TUNED-DISCRIMINATIVE but with standard LM training.
Code-switching corpus We use the Bangor Miami Corpus, consisting of transcripts of conversations by Spanish-speakers in Florida, all of whom are bilingual in English. 5 We split the sentences (45,621 in total) to train/dev/test with ratio of 60/20/20 respectively, and evaluate perplexity on the dev and test sets.
The dataset described in Section 4 is based on sentences from the dev and test sets, and serves as our main evaluation method.
Monolingual corpora The monolingual corpora used for training the English and Spanish monolingual models are taken from the OpenSub-titles2018 corpus (Tiedemann, 2009 ), 6 of subtitles of movies and TV series.
We use 1M lines from each language, with a split of 60/20/20 for train/dev/test, respectively. The test set is reserved for future use. For discriminative training we use 1/6 of the monolingual training data (as creating the data results in roughly 30 sentences per gold sentence).
Additional details on preprocessing and statistics of the data can be found in Appendix A.2.
Training We implement our language models in DyNet (Neubig et al., 2017) . Our basic configuration is similar to that of Gal and Ghahramani (2016) with minor changes. It has a standard architecture of a 2-layer LSTM followed by a softmax layer (see Appendix A.3 for details).
Tuning of hyper-parameters was done on the PTB corpus, in order to be on par with state-ofthe-art models such as that of Merity et al. (2017) . We then trained the same LM on our CS corpus with no additional tuning and got perplexity of 44.06, better than the 52.99 of Merity et al. (2017) when using their default parameters on the CS corpus. 7 We thus make no further tuning of hyper-parameters. When changing to discriminative training, we perform minimal necessary changes: discarding weight decay and reducing the learning rate to 1.
As done in previous work, in order to be able to give a reliable probability to every next-token in the test set, we include all the tokens from the test set in our vocabulary and we do not use the "
8 Results
The results of the different models are presented in Table 3 . For each model we report both perplexity and accuracy (except for discriminative training, where perplexity is not valid), where each of them is reported according to the best performing model on that measure (on the dev set). We also report the WER of all models, which correlates perfectly with the accuracy measure.
Using monolingual data As mentioned above, both in standard LM and in discriminative training, using monolingual data in a correct manner (FINE-TUNED-LM and FINE-TUNED-DISCRIMINATIVE) significantly improves over using solely the code-switching data. In standard LM, adding monolingual data results in an improvement of 7.5 points (improving from an accuracy of 57.9% to 65.4%), and in the discrim- Table 3 : Results on the dev set and on the test set. "perp" stands for perplexity, "acc" stands for accuracy (in percents), and "wer" stands for word-error-rate.
inative training it results in an improvement of 5 points (improving from an accuracy of 70.5% to 75.5%). Even though both ALL:SHUFFLED-LM and ALL:CS-LAST-LM use the same data as the FINE-TUNED-LM model, they perform even worse than CS-ONLY-LM that does not use the monolingual data at all. This emphasizes that the manner of integration of the monolingual data has a very strong influence.
Note that the FINE-TUNED-LM model also improves perplexity. As perplexity is significantly affected by the size of the vocabulary-and to ensure fair comparison-we also add the additional vocabulary from the monolingual data to CS-ONLY-LM (CS-ONLY+VOCAB-LM). Extending the vocabulary without training those additional words, results in a 2.37-points loss on the perplexity measure, while our evaluation metric (accuracy) stays essentially the same. This demonstrates the utility of our proposed evaluation compared to using perplexity, allowing it to fairly compare models with different vocabulary sizes.
In order to examine the contribution of the monolingual data, we also experimented with subsets of the code-switching training data. Table 4 depicts the results when using subsets of the CS training data with discriminative training. The less code-switching data we use, the more the effect of using the monolingual data is significant: we gain 8.8, 6.5, 4.2 and 5 more accuracy points with 25%, 50%, 75% and 100% of the data, respectively. In the case of 25% of the data, the FINE-TUNED-DISCRIMINATIVE model improves over CS-ONLY-DISCRIMINATIVE by 17 relative percents.
Standard LMs vs. Discriminative Training In the standard training setting, The FINE-TUNED-LM baseline is the strongest baseline, outperforming all others with an accuracy of 65.4%. Similarly, when using discriminative training, the FINE-TUNED-DISCRIMINATIVE model outperforms the CS-ONLY-DISCRIMINATIVE model. Note that using discriminative training, even with no additional monolingual data, leads to better performance than that of the best language model: the CS-ONLY-DISCRIMINATIVE model achieves an accuracy of 70.5%, 5.1 points more than the accuracy of the FINE-TUNED-LM model. We gain further improvement by adding monolingual data and get an even higher accuracy of 75.5%, which is 10.1 points higher than the best language model.
Analysis Table 5 breaks down the results of the different models according to two conditions: when the gold sentence is code-switched, and when the gold sentence is monolingual. As expected, the FINE-TUNED-DISCRIMINATIVE model is better able to prioritize the gold sentence than all other models, under both conditions. The improvement we get is most significant when the gold sentence is CS: in those cases we get a dramatic improvement of 27.73 accuracy points (a relative improvement of 58%). Note that for the standard LMs, the cases in which the gold sentence is CS are much harder, and they perform badly on this portion of the test set. However, using discriminative learning enable us to get improved performance on both portions of the test set and to get comparable results on both parts.
A closer examination of the mistakes of the FINE-TUNED-LM and FINE-TUNED-DISCRIMINATIVE models reveals the superiority of the discriminative training in various cases. Table 6 presents several examples in which FINE-TUNED-LM prefers a wrong sentence whereas FINE-TUNED-DISCRIMINATIVE identifies the gold one. In examples 1-3 the gold sentence was code-switched but FINE-TUNED-LM forced an improbable monolingual one. Example 4 shows a mistakes in a monolingual sentence.
While discriminative training is significantly better than the standard LM training, it can still be improved quite a bit. Table 7 lists some of the mistakes of the FINE-TUNED-DISCRIMINATIVE model: in examples 1 and 2, the gold sentence was code-switched but the model preferred a monolingual one, in example 3 the model prefers a wrong CS sentence over the gold monolingual one, and in 4 the model makes a mistake in a monolingual sentence.
9 Related Work
CS Most prior work on CS focused on Language Identification (LID) (Solorio et al., 2014; Molina et al., 2016) and POS tagging (Solorio and Liu, 2008; Vyas et al., 2014; Ghosh et al., 2016; Barman et al., 2016) . In this work we focus on language modeling, which we find more challenging.
LM Language models have been traditionally created by using the n-grams approach (Brown et al., 1992; Chen and Goodman, 1996) . Recently, neural models gained more popularity, both using a feed-forward network for an n-gram language model (Bengio et al., 2003; Morin and Bengio, 2005) and using recurrent architectures that are fed with the sequence of words, one word at a time (Mikolov et al., 2010; Zaremba et al., 2014; Gal and Ghahramani, 2016; Foerster et al., 2017; Melis et al., 2017) .
Some work has been done also on optimizing LMs for ASR purposes, using discriminative training. Kuo et al. (2002) , Roark et al. (2007) and Dikici et al. (2013) all improve LM for ASR by maximizing the probability of the correct candidates. All of them use candidates of ASR systems as "negative" examples and train n-gram LMs or use linear models for classifying candidates. A closer approach to ours is used by Huang et al. (2018) . There, they optimize an RNNLM with a discriminative loss as part of training an ASR system. Unlike our proposed model, they still use the standard setting of LM. In addition, their training is coupled with an end-to-end ASR system, in particular, as in previous works, the "negative" exam- ples they use are candidates of that ASR system. LM for CS Some work has been done also specifically on LM for code-switching. In Chan et al. (2009) , the authors compare different n-gram language models, Vu et al. (2012) suggest to improve language modeling by generating artificial code-switched text. Li and Fung (2012) propose a language model that incorporates a syntactic constraint and combine both a code-switched LM and a monolingual LM in the decoding process of an ASR system. Later on, they also suggest to incorporate a different syntactic constraint and to learn the language model from bilingual data using it (Li and Fung, 2014) . Pratapa et al. (2018) also use a syntactic constraint to improve LM by augmenting synthetically created CS sentences in which this constraint is not violated. Adel et al. (2013a) introduce an RNN based LM, where the output layer is factorized into languages, and POS tags are added to the input. In Adel et al. (2013b) , they further investigate an n-gram based factorized LM where each word in the input is concatenated with its POS tag and its language identifier. Adel et al. (2014; 2015) also investigate the influence of syntactic and semantic features in the framework of factorized language models. Sreeram and Sinha (2017) also use a factorize LM. They add POS tags and also use a small amount of monolingual data with the help of information about CS points. No standard benchmark or evaluation setup exists for CS LM, and most previous works use an ASR system as part of their evaluation setup. This makes comparison of methods very challenging. Indeed, all the works listed above use different setups and don't compare to each other, even for works coming from the same group. We believe the evaluation setup we propose in this work and our English-Spanish dataset, which is easy to replicate and decoupled from an ASR system, is a needed step towards meaningful comparison of CS LM approaches.
10 Conclusions And Future Work
We consider the topic of language modeling for code-switched data. We propose a novel ranking-based evaluation method for language models, motivated by speech recognition, and create an evaluation dataset for English-Spanish code-switching LM (SpanglishEval).
We further propose an effective protocol for training language models for CS: pre-training on a mix of monolingual sentences, followed by finetuning on a code-switched dataset. This protocol significantly outperforms various baselines. Moreover, we show that the less code-switched training data we use, the more effective it is to incorporate the monolingual data.
Finally, we present a discriminative training for this ranking task that is intended for ASR purposes. This training procedure is not bound to probability distributions, and uses both positive and negative training sentences. This significantly improves performance. Such discriminative training can also be applied to monolingual data.
Our proposed evaluation framework and dataset will facilitate such future work by providing the ability to meaningfully compare the performance of different methods to each other, an ability that was sorely missing in previous work.
A Appendix
A.1 Creating Evaluation Dataset -Implementation Details
Our implementation is based on the Carmel FST toolkit. 8 We create an FST for converting a sentence into a sequence of phonemes, and its inverse FST. The words to phoneme mapping is based on pronunciation dictionaries, according to the language tag of each word in the sentence. We use The CMU Pronouncing Dictionary 9 for English and a dictionary from CMUSphinx 10 for Spanish. As the phoneme inventories in the two datasets do not match, we map the Spanish phonemes to the CMU dict inventory using a manually constructed mapping (see Table 8 ). 11 To favor frequent words over infrequent ones, we add unigram probabilities to the edges of the transducer (taken from googlebooks unigrams 12 ). We filter some words that produce noise (for example, single letter words that are too frequent). When creating a monolingual sentence, we use an FST with the words of that language only. As many phoneme sequences in Spanish do not produce English alternatives (and vice versa) we allow minor changes in the phoneme sequences between the languages. Specifically, we create a small list of similar phonemes (such as "B" and "V", see Table 2 ), and generate an FST that for each phoneme allows changing it to one of its alternatives or dropping it with low probability.
https://www.ibm.com/watson/services/ speech-to-text/
We chose accuracy over WER as the default metric since in our case, WER should be treated with caution: the alternatives created might be "too close" to the gold sentence (e.g. when only a small fraction of the gold sentence is sampled and replaced) or "too far" (e.g. a Spanish alternative for an English sentence), thus affecting the WER.3 The requirements for creating an evaluation dataset for a language pair L1,L2 is to have access to code-switched sentences where each word is tagged with a language ID (language ID is not mandatory, but helps in cases in which a word is found in the vocabulary of both languages and is pronounced differently), compatible pronunciation dictionaries and unigram word probabilities for each of the languages.
Specifically, we compose the following FSTs: (a) an FST for converting a sentence into a sequence of phonemes, (b) an FST that allows minor changes in the phoneme sequence, (c) an FST for decoding a sequence of phonemes into a sentence, the inverse of (a).
http://bangortalk.org.uk/speakers. php?c=miami 6 http://opus.nlpl.eu/ OpenSubtitles2018.php http://www.opensubtitles.org/
https://github.com/salesforce/awdlstm-lm