Evaluating Gender Bias in Machine Translation
We present the first challenge set and evaluation protocol for the analysis of gender bias in machine translation (MT). Our approach uses two recent coreference resolution datasets composed of English sentences which cast participants into non-stereotypical gender roles (e.g., “The doctor asked the nurse to help her in the operation”). We devise an automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis (e.g., the use of female inflection for the word “doctor”). Our analyses show that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all tested target languages. Our data and code are publicly available at https://github.com/gabrielStanovsky/mt_gender.
Learned models exhibit social bias when their training data encode stereotypes not relevant for the task, but the correlations are picked up anyway. Notable examples include gender biases in visual SRL (cooking is stereotypically done by women, construction workers are stereotypically men; Zhao et al., 2017) , lexical semantics ("man is to computer programmer as woman is to homemaker"; Bolukbasi et al., 2016) , and natural language inference (associating women with gossiping and men with guitars; Rudinger et al., 2017) .
In this work, we conduct the first large-scale multilingual evaluation of gender-bias in machine translation (MT), following recent small-scale qualitative studies which observed that online MT services, such as Google Translate or Microsoft Translator, also exhibit biases, e.g., translating nurses as females and programmers as males, regardless of context (Alvarez-Melis and Jaakkola, The doctor asked the nurse to help her in the procedure El doctor le pidio a la enfermera que le ayudara con el procedimiento Figure 1 : An example of gender bias in machine translation from English (top) to Spanish (bottom). In the English source sentence, the nurse's gender is unknown, while the coreference link with "her" identifies the "doctor" as a female. On the other hand, the Spanish target sentence uses morphological features for gender: "el doctor" (male), versus "la enfermera" (female). Aligning between source and target sentences reveals that a stereotypical assignment of gender roles changed the meaning of the translated sentence by changing the doctor's gender.
2017; Font and Costa-Jussà, 2019). Google Translate recently tried to mitigate these biases by allowing users to sometimes choose between gendered translations (Kuczmarski, 2018) .
As shown in Figure 1 , we use data introduced by two recent coreference gender-bias studies: the Winogender (Rudinger et al., 2018) , and the Wino-Bias (Zhao et al., 2018) datasets. Following the Winograd schema (Levesque, 2011) , each instance in these datasets is an English sentence which describes a scenario with human entities, who are identified by their role (e.g., "the doctor" and "the nurse" in Figure 1 ), and a pronoun ("her" in the example), which needs to be correctly resolved to one of the entities ("the doctor" in this case). Rudinger et al. (2018) and Zhao et al. (2018) found that while human agreement on the task was high (roughly 95%), coreference resolution models often ignore context and make socially biased predictions, e.g., associating the feminine pronoun "her" with the stereotypically female "nurse."
We observe that for many target languages, a faithful translation requires a similar form of (at least implicit) gender identification. In addition, in the many languages which associate between biological and grammatical gender (e.g., most Romance, Germanic, Slavic, and Semitic languages; Craig, 1986; Mucchi-Faina, 2005; Corbett, 2007) , the gender of an animate object can be identified via morphological markers. For instance, when translating our running example in Figure 1 to Spanish, a valid translation may be: "La doctora le pidio a la enfermera que le ayudara con el procedimiento," which indicates that the doctor is a woman, by using a feminine suffix inflection ("doctora") and the feminine definite gendered article ("la"). However, a biased translation system may ignore the given context and stereotypically translate the doctor as male, as shown at the bottom of the figure.
Following these observations, we design a challenge set approach for evaluating gender bias in MT using a concatenation of Winogender and WinoBias. We devise an automatic translation evaluation method for eight diverse target languages, without requiring additional gold translations, relying instead on automatic measures for alignment and morphological analysis (Section 2). We find that four widely used commercial MT systems and two recent state-of-the-art academic models are significantly gender-biased on all tested languages (Section 3). Our method and benchmarks are publicly available, and are easily extensible with more languages and MT models.
2 Challenge Set For Gender Bias In Mt
We compose a challenge set for gender bias in MT (which we dub "WinoMT") by concatenating the Winogender and WinoBias coreference test sets. Overall, WinoMT contains 3,888 instances, and is equally balanced between male and female genders, as well as between stereotypical and nonstereotypical gender-role assignments (e.g., a female doctor versus a female nurse). Additional dataset statistics are presented in Table 1 .
We use WinoMT to estimate the gender-bias of an MT model, M , in target-language L by performing following steps (exemplified in Figure 1 ):
(1) Translate all of the sentences in WinoMT into L using M , thus forming a bilingual corpus of English and the target language L.
(2) Align between the source and target translations, using fast align (Dyer et al., 2013) , trained on the automatic translations from from step (1). Male 240 1582 1826 Female 240 1586 1822 Neutral 240 0 240 Total 720 3168 3888 We then map the English entity annotated in the coreference datasets to its translation (e.g., align between "the doctor" and "el doctor" in Figure 1 ).
Winogender Winobias Winomt
(3) Finally, we extract the target-side entity's gender using simple heuristics over languagespecific morphological analysis, which we perform using off-the-shelf tools for each target language, as discussed in the following section. This process extracts the translated genders, according to M , for all of the entities in WinoMT, which we can then evaluate against the gold annotations provided by the original English dataset.
This process can introduce noise into our evaluation in steps 2and 3, via wrong alignments or erroneous morphological analysis. In Section 3, we will present a human evaluation showing these errors are infrequent.
In this section, we briefly describe the MT systems and the target languages we use, our main results, and their human validation.
3.1 Experimental Setup
MT systems We test six widely used MT models, representing the state of the art in both commercial and academic research: (1) Google Translate, 1 (2) Microsoft Translator, 2 (3) Amazon Translate, 3 (4) SYSTRAN, 4 (5) the model of , which recently achieved the best performance on English-to-French translation on the WMT'14 test set, and (6) the model of , the WMT'18 winner on English-to-German translation. We query the online API for the first four commercial MT systems, while for the latter two academic models we use the pretrained models provided by the Fairseq toolkit. 5
Microsoft Translator Amazon Translate Table 2 : Performance of commercial MT systems on the WinoMT corpus on all tested languages, categorized by their family: Spanish, French, Italian, Russian, Ukrainian, Hebrew, Arabic, and German. Acc indicates overall gender accuracy (% of instances the translation had the correct gender), ∆ G denotes the difference in performance (F 1 score) between masculine and feminine scores, and ∆ S is the difference in performance (F 1 score) between pro-stereotypical and anti-stereotypical gender role assignments (higher numbers in the two latter metrics indicate stronger biases). Numbers in bold indicate best accuracy for the language across MT systems (row), and underlined numbers indicate best accuracy for the MT system across languages (column
* SYSTRAN Acc ∆ G ∆ S Acc ∆ G ∆ S Acc ∆ G ∆ S Acc ∆ G ∆ S ES
Target Languages And Morphological Analysis
We selected a set of eight languages with grammatical gender which exhibit a wide range of other linguistic properties (e.g., in terms of alphabet, word order, or grammar), while still allowing for highly accurate automatic morphological analysis. These languages belong to four different families: (1) Romance languages: Spanish, French, and Italian, all of which have gendered noun-determiner agreement and spaCy morphological analysis support (Honnibal and Montani, 2017) . (2) Slavic languages (Cyrillic alphabet): Russian and Ukrainian, for which we use the morphological analyzer developed by Korobov (2015) . (3) Semitic languages: Hebrew and Arabic, each with a unique alphabet. For Hebrew, we use the analyzer developed by Adler and Elhadad (2006) , while gender inflection in Arabic can be easily identified via the ta marbuta character, which uniquely indicates feminine inflection. (4) Germanic languages: German, for which we use the morphological analyzer developed by Altinok (2018).
Our main findings are presented in Tables 2 and 3 . For each tested MT system and target language we compute three metrics with respect to their ability to convey the correct gender in the target language. Ultimately, our analyses indicate that all tested MT systems are indeed gender biased. First, the overall system Accuracy is calculated by the percentage of instances in which the translation preserved the gender of the entity from the original English sentence. We find that most tested systems across eight tested languages perform quite poorly on this metric. The best performing model on each language often does not do much better than a random guess for the correct inflection. An exception to this rule is the translation accuracies on German, where three out of four systems acheive their best performance. This may be explained by German's similarity to the English source language (Hawkins, 2015).
In Table 2 , ∆ G denotes the difference in performance (F 1 score) between male and female translations. Interestingly, all systems, except Microsoft Translator on German, perform significantly better on male roles, which may stem from these being more frequent in the training set.
Perhaps most tellingly, ∆ S measures the differ- ES FR IT RU UK HE AR DE 20 40 60 80 100 67 80 52 44 46 76 60 69 46 54 30 33 35 38 44 57 Accuracy (%) Stereotypical Non-Stereotypical Figure 2 : Google Translate's performance on gender translation on our tested languages. The performance on the stereotypical portion of WinoMT is consistently better than that on the non-stereotypical portion. The other MT systems we tested display similar trends. ence in performance (F 1 score) between stereotypical and non-stereotypical gender role assignments, as defined by Zhao et al. (2018) who use statistics provided by the US Department of Labor. 6 This metric shows that all tested systems have a significant and consistently better performance when presented with pro-stereotypical assignments (e.g., a female nurse), while their performance deteriorates when translating antistereotypical roles (e.g., a male receptionist). For instance, Figure 2 depicts Google Translate absolute accuracies on stereotypical and nonstereotypical gender roles across all tested languages. Other tested systems show similar trends.
3.3 Fighting Bias With Bias
3.4 Human Validation
We estimate the accuracy of our gender bias evaluation method by randomly sampling 100 instances of all translation systems and target languages, annotating each sample by two target-language native speakers (resulting in 9,600 human annotations). Each instance conformed to a format similar to that used by our automatic gender detection algorithm: human annotators were asked to mark the gender of an entity within a given targetlanguage sentence. (e.g., see "el doctor" as highlighted in the Spanish sentence in Figure 1 ). By annotating at the sentence-level, we can account for both types of possible errors, i.e., alignment and gender extraction. We compare the sentence-level human annotations to the output of our automatic method, and find that the levels of agreement for all languages and systems were above 85%, with an average agreement on 87% of the annotations. In comparison, human inter-annotator agreement was 90%, due to noise introduced by several incoherent translations.
Our errors occur when language-specific idiosyncrasies introduce ambiguity to the morphological analysis. For example, gender for certain words in Hebrew cannot be distinguished without The janitor does not like the baker because she always messes up the kitchen.
[ES] Al conserje no le gusta el panadero porque ella siempre desordena la cocina.
Biased translation, giving "baker" a male inflection, with a mismatched pronoun reference.
The janitor does not like the pretty baker because she always messes up the kitchen.
[ES] Al conserje no le gusta la panadera bonita porque ella siempre desordena la cocina.
Adding a stereotypically female adjective "fixes" the translation.
The counselor asked the guard a few questions and praised her for the good work.
[FR] Le conseiller a posé quelques questionsà la garde et l'a louée pour le bon travail.
French uses "garde" for both male and female guards, allowing for a more direct translation from English. diacritics (e.g., the male and female versions of the word "baker" are spelled identically), and the contracted determiner in French and Italian (l') is used for both masculine and feminine nouns. In addition, some languages have only male or female inflections for professions which were stereotypically associated with one of the genders, for example "sastre" (tailor) in Spanish or "soldat" (soldier) in French, which do not have female inflections. See Table 5 for detailed examples.
Related work This work is most related to several recent efforts which evaluate MT through the use of challenge sets. Similarly to our use WinoMT, these works evaluate MT systems (either manually or automatically) on test sets which are specially created to exhibit certain linguistic phenomena, thus going beyond the traditional BLEU metric (Papineni et al., 2002) . These include challenge sets for language-specific idiosyncrasies (Isabelle et al., 2017) , discourse phenomena (Bawden et al., 2018) , pronoun translation (Müller et al., 2018; Webster et al., 2018) , or coreference and multiword expressions (Burchardt et al., 2017) .
Limitations and future work While our work presents the first large-scale evaluation of gender bias in MT, it still suffers from certain limitations which could be addressed in follow up work. First, like some of the challenge sets discussed above, WinoMT is composed of synthetic English sourceside examples. On the one hand, this allows for a controlled experiment environment, while, on the other hand, this might introduce some artificial biases in our data and evaluation. Ideally, WinoMT could be augmented with natural "in the wild" instances, with many source languages, all annotated with ground truth entity gender. Second, similar to any medium size test set, it is clear that WinoMT serves only as a proxy estimation for the phenomenon of gender bias, and would probably be easy to overfit. A larger annotated corpus can perhaps provide a better signal for training. Finally, even though in Section 3.3 we show a very rudimentary debiasing scheme which relies on oracle coreference system, it is clear that this is not applicable in a real-world scenario. While recent research has shown that getting rid of such biases may prove to be very challenging (Elazar and Goldberg, 2018; Gonen and Goldberg, 2019) , we hope that this work will serve as a first step for developing more gender-balanced MT models.
We presented the first large-scale multilingual quantitative evidence for gender bias in MT, showing that on eight diverse target languages, all four tested popular commercial systems and two recent state-of-the-art academic MT models are significantly prone to translate based on gender stereotypes rather than more meaningful context. Our data and code are publicly available at https://github.com/ gabrielStanovsky/mt_gender.
https://translate.google.com 2 https://www.bing.com/translator 3 https://aws.amazon.com/translate 4 http://www.systransoft.com 5 https://github.com/pytorch/fairseq