So, now I have your attention with that title, let’s delve into the real, less anthropomorphizing question: Is neural machine translation (NMT) biased? I was inspired to write this blog post after watching the documentary Coded Bias (available on Netflix) and a follow-up panel discussion entitled “Is AI racist?” (AI = artificial intelligence). Obviously, I borrowed the title for this blog post from that panel discussion. I highly recommend watching the aforementioned movie, if you haven’t already.
As a white Western European female, I don’t have first-hand experience with racism, so am therefore not really qualified to write a blog post about NMT and racist bias. However, as a female with two advanced STEM degrees (Science, Technology, Engineering and Mathematics), I do know a thing or two about gender bias. I was once told at the beginning of a physics lecture at university that
a woman’s place is in the kitchen.
This is, in fact, a verbatim quote, but I won’t mention names or other details to protect the guilty. Given that I am the world’s worst cook, I did not heed that “advice.” (How many other people do you know who have managed to explode an oven while trying to bake a cake?)
Is natural language processing biased?
Natural language processing (NLP, not to be confused with neuro-linguistic programming in psychotherapy with the same acronym) concerns the programming of computers to process natural language data (as opposed to computer languages). Examples are chatbots, machine translation, virtual assistants such as Alexa, Siri, Cortana, etc., and similar applications.
The answer to the question is that yes, NLP can be very biased. Take for example Tay, Microsoft’s infamous Twitter chatbot that was programmed to “talk” like a female teenager and to “learn” from its interactions with other Twitter users. Tay went live on March 23, 2016. 16 hours later, the following happened (warning, rated R!):

Tay turned into Hitler’s reincarnation as a teenage chatbot.
What happened with Tay?
As I pointed out in my previous blog article on the topic, AI is nothing but pattern matching, or, as Meredith Broussard put it, “statistics on steroids.” If you feed certain patterns into a neural network, it will reproduce those pattern in an amplified manner. In other words, garbage in — amplified garbage out. In the case of Tay, the Twitter exchange had been hijacked by some extreme Twitter trolls, and this was the result. Microsoft’s original programming was really quite good to enable the bot to learn this fast. I don’t doubt that the programming itself was unbiased, but after being fed tons of racist, sexist, and otherwise biased data, Tay went on the robot equivalent of a roid rage rampage.
In this post, I’ll be looking at biases in neural machine translation, specifically at gender bias when translating between gendered and non-gendered languages.
An incomplete and non-technical literature review
Because I don’t like to reinvent the wheel, I set out to review the existing literature on the topic. Based on my previous research experience in physics, I expected to have to wade through hundreds or thousands of scientific articles, given that this is an important topic that clearly deserves attention and given that neural networks have been around since the 1940s. I was really quite astonished to find only a few dozen relevant papers, most of them dating from the last 5 or 6 years.
Here is what I found in my literature review:
- Neural machine translation exhibits a significant gender bias. See, for example [1]-[2].
- Biased training data are one of the reasons for this. To quote [3] on biases in GPT-3, the state-of-the art AI language model with a whopping 175 parameters mentioned in my previous article:
Biases present in training data may lead models to generate stereotyped or prejudiced content.
- Biased training metrics are another reason. Training metrics are a measure that tells the neural network how far off its output is from the desired output. BLEU is one of these metrics that can lead to significant bias [4].
- Aside from training data and training metrics, the structure of the neural networks matters. The word embeddings themselves are also biased [5]. Recall from my previous post, embeddings are the representations or encodings of words as arrangements of numbers with which the neural network computes. Neural networks don’t handle words, they handle numbers. So, for machine translation, the words in the source language have to be converted into numbers. This conversion can introduce further bias.
- There are remedies [1], however, their implementation is lacking.
The following figure illustrates the existing gender bias in NMT graphically:
This figure is reproduced from
Prates et al., arXiv:1809.02208, showing the distribution of gender pronouns attributed to various STEM professions when translated into gendered languages by Google Translate. Clearly, the machine prefers male engineers to female engineers. However, this issue is not limited to Google Translate. Other examples are studied in the literature references above.
In fact, Google has now mitigated the problem for certain languages by introducing a warning that the translation is gender-specific. You can see for yourself for Turkish, which is apparently completely ungendered (I don’t speak Turkish), into English.

My own highly unscientific experiment on gender bias in NMT
After reading the sparse, but unambiguous literature, I decided to perform my own highly unscientific experiment with three of the most popular public machine translation engines, Google Translate, Bing Microsoft Translator, and DeepL. I actually thought that I might have to try some more contrived examples to get these popular NMT engines to reveal their gender bias, but I already succeeded with the first sentence I tried to translate between English and German. Given the above literature, this is quite astonishing and indeed very disappointing. The following screenshots are current as of May 20-25, 2021. Since NMT engines are continuously evolving, your results may vary.
Gender neutral English into gendered German
I input the following English sentence into Google Translate, Bing Microsoft Translator, and DeepL:
The translator talked to the secretary and the engineer.
In English, this sentence is perfectly gender-neutral. What about the German translation? In case you don’t speak German, German has gendered nouns, specifically, there are different words for female and male translators, secretaries, and engineers. Here are the results of the three NMT engines (as of May 20, 2021, 12:30pm US Pacific time zone):

Translation from English to German by Bing Microsoft Translator, Google Translate, and DeepL
As you can see, the results of the translations are all the same. The back translation reads:
The [male] translator talked to the [female] secretary and the [male] engineer.
Not unexpectedly, the secretary and the engineer are stereotyped as female and male, respectively. Both Google Translate and DeepL have the option to offer alternative translations to the most “relevant” translations. Here’s what happens if I use that feature:

Alternative translation suggestion by Google Translate

Alternative translation suggestions by DeepL
Google Translate offers an alternative translation for the whole sentence, which keeps the translator and the engineer male, but offers the male version of the default female secretary. DeepL offers alternative translations of individual words, but not the whole sentence. DeepL’s alternative translations for “translator” are nearly all male, with a gender neutral fifth option in the dropdown. The alternative options for “secretary” are all female, with two gender neutral exceptions. The synonyms for “engineer” are all male.
Now, given that AI is “statistics on steroids,” these alternative suggestions are partly expected. Of all secretaries in Germany, only 5 percent are male, so this particular gendered translation is not surprising. However, depending on who you ask (see here and here), it is estimated that of all engineers in Germany, roughly 20%-25% are female. Evidently, Google and DeepL completely ignore up to a fifth or a quarter of the population in their translations. On the other hand, Google remembers the only 5 percent of male secretaries in its alternative option.
What about German into English?

Well, it turns out when I try to translate the female versions into English, i.e. “The female translator spoke with the female engineer,” Google Translate helpfully asks me if I meant the plural, male engineers (to add insult to injury, with incorrect grammar). Google apparently does not recognize the female form “Ingenieurin,” despite the fact that I actually have this title printed on my university diploma. In other words, Google Translate is sending me back to the kitchen, figuratively speaking. At least DeepL and Bing Microsoft Translator don’t offer these kinds of “helpful suggestions.”
Thoughts on gender stars in German and word embeddings
Given that neural machine translation is simply pattern matching or statistics on steroids, the question arises as to why the aforementioned NMT engines that offer alternatives to the most likely translation completely ignore statistically relevant options in their list of alternatives. My suspicion is that perhaps all the allegedly helpful attempts to render the German language more gender neutral aren’t all that helpful when it comes to word embeddings. I’m referring to gender stars, gender colons, and Binnen-Is.
If you are unfamiliar with gender stars and Binnen-Is, these artificial constructs are used to make a gendered noun gender-neutral. For example, “Dolmetscher” refers to male interpreters (singular and plural), whereas “Dolmetscherinnen” refers to female interpreters, here in the plural form. To refer to both male and female interpreters in the plural form, instead of writing “Dolmetscherinnen und Dolmetscher” (female and male interpreters), often “Dolmetscher:innen,” “Dolmetscher*innen,” or “DolmetscherInnen” is used. These constructions are called gender colon, gender star, and Binnen-I, respectively.
As an aside, I personally strongly dislike gender stars, gender colons, and Binnen-Is and similar constructs and usually go out of my way to find gender neutral formulations without these implements. This is because these constructions wreak havoc with screen readers and similar apps. By trying to be inclusive of one group of people (females), another group of people (the visually impaired) is being left out.
Again, recall, the words have to be represented as numbers first, before a neural network can do anything with them, including translate them into another language. This process is known as encoding words into word embeddings. Now, most encoders treat characters that are not letters as word separators. At least this is the case in all encoders that I have been taught in the various NLP courses I attended. Thus, it is entirely conceivable that in the course of the embedding, the machine separates words such as “Übersetzer:innen” or “Übersetzer*innen” into “Übersetzer” and “innen” and discards the latter within the network because “innen” by itself does not make sense. In short, my theory is that by using these allegedly gender neutral constructs, the machine input is made even more gender biased, because the machine doesn’t know what to do with these artificial constructions and just chops them off. Of course, I could be wrong, but this definitely warrants further investigation.
The moral of the story for post-editors, NMT users, and programmers
Gender and other biases in AI are real. Neural machine translation is definitely biased when translating between gendered and non-gendered languages. Post-editors of machine translation output need to be aware of this issue. Computer scientists and programmers should also be aware of this issue and strive to make their neural networks more balanced. Of course, it is probably not possible to make a translation engine completely gender neutral. But one simple solution that can be implemented immediately, would be that publicly facing AI applications inform the uninformed public about these biases, as Google Translate has done for example for Turkish to English (see screenshot above). (For some unknown reason, there is no such warning for Turkish to German.) Such a simple statement should suffice to raise awareness among the users.
In my opinion, language is not like pie or cake. Cakes are finite and can only be divided in finite amounts. Language, on the other hand, is infinite and infinitely diverse. This should be reflected in its use, in written and translated texts, especially when these texts are produced by machines. Let’s make sure that these machines don’t send people back to the kitchen inadvertently!
References:
[1] Savoldi et al., Transaction of the Association for Computational Linguistics (TACL), 2021
[2] Caliskan et al., Science 356, 183–186 (2017); Prates et al., Neural Computing and Applications, arXiv:1809.02208; Stanovsky et al., 57th Annual Meeting of the Association for Computational Linguistics, pp. 1679–1684, 2019; and many others.
[3] Brown et al., arXiv:2005.14165,
[4] Roberts et al., 34th Conference on Neural Information Processing Systems (NeurIPS 2020), arXiv:2011.13477.
[5] Costa-jussà et al., arXiv:2012.13176 and Sweeney and Najafian, FAT* ’20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 359–368, 2020.
Copyright secured by Digiprove © 2021 Carola F Berger