Recently, claims have surfaced that neural machine translation (NMT) produces the same quality as human translators, that is, several NMT providers claim to have achieved or even exceeded human parity. In this article I want to investigate these claims of human parity. An introduction to neural machine translation can be found here.
What Is a Translation Slam?
A translation slam compares and contrasts two or more translations of the same source text. In such a battle of wits and words, the goal is not to determine a winner, but rather to examine the translation process and linguistic choices.
At the annual conference of the American Translators Association in 2018 the German Language Division held two of these translation slams, one for translation from German into English, and one for translation from English into German. You can read reviews of these translation slams here. I managed to obtain the three submissions for the slam from English into German from the participating translators, many thanks to Jutta Diel-Dominique, Maren Mentor, and Eva Stabenow. In the following, I will contrast these wonderfully translated texts with the output of three well-known neural machine translation engines — Bing Microsoft Translator, DeepL, and Google Translate. The source text is an article in the New York Times with the title “How to Get More Women to Be C.E.O.s” from July 25, 2017.
How Can Translation Quality Be Measured?
Now, how can you even compare machine and human translation? I wanted to show that claims of human parity of machine translation are greatly exaggerated, so I needed to find a way to compare these translations not only qualitatively, but also quantitatively. For this purpose I used some well-known metrics that are commonly used to assess machine translation quality.
BLEU – Bilingual Evaluation Understudy
BLEU is one of the oldest and most frequently used quality metrics. However, BLEU can also be highly misleading, because BLEU compares the translation in question with a reference translation word by word. This means that rephrasing or synonyms are penalized and lead to lower (worse) scores. Below it will become quite obvious why such literal comparisons rarely make sense for neural machine translation output and make no sense at all for human translations.
METEOR – Metric for Evaluation of Translation with Explicit ORdering
METEOR is a more advanced metric that takes into account that the word order can be changed within a specific sentence, while the sentence remains correct in terms of grammar and content. However, METEOR has the same disadvantages as BLEU in terms of rephrasing and synomyms, since it also involves a verbatim comparison.
BLEU, METEOR, and related metrics have the distinct advantage that they can be automated, and thus no human intervention is necessary. For this reason they are commonly used to gauge the quality of machine translation.
The distinct disadvantage of these metrics is that they do not take into account whether the translation is actually correct in terms of grammar and content. It is entirely possible that a translation with severe grammatical and other errors is rated higher than a translation without errors that contains synonyms to the words used in the reference translation, as we will see below. Further, BLEU and METEOR assume the existence of a reference translation. In my opinion, these metrics should be referred to as “similarity metrics” instead of “quality metrics.”
MQM – Multidimensional Quality Metrics
MQM was designed to remedy the aforementioned flaws of existing quality metrics, which however necessitates human intervention. MQM requires that human experts assess the translation and manually categorize and classify the errors in the translation. Hereby it can happen that different human experts rate the severity of an error differently. Such subjective disagreements can lead to significant differences in the ratings of the translation in question. This means that it is not possible to automate MQM. On the other hand, there is no need for a reference text, and the error categories are clearly defined.
Translation Slam – Human versus Machine
I have used the aforementioned metrics to pit the three professional human translators against the three big machine translation engines. I do not want to go into too much technical detail here, but you can read the definitions of the quality metrics in the links above. Eva, Jutta, and Maren were kind enough to send me their translations for this analysis, and the source text can be found on the website of The New York Times. I used the public portals of Bing Microsoft Translators, DeepL, and Google Translate at the beginning of August 2021 to translate this source text into German. It should be noted that these results can change over time, because the NMT engines are continuously under development.
Having all translations in hand, the question begged how to use BLEU and METEOR since they both require a reference translation. Of course, I could have chosen one of the three human translations as the reference text, but this would have been entirely unscientific and random. Therefore I chose to compare all 6 translations with each other in pairs. Fig. 1 shows the results of BLEU and METEOR for each of these pairs (in alphabetical order) at the corpus level. Corpus level means that I compared the entirety of each text, around 550 words total each.

Fig. 1: Results of the BLEU and METEOR quality metrics, comparing each pair out of the 6 translations
The tables are color coded for ease of understanding. BLEU has a maximum score of 100, which is achieved when a text is identical to the reference text. This is of course true if a text is compared with itself, as you can see in Fig. 1. METEOR has a maximum score of 1.0, which is also achieved when a text is identical to the reference text. The more a text differs from the reference, the lower the score for both BLEU and METEOR.
The results of Fig. 1 are clear. All machine translations are very similar to each other, but not identical (and colored green, closer to a score of 100 and 1.0, respectively). The human translations are not similar to each other and to the machine translations at all (colored yellow, orange, and red, whereby the exact color is an artifact of Excel’s algorithm and not relevant). If one interprets these results according to the usual quality criteria, one is led to believe that the human translations are absolutely terrible, since the BLEU scores range between 0 and around 20 and the METEOR scores are below 0.5. In other words, if I take Eva’s translation as the reference text, Maren’s and Jutta’s translations would be assessed as absolutely horrible, as would the NMT outputs, and vice versa. It must be emphasized again that the three human translations are free of errors of any kind, and the differences are merely in style, but not content.
What Does MQM Say?
Again, the human translations are 100% error-free, thus they would be rated as 100% according to the MQM scale. The machine translation outputs are a different story, however. Google’s and DeepL’s outputs each contained one critical error and 4 to 5 major errors, all of them mistranslations. Bing’s translation contained three critical errors and about a dozen major errors, again in the category mistranslation. In addition, all three machine translation engines produced several minor grammatical errors. Here I want to add that I classified errors as less severe when I was not entirely sure of the rating. It is therefore conceivable that a different reviewer classifies the NMT outputs as even worse than I did. The results are shown in Table 1.
Table 1: MQM score of the translations | Eva | Jutta | Maren | Bing | DeepL | Google |
---|
MQM score | 100% | 100% | 100% | 18.89% | 73.85% | 52.26% |
As you can see, Bing’s translation is inadequate, and the quality of DeepL’s and Google’s outputs are questionable as well. The severity of the aforementioned critical errors lies in the eye of the end-user, of course.
A Few Translation Examples
Below I list a few examples that contrast human and machine translations so you can gauge the parity of humans and machines for yourself.
The title of the article in The New York Times is: “How to Get More Women to Be C.E.O.s”. The three humans translated this as follows [I include my own somewhat lame literal back translation in parentheses for those of you who don’t speak German, though something always gets lost in back translations.]:
Mehr Frauen in der Chefetage: So gelingt’s [More Women Executives: This Is How]
Mehr weibliche Führungskräfte braucht das Land. [The Country Needs More Women Executives]
Mehr Frauen in die Chefetage – aber wie? Ein Wort von denen, die es geschafft haben [More Women Executives: But How? Some Comments by Those Who Suceeded]
Bing translates this entirely literally:
Wie man mehr Frauen dazu bringen kann, C.E.O.s zu werden
DeepL’s German translation has more oomph:
Wie man mehr Frauen in Führungspositionen bringt [How to Get More Women into the Executive Level]
Google is again fairly literal:
Wie man mehr Frauen dazu bringt, C.E.O.s zu werden
Hilarity ensues when the machine translation engines encounter the sentence “I’d pick their brain.” The three professionals translate fairly loosely:
Ich habe […] ihnen unzählige Fragen gestellt. [I asked […] them countless questions.]
[…] um von ihnen durch Fragen lernen zu können. [[…] to be able to learn from them by asking questions.]
Ich habe mir bei ihnen Ideen geholt. [I asked them for ideas.]
Bing and Google totally lose the context and switch to brain surgery:
Ich würde ihr Gehirn auswählen. [I would choose their brain.]
DeepL keeps the meaning:
Ich habe mir ihr Wissen angeeignet. [I have acquired their knowledge.]
The varied human translations and their (too) literal machine counterparts of this particular sentence also illustrate quite clearly why metrics such as BLEU, METEOR, or even word comparison metrics that take synonyms into account such as BERTScore don’t really reflect “translation quality,” unless one defines “quality” as synonymous with “similarity.” The human translators translated the meaning and incorporated the meaning into an idiomatic translation, reshuffling the sentence structure quite a bit. The machines translated the words and missed the meaning completely (Bing, Google) or at least partially.
Of course, these quotes are more obvious if you understand German. I plan to write a corresponding blog post for translation into English, where such mistranslations are more obvious for English speakers.
The Upshot: Careful When Extrapolating from Metrics to Translation Quality
Automatic quality metrics do not always measure quality, rather, they measure similarity. Great translations by humans that are idiomatic and grammatically correct are penalized if they use synonyms compared to the reference text. Quality metrics that require a human in the loop are more relevant. However, as one can see above, the claim that machine translation has reached or even exceeded human parity is clearly exaggerated. Of course, machine translation quality strongly depends on the subject matter and use case. I am sure that Bing, DeepL, Google, and others are much better at translating repetitive user manuals than quality journalistic articles. Nevertheless, sweeping statements of human parity are a gross misrepresentation, especially when compared with the translations of experienced professionals.
Copyright secured by Digiprove © 2021 Carola F Berger