If you are in any way connected with the world of translation and interpretation, you have certainly asked yourself at least one of the above questions about neural machine translation (NMT). These questions are by no means easy to answer. If you ask n experts, you’ll likely get n+1 different answers. Let me quote a few experts:
Wikipedia explains (status of April 16, 2021, 13:54 UTC):
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
Jost Zetzsche writes:
Narrow AI [artificial intelligence] is the ability of a machine to non-concurrently process large amounts of data and make predictions exclusively on the basis of that data. That’s what we have today, and computers are incredibly good at it. Much better than we are.
And Edith Vanghelof writes in her summary of the TC42 Conference on Translating and the Computer for the UNIVERSITAS Mitteilungsblatt 1/21 (p. 14):
In the end I did not really learn how neural machine translation works, but I did learn a lot how machine translation is changing the market in which translators work.
These statements are all correct, and you can find many more explanations on- and offline. I myself have written several blog posts on the topic of neural machine translation. So why am I attempting yet another explanation? Because I think I can summarize all these answers to the questions in the headline in two statements:
What is neural machine translation?
Neural machine translation is nothing other than pattern matching.
How does NMT work exactly?
Nobody knows.
Now, these are some very strong statements that need some qualifications and certainly an explanation.
Neural Machine Translation and Pattern Matching
What is Machine Learning?
Let’s step back a bit and look at a few more general terms that are often used in connection with neural machine translation: artificial intelligence and machine learning. What is machine learning? Machine learning is the ability by an algorithm, a machine, to self-adjust some internal parameters to produce a desired output, given a specific input. These internal parameters are not pre-programmed, the machine “learns” these parameters by trial and error. Sometimes this learning by trial and error involves an extremely sophisticated adjustment algorithm, but in the end, it is still trial and error.
Now, what is a parameter? For the non-mathematically inclined, you can think of a parameter as a virtual knob that can be adjusted (see Fig. 1a). This knob can be external (input or output parameter) or internal to the the algorithm.
Perhaps some of you may recall way back in math or science class, when we were given a couple of points distributed in a plane, and we had to find the straight line that best fit these points (see Fig. 1b). Some of us were unfortunate enough to have to calculate this straight line by hand (human learning). The more fortunate ones were allowed to use a computer with Excel or another program to find the best fit straight line. These fortunate ones used machine learning to find that straight line. In the case of a 2D straight line, the machine learned two parameters (slope and location of the line). It is fairly easy to picture this in 3D as well. More graphically talented people than me can perhaps extend this image to four and even five dimensions. But thereafter, things become hard to picture.

Fig. 1a) Adjustment knob

Fig. 1b) Best fit line through points
How about a whopping 175 billion parameters? This is how many machine learning parameters GPT-3 has, the currently most powerful artificial intelligence language model, which is said to produce “human-like text.” If a human were to adjust one parameter per second, for example, that human would spend an equivalent of 5549 years to adjust all parameters of GPT-3 only once! Even if that human could adjust 175 parameters per second, they’d still spend over 31 years doing nothing but adjusting GPT-3’s parameters. That is a lot of parameters!
What is Artificial Intelligence? Can Machines Think?
Artificial intelligence (AI) is the generalization of machine learning, whereby the goal is to emulate human thinking capability. There are two categories of AI: general AI and narrow AI. Narrow AI focuses on a specific task or set of tasks, whereas general AI, as the name implies, is more general or universal than that. Current AI systems are all narrow AI systems.
Artificial intelligence systems attempt to model the biochemical processes in the animal or human brain, to imitate neurons in the brain, and to connect these artificial neurons in complicated neural networks that are mutually connected in deep, complex layers. This is why artificial intelligence is also called deep learning. These complex connections between artificial neurons (also called nodes or units) and the neurons/nodes themselves are the parameters mentioned above.
Neural networks are fed an input; the machine spins its virtual wheels and produces an output. Then the machine adjusts its many many parameters to get closer to a desired output. This is usually done iteratively, i.e. you feed in some input and the machine produces an output. If the output is too far from the goal, the internal parameters are adjusted and the whole learning process is repeated until the output converges on some desired goal. This is called training. Naturally, with hundreds of millions if not billions of parameters, this process uses up a lot of computing resources and time. You also need a very high volume of training data to be able to adjust all these parameters properly. This is the reason why neural networks didn’t really take off until about a decade ago, although the basic concepts date back to the middle of the last century. Only recently computers have become sufficiently powerful and there is enough data available.
Once trained, the neural network can produce an output, given a specific input. In some cases, so-called adaptive learning is applied and the parameters are retrained almost on the fly. Does that mean the machine is thinking? Certainly not. If you must anthropomorphize the discussion, you could say that the current state of artificial intelligence is in some sense equivalent to animal and human instinct. Given an input, an action is performed based on a pattern that’s inherently coded into the brain, or into the machine. Given an hitherto unseen input that’s not similar to an input that was previously encountered (training data), an action is again performed based on the pre-coded pattern in the brain or the machine, but there is no adjustment of parameters to the novel input (“thinking”).
Does that mean that machines will be able to think at some point? That is a question that currently nobody can answer, because nobody really knows precisely how the biological brain works. Of course, neuroscientists know how the biochemical processes inside the brain work. But nobody can really explain where the biochemical reactions end and the thinking begins.
Neural Machine Translation and Context
Recently, claims have surfaced that NMT engines understand context. Let’s recap: NMT engines do nothing but pattern matching, admittedly extremely complex pattern matching. Older so-called statistical machine translation engines only looked at a very limited set of word clusters next to each other (so-called n-grams), as illustrated in Fig. 2 below. As you can see, this illustration fits on a two-dimensional graphic, showing that statistical machine translation engines did not really take context into account.

Fig. 2: Graphical illustration of an English-German translation process. From G. M. de Buy Wenninger, K. Sima’an, PBML No. 101, April 2014, pp. 43.
How about NMT engines with their millions to billions of parameters that are connected in deep, complex layers? You can conceivably draw a three-dimensional embedding, and perhaps even four or even dimensions, but beyond that, things become hard to picture. Below I’m showing a movie (from https://projector.tensorflow.org) that attempts to display how words are embedded in these high-dimensional models of patterns with hundreds of millions of parameters. You can see that some words are closer to each other, some are further apart. The closer words are together, the more related they are to each other. In other words, these patterns take context into account, at least in some sense of the word. Does that mean the machine “understands” what it is doing? No.
Neural Machine Translation – Unpredictability
Let’s recap again: NMT is pattern matching, extremely complex pattern matching with millions or even billions of parameters. Naturally, the machine will be able to recognize patterns that humans are not consciously aware of, and sometimes the machine recognizes patterns that aren’t even there.
This means that given an input that is the same or similar to a previously seen input in the training data or a mixture of previously seen inputs, the machine will be able to produce an output that is very close to or exactly the desired output. If the machine, however, is given an input that is very different from a previously seen input, the output will be unpredictable. Yes, you read correctly, the output is entirely unpredictable even for the people who programmed and trained the algorithm. Of course, one can calculate the general state of the neural network, but with hundreds of millions to billions of interconnected parameters, it is impossible to calculate the precise state of each of these parameters and therefore the precise output.
Figure 4 shows some screenshots of Google Translate from July 2018 (the strange behavior has since been corrected and cannot be reproduced anymore at the time of writing). At that time, when you input the word “dog” several times and insisted on the source language of “Maori,” Google Translate predicted the end of times in English. Of course, the input is entirely nonsense, and a human will tell you so. But a machine will produce an output, and that output is entirely unpredictable.

Fig. 4a) Google Translate, Maori to English, July 2018.

Fig. 4b) Google Translate predicts the end of times, Maori to English, July 2018.
Summary
Neural machine translation, and neural networks in general, are nothing but pattern matching engines. The patterns are highly complex, with interconnected parameters of the order of hundreds of millions to billions. Thus, when a neural network encounters an unforeseen input, the output is equally unpredictable.
This means specifically for machine translation:
- Garbage in – garbage out still holds in the age of (narrow) artificial intelligence. An engine is only as good as its training material. The more parameters the engine has, the better the engine, but the more training material is needed. And the better the training material, the better the engine. Guess where that training material comes from? Yes, humans. Therefore, human experts will always be necessary to train the engine.
- NMT engines will meet or exceed human parity for linguistic tasks that are highly repetitive. Humans make mistakes, machines don’t.
- For certain types of text, NMT will fail. These text types include highly creative texts such as literature and marketing texts (although some advertisements can seem very repetitive), and in general texts that contain real novelties, e.g. inventions that are not merely incremental improvements to existing state of the art.
- NMT output is lexically less varied than texts produced by humans. Even billions of parameters still constitute a finite set, encoding a finite amount of vocabulary. This is exacerbated by certain quality metrics that are used to train and evaluate many machine translation engines. This is a whole blog post by itself, to be addressed in a future article. By contrast, human imagination is virtually unlimited, and so are texts produced by humans.
The above is a high-level explanation of neural machine translation. A deeper but non-technical explanation of neural networks as they are used in machine translation can be found here.
I have now explained why neural machine translation is nothing but pattern matching. But what happens when the machine matches the wrong pattern or a pattern that doesn’t exist? This is a topic for a future blog post. Stay tuned!
Copyright secured by Digiprove © 2021 Carola F Berger