Category Neural machine translation

What is neural machine translation? How does NMT work? Will it replace human translators? Do machines think? Is the robot apocalypse near?

neural networks

If you are in any way connected with the world of translation and interpretation, you have certainly asked yourself at least one of the above questions about neural machine translation (NMT). These questions are by no means easy to answer. If you ask n experts, you’ll likely get n+1 different answers. Let me quote a few experts:

Read More

Handouts for ATA59 – An Introduction to Neural Machine Translation

I will be giving another presentation at the upcoming ATA Annual Conference in New Orleans, ATA59, jointly in the SciTech and Language Technology tracks. The presentation will give an introduction to neural machine translation. My talk is preliminarily scheduled for the very last time slot on Saturday before the final keynote. I hope to see you there, despite the late hour!

Abstract

“The end of the human translator,” “nearly indistinguishable from human translation” – these and similar headlines have been used to describe neural machine translation (NMT). Most language specialists have probably asked themselves: How much of that is hype? How far can this approach to machine translation really go? How does it work? The presentation will examine one of the available open source NMT toolkits as an illustrative example to explain the underlying concepts of NMT and sequence-to-sequence models. It will follow in the same spirit as last year’s general introduction to neural networks, which is summarized in the accompanying handouts.

Handouts

I have just uploaded the handout for the presentation onto the ATA server. The material is a slightly updated version of my blog post on neural networks, which summarizes my presentation at ATA58. You can download the handout here.

An Introduction to Neural Networks

For the readers who have been wondering whether I have made any progress with my neural machine translation project, indeed, I have. I have successfully installed and run OpenNMT with the default settings as in the tutorial, though the resulting translations were fairly terrible. This was to be expected, since a whole lot of fine-tuning and high-quality training corpora are necessary to obtain a translation engine of reasonable quality. However, as a proof of concept, I am fairly impressed with the results. As the next step, I am planning on tinkering with the various individual components of the NMT engine itself as well as the training corpus to improve the translation quality, as much as one can without some major programming work. But, before going into all the gory details in future blog posts, let’s first have a look at a relatively simple artificial neural network. I presented this example at the 58th Annual ATA Conference, the full slide set can be found here.

Neurons and Units

In the following, unless explicitly stated otherwise, the term “neural network” refers to artificial neural networks (ANNs), as opposed to biological neural networks.

ANNs are not a new idea. The idea has been floating around since the 1940s, when researchers first attempted to create artificial models of the human brain. However, back then, computers were the size of whole rooms and consisted of fragile vacuum tubes. Only in the last decade or so have computers become small and powerful enough to put these ideas into practice.

Like biological brains, which are composed of neurons, ANNs are composed of individual artificial neurons, called units. Their function is similar to biological neurons, as shown in the figures below. Fig. 1 shows a biological neuron, whose precise function is very complicated. Loosely speaking, the neuron consists of a cell body, dendrites, and an axon. The neuron receives input signals via the dendrites. When these input signals reach a certain threshold, an electrochemical process takes place in the nucleus, and the neuron transmits an output signal via the axon.

Biological neuron

Fig. 1: Biological neuron. Source: Bruce Blaus, https://commons.wikimedia.org/wiki/File:Blausen_0657_MultipolarNeuron.png


Unit in artificial neural net

Fig. 2: Unit in artificial neural net

Fig. 2 shows a model of a very simple artificial unit. It also receives inputs (labeled x1 and x2), and an activation function (the white blob in Fig. 2) transmits an output signal according to the inputs. The activation function can be a simple threshold function. That is, the unit is off until the sum of the input signals reaches a certain threshold, and then it transmits an on signal when the sum of the inputs exceeds the threshold. However, the activation function can also be much more complicated. As described, this artificial unit does not perform any particularly interesting functions. Interesting functions can be achieved by weighting the inputs differently according to their importance. An artificial neural net “learns” by adjusting the weights (labeled w1 and w2 in Fig. 2), or the importance, of the input signals into each unit according to some automated algorithm. In Fig. 2, input x1 is twice as important as input x2, as illustrated by the relative thickness of the input arrows.

Layers and Networks

Similar to biological brains, these units are assembled into a neural network, as shown in Fig. 3. More precisely, the figure shows a so-called feed-forward neural net.

Artificial neural network

Fig. 3: Artificial neural network. Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

ANNs consist generally of an input layer, one or more hidden layers, and an output layer. Each layer consists of one or more of the units described above. Neural networks with more than one hidden layer are called “deep” neural nets. Each unit is connected with one or more other units (indicated by the arrows in Fig. 3), and each connection is given more or less importance through an associated weight. In a feed-forward neural net, as shown in Fig. 3, a unit in a specific layer is only connected to units in the next layer, not with units within the same layer or in a previous layer, whereby the terms “next” and “previous” refer to sequences in time. In Fig. 3, the arrow of time flows from left to right. There are also so-called recurrent and convolutional neural networks, where the connections are more complicated. However, the main idea is the same. The middle layer in Fig. 3 is hidden, because it does not have direct connections to inputs or outputs, whereas the input and output layers communicate directly with the external world.

Training and Learning

The thus assembled neural network “learns” by adjusting the various weights, for example, numbers between -1.0 and +1.0, but other values are of course possible. The weights are adjusted according to a specific training algorithm.

The training of a neural network typically proceeds as follows: A set of inputs is fed into the input layer of the neural network. Then, the neural network feeds that input through the network, in accordance with the weights (connections) and the activation functions. The final output at the output layer is then compared to the desired output according to a specific metric. Finally, the weights throughout the network are adjusted depending on the difference between the actual output and the desired output as measured by the chosen metric. Then, the entire process is repeated, usually many thousands or millions of times, until the output is satisfactory. There are many possible algorithms to adjust the weights, but a description of these algorithms goes beyond the scope of this article.

An Example

As a concrete example, let’s look at a fairly simple neural network that recognizes handwritten digits. A sample of the inputs is shown in Fig. 4. I programmed this simple feed-forward neural net for Andrew Ng’s excellent introductory course on Machine Learning, which I highly recommend.

Sample handwritten digits

Fig. 4: Sample handwritten digits

The architecture of the neural network is exactly as shown in Fig. 3, with 400 input units, since the input picture files have a size of 20 x 20 grayscale pixels (=400 pixels). There are 25 units in the hidden layer, and 10 output units, one for each digit from 0 to 9. This means that there are 10,000 connections (weights) between the input layer and the hidden layer (400 x 25) and 250 connections between the hidden layer and the output layer (25 x 10). In other words, we have 10,250 total parameters! For the technically interested, the activation function is here a simple sigmoid.

ANN for handwritten digit recognition

Fig. 5: ANN for handwritten digit recognition

The training proceeded exactly as described above. I fed in batches of several thousand 20×20 labeled grayscale images as shown in Fig. 4 and trained the net by an algorithm called backpropagation, which adjusted the weights according to how far the output was from the desired label from 0 to 9. The result was remarkable, especially considering that there were only a couple dozen lines of code.

But How Does It Work?

The fact that it works is remarkable and also a somewhat unsettling, because all I did was program the activation function, specify how many units are in each layer and how the layers are connected, specify the metric and the backpropagation, and the neural net did all the rest. So, how does this really work?

Autopsy of a neural network

Autopsy of a neural network

To be honest, even after doing some complicated probabilistic and statistical ensemble calculations, I still did not understand how these fairly simple layers of units with fairly straightforward connections could possibly manage to discern handwritten digits. So I went on to “dissect” the above neural net layer by layer, and pixel by pixel. Here’s what is actually happening to the input, after successfully training the neural net.

The first set of weights between the input layer and the hidden layer can be thought of as a set of filters, which essentially filter out important patterns or features. If one plots only this first set of weights, one can visualize a set of 25 “filters,” as shown in Fig. 6. These filters map the input onto the 25 hidden units in the hidden layer. Fig. 7 shows what happens if you map a specific input, in this case a handwritten “0,” onto the hidden layer.

First set of weights

Fig. 6: First set of weights, acting as a “filter.”

Mapping of 0 to hidden units

Fig. 7: Mapping of 0 to hidden units.

The output of the hidden layer is then piped through another filter, as shown in Figure 8, and mapped onto the final output layer via this filter/set of weights. Fig. 8 shows how the input picture with the digit “0” is correctly mapped onto the output unit for the digit “0” (at the bottom, because the program displays things vertically from 1 at the top to 9 and then to 0 at the bottom).

Mapping of input to output via hidden layer.

Fig. 8: Mapping of input, here a “0”, to output via hidden layer

More examples of this filtering or mapping via the internal sets of weights are visualized in my slide set for ATA58 and also in Fig. 9.

Mapping of input to output

Fig. 9: Mapping of input “2” to output

Again, the internal weights act as a sort of filter to pick out the features of interest. Naively, I would have expected that these features or patterns of interest correspond to vertical and horizontal lines, for example for the digits 1, 4, or 7, or to various arcs and circles, for digits like 3 or 8 or 0. However, this is evidently not at all how the network picks out digits, as can be seen from the visualization of the first set of weights in Fig. 6. The patterns and structures that the neural net filters out are ostensibly much more complex than simple lines or arcs. This is also the reason for the “detour” via the hidden layer. A direct mapping from input to output, even with an internal convolution, would not be sufficient to pick out all the information that is necessary to distinguish one character from another. Similarly, for more complex tasks, more than one hidden layer will be needed. The number of hidden layers and units as well as their connections/weights grows with the complexity of the task.

Summary

This blog post aimed to explain the inner workings of a simple neural net by visualizing the internal process. Neural networks for other applications, including for machine translation, work pretty much the same way. Of course, most of these will have more than one hidden layer, possibly pre- and post-processing of input and output data, more sophisticated activation functions, and a more complicated architecture, like recurrent neural nets or convolutional neural nets. However, the basic idea remains the same: The underlying function of an artificial neural net is simply pattern recognition. Not more, not less. While well trained ANNs are extraordinary and unquestionably better than humans at the pattern recognition tasks they are trained for, because they don’t get tired or have lapses of concentration, one should never forget that they are remarkably ill-suited for anything that goes beyond the tasks they are trained for. In such cases they can sometimes detect patterns that are not there, and sometimes the tasks simply cannot be cast into a pattern, however complicated. In other words, while ANNs certainly exceed their programming, they can never exceed their training. (At least until the so-called technological singularity is upon us.)

Slides for presentation at ATA58, ST-7, “An Introduction to Artificial Intelligence, Machine Learning, and Neural Networks”

You can download the slides for my presentation here.
(© All rights reserved, though I am happy to share a version with higher resolution or give specific permission for reuse of the slides upon request.)

Abstract:

From spam filters to stock trading bots, the applications of artificial intelligence are already omnipresent. This poses important questions such as: Will my autonomous vacuum cleaner go on a rampage and eat the hamster? Do neural networks think like brains? What are the chances of a robot uprising? The presentation will address these questions and give an introduction to artificial intelligence, which is impacting all our lives, perhaps more than most people are aware of. However, the talk will not discuss machine translation and related topics. No knowledge of computer science or advanced mathematics is required to attend.

My Neural Machine Translation Project – Overview over Open-Source Toolkits – updated Dec 2017

Updated: December 2017

Before deciding on a toolkit, I needed to get an overview over the various open-source neural machine translation (MT) toolkits that are available at the time of writing (September 2017). In the following, I will summarize the features of the various toolkits from my point of view. Note that this summary does not include open-source MT toolkits such as Moses, which is based on a statistical approach. I will mainly summarize the impressions I got after lurking on the various support discussion forums/groups for a while.

The big kahuna – TensorFlow

Provided by: Google (TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.)
Website: https://www.tensorflow.org
Language: Python (main API), with APIs available for C, Java, and Go, however the latter seem to have somewhat less functionality
Architecture: Since Tensorflow is a whole framework, both recurrent as well as convolutional neural networks are available.
White paper: Large-Scale Machine Learning on Heterogeneous Distributed Systems, M. Abadi et al., Nov. 9, 2015
Support: Stack Overflow for technical questions; a Google group (what else?) for higher-level discussions about features etc., although some technical questions are also discussed in the Google group; and a blog announcing new features and tutorials
Summary: TensorFlow is a large-scale, general-purpose open-source machine learning toolkit, not necessarily tailored for machine translation, but it does include tutorials on vector word representations, recurrent neural networks, and sequence-to-sequence models, which are the basic building blocks for a neural machine translation system. TensorFlow also provides various other neural network architectures and a vast number of features one could play around with for language learning and translation. Definitely not a plug-and-play system for beginners.

The more user-friendly one – OpenNMT

Provided by: Harvard University and Systran
Website: http://opennmt.net
Language: Lua, based on the Torch framework for machine learning; there exist two “light” versions using Python/PyTorch and C++
Update: As of December 2017, the main lua version is now accompanied by a full-fledged Python version, based on the PyTorch framework, and a version based on the Tensorflow framework.
Architecture: Recurrent neural network
White paper: OpenNMT: Open-Source Toolkit for Neural Machine Translation, G. Klein et al., Jan 10, 2017
Support: a very active discussion forum (where, among other people, Systran’s CTO is very involved)
Summary: More suited for machine learning beginners, although the choice of the programming language Lua, which is not that widely used, may be a bit of a hurdle. Update December 2017: Since there are now two other versions, based on Python and Tensorflow, this should no longer be an issue. End update. On the other hand, there exist lots of tutorials and step by step instructions. Some of the questions that are asked in the forum are indeed quite elementary (and I’m far from an expert!). Thus, if one wants to play around with inputs (that is, well-chosen corpora!) and various metrics and cost functions for the output, this is the toolkit to choose. In machine translation systems input and output are just as critical as the architecture itself, if not more so. Because for neural networks, and thus also neural machine translation systems, the old adage “garbage in – garbage out” is particularly true. Therefore, it may make more sense for linguists and translators to approach the machine translation problem from the angle of the input (corpora) and output (translation “quality” metrics), instead of getting lost in the architecture and the code.

The newer kid on the block – Nematus

Provided by: University of Edinburgh
Website: Not really a website, but the project plus documentation and tutorials is here on Github.
Language: Python, based on the Theano framework for machine learning
Architecture: Recurrent neural network
White paper: Nematus: a Toolkit for Neural Machine Translation, R. Sennrich et al., Mar 13, 2017
Support: a Google group
Summary: This is the third kid on the block, not as active as the other two above. Like OpenNMT, it is a toolkit only for language translation, as opposed to the general-purpose TensorFlow framework. It uses the better-known Python as opposed to Lua, which would be an advantage, at least for me, over OpenNMT. However, the user base does not seem quite as extensive or active as OpenNMT’s. Thus, at the time of writing, Nematus seems to be an option to keep in mind, but not necessarily the first choice.

The brand new kid on the block – Sockeye

Provided by: Amazon
Website: The main website is here: http://sockeye.readthedocs.io/en/latest/. Not really a website, but a tutorial how to use Sockeye has been published on Amazon’s AWS (Amazon Web Services) AI blog — https://aws.amazon.com/fr/blogs/ai/train-neural-machine-translation-models-with-sockeye/.
Language: Python, built on the Apache MXNet framework for machine learning
Architecture: Recurrent neural network
White paper: SOCKEYE: A Toolkit for Neural Machine Translation, F. Hieber et al., Dec 15, 2017
Support: Aside from the website with documentation and a FAQ, there is the general AWS Discussion Forum.
Summary: The newest open source NMT toolkit is geared towards advanced users, who are also familiar with the AWS and MXNet setup. On the other hand, like with Google’s Tensorflow, there are many available architectures and advanced options, and therefore many more options for experimentation.

Another big one – Fairseq

Provided by: Facebook
Website: The Github repository is here: https://github.com/facebookresearch/fairseq. Not really a website, but a tutorial how to use Fairseq has been published on Facebook’s code blog — https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/.
Language: Lua, built on the Torch framework for machine learning
Architecture: Convolutional neural network
White paper: Convolutional Sequence to Sequence Learning, J. Gehring et al., May 8, 2017
Support: A Facebook group (what else?), and a Google group.
Summary: This is another open source toolkit for advanced users, for one, because it is also based on the more esoteric (as compared to Python) language Lua, for another, because the intended user base seem to be advanced researchers, not curious end users. It is also fairly new.