Blog

Checking your website’s security, sources, and cookies – GDPR preparation part II

A week ago, I discussed the installation of an SSL certificate on an existing WordPress site in preparation for the European General Data Protection Regulation (GDPR), which will go into effect in May 2018. Today I want to explain the process to check your website’s security, sources, and whether it is setting any first or third party cookies. This is important in order to write a GDPR compliant privacy statement that every website that processes data of European Union citizens needs to include, regardless of whether the website provider is located in the European Union or not.

The easiest way to check your website’s security, sources and cookies is to download Google Chrome. Once downloaded, turn on the Developer Tools, as shown in the screenshot below.

Google Chrome Developer Tools

Google Chrome Developer Tools

Then, the window will be split into two parts, one, which displays the usual browser window, and another one, which displays a range of options. Go to your website, which should now be reachable via https://, and specifically check the security tab, which will display any unsecured elements that your website may be loading and that you may have missed when adapting your site to the new SSL certificate. Also check the “sources” tab, which will display all data sources loaded by your website, and the “cookies” tab, which will display any cookies your site may set, first and third party cookies. As you can see from the screenshots below, my site does not load any third-party objects other than the standard fonts and also does not set any cookies. At least the homepage does not.

Inspection of website sources

Inspection of website sources

Inspection of website for cookies

Inspection of website for cookies

Now, visit every single page on your site in turn and check these three items. In my case it turned out that I had missed one insecurely loaded element in the SSL adaptation, but none of my pages set any cookies. Which means that my GDPR-compliant cookie policy is fairly straightforward, because there are no cookies.

SSL Installation on WordPress in Preparation for the GDPR

Recently, the Internet has been ablaze with information about the impending compliance deadline with the European General Data Protection Regulation (GDPR). The GDPR is already in effect, however, the grace period for compliance ends on May 25, 2018. This means that all businesses processing data on EU citizens must comply with the GDPR by that date, regardless of whether they are located in the EU or not. This also means that you are likely impacted if you have a website that is visited by EU citizens and your website stores cookies and/or has a contact form and/or has means for visitors to leave comments and “likes” etc (which basically means any blog). The GDPR states that you need to “…implement appropriate technical and organisational measures..in an effective way.. in order to meet the requirements of this Regulation and protect the rights of data subjects.

In practice, in my opinion (I am not a lawyer!) this means among other things for your website that you probably need to

  • Install an SSL certificate on your website, such that all web traffic is encrypted;
  • Update your website to include a disclaimer on your cookie and data protection policy.

Now, ideally, one would install an SSL certificate first and then set up the website, but alas, this is not what I have done with this website, which is based on WordPress. So I had to go through a few extra steps to make the SSL certificate work.

Step 1: Install SSL Certificate

This was the easy part, since all I had to do was to purchase an SSL certificate from my hosting provider, and the installation of the certificate was up to them.

Step 2: Edit the Settings of your WordPress Installation

This step is necessary so that all your permalinks point to https:// instead of http://. This can be accomplished by going to Settings > General Settings and editing the WordPress and Site addresses to point to https, see the screenshot below.

https settings in WordPress

https settings in WordPress

However, unfortunately, this was not the whole story, since my site contains quite a few pages and blog posts, complete with lots of images and uploads, which all still pointed to http:// instead of https:// internally. This meant that upon visiting my secured site (https://www.cfbtranslations.com instead of https://www.cfbtranslations.com), the browser didn’t show a nice (green) padlock in the address bar, but instead a broken lock, indicating partially insecure elements on the site.

Secure site indicated by padlock

Secure site indicated by padlock

A broken padlock means that portions of the site (links, images) still point to insecurely loaded elements, which means these images are for example loaded via http:// instead of https://.

Broken padlock indicating insecure elements on website

Broken padlock indicating insecure elements on website

Step 3: Change All Internal Links to https

In my case, getting the aforementioned insecure elements to load securely turned out to be the most cumbersome part. There are a number of WordPress plug-ins which claim to accomplish the same task with the click of a button. Unfortunately, they all turned out incompatible with my theme or some of the numerous plug-ins I use. If you don’t use any elaborate plug-ins and your theme is compatible, I suggest you simply search for plug-ins related to “SSL” and install the plug-in of your choice. In any case, please make sure you have a back-up of your site in case things go awry and you need to restore the site to its original condition before installing the plug-in.

If, however, the plug-in of your choice does not accomplish the task, there is a second option. Install and activate the plug-in “Better Search Replace,” and then search for “http://www.yoursite.com” and replace it with “https://www.yoursite.com.” After this step, visiting your site via https:// should show a nice (green) intact padlock with no security warnings.

Step 4: Redirect http:// to https:// in Your .htaccess File

This step is necessary so that all visitors typing www.yoursite.com or yoursite.com without any of the prefixes are redirected automatically to the secure version of your site at https://www.yoursite.com. Now, every hosting provider has their own means to access and edit the .htaccess file in your home directory. Most hosting providers also have a recommended syntax for the https redirect, so please follow the instructions of your hosting provider.

In my case, I had to insert the following lines at the very top of the .htaccess file, before anything else:
RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

That did the trick, and all visitors are now redirected to a safe and secure site. The second step to make my website GDPR-compliant is to check which, if any, cookies my site uses (direct or third party), and update my existing cookie and privacy policy page accordingly. However, this is the topic of a future blog post.

The curious case of the “Upwards Arrow With Tip Rightwards” in Trados Studio

A long struggle with a strange character that appeared in Trados Studio just came to a successful conclusion with the help of a wonderful inofficial Trados help group on Facebook. Here is the curious story of the “Upwards Arrow With Tip Rightwards” that had a number of power users stumped.

Once upon a time, I accepted a project to translate a patent from English into German, whereby the source document was sent to me in form of an innocent-looking Word file. However, after importing that document into Trados Studio 2017, the trouble began. The document was riddled with strange-looking symbols everywhere, see the two screenshots below.

arrow

Strange arrow appearing in hundreds of places in Trados Studio.


text with arrows

Sample source segment in Trados Studio, text is redacted for confidentiality.

I looked for these strange symbols in the source document, to no avail, they were not shown (and thus not search- and replaceable). After some back and forth in the aforementioned user group, I was able to determine that Studio treats these characters as whitespace characters, not as formatting or tags. According to this Wikipedia entry, the symbol itself is a so-called “upwards arrow with tip rightwards,” with unicode hex symbol U+21b1. Searching for that unicode hex symbol in Word only resulted in errors (Mac version) or “not found” messages (Windows version). Various transformations and trying to save the source document in various other formats lead nowhere. Saving the entire document as plain text and then reimporting into Word was not an option due to various intricate equations and other formatting that needed to be preserved.

After some more back and forth, thanks to the wonderful colleagues in the user group, we were able to determine that it is Studio’s way to display “left to right” bidirectionality marks. Such marks are completely superfluous in this document, which is entirely in English, and the overabundant appearance of these marks ever second word is definitely an error. In the Word for Windows version I was finally able to search for these invisible characters with “^h” and replace them with…nothing! (As an aside, the Mac version only output an error message, saying that the search for “^h” is not a valid search.) Saving the document with the thus removed bidirectionality marks resulted in a clean document. And I translated happily ever after.

The end…

Scam warning: US National Court Interpreters Database (NCID) email scam

The following was sent to me from a colleague, and I am hereby spreading the word.

To: All Individuals Listed in the National Court Interpreter Database
Re: Interpreting-NCID Scam 

The Administrative Office of the United States Courts (AO) was advised on January 22nd, 2018 that interpreters in several states received a scam email requesting money in exchange for being listed in the judiciary’s National Court Interpreter Database (NCID), which is used by federal courts to locate contract interpreters. This message, purportedly sent by AO Director James C. Duff, requested that interpreters wire money through Western Union to an individual in Iowa. The email is fraudulent and not connected with the United States courts or with Director Duff. 

The judiciary has never required that individuals pay a fee to be listed in the NCID. To become listed in the NCID, follow the steps described here: http://www.uscourts.gov/services-forms/federal-court-interpreters/national-court-interpreter-database-ncid-gateway.

The AO is working to make the relevant law enforcement agencies aware of this scam and we ask that you notify us at NCID_Help@ao.uscourts.gov if you receive this fraudulent email.

An Introduction to Neural Networks

For the readers who have been wondering whether I have made any progress with my neural machine translation project, indeed, I have. I have successfully installed and run OpenNMT with the default settings as in the tutorial, though the resulting translations were fairly terrible. This was to be expected, since a whole lot of fine-tuning and high-quality training corpora are necessary to obtain a translation engine of reasonable quality. However, as a proof of concept, I am fairly impressed with the results. As the next step, I am planning on tinkering with the various individual components of the NMT engine itself as well as the training corpus to improve the translation quality, as much as one can without some major programming work. But, before going into all the gory details in future blog posts, let’s first have a look at a relatively simple artificial neural network. I presented this example at the 58th Annual ATA Conference, the full slide set can be found here.

Neurons and Units

In the following, unless explicitly stated otherwise, the term “neural network” refers to artificial neural networks (ANNs), as opposed to biological neural networks.

ANNs are not a new idea. The idea has been floating around since the 1940s, when researchers first attempted to create artificial models of the human brain. However, back then, computers were the size of whole rooms and consisted of fragile vacuum tubes. Only in the last decade or so have computers become small and powerful enough to put these ideas into practice.

Like biological brains, which are composed of neurons, ANNs are composed of individual artificial neurons, called units. Their function is similar to biological neurons, as shown in the figures below. Fig. 1 shows a biological neuron, whose precise function is very complicated. Loosely speaking, the neuron consists of a cell body, dendrites, and an axon. The neuron receives input signals via the dendrites. When these input signals reach a certain threshold, an electrochemical process takes place in the nucleus, and the neuron transmits an output signal via the axon.

Biological neuron

Fig. 1: Biological neuron. Source: Bruce Blaus, https://commons.wikimedia.org/wiki/File:Blausen_0657_MultipolarNeuron.png


Unit in artificial neural net

Fig. 2: Unit in artificial neural net

Fig. 2 shows a model of a very simple artificial unit. It also receives inputs (labeled x1 and x2), and an activation function (the white blob in Fig. 2) transmits an output signal according to the inputs. The activation function can be a simple threshold function. That is, the unit is off until the sum of the input signals reaches a certain threshold, and then it transmits an on signal when the sum of the inputs exceeds the threshold. However, the activation function can also be much more complicated. As described, this artificial unit does not perform any particularly interesting functions. Interesting functions can be achieved by weighting the inputs differently according to their importance. An artificial neural net “learns” by adjusting the weights (labeled w1 and w2 in Fig. 2), or the importance, of the input signals into each unit according to some automated algorithm. In Fig. 2, input x1 is twice as important as input x2, as illustrated by the relative thickness of the input arrows.

Layers and Networks

Similar to biological brains, these units are assembled into a neural network, as shown in Fig. 3. More precisely, the figure shows a so-called feed-forward neural net.

Artificial neural network

Fig. 3: Artificial neural network. Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

ANNs consist generally of an input layer, one or more hidden layers, and an output layer. Each layer consists of one or more of the units described above. Neural networks with more than one hidden layer are called “deep” neural nets. Each unit is connected with one or more other units (indicated by the arrows in Fig. 3), and each connection is given more or less importance through an associated weight. In a feed-forward neural net, as shown in Fig. 3, a unit in a specific layer is only connected to units in the next layer, not with units within the same layer or in a previous layer, whereby the terms “next” and “previous” refer to sequences in time. In Fig. 3, the arrow of time flows from left to right. There are also so-called recurrent and convolutional neural networks, where the connections are more complicated. However, the main idea is the same. The middle layer in Fig. 3 is hidden, because it does not have direct connections to inputs or outputs, whereas the input and output layers communicate directly with the external world.

Training and Learning

The thus assembled neural network “learns” by adjusting the various weights, for example, numbers between -1.0 and +1.0, but other values are of course possible. The weights are adjusted according to a specific training algorithm.

The training of a neural network typically proceeds as follows: A set of inputs is fed into the input layer of the neural network. Then, the neural network feeds that input through the network, in accordance with the weights (connections) and the activation functions. The final output at the output layer is then compared to the desired output according to a specific metric. Finally, the weights throughout the network are adjusted depending on the difference between the actual output and the desired output as measured by the chosen metric. Then, the entire process is repeated, usually many thousands or millions of times, until the output is satisfactory. There are many possible algorithms to adjust the weights, but a description of these algorithms goes beyond the scope of this article.

An Example

As a concrete example, let’s look at a fairly simple neural network that recognizes handwritten digits. A sample of the inputs is shown in Fig. 4. I programmed this simple feed-forward neural net for Andrew Ng’s excellent introductory course on Machine Learning, which I highly recommend.

Sample handwritten digits

Fig. 4: Sample handwritten digits

The architecture of the neural network is exactly as shown in Fig. 3, with 400 input units, since the input picture files have a size of 20 x 20 grayscale pixels (=400 pixels). There are 25 units in the hidden layer, and 10 output units, one for each digit from 0 to 9. This means that there are 10,000 connections (weights) between the input layer and the hidden layer (400 x 25) and 250 connections between the hidden layer and the output layer (25 x 10). In other words, we have 10,250 total parameters! For the technically interested, the activation function is here a simple sigmoid.

ANN for handwritten digit recognition

Fig. 5: ANN for handwritten digit recognition

The training proceeded exactly as described above. I fed in batches of several thousand 20×20 labeled grayscale images as shown in Fig. 4 and trained the net by an algorithm called backpropagation, which adjusted the weights according to how far the output was from the desired label from 0 to 9. The result was remarkable, especially considering that there were only a couple dozen lines of code.

But How Does It Work?

The fact that it works is remarkable and also a somewhat unsettling, because all I did was program the activation function, specify how many units are in each layer and how the layers are connected, specify the metric and the backpropagation, and the neural net did all the rest. So, how does this really work?

Autopsy of a neural network

Autopsy of a neural network

To be honest, even after doing some complicated probabilistic and statistical ensemble calculations, I still did not understand how these fairly simple layers of units with fairly straightforward connections could possibly manage to discern handwritten digits. So I went on to “dissect” the above neural net layer by layer, and pixel by pixel. Here’s what is actually happening to the input, after successfully training the neural net.

The first set of weights between the input layer and the hidden layer can be thought of as a set of filters, which essentially filter out important patterns or features. If one plots only this first set of weights, one can visualize a set of 25 “filters,” as shown in Fig. 6. These filters map the input onto the 25 hidden units in the hidden layer. Fig. 7 shows what happens if you map a specific input, in this case a handwritten “0,” onto the hidden layer.

First set of weights

Fig. 6: First set of weights, acting as a “filter.”

Mapping of 0 to hidden units

Fig. 7: Mapping of 0 to hidden units.

The output of the hidden layer is then piped through another filter, as shown in Figure 8, and mapped onto the final output layer via this filter/set of weights. Fig. 8 shows how the input picture with the digit “0” is correctly mapped onto the output unit for the digit “0” (at the bottom, because the program displays things vertically from 1 at the top to 9 and then to 0 at the bottom).

Mapping of input to output via hidden layer.

Fig. 8: Mapping of input, here a “0”, to output via hidden layer

More examples of this filtering or mapping via the internal sets of weights are visualized in my slide set for ATA58 and also in Fig. 9.

Mapping of input to output

Fig. 9: Mapping of input “2” to output

Again, the internal weights act as a sort of filter to pick out the features of interest. Naively, I would have expected that these features or patterns of interest correspond to vertical and horizontal lines, for example for the digits 1, 4, or 7, or to various arcs and circles, for digits like 3 or 8 or 0. However, this is evidently not at all how the network picks out digits, as can be seen from the visualization of the first set of weights in Fig. 6. The patterns and structures that the neural net filters out are ostensibly much more complex than simple lines or arcs. This is also the reason for the “detour” via the hidden layer. A direct mapping from input to output, even with an internal convolution, would not be sufficient to pick out all the information that is necessary to distinguish one character from another. Similarly, for more complex tasks, more than one hidden layer will be needed. The number of hidden layers and units as well as their connections/weights grows with the complexity of the task.

Summary

This blog post aimed to explain the inner workings of a simple neural net by visualizing the internal process. Neural networks for other applications, including for machine translation, work pretty much the same way. Of course, most of these will have more than one hidden layer, possibly pre- and post-processing of input and output data, more sophisticated activation functions, and a more complicated architecture, like recurrent neural nets or convolutional neural nets. However, the basic idea remains the same: The underlying function of an artificial neural net is simply pattern recognition. Not more, not less. While well trained ANNs are extraordinary and unquestionably better than humans at the pattern recognition tasks they are trained for, because they don’t get tired or have lapses of concentration, one should never forget that they are remarkably ill-suited for anything that goes beyond the tasks they are trained for. In such cases they can sometimes detect patterns that are not there, and sometimes the tasks simply cannot be cast into a pattern, however complicated. In other words, while ANNs certainly exceed their programming, they can never exceed their training. (At least until the so-called technological singularity is upon us.)