Sometimes Word documents contain a lot of extra (hidden) tags, that can really hinder the use of a CAT-tool and stop your translation workflow in its tracks. Particularly PDF-documents that are converted to Word can have tags between every single word or even letter! The reason for this overabundance of tags is that Word seems to apply certain formatting settings to every single word or letter, instead of more globally to every single paragraph. To get rid of these tags without destroying the formatting of your source document, follow the procedure below. In any case, you’ll most likely have to do some reformatting of the translation no matter what, because I have yet to see a source and a target language pair with the exact same length of words.
The procedure applies to Word 2010, 2007, and 2003 — if you have a different version, or a different Word processor, there should be an equivalent procedure with equivalent keyboard shortcuts.
- Open the source document in Word. It is advisable to save a copy of the original in the unlikely case something goes wrong.
- Use CTRL+A to mark the entire text.
- Use CTRL+D to open the character formatting dialog box, go to the “Character Spacing” tab (or the equivalent in your Word version) as shown below.
Font dialog box in Word
- Set the Scale to 100%, the spacing and position to Normal, and disable Kerning.
- Save the file and try opening it in your favorite CAT-tool. The superfluous tags should have disappeared. If not, proceed to the next step.
- If you are in Word 2010 or Word 2007, save the file as a Word 97-2003 document. The step from .docx-format to .doc-format usually removes all additional tags. You may have to do some minor reformatting when you save the final translated document in the original 2010 format.
- If you are in Word 2003 or the file is already in Word 97-2003 format, try saving it in a newer format, and then save it again in Word 97-2003 format.