Translation Memory Maintenance

Volume 16, No. 1
January 2012

Translation Memory Maintenance:

Playing Hide and Seek

by Rafael Guzmán

Introduction

dentifying and fixing terminology errors is one of the most important requirements to keep the quality of translation memories (TMs) in good shape. These errors are normally mistranslations, translation inconsistencies, or simply translations that have become obsolete. The problem is that playing hide and seek with terminology errors is often more challenging than what one might expect.

Identifying and fixing terminology errors in translation memories is seldom an easy task.

This article describes some of the difficulties involved in accurately detecting terminology errors in TMs. It also discusses the linguistic side-effects resulting from making terminology global changes when necessary precautions are not taken, and what improvements can be made.

The future of translation memories

In recent months, there has been extensive discussion about the future of translation memories. In addition, a debate on whether or not quality levels are sustainable has also been going on for a while. This raises the question of whether there is any point in discussing TMs maintenance at all.

In my opinion, the title of an interesting article by Peter Reynolds captures well the reality: 'Reports of my death are greatly exaggerated--Translation Memory' (2011).

In any case, while TMs are still being used, and leaving aside how their functionality will be delivered in the future (stand-alone tools or Software as a Service in the Cloud), what seems to be clear is that TM technology 'has not reached the end of its line of development' (Zetzsche 2011). With this in mind, the next two sections in this article will focus on two particular aspects that need further development in order to enhance TM maintenance.

Detecting terminology errors in a TM

While prevention is better than cure, it must not be forgotten that terminology evolves with the passing of time making today's terminology obsolete tomorrow. Terminology discrepancies may also emerge after company mergers and acquisitions, and they also need to be proactively detected and resolved.

There are several possible approaches that can be used in order to check for terminology errors. For instance, Table 1 shows the term scanned (past simple) as a potential inflected form of the verb scan (infinitive). Terminology checkers that only rely on exact-match comparisons against the canonical form in which they are stored in a database (e.g. verbs in the infinitive form; nouns in the singular form) will be unlikely to detect that analizado, analizadas and analizó are just Spanish inflected forms of the verb analizar (scan), and that they do not necessarily need to be flagged as terminology errors.

Table 1: Example of potential terminology false positives in Spanish due to inflection

Term to check	English term	Spanish translations
scan = analizar	scanned	analizado, analizadas, analizó

Other terminology checkers support fuzzy match as an alternative to stemming (e.g. appending a wild character such as * to the lexeme of the term). This solution is helpful to capture many inflections, but not all. For instance, go* will match, go, going, and gone, but not went. Another disadvantage of this approach is that it also opens the door to false positives: go* will match goal, goat, and good.

Ideally, using a reliable parser that fully supports stemming would be the best approach in that it caters for all possible inflections of all the words that make a term. But this alone may not be enough to prevent potentially large amounts of false positives from being reported by terminology checkers.

For example, depending on the context, terminology checkers would need to be able to identify whether scan is functioning as a noun or a verb in a sentence, and, based on that, determine which Spanish translation to expect. This could be achieved by providing terminology checkers with the part of speech (POS) value of each term stored in their database. When this is not done, 'misleading results are likely to occur' (Guzmán 2009) increasing the number of false positives.

In addition, some more advanced grammatical intelligentsia would need to be added to terminology checkers to cater for scenarios like the one shown in Table 2, in which the expected Spanish translation for security level has been modified due to the insertion of an adjective (básico) in the middle of the term: (nivel básico de la seguridad, instead of nivel de la seguridad básico). While this is correct in Spanish because there is an adjective (basic) preceding the English term, terminology checkers will be likely to flag this as an error.

Table 2: Example of potential terminology false positives in Spanish caused by word order change

Term to check	English segment	Spanish segment
security level = nivel de la seguridad	basic security level	nivel básico de la seguridad

All of this makes it worth considering the possibility of connecting the grammatical intelligentsia behind rule-based machine translation (RBMT) to TM technologies. The next section will describe another area in which this possibility would be also worth exploring.

Making terminology global changes in a TM

When it comes to fixing global terminology errors by making global changes in a TM, it is important to consider the potential collateral damage that can be caused to hundreds, or even thousands, of sentences in a TM if some precautions are not taken. The following are the two main scenarios.

Scenario 1: the old and the new terms share the same gender

Suppose, for instance, that you discover that the term rule (regla) has been translated inconsistently in Spanish as both norma and regla in your TM. Since regla is the more accurate translation, you might decide to make a global change to replace norma with regla.

Because both terms share the same (feminine) gender, it is relatively safe to implement a global change provided that you enable the Case-sensitive and Match whole word only options before hand in whatever tool you are using to search for and replace the offending terms. Then, you will need to make other global changes to capture the same term in all its inflected forms, e.g. lower-case singular, lower-case plural, upper-case singular, and so on.

Table 3 shows one of the main side-effects that are likely to occur if these precautions are not taken. As it can be seen, the word normalmente has been damaged after globally changing all instances of regla with norma. The word norma has matched norma in all instances of the word normally and has replaced it with regla, resulting in reglalmente, which is incorrect in Spanish.

Table 3: Example of a possible side-effect after globally changing norma with regla

English	Spanish (before global change)	Spanish (after global change)
normally	normalmente	reglalmente

Finally, before hitting the Replace All button, you need to ask yourself whether you might still be in danger of matching false positives, such as all the instances of norms = normas.

To be on the safe side, you can use a regular expression (regexp) to match only the relevant translation units in which the real errors occur. If your TM tool does not support regexps, you can export your TM to a text editor and apply your regexp there. For example, let us consider a TM containing translation units like this:

<tuv xml:lang="EN-US">
<seg>The following rules are available:</seg>
</tuv>
<tuv xml:lang="ES-MX">
<seg>Las siguientes normas están disponibles:</seg>
</tuv>

Now you need a regexp to match only the translation units that contain the word rules (not norms!) in the English segment and reglas in the Spanish segment, and that replaces normas with reglas:

SEARCH: (<seg>.*?rules.*?(?:</seg>|</tuv>)\r\n.*?</tuv>\r\n.*?lang="(?:[A-z][A-z])-(?:[A-Z][A-Z])">\r\n.*?<seg>.*?)normas(.*?(?:</seg>|</tuv>))

REPLACE: $1reglas$2

This regexp is Perl-compatible and has been tested in UltraEdit text editor, but it can be used for other terms and adapted slightly to search for verb inflections.

Scenario 2: the old and the new terms have a different gender

In this scenario, unpleasant surprises are almost 100% guaranteed if global changes are made in a TM, even if the Case-sensitive and Match whole word only options are enabled.

For instance, suppose that you discover that the term job (trabajo) has been translated inconsistently in Spanish as both trabajo and tarea in your TM. Since trabajo is the more accurate translation, you may decide to make a global change to replace tarea with trabajo. Unlike scenario 1, these two terms have a different gender: trabajo is masculine and tarea is feminine. The danger of making a global change in this case is not just matching false positives, but the collateral damage that can be caused in the words surrounding the term that will be replaced in every sentence.

To illustrate this point, the second column in Table 4 shows a sample sentence in which jobs has been incorrectly translated as tareas, instead of trabajos in Spanish. The next column shows all the side effects after replacing tareas with trabajos.

Table 4: Possible side-effects after globally changing tareas with trabajos

English	Spanish (before global change)	Spanish (after global change)
View scheduled, active, and completed jobs.	Ver las tareas planificadas, activas y completadas.	Ver las trabajos planificadas, activas y completadas.

As can be seen, the gender of most of the articles and adjectives surrounding trabajos is still feminine, resulting in an unacceptable sentence. And most importantly, all these grammatical side-effects will now need to be manually identified and fixed, perhaps in hundreds of segments scattered throughout the TM, which can be extremely difficult.

Conclusion

Identifying and fixing terminology errors in translation memories is seldom an easy task. In particular, the potential linguistic side-effects caused by terminology global changes can be labor-intensive and expensive to fix. Needless to say that asking a QA engineer who is not a native speaker to implement terminology global changes should always be discouraged.

Since rule-based machine translation (RBMT) is about semantics, morphology and syntactic rules, why not avail of technologies such as Systran to assist terminology checkers and translation memory tools through an application programming interface (API)? Most, if not all, of the issues described in this article could be automatically addressed. Hopefully, it will not be necessary to wait for the "magic five years" to achieve this because RBMT is already here and has been delivering reasonably good results for a good while now. In the meantime, some basic precautions and the use of regular expressions can alleviate the manual effort involved in maintaining translation memories.

References

Guzmán, R 2009, 'Uncontrolled Terminology and MT: The Importance of Making Good Comparisons', Translation Journal, vol. 13, no. 2., viewed 9 May, 2011,
http://translationjournal.net/journal//48mt.htm.

Reynolds, P 2011, Reports of my death are greatly exaggerated--Translation Memory, viewed 13 April 2011, http://kilgray.blogspot.com/2011/03/reports-of-my-death-are-greatly.html.

Zetzsche, J 2011, 'The New Five-Year-Rule', Translation Journal, vol. 15, No. 2., viewed 20 April 2011, http://translationjournal.net/journal/56mt.htm.

Front Page

Translation Memory Maintenance:

Playing Hide and Seek