In the new world of technology, the translation profession, like other disciplines, cannot be deprived of modern tools such as electronic corpora. Recently, large monolingual, comparable and parallel corpora have played a crucial role in solving various problems of linguistics, including translation. In this study we shall attempt to show the effectiveness of a specialized monolingual corpus in translating various collocations usually found in political texts from English into Persian. This experiment compares the accuracy in translating collocations using a specialized monolingual corpus to the conventional resources (e.g. monolingual as well as bilingual dictionaries). The results show how the quality of translation can be improved using corpus-based translation tools.
n recent years computers have increasingly found their way into different branches of sciences, including humanities. Language studies are no exceptions in this respect. In this new world of technology, the translation profession, like other disciplines, canot be deprived of modern tools such as electronic corpora. Constructing as well as exploiting different types of corpora are among the computer applications available to researchers in various language fields. Recently, large monolingual, comparable and parallel corpora have played a crucial role in solving various problems of linguistics such as language learning and teaching (Aston, 2000; Leech, 1997; Nesselhauf, 2004), translation studies (Mosavi Miangah, 2006), information retrieval (Braschler, & Schauble, 2000), statistical machine translation (Brown et al., 1990) and the like. In this study, we shall attempt to show the effectiveness of a specialized monolingual corpus in translating various collocations usually found in political texts from English into Persian. This experiment compares the accuracies of collocation translation using a specialized monolingual corpus to the conventional resources (e.g. monolingual as well as bilingual dictionaries). The results show how the quality of translation can be improved using corpus-based translation tools.
2. What Is a Corpus?
Generally, a corpus can be defined as a collection of naturally occurring examples of language. A corpus includes no new information about language, but it gives new perspectives to linguistic researches and helps in the development of different processes such as language learning and teaching and translation.
Depending on the purpose and the form, different types of corpora may be distinguished.
2.1. Specialized corpus
Specialized corpus is a corpus which includes a particular type of texts. This specialization has no definite boundaries, but some criteria that specify the type of the text in question should be considered. Such corpora may contain either some texts specialized in terms of a particular timeframe (texts from 1822 to 1876) or a particular subject (art, politics, medicine) or some other factors. Some famous LSP (Language for Special Purposes) corpora are the 5-million word Cambridge and Nottingham Corpus of Discourse in English (CANCODE) and the Michigan Corpus of Academic Spoken English (MICASE).
2.2. General corpus
This is a type of corpus which includes various types of texts, either written or spoken, on a variety of subjects. Sometimes it is called "reference corpus" concerning its function as a reference material for language learning, translation, etc. Some of the best-known general corpora are the 100-million words British National Corpus (BNC) and the 400- million Words Bank of English.
2.3. Comparable corpus
A corpus consisting of texts of the same type and content in different languages (e.g. legal contracts in English and French), or articles about linguistics from English and Persian journals. The ICE corpus (International Corpus of English) is a one-million word comparable corpus of different varieties of English.
2.4. Parallel corpus
Parallel corpora are those consisting of texts with their translations into two or more languages, eg. a medical article translated into Spanish, Finnish, and French. They can be of great help in searching equivalent expressions in each language and investigating the differences between languages by translators and learners.
2.5. Learner corpus
A collection of textsessays, for exampleproduced by learners of a language (Hunston, S. 2006). This corpus is prepared to help to find the differences between texts produced by the learners and text produced by native speakers. the International Corpus of Learner English (ICLE) with 20,000 words and Louvain Corpus of Native English Essays (LOCNESS) are the examples of numerous well-known learner corpora.
2.6. Pedagogic corpus
Pedagogic corpus is a corpus consisting of all texts to which a learner has been exposed (Hunston, S. 2006). A pedagogic corpus collected by a teacher or researcher may consist of all course books, readers, etc. used by a learner and the tapes they have listened to. This includes all instances of a word or phrase that learners encounter in different contexts, to improve their knowledge of language.
2.7. Historical and diachronic corpus
This is a corpus which includes texts belonging to various periods of time, to show the development of language over a specified timeframe. The most famous English historical corpus is the Helsinki Corpus with 1.5-million words .
2.8. Monitor corpus
This is a corpus which consists of texts of the same type to trace the changes in the language by adding to it annually, monthly, even daily. So the texts of one year (month or day) can be compared to those of another, similar, period.
Different types of corpora may be annotated differently in accordance with the needs of the researchers. Some types of information, which are encoded in a corpus and are effective in translation tasks are parts of speech (POS), syntactic structure, parsing, word senses, and anaphoric relation
3. Related Work
In recent years, the importance of corpora in the field of translation has become noticeable to trainers and researchers. Therefore, some researchers believe that the analysis of corpora should be integrated into translator education. There have been a number of studies on monolingual corpora (general and specialized) and various kinds of exploitation of such corpora like extraction of collocations.
The website "Gateway to corpus linguistics on the Internet" at http://www.corpus-linguistics.de/ is a proper reference for obtaining information about many of best-known corpora and their features such as their size, content, and accessibility as well as when and by whom they were compiled.
Most of the latest research in translation knowledge acquisition is based on parallel corpora (Brown et al.1993). However, since large aligned bilingual corpora are hard to obtain, some researches have tried to exploit translation knowledge from non-parallel corpora such as comparable corpora or monolingual corpora. One of the best known large-scale monolingual corpora is the British National Corpus (BNC), a 100 million-word collection of samples of written and spoken language from wide range of sources. However, the BNC has, despite its large size, serious limitations as a translation aid if you are translating contemporary specialized text (Wilkinson, M. 2006).
In a pilot experiment, Bowker (1998) found that learners using a specialized corpus of texts in the target language (their L1) showed greater correct term choice and idiomaticity than a matched group using bilingual dictionaries alone. In his study, Bowker determined that a specialized monolingual native-language corpus assists translators to improve two of the most important criteria to produce high quality translation: subject-field understanding and specialized native-language competence (Bowker, L. 1998).
Bowker & Pearson (2002) provide a good experiment on exploiting such monolingual corpora in translating texts on mechanical engineering. They attempt to investigate the term "nut" and its various collocations in the 100-million-word BNC corpus. They found 670 occurrences of this term. However they found most of the concordance lines not helpful, since most of contexts show examples of "nut" being used in other meanings, such as food or eccentric person. Although some of the occurrences describe the type of nuts used in engineering, it takes time to identify them; there is excessive "noise" due to the fact that "nut" is a homonymit has various meaningsand so separating the wheat from the chaff is a time-consuming process.
Bowker & Pearson go on to report that a search for the term "nut" in a 10,000-word corpus containing catalogues, product descriptions and assembly instructions from companies in the manufacturing industry generated 49 occurrences. Although this was far fewer than the BNC search, the findings were far more relevant, since the noise was considerably reduced, and it was easy to spot the many different types of "nut" used in manufacturing (e.g. collar nut, compression nut, flare nut, knurled nut, winged nut), as well as the verbs that collocate with nut (e.g. thread, screw, tighten, loosen)
Thus, the role of specialized corpora in translating different types of texts becomes more prominent. Such specialized corpora which are restricted to the language of a particular specialized field and focus on Language for Special Purposes are sometimes referred to as LSP corpus (Wilkinson, M. 2006).
4. Compiling and Exploiting Specialized Monolingual Corpora
Nowadays, specialized corpora play a crucial role in translation. However, due to the unavailability of ready-made LSP corpora, translators can construct their own specialized corpora. In this respect, we tried to compile a specialized monolingual corpus of Persian texts in the field of politics consisting of over 5 million words or 150 MB. These texts are mainly extracted from political articles, journals, interviews, etc. found on the Internet and preprocessed before being entered in the corpus. That is, all tables, pictures, figures or diagrams are to be deleted from the texts to be ready for the corpus. Moreover, the texts should be converted to an XML format to be suitable for use on Internet sites. At this stage the texts can be entered into the corpus to be used by translators trying to translate political texts from English into Persian. At present, the Persian monolingual corpus is freely available from the following URL:
Considering the fact that concepts and terms
within a particular field are evolving constantly, we need our corpus to be
open in order to add or remove some texts when required. As it is mentioned in the
definition of corpus, corpora by themselves are nothing more than collection of
examples of language. But beside other tools they become invaluable and find
their position in translation task.
4.1. Two main applications of the
corpora in translation
two applications of specialized corpora are introduced to describe their role
in producing a high-quality translation.
4. 1. 1.
Referring to a monolingual corpus in the field of politics (containing about 5-million
words), we search for different collocations which are frequently encountered by translators. We also use a bilingual dictionary (Aryanpur, English to Persian)
to compare the use of a bilingual English-Persian dictionary to a monolingual
Persian corpus. Consider the noun phrase "pre-emptive war." At first, we refer
to a conventional resource such as a bilingual dictionary in which we naturally cannot
find such collocation as an individual entry. However, some suggested
equivalents for two components of the collocation are found. For the word
"pre-emptive" we found three suggested equivalents as پيش
گيرانه and بازدارنده. And for the word "war" only جنگ has been suggested. Then we turned to our corpus and found 0 occurrence of جنگ
پيش گيرانه, 0 occurrence of جنگ پيش
دستانه, and 14 occurrences of جنگ
بازدارنده. So, we selected the third equivalent
of this collocation as the most probable translation due to its higher
frequency in the corpus. By this way, the corpus can help us obtain the most
probable translation of the other collocates, too.
parallel movement, we considered the more common collocation "increasing
relations," and found "توسعه," "گسترش" suggested by the dictionary as two
equivalents for "increasing." It may make no difference for a translator to use
"توسعه" instead of "گسترش" or vice versa. But it is wondeful to
find 199 occurences of "گسترش
روابط" and 79 of "توسعه
روابط." As you see, when we think that our dicision is right, the
corpus changes the situation and reveals the truth. In the following table we
have mentioned some other examples:
4. رويارويي نظام
Slow Pace of Negotiations
Suspension of Uranium
Table 1. Corpus decision on certain
collocations' equivalents suggested by dictionary
4.1.2. Verifying or rejecting decision taken based on other tools
traditional translation tools (such as dictionaries) suggest more than one
equivalents and sometimes improper ones, corpora become an effective solution
to these problems. When you are in doubt about which one to choose among the equivalents
suggested by dictionary, corpora are great tools for verifying or rejecting the
suggested translation(s). A number of equivalents for "trade-off"
suggested by the dictionary are as follows: "مبادله," "تهاتر," "پاياپاي
بستاني." The occurrences are illustrated in the following table.
Occurrences in Corpus
2. Corpus decision on "trade-off" equivalents
can use this strategy in translation criticism in evaluating the naturalness of
translation. For the word "confidence" in the phrase "confidence-building"
there are 2 equivalents suggested by Aryanpur dictionary, "اعتماد," "اطمينان." Due to the great similarity between
these two words and their high frequency in Persian language, it is hard even
for a native speaker to select between these two translations: "اعتماد
But when they occur in a political texts
and therefore are searched in our corpus, it is surprising to find no
occurrence of "اطمينان
and 18 occurrences of "اعتماد
According to Larson, to do effective translation one must
discover the meaning of the source language and use receptor language forms
which express this meaning in a natural way (Larson, M. 1984). So, in addition
to other conventional translation tools a translator should use corpora to
become more certain that his/her choice is a proper and natural one. According to above explanations, corpora can be of
great help in finding suitable collocates and verifying or rejecting the
suggested translations by dictionaries. As Varantola states, the general
comment made by her students about the corpus evidence: "This evidence helps
translators to be less bound to the source material and feel much more
confident when deviating from the way things are expressed in the source
material if they feel that the changes are justified." (Varantola, 2003, p.
Large monolingual as well as
bilingual electronic corpora are just recently becoming available to
translators, and this is a good opportunity for them to be provided with more
precise, natural, and up-to-date information about words and collocations' senses
than before. Open parallel corpora can play their greatest role in
resolving different translation problems. Unfortunately, this invaluable tool
has not been widely used by translators in Iran. This may be due to the fact
that they have not been exposed to the potentials of corpus analysis tools
during their college education. Unavailability of ready-made special
field corpora may be another reason in this respect. So, we decided to describe
the effective applications of a specialized monolingual corpus of Persian in
the sensitive task of translating political texts.
We hope to expand this study to cover experiments dealing
with other subject fields such as medicine, sports, business, religion,
literature, and the like. It is suggested that such experiments be also performed with
other language pairs to see if more definitive conclusions in terms of the effect of
monolingual corpora on the translator's work can be drawn.
A. and Aryanpur, M. (1991). English-Persian
Collegiate Dictionary. (Ninth Edition) Amir-Kabir Publication Organization,
G. (2000). I corpora come risorse per la traduzione e l'apprendimento. In Silvia
Bernardini and Federico Zanettin (eds.) I corpora nella didattica della
traduzione. Bologna: CLUEB, 21-29.
Bowker, L., 1998, Using
specialized monolingual native-language corpora as a translation resource: a
pilot study, Meta, 43/4, pp. 631-651.
Bowker, L. and Pearson, J. (2002). Working with
Specialized LanguageA practical guide to using corpora. London: Routledge, Pp. xiv + 242
P.F., Pietra, S.A.D., Pietra, V. J. D., and Mercer R. L. 1993. The mathematics
of machine translation: parameter estimation. Computational Linguistics,
M. and Schauble, P. 2000. Using corpus-based approaches in a system for
multilingual information retrieval. Information Retrieval, 3, PP.
P., Cocke, S., Della Pietra, V., Della Pietra, S., Jelinek, F., Lafferty, J.,
Mercer, R. & Roosin, P. 1990. A Statistical Approach to Machine
Translation. Computational Linguistics 16:2, 79-85.
Mildred L. (1998). Meaning-based translation: A guide to cross- language
equivalence. Lanham, MD:
University Press of America
and Summer Institute of Linguistics.
Leech, G. (1997). Teaching and
language corpora: A convergence. In: A. Wichmann, S.
Fligelstone, T. McEnery & G. Knowles (Eds.), Teaching and
language corpora (1-23). New
York: Addison Wesley Longman
Mosavi Miangah, T. (2006). Applications
of corpora in translation.
12, pp: 43-56.
Nesselhauf, N. (2004). Learner
corpora and their potential for language teaching. In: J. McH. Sinclair (Ed.), How
to use corpora in language teaching (125-152). Amsterdam: Benjamins.
Varantola, K. 2003. Translators and Disposable Corpora. In
Federico Zanettin, Silvia Bernardini and Dominic Stewart (eds.) Corpora in
Translator Education Manchester: St Jerome, pp 55-70.
M, (2006). Compiling Corpora for Use as Translation Resources, Translation
Journal, Vol. 10, No. 1.