Volume 14, No. 3 
July 2010


Michael Wilkinson


Front Page

Select one of the previous 52 issues.

Index 1997-2010

TJ Interactive: Translation Journal Blog

Submissions to the TJ
by Gabe Bokor

  Translator Profiles
Can You Translate That for Me?
by João Roque Dias

  The Profession
The Bottom Line
by Fire Ant & Worker Bee
The Concepts of Globalization and Localization
by Ying-ting Chuang
Will We Be Here Tomorrow?
by Danilo Nogueira and Kelli Semolini

  Translation and Politics
Señoras y Señores diputados/'Onorevoli deputati'
by Armando Francesconi, Ph.D.
Ideological Interference in Translation: Strategies of Translating Cultural References
by Shih Chung-ling

A Prototype System For Machine Interpretation
by Milam Aiken, Mina Park, Shilpa Balan

  Translator Education
Translanguage vs. Interlanguage: Exploration in Translation Strategies
by Dr. Ali R. Al-Hassnawi
  Science & Technology
Glossary of Aeronautical Terms
by Concepción Mira Rueda

  Translators and Computers
Hostile Takeover? Welcome Addition? Machine Translation Enters the World of the Translator
by Jost Zetzsche

  Advertising Translation
Advertisement as a Writing Style and Strategies for its Translation
by Shi Aiwei

  Translators' Tools
Quick Corpora Compiling Using Web as Corpus
by Michael Wilkinson
Projetex: A Translation Project Management Tool
by Vitaliy Pedchenko
Translators’ Emporium

  Caught in the Web
Web Surfing for Fun and Profit
by Cathy Flick, Ph.D.
Translators’ On-Line Resources
by Gabe Bokor
Translators’ Best Websites
by Gabe Bokor

Call for Papers and Editorial Policies
  Translation Journal

Translators' Tools

Quick Corpora Compiling

Using Web as Corpus

by Michael Wilkinson

uring the past five years I have written several articles discussing some of the ways in which a monolingual target-language corpus can be a useful performance-enhancing resource in translating. Some of these articles are viewable online (see Wilkinson 2005a; 2005b; 2007a; 2007b). I have also pointed out that because very few ready-made special-field corpora are at present available - either for free or commercially - translators should be able to compile their own specialized corpora, tailor-made to suit their own requirements, and have suggested ways of doing this (Wilkinson 2006). Unfortunately compiling a sufficiently large do-it-yourself corpus is a rather time-consuming process.

The Web as Corpus (WaC) website, launched by Bill Fletcher in 2007, includes a user-friendly freeware concordancer that goes a long way to solving this problem. In the following I provide a simple guide on how to use the tool.

Go to the WaC site at: http://webascorpus.org/

Click on Web Concordancer, which will take you to: http://webascorpus.org/searchwac.html, as shown in Figure 1. Select the language you want in "Find web concordances in xxxxx"

Figure 1

Click on the "Source Options" Tab and you will get the view shown in Figure 2:

Figure 2

You can now select things like the amount of context to show, how many web pages to analyse and maximum matches to show per page as well as "source" countries. It's perhaps best to try these alternatives out for yourself to see how they affect your results. In the "Details" section of the tool, Fletcher has various tips about this, such as the following:

  • While the option to retrieve up to 500 pages at a time does exist, that may not be possible as the server terminates the script after 5 minutes.  300-400 matches per time is a good practical limit; you can fetch the rest of the available matches via the "continue this search" (top of page) or "more > >" (bottom) links.

Now click on the "Advanced Query" tab. You need to feed in enough words to make sure that the pages found are restricted to the special field you have in mind. So recently, when I was translating a text about a multi-function conference centre in eastern Finland from Finnish into English, I tried the query shown in Figure 3.

Figure 3

Notice that you must enter each word or phrase on separate lines. If I'd entered "conference facilities" on the same line, this would have restricted the search excessively. But notice also that I've selected "Match all of these words" - if I'd selected "Match any" I would have got too many irrelevant "hits" (e.g. all kinds of facilities that are unrelated to conferences). You need to enter enough words to limit the field, but not so many as to excessively restrict the finds. Unfortunately you can't use wildcards in your search words.

The Include / Exclude boxes also help you to focus your search. When I first started using this tool, I compiled a corpus to help me with a translation about seals in the lakes of eastern Finland. My failure to use the exclude-box resulted in a tremendous amount of unwanted material relating to, for example, mechanical devices that prevent leakage, the special operations forces of the US Navy, contract law, emblems, and a well-known British singer. I should have read Fletcher's tips about this:

  • To further focus your search, specify words or phrases that must / may not occur on webpages (include / exclude). These filter terms are not shown in the results.
  • Include search terms are logically ORed (i.e. any one of them matches) and exclude terms are ANDed (i.e. none may be present).
  • For example, a search for bass could include terms like player, music, clef, sing and exclude terms like fish, boat, hook to clarify which homograph is meant.

Now press the "Find" button in the top left of the Advance Query window.

The Web Concordancer occasionally stalls before it reaches the specified number of hits. This is apparently due to a delayed response from Bing (the search engine it uses) or overload on WaC's server. Clicking the "continue stalled search" link usually succeeds in remedying this.

Figure 4

However you soon get lots of ready-made concordance lines, the first of which are visible in Figure 4 - quite useful as such, but you can't "manipulate" them in the same way as you can with a corpus analysis tool. At this point you also have the option of excluding any of the files by unticking the "include in download" checkbox after the document data. For example with the conference + facilities search I got some sites about Italian conference centres that had clearly been translated from Italian into English—so it's handy that those can be deselected so easily.

Next you must press the "text files" option in the menu at the top of the page and save the compressed file in a folder of your choice. You will then need to unzip the files into another folder - after which you can try them out with a corpus analysis tool. If you have selected the "combine textfiles" download option (see Figure 7), you will only need to import a single file into your analysis software.

With my first query I got a corpus of around 150,000 words, which I then expanded by trying other queries with the Web Concordancer - e.g. conference + data projector. So in less than an hour I had a corpus containing over 500,000 words that was very helpful for checking on terminology usage. In Figures 5 & 6 you can see screenshots showing some of the results from a couple of my searches using WordSmith Tools Version 5 (Scott 2008).

Figure 5


Figure 6

The corpora produced are so-called "dirty corpora", meaning that they haven't been tidied up. Some of my students at the University of Eastern Finland have used Web Concordancer in the past academic year to compile Finnish-language corpora as translation aids or for research purposes and have complained that their corpora tend to be very "messy". For example, Iloniemi (2010) points out that "text files are by default encoded as UTF-8. In WordSmith Tools, Finnish text is displayed incorrectly with this encoding, the preferred encoding being ANSI." Thus the letters å, ä and ö split the words and come out as nonsense graphemes. Iloniemi points out that the same shortcoming may also apply to other languages which include unique characters. However this shortcoming has now been remedied with the latest update of Web Concordancer (24 May, 2010). The new release now enables conversion from UTF-8 into more widely-supported encodings, such as Windows-1252 or ISO-8859-1 for Western European languages. To do this you need to press the "Download Options" tab (see Figure 7).

Figure 7

Under "Convert textfile encoding from UTF to..." you can open a pop-up window with over 40 different text-file encodings to choose from (see Figure 8). The download option also includes other useful alternatives for reformatting your text files - these are explained in detail under the "Explain formats and options" link.

Figure 8

In the space of a few minutes, after selecting an appropriate encoding, I compiled a 100,000 word Finnish-language "conference corpus", and as can be seen in figure 9 this produced very clean results.

Figure 9

I strongly recommend those translators who use corpus analysis programs as translation aids to experiment with Web Concordancer. In my opinion it is an invaluable tool for those who need to compile a specialised corpus in a short time.




Fletcher, William H. (2007-2010). Web as Corpus Web Concordancer. http://webascorpus.org/searchwac.html.

Iloniemi, Stiina (2010, unpublished). "Key Word Analyses of Text Corpora as Interpreters' Tools". Pro Seminar paper, University of Eastern Finland.

Scott, Mike (2008). WordSmith Tools version 5, Liverpool: Lexical Analysis Software. http://www.lexically.net/wordsmith/index.html

Wilkinson, Michael (2005a). "Using a Specialized Corpus to Improve Translation Quality", in Translation Journal, Volume 9, No 3. Online at: http://translationjournal.net/journal/33corpus.htm

Wilkinson, Michael (2005b). "Discovering Translation Equivalents in a Tourism Corpus by Means of Fuzzy Searching", in Translation Journal, Volume 9, No 4. Online at: http://translationjournal.net/journal/34corpus.htm

Wilkinson, Michael (2006). "Compiling Corpora for use as Translation Resources", in Translation Journal, Volume 10, No 1. Online at: http://translationjournal.net/journal/35corpus.htm

Wilkinson, Michael (2007a). "The corpus analysis tool—an under-exploited translation aid" in Kääntäjä 7/2006. Online at: http://www.lexically.net/wordsmith/corpus_linguistics_links/Wilkinson.doc

Wilkinson, Michael (2007b).Corpora, Serendipity & Advanced Search Techniques in The Journal of Specialised Translation, 2007. Online at: http://www.jostrans.org/issue07/art_wilkinson.php