Revised paper submitted to the International Journal ...

5 downloads 456 Views 400KB Size Report
use in order to extract translation data from such corpora with the aid of generic corpus ... tools, data extraction techniques ...... m a hard drive to a backup tape.
Bowker, Lynne. “Towards a Methodology for Exploiting Specialized Target Language Corpora as Translation Resources.” INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 5(1), 2000. 17–52 © John Benjamins Publishing Co.

Revised paper submitted to the International Journal of Corpus Linguistics on 10 November 1999 (Ref: OA-67-99) Title:

Author: Affiliation:

Towards a Methodology for Exploiting Specialized Target Language Corpora as Translation Resources Lynne Bowker School of Applied Language and Intercultural Studies Dublin City University Dublin 9 Ireland Tel: +353 1 704 5381 Fax: +353 1 704 5527 Email: [email protected]

Towards a Methodology for Exploiting Specialized Target Language Corpora as Translation Resources LYNNE BOWKER School of Applied Language and Intercultural Studies, Dublin City University Specialized target language (TL) corpora constitute an extremely valuable resource for translators, and although no specialized tools have been developed for extracting translation data from such corpora, this paper argues that translators would be remiss not to consult such resources. We describe the advantages of using specialized TL corpora and outline a number of techniques that translators can use in order to extract translation data from such corpora with the aid of generic corpus analysis tools. These advantages and techniques are demonstrated with reference to two translations, one of which was done using only conventional resources and the other with the help of a corpus.

KEYWORDS: specialized target language corpus, translation resource, generic corpus analysis tools, data extraction techniques 1.

Introduction

In the words of Burnard (1999:23), “Research needs constantly advance and evolve in an uneasy symbiosis with the capabilities of technology.” Burnard goes on to observe that this is particularly true of research involving corpus analysis, which could not be easily accomplished without the help of computers. Nevertheless, despite the obvious need for computers to assist with corpus analysis, an accusation often levelled against technology is that it forces researchers to alter or adapt their research in undesirable ways in order to do what is permitted by the available tools, instead of carrying out the investigations they would ultimately like to conduct. In an ideal world, what researchers want to investigate would be independent of the tools at their disposal, but in reality, what researchers want to do cannot always be neatly separated from what they know how to do, which is in turn often tied to the tools available to them. It has therefore been suggested that perhaps linguists should learn how to program, which would allow them to develop their own tools that would better meet their needs.i Furthering this debate is beyond the scope of this paper, but regardless of whether people feel that linguists should be able to program, the reality of the situation is that, for reasons such as lack of time, interest or ability, many linguists will never be able to develop their own computerized tools. However, instead of letting the tools drive the research, linguists may be able to compromise by coming up with ways to use these tools which may extend beyond the intended purposes envisaged by the software developers, but which nevertheless make it possible for the linguists to carry out the desired investigations. In this paper, we will address the case of corpus-based translation, with a particular focus on how translators themselves can develop a useful methodology which will enable them to use generic corpus analysis tools to exploit specialized target language (TL) corpora as resources to help them in their everyday work. This paper is organized into five sections. In section 2, we briefly investigate the different types of corpora that have already been integrated into translation studies and evaluate their usefulness as potential translation resources. In section 3, we explore the advantages to be gained by using a new type of corpus, the specialized TL corpus, as a translation resource. In section 4, we outline the basic capabilities of generic corpus analysis tools, using WordSmith Tools as an example. Finally, in section 5, we propose techniques that can be used to extract relevant translation data from a monolingual TL corpus. These techniques are demonstrated using an extended example which shows how an extract from a semi-specialized French text on I2O operating system technology has been translated into English both with and without the help of a specialized TL corpus. 2.

Corpora and translation studies

The use of corpora in the discipline of translation is enjoying increasing popularity and we will begin by briefly examining a number of different types of corporaii which have been integrated into translation studies in a number of different ways to date. One type of corpus is the monolingual comparable corpus, which is defined by Baker (1995:234) as two separate collections of texts in the same language: one corpus consists of original texts in the language in question and the other consists of translations into that language from a given source language (SL). Researchers such as Baker (1995) and Laviosa (1998a/b) have demonstrated the usefulness of monolingual comparable corpora for investigating the phenomenon of translation by carrying out descriptive studies which have resulted in the identification of a variety of universal features of translated texts. While this type of descriptive study is an interesting and important research area, its value is mainly theoretical and the monolingual comparable corpora used for these purposes do not really provide working translators with much needed translation resources. Another type of corpus often used in translation is an aligned bi- or multilingual corpus. This type of corpus contains original texts in L1 (language 1) alongside their translations into L2 (and possibly L3, L4, etc.). The texts and their translations are generally aligned at sentence level. McEnery and Wilson (1993) and Somers (1993) discuss the potential of aligned corpora in applications such as machine translation and other areas of natural language processing, while Malmkjaer (1998) and Peters and Picchi (1998) explore the potential of aligned corpora as a source of translation equivalents for human translators. The merits of using aligned corpora for machine translation purposes are evident since machines cannot interpret data but must rely on techniques such as pattern matching and statistical probabilities in order to identify potential translation equivalents. Moreover, it is generally accepted that the output of machine translation will not be of as high a quality as texts produced by human translators. With regard to human translation, however, where there is an expectation for high-quality output, we can identify two main drawbacks regarding the usefulness of aligned corpora. Firstly, there are relatively few corpora of this type readily available (certainly not enough to cover the wide variety of text types and subject fields that translators have to deal with), and even if suitable texts can be found, it is difficult for translators, who are constantly working against tight deadlines, to create this type of corpus because it needs to be alignediii. Secondly, professional translators are very wary of relying on translated material as a resource. They typically prefer to consult documents that have originally been written in the TL because such documents are less likely to contain conceptual or terminological errors or unidiomatic constructions (Peters and Picchi 1997:254; Teubert 1996:247). This has led some researchers, such as Peters and Picchi (1997:254), to suggest that a bi- or multilingual comparable corpus might be a more reliable source for certain types of studies (e.g. studies of how a particular concept is rendered independently in different languages where the starting point is the idea and not an existing text in a given language). This type of corpus consists of sets of texts from multiple languages which have been composed independently in their respective language communities, but which have the same communicative function. Thus they are not translations, but they do share certain features such as topic, date of publication, text type, etc. This type of corpus would be a very valuable resource for translators engaged in building up subject field glossaries, which involves studying concepts independent of specific texts; however, when they are working on a particular translation, they are, by definition, focusing on one particular source text. Therefore, the task of collecting an entire corpus of texts in the source language is superfluous and consumes precious time.iv We propose that, as a translation resource for a practicing translator, the most appropriate and useful type of corpus is a specialized target language corpus. This type of corpus has already proved useful in the context of terminology research (e.g. Bowker 1998a; Pearson 1996), and we feel that its applications can be usefully extended to apply to translation. The following section outlines some of the advantages to be gained by using such corpora as translation resources.

3.

Advantages of using a specialized TL corpus as a translation resource

If we consider a corpus to be a large collection of texts in electronic form that have been selected according to explicit criteria, then we must acknowledge that, in some respects, the use of a specialized TL corpus is not so much a radical departure from approaches currently used by many translators, but rather a refinement of this approach. The use of such a corpus mirrors to some extent the use of what are known in conventional translation circles as parallel texts; that is, documents that have been produced independently in the target language, but which have the same communicative function as the source textv. The value of printed parallel texts as translation resources has already been attested in the literature (Schäffner 1998; Williams 1996). Although conventional lexicographic resources, such as dictionaries, do have some strengths, they also have a considerable number of drawbacks that parallel texts can help to overcome. Firstly, dictionary entries are decontextualized and hence users may not comprehend or use the term correctly, whereas parallel texts present terms in authentic contexts, allowing translators to acquire both specialized conceptual and linguistic knowledge. Secondly, in dictionaries, information about frequency or generality of use is not provided in a consistent manner, whereas this information can be obtained somewhat more readily from parallel texts. Finally, because of the length of time required to compile and publish a dictionary, it may not always reflect current usage. Therefore, a translator who uses information from a dictionary may not be getting up-to-date information, even if the publication date of the dictionary is recent. Parallel texts, on the other hand, are typically more current and contain a wider range of terms. Nevertheless, when consulted in their conventional printed form, parallel texts also present a number of pitfalls. In the first place, the effort required to physically gather together a printed corpus often means spending hours at the library and/or photocopier. Once the corpus is gathered, further hours must be spent consulting the texts, and this generally entails reading lots of irrelevant material before stumbling upon a discussion of a pertinent point. Thus acquiring and consulting parallel documentation in printed form has two major disadvantages. The first is that when working manually, the translator typically cannot gather and consult a wide enough range of documents to ensure that all relevant concepts, terms and linguistic patterns will be present. Secondly, as observed by Church et al. (1991:116), manual analysis is inherently error-prone: “The unaided human mind simply cannot discover all the significant patterns, let alone group them and rank them in order of importance.” Gathering parallel texts in the form of an electronic corpus has numerous benefits. Firstly, corpora can increase the scale and speed of a translator’s research. It is well known that translators have to work on tight deadlines, but electronic corpora allow them to process a greater number of documents at a faster rate. Secondly, corpora can be searched and manipulated in ways that are simply not possible with other formats. For instance, information that may be difficult to detect by conducting manual searches can be brought to light more readily using corpora (e.g. collocational patterns, frequency data).vi Needless to say, the value of what comes out of a corpus is heavily dependent on the quality of what goes into the corpus. A description of the design of a corpus suitable to be used as a translation/terminology resource is beyond the scope of this paper and has already been discussed in detail elsewhere in the literature (e.g. Bowker 1996 and 1998b; Meyer and Mackintosh 1996; Pearson 1998). However, it is worth mentioning that criteria which generally need to be considered include size, text type, domain coverage, date of publication, author and language. 4.

Corpus analysis tools

As outlined above, the aim of this paper is to investigate how a specialized TL corpus can be exploited as a translation resource (e.g. for subject field knowledge, terminology, phraseology, etc.). One of the main difficulties in using a monolingual TL corpus as a translation resource lies in finding ways in which to access the desired data. When using an aligned bilingual corpus, the strategy used can be much the same as when using a bilingual

dictionary: a translator can use the unknown term or phrase in the source language in order to access the potential equivalent in the target language. However, using a monolingual TL corpus is much like using a monolingual TL dictionary: in order to look up a term in a TL dictionary, the translator has to know what term to look for, and if the translator already knows the term, chances are there is no need to look it up. The main challenge then is to devise a method for looking up “unknown” terms in resources where the SL term cannot be used as an access point. To date, there have been no monolingual corpus analysis tools developed specifically for translation applications.vii Many of the corpus analysis tools currently available, such as concordancers, were originally aimed at language teachers and lexicographers. Although it would be desirable to have some corpus analysis tools developed specifically for translators, it would be rather short-sighted to wait until such tools have been developed before starting to use corpora, and hasty to dismiss the results of corpus-based analysis as “dubious” or “irrelevant” simply because the tools that were used were not specially designed for the job. According to Barnbrook (1996:165), “the most important qualities needed by anyone who wants to use a computer to explore and analyse language are imagination, flexibility and, of course, patience.” Translators have these qualities in abundance, all they need to do is apply them. Until custom-built tools become available, translators can work on developing methodologies for using existing non-translation-specific tools to meet their own needs. This may be dismissed by some as being overly opportunistic or as resorting to a “bag of tricks”, but it is worth pointing out that, as in all types of corpus analysis, the tools are only being used as a means to an end – they will never replace the need for careful analysis by the translator, who is still ultimately responsible for interpreting the data. Let us now examine some of the basic features provided by generic corpus analysis tools, taking WordSmith Tools as an example, before going on to explore how such tools can be usefully employed by translators in order to extract relevant information from specialized TL corpora. 4.1

WordSmith Tools

WordSmith Toolsviii is a suite of integrated corpus analysis tools developed by Mike Scott of the University of Liverpool. It contains a number of useful features that allow users to generate and manipulate word lists, concordances and collocations. The WordList feature is used to create alphabetical and frequency-ranked lists of all the words found in the corpus. Users can thus establish which words seem to be ‘important’ in the corpus on the basis of frequencyix and they can compare the frequencies of different words. WordList also contains a feature called Cluster which can be used to generate strings of words that recur with a specified minimum frequency. A related feature is KeyWords, which takes as input two word lists: one is the word list generated for the corpus in which the user is interested, the second is the word list generated from a kind of general reference corpus. Using this input, KeyWords produces a list of words that particularly characterize the corpus under investigation because they appear more (or less) frequently than would be expected in relation to the reference list. The Concord feature is a monolingual concordancer that retrieves all the occurrences of a particular search pattern in its immediate contexts and displays these in an easy-to-read format (i.e., KWIC – key word in context) with the search pattern highlighted in the centre of the screen. The extent of the context on either side of the search pattern is variable. Moreover, these contexts can be sorted in a variety of ways, such as in order of appearance in the corpus, or alphabetically according to the words preceding or following the search pattern. Concord permits relatively sophisticated search patterns, allowing functions such as case-sensitive vs non-case sensitive searches, wildcard searches (e.g. print* to retrieve print, printed, printer, printing, prints, etc.), and context searches where another term must appear within a userspecified distance of the search pattern (e.g. contexts where cartridge appears within five words of printer).

Concord also has the ability to compute the collocates of the search pattern. Collocates are regarded as words which “go with” the search pattern (i.e., words that appear within a user-specified distance of the search pattern and with a user-specified minimum frequency within that distance). Once computed, the collocates are displayed in alphabetical or frequency-ranked order in a table that shows the search pattern and the positions to the left and right of the search pattern in which the collocate occurs. The number of instances of each collocate appearing in each of the positions is also provided. To successfully use a generic corpus analysis tool for extracting translation-related information from a monolingual TL corpus, translators must bring all their conceptual, linguistic, and textual knowledge to bear on the task at hand. Much of the initial work may be based on trial and error, and strategies that work for a particular language pair or a specific text, text type or subject field may not be applicable in other situations. In the following section, we will present a number of strategies and methods, some simple and some more sophisticated, which have proved helpful for allowing translators to identify and extract useful information from specialized TL corpora using WordSmith Tools. To demonstrate these techniques, we will use a short extract from a semi-specialized French text on operating system technology which has been translated into English by two student translators, one of whom used the corpus and one of whom did not. We must emphasize that this text is being used strictly as an example to demonstrate extraction techniques. We are not claiming that the data could not be found in any other resources, or that the list of strategies outlined here is exhaustive or foolproof. Nor do we intend this list of techniques to act as a recipe which should be followed step by stepx. Our aim is somewhat more modest: we are simply providing examples of possible search strategies and we hope that other translators will be inspired to use and adapt these, as well as to develop additional techniques that are suited to the languages, subject fields and text types that they encounter. 5.

Techniques for extracting translation information from a specialized TL corpus

Before describing the specific strategies and methods used to extract information from the corpus, we will first introduce the source text and the corpus from which the examples have been taken. The source text is a 180-word extract from a text on a new type of operating system technology known as I2O. This text appeared in the French semi-specialized publication Informatiques magazine in June 1998 (see Appendix). The specialized TL corpus compiled to assist with the translation of this source text consists of just over 150,000 words extracted from a series of commercially available CD-ROMs called Computer Selectxi. Each disc contains thousands of English-language articles from several hundred publications dealing with a wide range of computer-related topics. Each article is indexed according to a variety of criteria, including source, publication date, text type, and keywords, which means that suitable texts (i.e., those that have a similar communicative function as the source text) can be extracted for inclusion in the corpus. In our corpus, we included texts that were 1) of a similar text type, 2) from a similar type of publication, 3) of a similar publication date and 4) on the same topic as the source text. In other words, our corpus consisted of feature articles about I2O operating system technology published in semi-specialized computer magazines (e.g. Byte, PC Week) dating from 1997 and 1998. The time required to compile the corpus was approximately 45 minutes. In the following sections, we will describe specific techniques that can be used to identify and extract information from the corpus. Where relevant, we will refer to the two translations contained in the Appendix. Both translations were done by students on the Graduate Diploma in Translation Studies programme at Dublin City University, neither of whom had any previous experience translating in the subject field of operating system technology. The student who produced translation A used only conventional resources such as mono- and bilingual general and specialized dictionaries and printed parallel texts, while the student who produced translation B had access to both the conventional lexicographic resources and the specialized TL corpus which was exploited with the help of WordSmithxii. Following the translation exercise, both students were interviewed and asked to comment on

the translation choices they had made, the strategies they had used, and the value of the resources at their disposal. 5.1

Identifying important terms

When working with conventional parallel texts, a translator can get a feel for important terms in the domain by skim reading one or two of these texts, and shexiii may even identify potential translation equivalents during this process. Obviously, the same linear approach cannot be used on a corpus containing tens or hundreds of thousands of words; however, by using a feature such as WordList, a translator can identify the most frequently occurring words in the corpus and thus get a feel for which terms might be common or important in the field. Obviously, the chances of finding a potential translation equivalent using this strategy depend on a number of factors. Firstly, the size of the corpus will largely determine the length of the word list that will be generated. A long word list will be more time consuming to scan; however, it will inevitably be quicker than scanning running text, which is the method that translators have to resort to if they are working with a printed corpus of parallel texts. The length of the word list can be reduced by using a stop list, which is a list of words to be ignored. Typically, stop lists contain function words, such as conjunctions and prepositions, which have little or no semantic value and are unlikely to be key terms in the domain. Words can also be lemmatized to get a truer indication of relative frequency (e.g. the different inflections of a word can be grouped together). Secondly, the importance of the unknown term and its equivalent within the domain is another factor that will determine whether or not scanning a word list will help to find the equivalent. Often, important words appear frequently in a text, though this is not always the case as the actual word may sometimes be replaced by a pronoun, abbreviation or other referential expression. When faced with a long word list, a translator may decide to scan only a portion of the list (e.g. the top 100 words), so the more common the word is in the domain, the greater the likelihood that it will appear higher up on the list. A final consideration, and one that is very important, is the depth of the translator’s subject field knowledge. Sometimes a translator understands the concept being expressed in the source text, but is unable to conjure up the corresponding TL expression off the top of her head. In such cases, perhaps the translator will be able to make the connection when faced with a potential term on the word list since it is generally accepted that people find it easier to recognize and choose appropriate items when presented with a selection of possibilities, rather than having to dredge everything up from memoryxiv. Referring to the source text in the Appendix, we see that the term pilote appears. In order to find a potential equivalent, a translator might decide to scan the frequency list, and then to further investigate any plausible candidates in the concordancer. In scanning the 100 most frequent words in the corpus, the translator comes across the term ‘driver’, which appears in position 86 on an unlemmatized frequency list where no stop list was used, and in position 17 on a frequency list which has been lemmatized and where a stop list has been used. Using a combination of real world knowledge and conceptual association, a translator may deduce something along the following lines: pilote is generally translated by ‘pilot’ and a pilot is most commonly used to refer to someone who “drives” an airplane, so a possible translation for pilote may be ‘driver’. A subsequent concordance search on ‘driver*’ revealed 409 contexts and a closer examination of some of these, as shown in table 1, allowed the translator to establish a conceptual match between pilote and ‘driver’. soft or Novell, can write generic are spared the pain of writing a deeply influences the way device cause it allows them to write one s peripheral vendors to write one ystem requires a different device issue. Currently, a unique device

drivers for peripherals. Future revisi driver for each combination of periphe drivers are written. Thus, it is inex driver for the device instead of one f driver that runs on the intelligent pr driver to be written, tested, and inte driver must be written, verified, conf

Table 1. Selected concordances retrieved using the search pattern ‘driver*’.

Furthermore, as described in section 4.1, WordSmith also has a KeyWords feature, which a translator can use to compare the frequency list of the I2O operating system corpus with a list taken from a reference corpus. On the basis of this comparison, KeyWords produces a list of words which characterize the I2O corpus because they appear more frequently than would be expected in relation to the reference list, which in this case was a list generated from a corpus containing 95 million words taken from The Guardian newspaperxv. Based on this data, KeyWords produced a list in which ‘driver’ appears in position 62, which indicates that it is a relatively important term in the corpus in question. It is worth pointing out that the term ‘driver’ is in fact a polyseme, which has another meaning in general language (i.e., the operator of a motor vehicle). This word is used quite regularly in general language, particularly in newspapers which often report on incidents such as traffic accidents. The fact that it is a polyseme and that the reference list has been generated from a newspaper corpus has distorted the importance of ‘driver’ in the specialized corpus, causing it to appear somewhat farther down the list than it might have if it were not a polyseme. Nevertheless, despite this distortion, the term still appears high enough up the list (well within the top 100 words) to signal that it might be a potential equivalent and merits further investigation. 5.2

Investigating a linguistic hunch

One of the most straightforward uses that a translator can make of a TL corpus is to use it as a testbed for investigating a linguistic hunch. For example, a translator may know that the translation of écrire is generally ‘to write’, but she may want to check that it fits in the context of software development and specifically of drivers. This can be done by simply doing a concordance search on the candidate term and studying the term in context to see if it is being used in a comparable way to refer to the same concept. In the source text, the context in question is: L’Irtos possède non seulement toutes les caractéristiques nécessaires pour écrire facilement des pilotes,… In order to verify the hunch that the translation of écrire may be ‘write’, the translator can simply enter ‘writ*’ as the search term in the concordancer. Using the wildcard (*) ensures that most forms of the lemma (e.g. write, writes, writing, written) will be retrieved. By reading through a selection of the concordances retrieved, as shown in table 2, the translator is able to establish that ‘write’ is indeed an appropriate translation for écrire in this context. evelopers of I/O products can eat number of drivers must be drivers required. OS vendors ice, and device manufacturers e independent hardware vendor But if a single driver can be

write one driver for any operating syste written, tested, integrated and supporte write a single I2O driver for each class write a single I2O driver for each devic writes a separate driver for every opera written for all of them, then adapters c

Table 2. Selected concordances retrieved using the search pattern ‘writ*’. Just as importantly, this strategy quickly reveals if a translator’s hunch is incorrect and alerts her to the fact that more research must carried out to find an appropriate equivalent. This can be especially helpful in the case of false friends. For example, a translator may have a hunch that pilote can be translated by ‘pilot’, but a concordance search on the term ‘pilot*’ revealed that there are no occurrences of this pattern in the entire I2O corpus. Thus, the translator is made aware of the fact that this is probably not an appropriate translation since it does not appear even once in a corpus of some 150,000 words written on the subject of I2O operating system technology by experts in the field. This level of awareness might not be triggered by a dictionary since it would likely contain an entry for ‘pilot’, though not necessarily for all senses of the word. In our experience, novice translators often put blind faith in dictionaries, choosing to incorporate proposed equivalents into their translations even when they are not appropriate. This was the approach taken, for example, by student A, who

translated one occurrence of pilote in the source text by ‘pilot’. In contrast, student B was able to use the “negative” evidence provided by the corpus to raise her level of awareness and avoid making a similar error. Consequently, she broadened her search for an equivalent and eventually located and incorporated the correct term ‘driver’ into her target text. 5.3

Investigating a conceptual hunch

In addition to investigating linguistic hunches, translators can also use a specialized TL corpus to investigate conceptual hunches. For example, the source text contains the following passage: L’Irtos possède non seulement toutes les caractéristiques nécessaires pour écrire facilement des pilotes, mais il supporte également une API (interface de programmation d’application) pour créer des modules de services intermédiaires nommés ISM… We see that in her translation of this passage, student A has misunderstood the concept behind the term pilotes and has elected to translate it as ‘Beta versions’. On the surface, this could be deemed as a reasonable error since when we talk of a ‘pilot project’, we are referring to a trial study, and in the computing field, beta versions are pre-final versions of software that are released for testing by users so that any remaining bugs can be discovered and fixed before the final release. Unfortunately, in context of the source text, this sense of ‘pilot’ does not apply. When using a corpus, a translator can seek to verify such conceptual hunches. For example, a concordance search revealed that there were 19 occurrences of ‘beta’ in the corpus; however, a search for ‘beta’ using the context word ‘IRTOS’ (set to appear within a 25-word span) retrieved no occurrences. Similarly, there were no instances of ‘beta’ and ‘write’ or of ‘beta’ and ‘API’ appearing in the same 25-word context. The fact that ‘beta’ never appeared in the proximity of words such as ‘IRTOS’ or ‘write’ should alert the translator to the fact that, in this context, it is not likely that ‘beta’ and pilote are referring to the same concept. This can be confirmed by examining the contexts in which ‘beta’ does appear, as shown in table 3. Here it can be seen that ‘beta’ is referring to a different concept than the one referred to by pilote in the source text. re 7 (storage, management, and LAN beta). http://www.data.com/round ying it will have a Windows NT 5.0 beta 1 and device driver kit availabl ilable in 3Q97. the release of the beta 1 version of Memphis will be the released in what Mr. Stork termed "beta 1" version in 3Q97. "The main th e shared SCSI concept is that even beta 2 Wolfpack code can't support st and is now available in a workable beta format for Windows NT 4.0 server s. PC Week Labs evaluated an early beta of one of the first, SuperMicro ave OSMs available today in either beta or released form. Furthermore, m lready looked at the first, barely beta products (see PC Week, May 5, Pa in a developer's kit as well as a beta release of its forthcoming Windo y both BYTE and NSTL of the second beta release of Wolfpack. In addition has been subjected to an extensive beta test program comprising a wide c gic markets development. Extensive Beta Test Program Demonstrates Robust uction versions of products, never beta-test versions. Products or solut xWare operating systems, is now in beta testing and due at year's end. I d Aurora and due to be released in beta this fall, the next major releas anging photos, or playing games. A beta version of the new Intel Interne integrated with Windows NT, and a beta version proves that the technolo equire users to have Access. * The beta versions of the Internet Explore

Table 3. Concordances retrieved using the search pattern ‘beta’. As a further step in investigating this concept, the translator can examine extended contexts for IRTOS, as shown in table 4. Here it is revealed that in these contexts, the items which are being ‘written’ are not ‘beta versions’, but rather, ‘drivers’.

IRTOS components typically include an event-driven driver framework, host protocols and modules for configuration and control. Application programming interfaces (API) are necessary so that network operating system (NOS) publishers, such as Microsoft or Novell, can write generic drivers for peripherals. Future revisions of the Intelligent Input/Output (I2O) specification are expected to include specifications for accommodating sophisticated I/O subsystems, including peer-to-peer communications capabilities and management of hierarchical driver modules and intermediate service routines. Simplifies Writing: The elements of the I2O intelligent real-time operating system (IRTOS) include an event-driven driver framework, host message protocols and executive modules for configuration and control. This design greatly simplifies the writing of basic device drivers, provides NOS-to-driver independence and provides a prioritized multithreaded framework that allows I/O software from multiple vendors to coexist safely. IRTOS also shields the driver writer from detailed configuration issues, such as the number of interrupts hooked in one peripherals or the number of available DMA channels. The API is designed to hide such details; in effect, IRTOS acts as the BIOS for the IOP. Table 4. Selection of extended contexts retrieved using the search pattern ‘IRTOS’. 5.4

Selecting between multiple options

In the course of their work, translators are constantly having to make choices between different possible ways of expressing the source text message. When faced with choices such as orthographic variants, synonyms and word order, translators can look to a specialized monolingual TL corpus for help with making their decisions. 5.4.1

Orthographic variants

There are a number of terms in the source text which might lead a translator to have to make a decision between different orthographic variants. For example, should the English equivalent of sous-système include a hyphen or not? The corpus revealed that in English, the preferred spelling is the unhyphenated form ‘subsystem’, which appeared 124 times, as opposed to the hyphenated form ‘sub-system’, which appeared only twice. Student A did not have access to such frequency information and chose to use the hyphenated form, whereas student B was able to take advantage of the frequency information provided by the corpus and she chose to use the unhyphenated form. In another example, the acronyms Irtos and Raid appear in the source text bearing a capital on the initial letter only, and student A has maintained this form in her target text. Student B, however, has been able to use the concordancer to discover that in an overwhelming number of cases, these terms are written in English using all upper case letters. Case-sensitive searches revealed that ‘IRTOS’ appeared in the corpus 22 times, while ‘Irtos’ did not appear at all; meanwhile, ‘RAID’ appeared 200 times, and ‘Raid’ only twice. 5.4.2

Synonyms and pseudo-synonyms

Another type of choice that often has to made by a translator is the choice between two synonyms. The source text contains the term la mémoire système, which according to specialized dictionaries can be translated as either ‘system memory’ or ‘main memory’; however, no advice was given as to when one term should be used over the other. Student A elected to use the translation ‘system memory’ on the basis that it was most similar in appearance to the SL term; however, student B was able to use the corpus to discover that ‘system memory’ appeared only 5 times while ‘main memory’ was used 20 times; hence she chose to use the term ‘main memory’ in her translation. Of course such crude statistical measures cannot always be used as the sole determining factor for choosing between two apparent synonyms. As pointed out by Rogers (1997:219), absolute synonymy (i.e., substitutability in all contexts) is an elusive notion, even

in specialized languages. The notion of substitutability introduces another aspect of meaning beyond that of denotative meaning, namely that of syntagmatic relations, or collocability. The source text presents an interesting example of a translator being required to make a choice between two apparently synonymous equivalents for the term périphérique. Specialized dictionaries indicated that in the field of computing, the terms ‘peripheral’ and ‘device’ both refer to the same concept; however, the corpus revealed some interesting differences in the usage of these two terms. ‘Peripheral’ appears in the corpus 92 times and ‘device’ 263 times. When used as nouns, both ‘peripheral’ and ‘device’ do seem to be synonymous and can be freely interchanged; however, when used as adjectives, they collocate with different nouns. As shown in table 5, ‘peripheral’ tends to collocate with animate nouns (e.g. vendors, manufacturers), while ‘device’ collocates with inanimate ones (e.g. module, driverxvi). peripheral device vendor 14 0 manufacturer 7 1 driver 0 111 module 0 31 Table 5. Collocates for ‘peripheral’ and ‘device’. Moreover, it was interesting to note that although ‘peripheral’ in its adjectival form did not collocate with ‘driver’, the nominal form was regularly found in the same context as ‘driver’, as shown in table 6. This shows that, although it is not common for experts to speak of ‘peripheral drivers’, it is acceptable to speak of ‘drivers for peripherals’. In contrast, there were no examples of ‘drivers for devices’, but many for ‘device drivers’. This type of collocational information would not likely appear in a dictionary, and it would be difficult to uncover such patterns in printed parallel texts. Consequently, student A did not make use of collocational information when choosing between ‘peripheral’ and ‘driver’, whereas student B did. … simply create generic drivers for each type of peripheral that a user might attach to a system … a single driver for each peripheral on the I/O side … spared the pain of writing a driver for each combination of peripheral, motherboard and operating system Table 6. Selection of contexts containing the nominal form of ‘peripheral’ and ‘driver’. Another interesting revelation was that there were three occurrences of ‘peripheral device’ and one of ‘peripheral device driver’ in the corpus. This might indicate that, over time, the term has undergone a form of shortening, as described in Rogers (1997:220), but different parts of the expanded term have been retained for collocation with different nouns. 5.4.3

Word order

A term such as sous-système disque may also present another difficult choice for the translator: word order. As a non-expert in the field, a translator may be unsure as to whether the term should be translated as ‘subsystem disk’ or ‘disk subsystem’. As a rule of thumb, adjectives follow the noun in French and precede the noun in English, so there is cause for assuming that the correct translation should be ‘disk subsystem’; however, given that the larger context reads …pour transférer des données entre son sous-système disque et la carte réseau…, the translator may be tempted to consider the translation ‘subsystem disk’ since anyone who has used a computer knows that data is transferred to and from various types of disks on a regular basis. Moreover, since in English, both elements of the term can be used as either modifiers or head nouns (e.g. system memory/operating system; disk drive/floppy

disk), the translator might feel more comfortable verifying the correct word order before committing herself. A concordance search revealed that both combinations are possible: ‘disk subsystem’ appeared 12 times, while ‘subsystem disk’ appeared once. However, a closer inspection of the concordance for ‘subsystem disk’, as illustrated in table 7, shows that in this context, ‘disk’ is not a noun but rather an additional modifier. In these peer-to-peer communications, a subsystem disk card can talk to a tape-drive interface and transfer data while the server remains online. Table 7. A context retrieved using the search pattern ‘subsystem disk’. Meanwhile, the results of a concordance search on ‘subsystem’ which have been sorted according to the word appearing immediately to the left of the search term allows the translator to see that ‘disk subsystem’ (12 occurrences) is following a pattern which has a number of parallels: ‘I/O subsystem’ (57 occurrences), ‘storage subsystem’ (14 occurrences), ‘external drive subsystem’ (2 occurrences). After examining these concordances, the translator can feel more confident in choosing ‘disk subsystem’ as the equivalent of soussystème disque. This was indeed the choice made by student B, while student A, who did not have access to the corpus, combined the words in the incorrect order. 5.5

Investigating usage

One of the greatest strengths offered by a corpus of running text is that it allows users to investigate appropriate usage patterns. Such queries can be relatively simple, such as an investigation into whether it is appropriate to make an abbreviation plural. For instance, the source text refers to des modules de services intermédiaires nommés ISM, where the expanded term is in the plural, but the abbreviation (which has been retained in English) is in the singular. A translator may wonder whether it would be correct to make the abbreviation plural when writing in English. As shown in table 8, a search on the pattern ‘ISM*’ revealed that it is indeed acceptable to use this abbreviation in the plural form. Therefore, student B was able to confidently use the plural form in her translation, whereas student A was not. ement applications that reside as ISMs in the transport layer, as shown as intermediate service modules (ISMs) or through peer-to-peer service any associated IP security. Other ISMs could manage HTTP lookups and pro b page caching. Thus, through I2O ISMs and peer-to-peer services, most o f RAID's storage integrity. Other ISMs could perform on-the-fly data com formance and reducing cost. Other ISMs might perform data compression/de ement applications that reside as ISMs in the transport layer, as shown

Table 8. Selection of concordances showing the plural form of the abbreviation ISM. Another example taken from the source text is the phrase …il [Irtos] supporte également une API. Student A has translated this quite literally as ‘it can also support an application programme interface (API)’. However, in the corpus, there are no instances of ‘API’ appearing within a 25-word context of ‘support’. A concordance search revealed that IRTOS commonly ‘has’, ‘comes with’ or ‘offers’ an API, but it never seems to ‘support’ an API. In her translation, student B has investigated the appropriate usage and has chosen to render supporte by ‘comes with’. A similar investigation revealed that although in French it is appropriate to créer des modules de services intermédiaires, in English ISMs are ‘developed’ (as used by student B) rather than ‘created’ (as used by student A). A particularly interesting usage problem which can be investigated with the help of a TL corpus is semantic prosody. Semantic prosody refers to the connotation, be it positive, negative, or neutral, that is expressed by a lexical item in association with its collocates. Partington (1998:77) has identified semantic prosody as an important area of research in translation studies, observing that cognate words in two languages can have very different semantic prosodies. He observes, for example, that the verb ‘incite’ in English almost always

relates to unfavourable phenomena (e.g. racial hatred, moral hazard, violence, etc.), whereas the Italian verb incitare often collocates favourably, with the meaning ‘to encourage’ (e.g. athletic performance, rational policies, etc.). Partington (1998:78) concludes that “the pitfalls for translators unaware of such prosodic differences are evident.” Looking at the sample source text, we encounter the phrase …une procédure qui utilise des ressources CPU importantes. Here we see that the French term ressources CPU has collocated with the verb utilise. In general, utilise is translated by a form of ‘to use’, which has a fairly neutral semantic prosody. However, the corpus evidence shows that in English texts, ‘CPU resources’ are rarely associated with a neutral or positive semantic prosody, but rather with a negative semantic prosody. Typically, as shown in table 9, ‘CPU resources’ are ‘eaten up’, ‘consumed’, ‘drained’, and ‘expended’; they also suffer ‘burnout’ and consequently need to be ‘freed’ or the pressure on them ‘alleviated’ or ‘relieved’. I/O processing can eat up as much as 25 percent to 35 percent of the main CPU's resources, TCP/IP processing consumes 26.4 percent of the host CPU's resources facilitating encryption and decryption, can become quite a drain on CPU resources Today’s servers, even quad-Pentium Pro giants, expend an extraordinary amount of CPU resources on mundane housekeeping tasks Today’s PC servers face ever-daunting demands on their CPU resources I2O has been designed to help alleviate the problem of CPU burnout the I/O subsystem redirects interrupt-intensive I/O tasks away from the host CPU and thus frees host resources designed to improve I/O throughput and system performance by relieving host resources--such as the CPU, memory, and system bus--of interrupt-intensive I/O tasks Table 9. Selection of contexts illustrating the negative semantic prosody of ‘CPU resources’. As observed by Louw (1993:173), for many years, linguists were not able to deal with semantic prosodies in a satisfactory way because semantic prosody patterns can be difficult to detect with the naked eye. Therefore, while a translator may once have been excused for translating utilise des ressources CPU importantes by something with a neutral semantic prosody, such as ‘uses substantial CPU resources’ (as proposed by student A), that same translator can now use a corpus to easily verify the semantic prosody of the term and come up with a more appropriate translation, such as ‘constitutes an unnecessary drain on CPU resources’ (as proposed by student B). Finally, although it did not prove to be particularly problematic in the short source text used as an example in this paper, register is another area of usage which can be difficult for translators. Again, this is information which can be hard to glean from a dictionary, but which can be investigated more easily using a corpus. For example, a simple set of searches can give the translator an indication of whether or not it is appropriate to use contractions in a given text type, or whether or not it is appropriate to address the reader directly in the text, etc. 5.6

Retrieving possible equivalents for “unknown” terms

As mentioned in section 4, perhaps the most challenging task facing a translator who wants to use a specialized TL corpus as a translation resource is that of finding ways to look up an “unknown” term when the SL term cannot be used as an access point. The remaining sections focus on techniques which can help translators to overcome this hurdle. 5.6.1

Stem + wildcard lookup

Depending on the subject field and languages in question, one useful strategy may be for the translator to do a concordance search using the combination of a word stem and a wildcardxvii. For example, in the case of technical subject fields, such as computing, French and English

words often share the same stem and differ only in the ending, for example, périphérique/peripheral; serveur/server; technologie/technology; etc. In the source text, the term l’encryptage appears. Student A has not been able to find a translation and has used the term ‘encryptage’ in her target text, believing it to be a technical term in English also. Meanwhile, student B has used the search pattern ‘encrypt*’ to retrieve the candidate terms ‘encrypt’ and ‘encryption’. She then examined these terms in their TL contexts, as shown in table 10, to determine if a conceptual match could be established. Other ISMs might perform data compression/decompression on a block storage device, encrypt data packets going out over a network, or implement firewall support for a Web server. At minimum, the NIC’s IOP would execute an ISM that implements the algorithms used to encrypt or decrypt secure data streams, sparing the server’s host processors from this computational overhead. Although serving up Web pages isn’t particularly I/O intensive, other functions, such as facilitating encryption and decryption, can become quite a drain on CPU resources. An architecture called I2O has been developed to help alleviate the problem of CPU burnout. Table 10. Selection of contexts retrieved using the search pattern ‘encrypt*’. An added advantage to using the stem + wildcard search pattern is that it allows for the possibility of uncovering patterns involving transposition. As defined by Vinay and Darbelnet (1995:94), “transposition consists of replacing one class of words by another without changing the meaning of the message.” Different languages often give predominance to different parts of speech; for example, the key role of a noun-based construction in French might be expressed more naturally using a verb-centred construction in English. When translating, the translator can sometimes experience SL interference and may be tempted to transfer the meaning to the TL using the same parts of speech that were employed in the SL, even when transposition might be a more appropriate strategy. By using the stem + wildcard search pattern (keeping in mind that wildcards can be used before the stem as well as after), the translator can retrieve numerous derivations and determine from the contexts which form is most appropriate. For instance, in the source text, there is a string of noun-based constructions at the end of the following phrase: …pour créer des modules de services intermédiaires nommés ISM adaptés, par exemple, à la technologie Raid, l’encryptage, la compression, ou à tout composant de middleware. Student A has retained this type of construction in her translation: ‘…to create Intermediate Service Modules (ISM) adaptable to, e.g., Raid, encryptage, compression or any middleware component.’ In contrast, student B has used the corpus to discover that transposition is a viable option here, and she has consequently elected to employ verbs instead of nouns, which results in a more natural sounding text. …to develop Intermediate Service Modules (ISMs) which are able, for example, to implement RAID technology, to encrypt or compress data, to carry out the functions of any piece of middleware, etc.’

5.6.2

Collocates or clusters of known terms or parts of terms

Another strategy that a translator can use when dealing with an unknown term is to investigate terms which she does know and which tend to appear in the proximity of the

unknown term. For instance, the term système d’exploitation appears in the source text. A translator could make an educated guess that système would be translated by ‘system’, but the word exploitation might pose more of a difficulty. As outlined above, a first strategy might be to conduct a concordance search using the pattern ‘exploit*’; however, in our corpus this search revealed no occurrences, and this “negative” evidence alerted the translator to the fact that more research was necessary. Another strategy might then be to see what words collocate most often with the “known” part of the term, which in this case is ‘system’. The concordancer retrieved 860 occurrences of ‘system’, which in general language is a fairly generic noun; however, the collocation viewer revealed that, in our corpus, the most common collocate of ‘system’ is ‘operating’, which occurred 241 times in the position immediately to the left of ‘system’. This means that approximately 28% of the time, ‘system’ collocates with ‘operating’ (in the position to the left). The next most common meaningful (i.e., non-function word) collocates in this position are ‘embedded’ (16 occurrences, 0.02% frequency), ‘file’ (9 occurrences, 0.01% frequency) and ‘computer’ (4 occurrences or 0.005% frequency). In WordSmith, the collocation viewer can be a little awkward to read since it essentially shows a series of numbers in a matrix. Another way of determining which words “go with” the word ‘system’ could be to use the Cluster feature. Clusters are words which are found repeatedly in each others’ company, and after performing a concordance search on a word such as ‘system’, the translator can then click on Cluster and this feature will go through the concordance output, seeking the repeated word clusters. The settings here allow you to choose how many words a cluster should contain (e.g. 2, 3, 4), and how many occurrences of each must be found for the results to be worth displaying (e.g. a minimum of 5). Table 11 shows the results of a two-word cluster search on ‘system’. operating system 241 the system 99 based system 45 and system 38 a system 35 overall system 19 of system 18 embedded system 16 file system 9 each system 7 Table 11. Two-word clusters containing the term ‘system’. Based on the output of the Collocate viewer and the Cluster feature, it would seem that, in the context of I2O, ‘operating system’ is the most likely potential equivalent for système d’exploitation, and the concordances for ‘operating system’ can now be examined in more detail to confirm this hypothesis. A similar process can be followed to establish, for example, the English equivalent of serveur de fichiers. First, a concordance can be run on ‘serv*’, then collocates and/or clusters can be generated to reveal potential equivalents such as: ‘application server’ (29 occurrences), ‘web server’ (26 occurrences), ‘file server’ (19 occurrences). Concordances for each of these terms can then be scrutinized in order to establish a conceptual match. 5.6.3

Syntactic structures

In addition to lexis, syntactic structure can also provide a means for accessing the desired equivalent in the TL corpus. There are many examples of structures in a given language which have a corresponding and predictable structure in another language. Take, for example, the French sentence: Plus je x, plus je y. Even if we do not know the lexical equivalents for x and y, we can hazard a guess that the structure of the corresponding English sentence would

take the form: ‘The more I x, the more I y’. Translators can use their knowledge of typically corresponding syntactic patterns to interrogate the TL corpus. For instance, the source text contains the French terms un ISM point à point and la transaction point à point. Student A has translated point à point by ‘point-to-point’, which, although it is a valid translation in some contexts, is not the correct translation here. Initially, student B also did a search on ‘point’; however, upon examining the contexts, she soon discovered that ‘point’ is not the appropriate equivalent. Although the lexically-based search pattern did not provide the correct equivalent, she went on to use syntactic knowledge to investigate further. The structure x à x in French is often rendered by the structure ‘x-to-x’ in English. As shown in table 12, a concordance search using the search pattern ‘*-to-*’ turned up the following concordances, among others: March 1997. This version supports ffering is the PCI 9080 I2O-ready ns based on yesterday's slowpoke, h lara) introduced the first 66 MHz moth n operational and perform typical ta embedded-system designers improve and ent specification, it helps speed r es/second; 2. A second-generation wed by the PCI specification; and ho rmance onto the internal bus. The ha net NIC is used as a single-seat, pricin be equipped with local memory, a t ack. The board includes multilink com result is a high-end server with tech unique mode of operation, termed commun we couldn't conduct BYTE's usual compariso ironments. Also, Octopus works in lso, Octopus works in one-to-one, toworks in one-to-one, one-to-many, configur ne, one-to-many, many-to-one, and

peer-to-peer technology, which allows PCI-to-local-bus accelerator. I hard-to-use standards and move up to PCI-to-PCI bridge chips, allowing day-to-day functions in addition to time-to-market, system performance time-to-market and ensure consistent PCI-to-PCI bridge supporting multiple peer-to-peer data transfers without PCI-to-PCI bridge in the new design host-to-LAN connection.

As for

PCI-to-PCI bridge, and enough smarts point-to-point protocol support to first-to-market hot-swappable PCI peer-to-peer, where devices can apples-to-apples performance one-to-one, one-to-many, many-to-one, one-to-many, many-to-one, and manymany-to-one, and many-to-many many-to-many configurations, not just

Table 12. Selection of concordances retrieved using the search pattern ‘*-to-*’. Because this search focused on a structural rather than a lexical pattern, it is very important that the retrieved concordances be lexically sorted as a next step. Some of the occurrences retrieved in table 12 can be ruled out immediately by the translator, either because they are general language terms (e.g. apples-to-apples, day-to-day, one-to-one), or because they are x-to-y rather than x-to-x constructions (e.g. hard-to-use, host-to-LAN, timeto-market). Nevertheless, it is difficult to identify significant patterns with this “noise” (i.e., irrelevant retrievals) getting in the way. Looking at table 12, a translator might initially think that patterns such as ‘point-to-point’, ‘peer-to-peer’ and ‘PCI-to-PCI’ are all worth investigating more closely; however, sorting the data soon reveals that there are only 3

occurrences of ‘point-to-point’, 13 occurrences of PCI-to-PCI and 52 occurrences of ‘peer-topeer’. Furthermore, the sorted concordances clearly revealed that in this corpus, ‘point-topoint’ only collocates with ‘protocol’ (in the context of TCP/IP), while ‘PCI-to-PCI’ always collocates with ‘bridge’. An additional sort on ‘peer-to-peer’ demonstrates that this is a much more generic modifier that can collocate with a wide variety of nouns, as shown in the extract in table 13. Moreover, a closer examination of the extended concordances shows that ‘peerto-peer’ is being used in the English texts in a comparable way to point à point in the source text. raight onto a LAN interface. Other peer-to-peer application examples inc f more smart devices. Finally, the peer-to-peer communications can enabl pects of I2O is a mechanism called peer-to-peer communications. Champion ort layer, as shown in the figure "Peer-to-Peer Communications." For exa owed by the PCI specification; and peer-to-peer data transfers without h streaming, I2O message passing and peer-to-peer functions. A critical sy O processor (IOP). For example, a peer-to-peer hardware device module ( firmware to recognize I2O devices. Peer-to-peer implementations won't ar could be combined on the same NIC. Peer-to-peer operations could be used m a hard drive to a backup tape. A peer-to-peer OS-specific module (OSM) aching. Thus, through I2O ISMs and peer-to-peer services, most of the I/ service modules (ISMs) or through peer-to-peer services. For network op t an intelligent LAN card with our peer-to-peer software and the latest ns and supervise the transfer. The peer-to-peer specifications did not m rm." Although the current focus is peer-to-peer technology, she says, th March 1997. This version supports peer-to-peer technology, which allows CPU, as well as with each other in peer-to-peer transactions. "I2O is de rage, SCSI, system management, and peer-to-peer transfers. Extensions fo ransfers. Finally, in concert with peer-to-peer transfers with a hard dr a unique mode of operation, termed peer-to-peer, where devices can commu

Table 13. Selection of concordances retrieved for the search pattern ‘peer-to-peer’ which have been sorted alphabetically according to the word appearing immediately to the right of the search pattern. 5.6.4

Abbreviations

Another possible strategy that may prove useful is that of investigating abbreviations. A common characteristic of French texts in the computing field is that they use English acronyms or abbreviations. Sometimes, the expanded English term may have a French equivalent, but the acronym used in both languages is often the one which reflects the English words and word order. Looking at the source text, we can see a number of examples of terms which have been expanded or explained in French, but which are abbreviated using English acronyms: …il supporte également une API (interface de programmation d’application) pour créer des modules de services intermédiaires nommés ISM… If the terms in question were abbreviated according to French word order, we would expect to see the acronyms IPA and MSI respectively. The fact that these acronyms do not reflect the French word order provides the translator with a clue and a means of accessing the TL corpus to find the appropriate expanded term in English. Note that it is often necessary to actually look up the expanded term as it cannot always be gleaned from the acronym alone. For instance, student A has guessed incorrectly that the abbreviation API expands into the English term ‘application program interface’. Meanwhile, as shown in table 14, student B has used a concordance search on the English abbreviation to reveal contexts which also provide the correct expanded form of the English term, which is ‘application programming interface’. . An application programming interface (API) is necessary so that networ mmon application programming interface (API) and I/O interface.

Integrat mmon application programming interface (API) and I/O interface. Other me

Table 14. Selection of concordances showing the expanded form of the abbreviation API. There are a number of other English abbreviations and acronyms used in the French text without any expanded terms or explanations (e.g. IRTOS, RAID, CPU and IOP). Using the abbreviated forms, translators can access the expanded terms and provide this information to the target audience if they feel it is necessary or beneficial. For example, student B has used the information provided by concordances, such as those shown in tables 15 and 16, to include the expanded form of IRTOS and IOP in her target text. Meanwhile, student A did not have access to this information and hence could not provide it in her translation. …turned to the I2O real-time operating system (IRTOS) and corresponding development environment. …which will run the I2O real-time operating system (IRTOS). This processor … Table 15. Selection of concordances showing the expanded form of the abbreviation IRTOS. I2O devices will have a dedicated chip, called an I/O processor (IOP), which will run the I2O real-time operating system (IRTOS). By adding the I2O architecture and an I/O processor (IOP), designers can build systems that are much more powerful… I2O is a device- and operating-system-independent standard architecture that frees up the CPU by offloading I/O (input/output) requests to a separate processor. Table 16. Selection of concordances showing the expanded forms of the abbreviations I/O and IOP. 5.6.5

Product names

Just as abbreviations are often retained in their English form in the field of computing, so too are product names or other proper nouns. Therefore, if an unknown word appears in the vicinity of a product name (e.g. Windows), then it might be useful to do a concordance search on the product name and then scan the contexts for potential candidates. For instance, student B was able to do a concordance search on ‘Windows’ and to discover from reading these contexts that Windows was actually used to ‘manage’ applications. A further concordance search on ‘manag*’ revealed that this was an action that was also carried out by drivers. As a result, she was able to translate the passage: Il correspond à un environnement spécifique au pilote de périphériques, semblable à ce qu’est Windows pour les applications. by the following: ‘Just as Windows is an environment that focuses on the management of applications programs, IRTOS is specifically concerned with managing device drivers.’ In contrast, student A produced a much more literal and less well-informed translation: ‘It corresponds to a specific environment for peripheral pilots, like Windows is for applications.’

5.6.6

Numbers

Numbers can also provide a useful access point into a TL corpus. Although our source text did not contain any numbers, we will illustrate this strategy using another example taken from a French text on optical scanners. A translator may have difficulty finding a translation for the term ppp as it appears in the following extract: …un scanner capable de numériser jusqu’à 1200 ppp…. Based on a 150,000 word corpus on optical scanning technology, a concordance search on ‘scanner’ retrieved too much data, while concordance searches on ‘capable’, ‘ppp’ and ‘1200’ turned up no occurrences of these patterns. However, a translator familiar with the general field of computing texts will realize that numbers are always changing as the speed, power and quality of technology increases. Therefore, the key to finding the English equivalent of ppp may actually be to use a number other than 1200. Rather than trying to guess the entire number, a useful strategy may be to use a wildcard and the last digit (e.g. *0, *1, *2, etc.) xviii. The results can be a mixed blessing, however, because although this strategy increases the chance of finding a relevant number (and hence a useful context), it also provides the widest scope for retrieving non-relevant data. Nevertheless, scanning through the list of all concordances retrieved is still faster than trying to read through all the text in a linear fashion. In the case of the example above, we used the search term ‘*0’, which retrieved, among others, the following hits as shown in table 17: …a 300 dpi scanner… …scanner with an enhanced resolution of 4000 dpi… …an 8-bit 400 dpi colour scanner… Table 17. Selection of candidate terms retrieved using the search pattern ‘*0’. The abbreviation ‘dpi’ appears to be potential candidate for the equivalent of ppp, and a further concordance search on ‘dpi’ revealed the following context: ‘…resolution measured in dots per inch (dpi)…’. At this point, the translator may go on to reason (with the help of a dictionary if necessary) that the general translation for ‘dot’ is typically point, that for ‘per’ is par, and that for ‘inch’ is pouce – ‘dpi’ and ppp. Of course, the terms can now be verified in other resources or by a subject field expert, and when consulting these other resources, the translator can now make use of a number of potential points of access that only came to light through the exploration of the corpus. Concluding remarks The aim of this paper was to demonstrate that specialized TL corpora can be valuable resources for translators. Although corpora are not, in essence, very different from conventional printed parallel texts, their machine-readable form means that they have more to offer translators in a number of respects. For example, when coupled with a corpus analysis tool such as WordSmith, corpora allow translators to consult a wider range of documentation and to easily determine the centrality of patterns found within that range. The numerous contexts provide translators with information about both subject matter and acceptable linguistic usage and the corpus can act as a testbed which translators can use to prove or disprove their hypotheses. Specifically, we have outlined here a series of techniques that can be used to extract information from such a corpus with the help of a generic corpus analysis tool such as WordSmith. It is likely that many of the techniques described were not envisaged by the tool’s developer, but as Burnard (1999:24) observes, “Software systems are among the most plastic forms of art in contemporary culture: their design and the methods they facilitate perhaps have a wider significance than we users and designers are aware of.” Of course, translators need to be educated about both benefits and the limits of the corpus-based tools and resources, just as they need to be educated about the advantages and disadvantages of conventional resources. However, once translators have been taught to use these tools and strategies, they can be incorporated into the translators’ established working

practices without any major disruptions. Clearly, the bottom line is that no resource is likely to eliminate the need for careful data analysis and interpretation by the translator. Nevertheless, in our experience, translators who have used specialized TL corpora to help them with their translation work have found that such corpora are valuable resources which effectively complement other resources such as dictionaries. This can be seen by referring to the two translations in the Appendix. Student A, who used only conventional resources, encountered numerous difficulties including problems understanding some of the concepts in source text, trouble identifying the correct TL terminology, and difficulties establishing appropriate TL usage patterns. When she was interviewed following the translation exercise, student A identified the following as some of the drawbacks of the conventional resources at her disposal: 1) many lexical resources (even recently published ones) did not contain the terms she was looking for, 2) the bilingual lexical resources tended to be too general, 3) although the monolingual lexical resources were more specialized, she had no access points into these TL resources, 4) it was time consuming to read the printed parallel texts, 5) little or no usage information was provided in the lexical resources and 6) it was difficult to spot usage patterns in the printed parallel texts. In contrast, student B, who used a combination of both conventional lexicographic resources and a specialized TL corpus, was able to overcome many of problems faced by student A, and her resulting translation was more accurate and more idiomatic. Student B did emphasize the importance of the training that she received prior to the translation exercise, stating that without such training, she felt it would have been difficult to complete the task as it represented a new way of working. She went on to add, however, that with some practice, she felt that she had become efficient at using the corpus and she believed it to be an extremely valuable resource. We hope that the ideas presented here will encourage other translators to develop further strategies and techniques which suit their own translation needs, such as strategies that can be used with other language pairs, subject fields, text types, or techniques that can be applied using annotated corpora. We see this type of activity as one of the first critical steps towards the development of a methodology which will allow translators both to utilize existing corpus-based resources and corpus analysis tools more effectively, and to identify and articulate their needs more clearly so that software developers can design and create the specialized tools which meet these needs. Corpora are such valuable resources that translators would be remiss not to take advantage of what they have to offer. In fact, it may be fair to say that, although corpora and corpus processing tools will not replace translators, translators who use these resources are likely to replace those who don’t! Acknowledgements We are grateful to the two students who participated in the translation exercise described here, as well as to the numerous other students at Dublin City University who have participated in other corpus-based translation exercises over the past few years. We also appreciate the constructive comments received from three anonymous reviewers. Notes

References Baker, M. 1995. “Corpora in Translation Studies: An Overview and Some Suggestions for Future Research”. Target 7(2): 223-243. Barnbrook, G. 1996. Language and Computers. Edinburgh: Edinburgh University Press. Bowker, L. 1996. “Towards a Corpus-Based Approach to Terminography”. Terminology 3(1): 27-52. Bowker, L. 1998a. “Exploitation de corpus pour la recherche terminologique ponctuelle”. Terminologies nouvelles 18: 22-27. Bowker, L. 1998b. “Using Specialized Monolingual Native-Language Corpora as a Translation Resource: A Pilot Study”. Meta 43(4): 631-651. Burnard, L. 1999. “The good workman appraises his tools: What kind of software is needed for corpus analysis?” Abstracts from the 2nd International Conference on Practical Applications in Language Corpora (PALC’99), Lodz, Poland, 15-18 April 1999, 23-24. Church, K., Gale, W., Hanks, P., and Hindle, D. 1991. “Using Statistics in Lexical Analysis”. In Lexical Acquisition: Exploiting Online Resources to Build a Lexicon, ed. by U. Zernik, 115-164. Englewood Cliffs, NJ: Lawrence Erlbaum Associates. Laviosa, S. 1998a. “The English Comparable Corpus: A Resource and a Methodology”. In Unity in Diversity? Current Trends in Translation Studies, ed. by L. Bowker, M. Cronin, D. Kenny and J. Pearson, 101-112. Manchester: St. Jerome. Laviosa, S. 1998b. “Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose”. Meta 43(4): 557-570. Louw, B. 1993. “Irony in the text or insincerity in the writer? -- The diagnostic potential of semantic prosodies”. In Text and Technology: In Honour of John Sinclair, ed. by M. Baker, G. Francis, E. Tognini-Bonelli, 157-176. Amsterdam: John Benjamins. Malmkjaer, K. 1998. “Love Thy Neighbour: Will Parallel Corpora Endear Linguists to Translators?” Meta 43(4): 534-541. McEnery, T. and Wilson, A. 1993. “Corpora and Translation: Uses and Future Prospects”. Technical Report from the Unit for Computer Research on the English Language (UCREL), University of Lancaster. Meyer, I. and Mackintosh, K. 1996. “The Corpus from a Terminographer’s Viewpoint”. International Journal of Corpus Linguistics 1(2): 257-285. Partington, A. 1998. Patterns of Meaning in Text. Amsterdam: John Benjamins. Pearson, J. 1996. “Teaching terminology using electronic resources”. In Proceedings of Teaching and Language Corpora 1996, ed. by S. Botley, J. Glass, T. McEnery and A. Wilson, 203-216. University of Lancaster, UCREL. Pearson, J. 1998. Terms in Context. Amsterdam: John Benjamins. Peters, C. and Picchi, E. 1997. “Reference Corpora and Lexicons for Translators and Translation Studies”. In Text Typology and Translation, ed. by A. Trosborg. 247-274. Amsterdam: John Benjamins. Peters, C. and Picchi, E. 1998. “Bilingual Reference Corpora for Translators and Translation Studies”. In Unity in Diversity? Current Trends in Translation Studies, ed. by L. Bowker, M. Cronin, D. Kenny and J. Pearson, 91-100. Manchester: St. Jerome. Rogers, M. 1997. “Synonymy and Equivalence in Special-language Texts: A Case Study in German and English Texts on Genetic Engineering”. In Text Typology and Translation, ed. by A. Trosborg, 217-245. Amsterdam: John Benjamins. Schäffner, C. 1998. “Parallel Texts in Translation”. In Unity in Diversity? Current Trends in Translation Studies, ed. by L. Bowker, M. Cronin, D. Kenny and J. Pearson, 83-90. Manchester: St. Jerome. Scott, M. WordSmith Tools: http://www1.oup.co.uk/elt/catalogu/multimed/4589846/4589846.html Somers, H.L. 1993. “Current Research in Machine Translation”. Machine Translation 7: 231-246. Teubert, W. 1996. “Comparable or Parallel Corpora?” International Journal of Lexicography 9(3): 239264. Vinay, J.-P. and Darbelnet, J. 1995. Comparative Stylistics of French and English: A methodology for translation. Translated and edited by J.C. Sager and M.-J. Hamel. Amsterdam: John Benjamins. Williams, I.A. 1996. “A Translator’s Reference Needs: Dictionaries or Parallel Texts?” Target 8(2): 275299.

Zanettin, F. 1998. “Bilingual Comparable Corpora and the Training of Translators”. Meta 43(4): 616-630. Appendix SOURCE TEXT (extract from: Bel, Bruno. “I2O, nouveau coeur des entrées et sorties,” Informatiques magazine, no. 49 (juin 1998), 79.) Un système d’exploitation dédié aux périphériques L’Irtos est le premier système d’exploitation dédié aux entrées et sorties. Il correspond à un environnement spécifique au pilote de périphériques, semblable à ce qu’est Windows pour les applications. L’Irtos possède non seulement toutes les caractéristiques nécessaires pour écrire facilement des pilotes, mais il supporte également une API (interface de programmation d’application) pour créer des modules de services intermédiaires nommés ISM adaptés, par exemple, à la technologie Raid, l’encryptage, la compression, ou à tout composant de middleware. Imaginons, par exemple, un serveur de fichiers très souvent sollicité pour transférer des données entre son sous-système disque et la carte réseau. Sans la technologie I2O, les données doivent passer du disque à la mémoire système, pour être ensuite seulement dirigées vers la carte réseau, une procédure qui utilise des ressources CPU importantes. Avec un IOP sur la carte mère et un ISM point à point chargé dans l’IOP, les données peuvent être transférées directement entre le sous-système disque et la carte réseau. Le système d’exploitation reste responsable de l’initiation de la transaction point à point. TRANSLATION A: (produced by a student who used only conventional resources) An operating system designed for peripherals Irtos is the first operating system dedicated to input/output. It corresponds to a specific environment for peripheral pilots, like Windows is for applications. Irtos not only has all the necessary characteristics to write Beta versions with ease, but it can also support an application programme interface (API) to create Intermediate Service Modules (ISM) adaptable to, e.g., Raid, encryptage, compression or any middleware component. Imagine for example a file server which is very often used to transfer data from its sub-system disk to the network card. Without I2O the data must first pass from the disk to the system memory. From there it must be directed to the network. This process uses substantial CPU resources. With an IOP on the parent card and a point-to-point ISM loaded into the IOP, data can be directly transferred between the subsystem disk and the network card. The operating system is responsible for initiating the point-to-point transference. TRANSLATION B: (produced by a student who used both conventional lexicographic resources and a specialized TL corpus) An Operating System for Peripherals The I2O real-time operating system (IRTOS) is the first operating system designed specifically to manage data input and output (I/O). Just as Windows is an environment that focuses on the management of applications programs, IRTOS is specifically concerned with managing device drivers. Not only does IRTOS have all the features necessary for writing drivers, it also comes with an API (Application Programming Interface) which makes it possible to develop Intermediate Service Modules (ISMs) which are able, for example, to implement RAID technology, to encrypt or compress data, to carry out the functions of any piece of middleware, etc. Imagine, for example, a file server that is often required to transfer data between its disk subsystem and the network card. Without I2O technology, the data would have to travel from the disk to the main memory, and would then be redirected to the network card. This extra step constitutes an unnecessary drain on CPU resources. With an I/O processor (IOP) on the motherboard, and a peer-to-peer ISM on the IOP, the data

can be transferred directly from the disk subsystem to the network card. Nevertheless, the operating system is still responsible for initiating the peer-to-peer transaction. i

An interesting debate on this subject took place over the course of several days on the Corpora List beginning on 29 July 1998 with a message from Oliver Mason and Ylva Berglund (http://www.hit.uib.no/corpora/1998-3/0029.html). ii The terminology used to refer to different types of corpora is still in a state of flux, and different authors use different terms to refer to the same types of corpora. We have provided explanations for how terms are being used in the context of this paper. iii Although the task of aligning texts may seem straightforward, it is actually non-trivial because languages do not always have simple one-to-one correspondence. It is not unusual, for example, to find that a single sentence in the source text (ST) may be rendered by several in the target text (TT), or vice versa. Moreover, the order in which information is presented in the ST may differ from the order in which it is presented in the TT. Of course, aligned corpora may become more readily available as more translators use translation memory tools to build up such resources. iv An exception to this may be in the context of translator training as described by Zanettin (1998). In cases where the translators in question are inexperienced, the analysis of a series of texts produced in comparable communicative situations can help them investigate the respective expectations, experience and knowledge of the linguistic communities involved. v Note that the term “parallel” as used by translators differs from the use familiar to corpus linguists. For a corpus linguist, the term “parallel” is typically used to refer to a corpus which consists of a source text aligned with its translation, whereas for a translator, “parallel” is used to refer to a TL text which has a similar communicative function as the source text, but which is not a translation of the source text. vi A third advantage is that corpora can be enriched with additional information. The possibility of annotating a corpus means that information that was formerly implicit can now be made explicit. Annotation thus makes it quicker and easier to retrieve and analyze specific data. However, further discussion of annotation is beyond the scope of this paper. vii Most corpus analysis tools developed specifically for translation have been developed for use with aligned corpora. Some have been developed for use with specific corpora (e.g. Translation Corpus Explorer (TCE) developed by Jarle Ebeling of the University of Oslo specifically for use with the EnglishNorwegian Parallel Corpus; TranSearch developed for use with the French-English Canadian Hansard Corpus), while others can be used with any bilingual corpora provided they have been aligned in an appropriate way (e.g. MultiConcord). viii WordSmith Tools is available from Oxford University Press at: http://www1.oup.co.uk/elt/catalogu/multimed/4589846/4589846.html ix Stop lists, which are lists of words to be ignored, can also be used if desired. This could be done, for example, in order to remove common function words such as prepositions or conjunctions. operating system technology by experts in the field. This level of awareness might not be triggered by a dictionary since it would likely c n tain an entry for ‘pilot’, though not necessarily for all senses of the word. In our x perience, novice translators often put blind faith in dictionaries, choosing to incorporate proposed equivalents into their translations even when they are not appropriate. This was the approach taken, for example, by student A, who translated one occurrence of pilote in the source text by ‘pilot’. In contrast, student B was able to use the “negative evidence provided by the corpus to raise her level of awareness and avoid making a similar error. Consequently, she broadened her search for an equivalent and eventually located and incorporated the correct term ‘driv r ’ into her target text.5.3 Investigating a conceptual hunchIn addition to investigating linguistic hunches, translators can also use a specialized TL corpus to investigate conceptual hunches. For example, the source text contains the following passage:L’Irtos possède n on seulement toutes les caractéristiques nécessaires pour écrire facilement des pilotes, mais il supporte également une API (interfac de programmation d’application) pour créer des modules de services intermédiaires nomm s ISM…We see that in her translation of this passage, student A has misunderstood the concept behind the term pilotes and has elected to translate it as ‘Beta versions’. On the s

r

face, this could be deemed as a reasonable error since when we talk of a ‘pilot project’, we are referring to a trial study, and in the computing field, beta versions are pre-final versions of software that are released for testing by users so that any remaining bugs can be discovered and fixed before the final release. Unfortunately, in context of the source text, this sens