Machine Translation for Scientific Abstracts: A Case

0 downloads 0 Views 6MB Size Report
not only in my research and thesis writing, but in many life issues as well. The completion of my ..... 3.4.1 Verbs of Main and Subordinate Clauses . ...... case study for this thesis, as the terminologies (mostly nouns) and a limited number of verbs ..... For example, while in the following phrase the translations for “generation”,.
Machine Translation for Scientific Abstracts: A Case Study on Lexical Customization with Applied Optics

WEI, Yuxiang

A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Philosophy in Translation

The Chinese University of Hong Kong July 2017

Thesis Assessment Committee

Professor WONG Wang Chi (Chair) Professor KWONG Oi Yee (Thesis Supervisor) Professor CHAN Wai Kwong Samuel (Committee Member)

Abstract of thesis entitled: Machine Translation for Scientific Abstracts: A Case Study on Lexical Customization with Applied Optics Submitted by

WEI Yuxiang

for the degree of

Master of Philosophy

at The Chinese University of Hong Kong in July 2017

Despite all the criticisms on the “poor” quality of Machine Translation (MT), the use of MT for information circulation is not to be ignored. Meanwhile, recent MT systems have been equipped with increasing customizability, allowing users to tailor the software to specific texts for better output. This thesis intends to investigate the output of MT for specialized language and its lexical customization: a case study of using SYSTRAN to translate abstracts of domain-specific academic articles. The translation of abstracts is not only suitable for MT, but largely meaningful given the massive demand for information-oriented translation. Starting from a general discussion of the theoretical perspectives regarding MT usage, translation, and specialized language, the thesis points out a conceptual gap in MT evaluation and takes the functionalist perspective with information assimilation as the primary purpose. The case study then uses the 2014 volume of Applied Optics for an in-depth analysis of the MT system’s raw output and lexical customization. The totally 1,328 abstracts are first systematically sampled for a preliminary discussion of the Source Text (ST) features concerning lexical ambiguity, together with SYSTRAN’s disambiguation results. This is followed by further lexical error analysis, providing the basis for glossary modification in SYSTRAN. The comparison between the initial translation and the customized translation shows significant improvement beyond the lexical issues, and indicates the effectiveness of the customization conducted. Further discussions beyond the sample show that the modified glossary entries are representative of the entire journal, as well as other similar journals. This is conducted via corpus-based investigations of Applied Optics and two other journals of the same kind –– Optics Express and Optics Letters.

i

SYSTRAN’s automatic process of lexical customization is also discussed, where the results show considerable inaccuracy but can shed light on what words to add into the User Dictionary before translating an ST. These investigations not only argue for a proper perspective of MT use, but also provide an implication for the practical methods for lexical customization.

ii

論文中文摘要 科學論文摘要的機器翻譯:基於 Applied Optics 的詞彙定制個案研究 呈交人

魏玉祥

哲學碩士論文 香港中文大學

二零一七年七月

雖然機器翻譯(MT)的質量尚未達到可以完全取代人工翻譯的水平, 但其在信息交流中的地位不可低估。同時,隨著近年來機器翻譯系統的可定制 性(customizability)越來越高,使用者可以根據文本的具體特點來定制系統, 從而提升翻譯質量。本文通過個案研究,探討機器翻譯在專門語言 (specialized language)範圍的翻譯結果與詞彙定制,案例使用 SYSTRAN 翻 譯特定領域的學術論文摘要。論文摘要的翻譯不僅適於機器翻譯,鑒於以信息 為主導的翻譯有大量的需求,此類翻譯也意義重大。 本文首先探討相關的理論視角,範圍包括機器翻譯、翻譯學與專門語言, 並採用功能主義的視角討論本案例。然後,用 Applied Optics 中 2014 年的摘要, 詳細分析 SYSTRAN 的翻譯結果和詞彙定制。在總共 1328 篇摘要中,先以系 統取樣法(systematic sampling)取到樣本,並以此為基礎討論原文的詞彙歧義 以及 SYSTRAN 的歧義消除結果,然後進一步分析系統在詞彙處理上的錯誤, 作為修改 SYSTRAN 詞典的基礎。通過對比原翻譯與定制後的翻譯,本文發現 翻譯質量的提升不僅體現在詞彙方面,句法和整體信息傳遞都有顯著提高。 在樣本之外的討論表明,上述過程所修改的詞彙條目,對於 Applied Optics 整體期刊,以及其它類似期刊,具有代表性。這部分討論使用語料庫的 方法,討論範圍包括 Applied Optics, 與兩個同類期刊 Optics Express 和 Optics Letters。 最後,本文討論 SYSTRAN 的自動詞彙定制功能。結果顯示,其準確率 較低,但對於詞彙定制中所應當加入的條目有所啓示。 通過對以上問題的討論,本文論證,對於機器翻譯的使用應當有恰當的 視角,同時本文也對詞彙定制的方法有所啓發。

iii

Acknowledgements I am greatly indebted to the many people who have supported me in the process of completing this thesis. My supervisor, Professor Kwong Oi Yee, has given me tremendous guidance not only in my research and thesis writing, but in many life issues as well. The completion of my thesis coincides with the emergence of some personal issues which I had never previously confronted, and these were difficulties demanding significant efforts to manage while I work on this thesis simultaneously. In this period, her support has been crucial, both emotionally and academically. I can never repay my debt to Professor Kwong for her patience, understanding, and most importantly, profound academic advice, without which this thesis can hardly be completed. Professor Chan Sin-wai, who supervised me until his retirement, has been no less supportive, giving me numerous suggestions at the stage when I was reading extensively, organizing ideas, and collecting initial data for my thesis. Without his comments and encouragement, the completion of this thesis would not have been possible either. Words alone are far from enough to express my gratitude for him. My gratitude also goes to the Department of Translation, for providing the necessary resources for this work. I am thankful for the many friends who have never refused to help when needed. Dr. Duncan James Poupard, with whom I have had many inspiring and encouraging conversations, is both a former teacher and a very good friend, and has been very helpful whenever I found it difficult to maintain focus. The help I have received from Dr. Chou Isabelle Ching is also essential for my process of completing this work. In particular, I have to thank Mr. Xie Qijie for his insightful comments on my understanding of the optical principles that are important for the textual analyses in this case study, an invaluable help for the completion of the thesis as a whole. Last but not least, my parents always deserve my special gratitude, whose support has been the source of my motivation.

iv

Table of Contents Abstract (English) ....................................................................................................... i Abstract (Chinese) ..................................................................................................... iii Acknowledgements .................................................................................................... iv Chapter 1 Introduction .............................................................................................. 1 1.1 Motivation........................................................................................................... 1 1.2 Issues, Scope, and Outline .................................................................................. 7 Chapter 2 Theoretical Perspectives ........................................................................ 10 2.1 The Use of MT .................................................................................................. 10 2.1.1 Useful for “Rough Draft Translations” ...................................................... 10 2.1.2 The Usability Gap: Discussion ................................................................... 12 2.2 MT for Scientific Abstracts: A functionalist translation perspective ............... 21 2.2.1 Equivalence: Problems and paradigm shifts .............................................. 21 2.2.2 Functionalism and MT ............................................................................... 23 2.2.3 MT for Scientific Abstracts: Functionalist issues ...................................... 26 2.2.4 MT for Scientific Abstracts: Information translation ................................. 34 2.3 Sublanguage, Ambiguity and MT Customization ............................................ 39 2.3.1 Sublanguage and Controlled Language ...................................................... 39 2.3.2 Ambiguity ................................................................................................... 42 2.3.3 Customization ............................................................................................. 45 2.4 Summary ........................................................................................................... 48 Chapter 3 Case Study .............................................................................................. 50 3.1 SYSTRAN ........................................................................................................ 53 3.1.1 Transfer Architecture .................................................................................. 53 3.1.2 Customization in SYSTRAN ..................................................................... 55 3.1.3 Investigation of the Output ......................................................................... 56 3.2 Initial Translation: Lexical ambiguity............................................................... 60 3.2.1 Not Found Words (NFW) ........................................................................... 60 3.2.2 Categorial Ambiguity ................................................................................. 62 3.2.3 Polysemy .................................................................................................... 71 3.3 Lexical Error Analysis and Glossary Modification .......................................... 79 3.3.1 Error Analysis ............................................................................................. 79 3.3.2 Glossary Modification in SYSTRAN......................................................... 80 3.4 Customized Translation: Lexical improvement and beyond ............................ 89 3.4.1 Verbs of Main and Subordinate Clauses ..................................................... 92 3.4.2 “-ing” Words ............................................................................................... 99 3.4.3 Attachment ................................................................................................ 103 3.4.4 Conjunction .............................................................................................. 110 v

3.4.5 Noun Strings ............................................................................................. 112 3.4.6 Terminological Items ................................................................................ 117 3.4.7 Not Found Words ...................................................................................... 120 3.4.8 Remaining Errors ...................................................................................... 122 3.5 Summary ......................................................................................................... 133 Chapter 4 Further Discussion: Beyond the sample ................................................... 4.1 Concordance Search........................................................................................ 137 4.1.1 Method ...................................................................................................... 138 4.1.2 Results and Discussion ............................................................................. 152 4.2 Automatic Customization................................................................................ 161 4.2.1 Method ...................................................................................................... 161 4.2.2 Results and Discussion ............................................................................. 162 4.3 Summary ......................................................................................................... 168 Chapter 5 Conclusions ........................................................................................... 171 5.1 Theoretical Perspectives ................................................................................. 171 5.2 Case Study ...................................................................................................... 173 5.3 Limitations and Future Work .......................................................................... 177 Appendix A –– Categorial Ambiguity ...................................................................... 179 Appendix B –– Polysemy ......................................................................................... 182 Appendix C –– Lists of Items in Discussion ............................................................ 186 Appendix D –– Translation Output .......................................................................... 192 Appendix E –– Search Results of WUD_SC from Applied Optics .......................... 200 Appendix F –– Search Results in WUD_SC from All Corpora ............................... 204 Appendix G –– Search Results in WUD from Applied Optics ................................ 213 References ................................................................................................................ 224

vi

Chapter 1 Introduction

1.1 Motivation Academic Abstracts: Translation demand Academic researchers frequently read abstracts — this might be an on-line search for papers relevant to a specific research question, or a quick browse of core journals to build awareness of the most recent updates in the community. Reading the abstracts is a preliminary step for deciding whether to proceed and retrieve the articles for more extensive reading. While abstracts have always been important for disseminating research, its importance has nowadays been more obvious than in the past — the expansion of digital, on-line, and increasingly accessible databases, together with a growth of social network platforms for researchers (Crouzier, 2015), has enabled research outputs to circulate worldwide in an easier, faster and more massive manner. There have been rapid and significant increases over the past few decades in both the amount of papers published and the percentage of papers cited (Larivière, Gingras, & Archambault, 2009; Yang et al., 2010); abstracts play vital roles in retrieving important and relevant information from the massive — and frequently updating — sources of materials. However, the benefit from this expansion of information circulation is often relevant only within the community who share a common language for academic writing. At the international level, it is typically English that is generally, and yet still increasingly, used as the lingua franca; researchers with lower English proficiency, especially in science, seem quite disadvantaged if not excluded, although many of them have nevertheless very good research in their native, non-English-speaking community. The various challenges for non-Anglophone academics due to the dominance of English in scientific communication have long been acknowledged and 1

investigated, regarding cases of different countries and regions world-wide (e.g. Benfield & Feak, 2006; Curry & Lillis, 2004; Flowerdew, 1999; Giannoni, 2008; Ives & Obenchain, 2007; Pérez-Llantada, Plo, & Ferguson, 2011). To overcome the language barrier, translation is largely necessary and beneficial. Access to a wider range of reference materials beyond the scientists’ native language would facilitate their research to a significant extent. This is important not only for those who find it difficult to read English academic writings adequately, but even for those scientists with only minor language problems, since the excessive time and effort needed for lower-proficiency English readers to search for, filter, evaluate, and read the relevant literature would be neither worthy nor reasonable if there can be usable, reliable and instantly available translations in their native language. Given the increasingly massive and rapid circulation of information in academia, such a demand for translation is meant to be remarkably large, and to continue growing at a significant speed.

Translating Abstracts: Possibilities for Machine Translation However, in this regard human translation does not suffice, due to the sheer volume of texts to translate, the high cost of a human translator, the lack of speed, and various problems which arise when the translator is not familiar with the subject matter or the terminologies involved. It does not seem practical to always resort to humans for such translation. Machine Translation (MT), on the other hand, seems very practical. This type of Source Text (ST) is considerably restricted, because 1). all the texts are informative, formal and academic; 2). most of the papers that a researcher regularly reads are within a specific area of a certain discipline, with limited language flexibility and ambiguity; and 3). abstracts follow relatively strict patterns of writing. In a way, the ST belongs to a sublanguage, using only a subset of the vocabulary and syntactic constructs of the natural language. In the context of this translational act, for abstract reading the focus of the reader is either 1). to see if the article is relevant and needs to be downloaded for more extensive reading, or 2). to obtain a rough and general idea of what the paper investigates, the methodology, and its conclusion. Though the latter might perhaps require a higher linguistic quality than the former,

2

for both it is the extraction of key information that takes the priority, rather than the linguistic presentation. For this type of translation, content matters more than form. In other words, it is largely consistent to what Somers (1997) described two decades ago as the “two strictly defined conditions” under which fully automatic MT is possible: the ST is restricted, and a relatively lower linguistic quality is accepted. Meanwhile, there has been tremendous progress in MT over the years (e.g. Doherty & O’Brien, 2012; Graham, et al., 2014), to the extent that many MT developers today are beginning to depict themselves as capable of achieving “costeffective publishable quality translations”1. Therefore, it seems safe to assume that the translation of scientific abstracts is largely suitable for MT. The above need for translation is perhaps becoming a need for automatic translation which can be dependable, comprehensible, and more importantly, instantly available.

MT: Increasing popularity Meanwhile, MT has become considerably popular in the recent decade or so, primarily thanks to the development of Natural Language Processing (NLP) techniques in general. Successful examples are not in lack, and a growing body of evidence demonstrates that the use of MT is still on the rise. Among the 1,000 Language Service Providers surveyed in DePalma et al. (2013), 44% were offering MT and post-editing as a service; the review of 1,119 industry participants conducted by Kelly et al. (2012) reveals also the growing demand for expertise in post-editing of MT output in recent years; with a focus on internal processes, Doherty et al. (2013) report that 34% of the 467 translation and localization buyers and vendors surveyed are currently using MT, that 35% of them plan to adopt it in the future, and that 77% acknowledge the increasing demand for MT services which are explicitly requested by their clients. Regarding the general public, the popularity of MT is even more obvious: in Gaspari’s (2007, p. 103) survey, for instance, “the vast majority of the respondents (96.1%) took advantage of free web-based MT for assimilation purposes, which is often accompanied by dissemination tasks”.

SYSTRAN’s description of their product. Retrieved 13 January 2017 from http://www.systransoft.com/systran/ 1

3

This seems to support the assumption that using MT to meet the abovementioned needs could in practice be feasible. Therefore, it is worthy of asking: how usable in practice, especially for translating academic abstracts, is the raw outcome (i.e. the output from MT without human post-editing) from those systems widely used today?

MT Output Usability: The gap Not surprisingly, much recent research on MT output has focused on the usability of its raw translation and on “the effort involved” in post-editing (e.g. Anazawa, Ishikawa, & Takahiro, 2013; Doherty & O’Brien, 2012; Doherty & O’Brien, 2014; Gaspari, 2004; Moorkens et al., 2015; Tatsumi, 2010; Turner et al., 2014). But generally speaking, among the huge amount of investigations out there in the literature, the usability of the raw MT output is still rarely considered from a translation perspective (Doherty & O’Brien, 2014). Often times, such concepts as “functionality” and “usability” in the realm of MT (c.f. Hovy, King, & Popescu-Belis, 2002; King, 1997; Quah, 2006) refer to the features of the MT system as a piece of software rather than in a way Translation Studies would define similar terms. The attention tends to be on the system rather than on the translated output. Toral and Way (2015), for example, even describe MT as lagging far behind from the “state-ofthe-art theories in Translation Studies”. Translation Studies, on the other hand, rarely goes beyond issues of Computer-Aided Translation, if any, when it comes to technology. This gap seems apparent and significant, and has now been increasingly recognized by scholars in both communities. As Hauenschild and Heizmann (1997) rightly state, although in principle Translation Studies and Machine Translation “ought to have a natural common interest”, the two fields of research have in fact long “ignored” each other. Therefore, it is of much worthiness, if not a necessity, to explore the common grounds that exist between these two areas of research when investigating the output from MT. At the theoretical level, some perspectives from Translation Studies is perhaps quite helpful for the investigations into the usability of the machinetranslated content. If these two communities — Machine Translation and Translation Studies — could be regarded as positing themselves at two points of extreme, perhaps somewhere along the continuum would be another group relevant to this issue —

4

those in other professions (e.g. medical studies) who are in much need of (domainspecific) translation and interested in investigating MT. Many of their investigations, however, are significantly negative about MT usability, in sharp contrast to the above assumption. While few would deny the usefulness of MT for general-purpose rough translations, when it comes to specialized translation the situation changes a bit. For example, one of the contexts in which usability of the translated content is “most likely to be carried out” is in health and clinics (O’Brien, 2012), but existing investigations in this community frequently reveal considerable pessimism about MT for translating their documents. Anazawa et al. (2013) examine Google Translate with nursing research abstracts, only to find the English-Japanese translation quality “minimally acceptable”. The questionnaire by Anazawa et al. (2012) reveals also a somewhat low “perceived usability” among Japanese nurses: about 60% of the respondents perceive the output as “not very useful”. Zeng-Treitler et al. (2009) use Babel Fish to translate medical records from English into Spanish, Chinese, Russian and Korean, but in each language “the majority of the translations were incomprehensible (76% to 92%) and/or incorrect (77% to 89%)”. Although the findings of these investigations are generally not as hostile as in, for example, Yates’ (2006) explicitly disapproving evaluation for legal texts, there still tends to be much lack of trust as to the adequacy of MT for their domain-specific translation practice in general. These results seem to be very disappointing for the above assumption of translating research abstracts with MT. More importantly, these empirical investigations with real texts of the specialized domains, which are conducted by domain experts, seem to provide much support for the general suspicion of MT which can be found throughout the results of numerous surveys, such as the one by O’Brien and Moorkens (2014), where 56% of the correspondents consider MT “problematic”, “still in baby shoes”, or “just horrible”. This general suspicion regarding MT highlights an interesting question: given the pervasive popularity of MT both in professional translation settings and among the general public, why is it that the results of so many investigations are still largely negative? Or, if the quality of MT output is truly that inadequate, how have those systems become such a popularity as almost the “de facto standard” (Turner et al., 5

2014) for many commercial translation vendors? In other words, why is there such a sharp contrast between the numerous successful examples of MT and the largely pessimistic results from the many investigations of its output? Curiously enough, such investigations as the above tend to reveal that “MT tools are often better at translating generic text, such as news articles, than translating domain-specific text containing specialized vocabulary” (Turner et al., 2014), while in much contrast, within the MT community the belief is quite on the contrary: the machine is in fact more likely to produce better results for domain-specific texts rather than for generic ones, mainly because of a less extent of lexical and structural ambiguity. This highlights one of the many widely held misconceptions of MT and its usage among users of online systems for domain-specific translation. It also highlights the need for customizing an MT system to the specific needs of the user, because usability of MT output is largely dependent on how it is utilized — translating suitable STs for appropriate purposes and properly customizing the system in accordance to the ST and the purpose in question. In the above example of medical translation, perhaps the ST and the purpose are both very suitable, but customization is in lack. From the results of these investigations it can be easily seen that general-purpose MT without customization would result in very poor quality. What is encouraging, though, is that the domain-specific investigations by these professionals in their own domain — who are also both MT users and Target Text (TT) recipients — are not in much disagreement with the idea that the translation quality would be significantly improved if these general-purpose MT tools are adapted to their specific domains; typically, they emphasize the importance of terminological items (Anazawa et al., 2013; Kiuchi & Kaihara, 1991; ZengTreitler et al., 2009) and suggest that “enrichment in technical terms appears to be the key to better usability” (Anazawa et al., 2012). It is worthy of asking: if the terms are enriched, would the MT systems produce better output for such texts? How much difference would it make? More importantly, would the investigations of customized MT lead to opposite results compared with the largely negative ones mentioned above? Does the “enrichment” of terms introduce noticeable improvement –– or otherwise cause more problems –– in other aspects, particularly regarding syntax, or does it make little difference other than simply turning the system into a more accurate mechanical dictionary? If these investigations could go one step further and conduct a bit of MT customization, perhaps it would be more interesting and meaningful. 6

On the other hand, many of today’s MT systems allow a considerable extent of customization from the user, particularly regarding terms and grammar, which provides much convenience for answering these questions. Research in MT customization, however, is a much neglected area (Chan, 2015). The MT usability investigations mentioned above are also representative of the neglect of customization in the practical sense when it comes to MT. This thesis intends to address this issue –– and hopefully to bridge the gap to some extent –– with a case study. Since most MT software on the market are for general uses and not for specific domains, it seems worthy of investigating customization with a narrow focus on the circumstance in which MT is used and the type of text to be translated.

1.2 Issues, Scope, and Outline This thesis is primarily focused on investigations of the output from customized MT for academic writing. As mentioned above, it intends to address the issue both from the theoretical aspect, and with a case study. The relevant theoretical issues are discussed in Chapter 2, followed by the case study in Chapter 3 and Chapter 4. It is important to note that the theoretical discussions in Chapter 2 provide much guidance for the subsequent case study in many ways. Those discussions at the theoretical level intends to address many of the questions mentioned above, aiming for a relatively realistic, fair and objective investigation for the output of MT regarding specialized language. It reveals some misconceptions, inappropriateness, or lack of certain considerations, which exist in MT output investigations in different communities concerned –– MT, Translation Studies, and the specialized domain for which the translation is used, exploring mutual grounds which can be helpful for the issue at large. The thesis argues for 1). a proper perspective from which such investigations should approach translation in the first place, and for 2). the importance of proper MT usage in these investigations –– two aspects which are often neglected, if not absent, in each of these communities. Regarding the former, it is worthy of mention that the results of MT investigations are largely dependent on how one views the intrinsic issues of 7

translation –– what it essentially is, what constitutes a good translation, etc. Therefore, the thesis would seek for theoretical aspects of Translation Studies that are particularly compatible with MT, with a focus on equivalence and functionalism. Perhaps in this aspect a functionalist definition of translation would be the most suitable to the kind of scenarios where MT investigations are conducted, as will be elaborated in Chapter 2. As for the latter, this thesis illustrates that a proper use of MT depends on its ST, the purpose of the translational act, and more importantly the customization of the system. The investigations of MT output should put the MT system to its appropriate use: translating suitable STs for appropriate purposes and properly customizing the MT in accordance to the ST and the purpose in question. This will also be elaborated in Chapter 2. These two aspects are largely related to the gap that exists in MT output investigations among different research communities –– MT, Translation Studies, and the specialized domain for which the translation is used. The perspective which this thesis takes is dependent on what is in lack in each of the three communities and what can be introduced to one another. Therefore, Chapter 2 will include a more detailed discussion on this gap before the elaboration of those two aspects. In addition, many other issues related to the case study are also important, including the definition of “information”, ambiguity, features of academic writing, etc. These issues will be discussed in the latter part of Chapter 2. Perhaps most important is the issue of MT customization. As mentioned above, many MT systems today allow the user to extensively customize the system to the specific kind of texts to translate. A very good example of this is SYSTRAN, in which lexical customization is both important and convenient. Chapter 3 illustrates a case study using SYSTRAN as the MT system and a sample of Applied Optics as the ST2, adopting the above-mentioned perspectives and investigating the system’s raw output, its lexical and syntactic errors, the features of the ST as a restricted language, the modification of glossary entries, and more importantly the output after the system is customized with a user-defined dictionary. Here, if the customized system results in sharply improved translation compared with its raw output, it would in turn justify the emphasis on proper MT use

2

Description of SYSTRAN and Applied Optics is found in Chapter 3. 8

as mentioned above, since making judgment of MT quality on the basis of an uncustomized system might not be objective for its real use in practice. Meanwhile, the case study also goes beyond the investigated sample and discusses whether the customization can be dependable in a wider scope, as illustrated in Chapter 4. Through concordancing, it aims to test whether the way in which glossary entries are modified in the case study would be representative of the usage of these items in the entire volume of journal abstracts. Only if the lexical features of the sample are consistent with the entire corpora of abstracts can such customization be justified. A practical concern is also important: if the modification conducted in this case study is effective in improving MT output, how do we know what to modify before translating a specific text? In Chapter 3, the case study discovers the items by a trial run in SYSTRAN before an error analysis, but in the practical sense this does not seem a feasible strategy. Therefore, Chapter 4 also intends to investigate whether SYSTRAN’s automatic customization functions can manage to discover the items that are modified in the case study, together with how accurate it is in attaching the relevant lexical information to the items. In Chapter 5, all the above issues are summarized and concluded, while the limitations of this study, as well as the issues which are worthy of further investigation, are also illustrated.

9

Chapter 2 Theoretical Perspectives

2.1 The Use of MT 2.1.1 Useful for “Rough Draft Translations” Machine Translation (MT) is nothing new; it dates all the way back to the early years before electronic computers even existed (Hutchins, 1986; Hutchins & Somers, 1992). But despite such a long history, its translation quality has still been widely considered “generally poor” — at least until the last decade or so (Hutchins, 2001, 2003a, 2003b). There is not any lack of awareness, either in the research community or among the general public, about how MT today still often fails to meet the expectations of linguistic quality in translating unrestricted texts (i.e. texts containing considerable ambiguity); Wilks (2009), for instance, explicitly acknowledges that there is an obvious “absence of any intellectual breakthroughs to produce indisputably high-quality fully-automatic MT”, a fact which has led some to say it is not even possible at all. The responses to this are twofold: either to resort to empirical, statistical and data-driven methods in the computational design (see Koehn, 2010), or to move to machine-assisted human translation (MAHT). Consequently, the former response (i.e. statistical machine translation) seems to have taken the mainstream of the technical solutions (Goutte, 2009), resulting in significant improvement in the underlying MT technology during the most recent decade (Doherty & O’Brien, 2014), to the extent that MT has become pervasively popular today (e.g. Gaspari, 2007). In comparison, the latter response (i.e. MAHT), typically advocated by Martin Kay (1997), is somewhat pessimistic about MT, although it does provide an insight which led to a number of Computer-Aided Translation systems that have become important tools for translators in the market worldwide. Despite this insight, however, the pessimistic 10

statement that MT “stands no chance of filling actual needs for translation” (Kay, 1997) has not been totally agreed upon, as, among others, Wilks (2009, p. 6) points to the failure of many pessimists to “anticipate the large market (e.g. within the European Commission) for output of the indifferent quality (i.e. about 60% of sentences correctly translated from SYSTRAN, for example) output that full MT systems continue to produce and which is used for rough draft translations.” Indeed, despite the fact that many are still highly critical of the linguistic quality of the MT output, surveys and investigations on the practical use of MT systems have often revealed considerable usefulness of these “rough draft translations” for various purposes, as mentioned in Chapter 1. Such translations produced by MT, generally speaking, could be used for 1). post-editing by a professional translator, aiming for a publishable quality (known as MT for dissemination), or 2). gisting by the Target Text (TT) recipient, who wishes to obtain a rough understanding of the information contained in the ST (known as MT for assimilation). In the past decade, the improvement of MT technology — together with Natural Language Processing (NLP) techniques in general — has enabled the raw outcome of automatic translation systems to be not only usable, but largely useful and necessary, for both kinds of MT use (c.f. O’Brien & Moorkens, 2014). For the former, there can be found abundant successful examples of combining MT with human post-editing processes, both in the research community and in industry sections (e.g. Garcia, 2011; Groves & Wicklow, 2008; Guerberof, 2014; Kirchhoff et al., 2011; O’Brien, 2006; Turner et al., 2014), in spite of the apparent presence of “significant translator resistance to the task” (O’Brien & Simard, 2014). Combining this with the statistics from the abundant amount of surveys mentioned above (see 1.1), it is not hard to see that the use of MT combined with human post-editing have become a “de facto standard” for many commercial translation vendors (Turner et al., 2014), increasingly incorporated as a useful, if not indispensable, component in the translation workflow, and providing “rough draft translations” for the segments whose fuzzy match to the Translation Memory is below a threshold (e.g. 75% similarity). For the latter (i.e. gisting), the usefulness of the “rough draft translations” is even more acknowledged — systems for this purpose “have been in use since the earliest days of MT” (Hutchins, 2005). Decades ago, the survey conducted by Henisz-Dostert, Macdonald and Zarechnak (1979, pp. 147-244) found that 90% of 11

the respondents (i.e. scientists and engineers who had used the Georgetown Machine Translation System during 1963-1973) judged the MT raw output to be “good” and “acceptable” for their purposes, and that 87% of them even preferred to use MT rather than human translation. Today, online MT tools have become even pervasive among the general public, providing “rough draft translations” at the click of a button — way speedier than a human translator — with reasonable quality for their gisting purposes. The survey conducted by Gaspari (2007, p. 103), which is mentioned above (see 1.1), is a good example. The significant growth of MT use in the market prompts a necessity to rethink its value: rather than being excessively critical of the “generally poor” quality in the linguistic sense, perhaps it is more meaningful to investigate a “usable” quality, rather than “high quality”, of MT. This is not only about redefining translation quality, but also about putting the systems to appropriate use: translating suitable STs for appropriate purposes and properly customizing the MT according to the ST and the purpose in question, as will be illustrated below. More importantly, as mentioned in Chapter 1, the popularity of MT has resulted in many investigations among different communities into the usability of its output in the practical sense.

2.1.2 The Usability Gap: Discussion As mentioned above in Chapter 1, there are generally three communities which have been interested in this issue — researchers or experts in Computer Science (i.e. MT community), Translation Studies (i.e. translation community), and the specialized areas which frequently demand the translated output (i.e. domain experts).

MT Community In the MT community, usability evaluation (Chan, 2004; Hovy et al., 2002; White, 2003) mostly adopts the definition in ISO/IEC guidelines (ISO/IEC, 2001), or at least in a similar manner, where the evaluation “tests the usefulness of a system for end-users, involving its utility and users’ satisfaction with it” (Kit & Wong, 2015, p.217). The focus tends to be on the system rather than on the translated output, as also mentioned previously.

12

The same is true for such concepts as “functionality” (Hovy et al., 2002; ISO/IEC, 2001; King, 1997; Quah, 2006), which in the MT evaluation arena generally refers to features of the system as a piece of software, rather than the functionality of the translated texts –– somewhat different from the way many theories in Translation Studies nowadays approach the issue. When it comes to the machine-translated texts (i.e. MT output evaluation), attention is generally (and perhaps excessively) paid to notions of “equivalence”, and quality assessment tends to be static, mostly with a central focus on the dual criteria of “fidelity” and “fluency”, or at most to what Nida (1964; 1969) terms “dynamic/functional equivalence”. This seems a bit inadequate from a translation perspective. While it is true that a translation is supposed to be — one way or another — somewhat faithful to the ST (i.e. fidelity) and at the same time intelligible (i.e. fluency), when we try to give content to those notions, various troubles arise, because “only in the very rarest circumstances can fidelity be defined” (King, 1997). The concept of “fidelity” refers to, of course, the notion of equivalence. The paradigm shift in Translation Studies from equivalence to functionalism (or typically, Skopos Theory), which questioned “the validity of the register-inspired equivalence paradigm” (Hatim, 2009) and “set in a practical reaction against the academic detail of extensive linguistic analysis” (Munday, 2009), seems quite neglected. This makes MT evaluation very controversial, and is one of the main reasons, as King (1997) argues, for which evaluating MT has long been problematic, unsatisfactory, and, to use the words of White (2003, p. 211), “at times misleading”. In this sense, perhaps Toral and Way (2015) –– among others –– are right in stating that MT “lags behind” today’s translation theories. While Translation Studies has since decades ago shifted paradigms for many times and moved far away from, or at least “marginalized” (Munday, 2001), the equivalence paradigm, not to mention “formal equivalence” (c.f. Nida, 1964), and while it continues to move further, “the vast majority of research in MT disregards functional and pragmatic aspects and aims to model –– somehow –– formal equivalence” (Toral & Way, 2015). There are, nevertheless, perspectives of MT output evaluation which are indeed functional, extrinsic and context-based, e.g. FEMTI framework (Hovy et al., 2002), acknowledging that a system can be used for different purposes; however, it is not only the system, but also the translated text, that can have dynamic functions depending on varying communicative situations, as is believed in the functionalist theories of Translation Studies. Therefore, the “purpose” here should perhaps refer to not only the purpose of the system, but also the purpose of the translated text itself, in 13

such a way as in the functionalist theories of Translation Studies. In addition, although there has always been awareness of differentiating “end-users” from the other stakeholders (e.g. managers, developers, vendors, investors) in MT evaluation (e.g. White, 2003), users of the MT system might not necessarily be the same as users of the translated output, as will be illustrated below. Therefore, some aspects of Translation Studies, from a wider scale of perspective, seem quite helpful in the investigations into the usability of machinetranslated output.

Translation Community The translation community3, on the other hand, is sometimes a bit harsh on the linguistic quality of MT and tends to neglect the fact that the system needs to be put into proper use. The evaluation process is also very often oversimplified, and MT is considered as entirely a black-box; typically, random texts are inputted into the system and errors are picked out and criticized, quickly coming to the conclusion that the system is largely unsatisfactory, sometimes “stupid”, and “frequently humorous”. Such words as “problematic”, “still in baby shoes”, and “just horrible” are not uncommon in the descriptions of MT. Many are highly critical of even the idea of using MT in the first place. The “general suspicion” of MT that can be found in the survey results of O’Brien and Moorkens (2014), as mentioned above, is representative of the many widely-held preconceptions regarding MT. Toral and Way (2015) point out another important bias that frequently exists in this regard, stating, Note also that literary translation is often selected by human translators not overly well-disposed to MT to demonstrate how useless it is for anything; on any randomly selected online translators’ forum, you don’t have to look too hard to find someone who has selected a section from a book and shown how MT messes up the translation. (p. 241) Part of this results from the misconceptions towards MT that often exist in this community, as will be illustrated in more detail later in this section; but perhaps It is important to note here that a distinction is often made between the community of translation practitioners and that of translation scholars, since views and focus might be very different between the two. However, this thesis does not make this distinction. The “translation community” here refers to both the practitioners and the scholars. 3

14

more important is the apparent “translator resistance” to post-editing (e.g. O’Brien & Moorkens, 2014). Although MT combined with post-editing has become a de facto standard in the industry of professional translation, this type of working style has met with quite some hostility from the translators, which results from many reasons (e.g. fear or reluctance to change, lack of creativity in the task, sharp decrease in translator income, etc.), and the resistance is largely related to the misconceptions on the postediting task itself. In addition, many translation agencies depict themselves as providing highquality translation by down toning MT, apparently for commercial considerations. A typical example is the following description (quoted from Sun, 2005): We never use machine translation. Language is fluid, fluent, varied and diverse. No machine can come close to understanding the nuances of a language, let alone be able to translate those nuances into another language. If you want your company’s documentation to “speak” to your colleagues or your clients, you have to use human translation. (p. 4) To some extent, this not only results from the MT misconception among the public, but more importantly reinforces the misconception in turn. In this community of language professionals, however, there seems to be much more acknowledgement on the use of other translation technologies, with an increasing number of translators and agencies actively incorporating into their workflow such tools as Translation Memory, terminology database, and translation project management systems (generally known as Computer-Aided Translation). When it comes to MT, interestingly, the resistance –– and misconceptions –– is not hard to find.

Domain Experts As mentioned above, there is another community, other than those in Translation Studies or Machine Translation, who are largely relevant to this issue and whose interest is apparent –– the experts of the domain to which the texts belong, or in other words, the text recipients. Strictly speaking they are the “users” of the translated text, but when investigating MT, they posit themselves also into the roles of system users. This often results in somewhat doubled side-effects: in many of these investigations, not only are translation issues approached in a static, oversimplified manner, MT systems are also evaluated as a black-box. Typically, 15

these investigations evaluate freely available, general-purpose online MT (e.g. Google Translate), using mostly restricted, domain-specific texts, without any customization made to the system, as the basis for judgments of the translated output — and MT usability in general. There is not any lack of such investigations in the literature. The examples given in Chapter 1 are already quite abundant and representative, all under the influence of much misconception on the proper use of MT. In addition to the largely negative results regarding the usability of MT output, what seem more interesting are the following two aspects: 1). They tend to conclude, quite inappropriately, that MT tools are more suitable for general language rather than for specialized domains. These investigations often start from the realization that MT software such as Google Translate is becoming popular among the public, and aim to test whether it is equally satisfactory for professional texts. What is important in this sense is that very few of them acknowledge the fact that the software often used for the investigations is actually intended for general, rather than domain-specific, purposes in the first place, though automatic domain adaptation might be incorporated in the software’s components. There is very little customization either, as mentioned above. In other words, what they genuinely reveal seems to be either that the system’s automatic domain adaptation is not satisfactory, or that general-purpose MT is more suitable to translate general language if without proper customization. This is also related to many of the misconceptions regarding MT, but in essence it is not contradictory to the stances taken in this thesis. It seems that these investigations only need to involve a bit of further work –– customization. 2). The conclusion that if the systems are modified and adapted to the terms used in their specialized domains seems very encouraging. In a way this is perfectly consistent to the emphasis of customization in this thesis. However, although they highlight the issue of terminologies, the overall low usability of the translated output in such investigations is often the result of many other issues as well, particularly in relation to the syntactic or structural ambiguities involved. The general-language words which are considerably ambiguous in their grammatical functions could be much more important for the distortion of the translated meaning, and such issues are often considerably more complicated. This can be seen in more detail in Chapter 3. Despite this, it can be safely assumed that in a specialized domain the language use is considerably restricted, which is often termed “sublanguage”, therefore the 16

grammatical functions of these otherwise ambiguous words would be somehow constrained to a significant extent. Therefore, the “enrichment” of these items in the glossary seems equally, if not more, important and effective. On the other hand, it is true that domain-specific texts are dense in terminology, which easily leads to the prominence of terminological errors. In Appendix D such an issue is also prominent, despite the settings in SYSTRAN which has already facilitated its selection of the most appropriate meanings of the polysemous words (see Chapter 3). Therefore, perhaps this part of their conclusion can be adjusted to “enrichment of terms and grammatically ambiguous words are the key to better quality”. However, to what extent this assumption justifiably stands remains to be tested –– an issue which constitutes a major part of this thesis (see Chapter 3).

Overall Discussion The above has illustrated the problems of MT output investigations from different perspectives, and revealed that there is in fact much misconception, inappropriateness, or controversy in each of these three communities, one way or another. This is not surprising, for many kinds of reasons which this thesis is not able to include exhaustively, but two of them are especially worthy of mention here. First, the perceived usability of MT is “relative to users’ expectations” (Kit & Wong, 2008). For example, Bowker (2010) discovers in her “recipient evaluation” that the “average” recipients are more open to MT than language professionals. In Morland’s (2002) case study within an international company, those who are fluent in English (i.e. the dominant language, typically used as the Source Language for MT) tend to rate MT to be “of lower quality and less useful” than those for whom English is difficult. The investigation of user-generated content in Mitchell et al. (2014) shows “the community evaluators were more critical in their ratings for fluency than the domain experts”. Combining these findings with the sharp contrast of attitudes between the translation community, who are frequently disapproving and sometimes hostile about MT, and the scientists surveyed in Henisz-Dostert et al. (1979, pp. 147-244) (see 17

2.1.1 above) who are generally positive about the output, it is not hard to discover the difference in their focus of attention, together with the tendency that the more one cares about the translated content, the less suspicious he/she would be about MT. The translators, as bilinguals and language professionals, pay very much — sometimes excessive — attention to the language while the ones who need translation, the less-bilingual domain experts, tend to put more emphasis on the content of the translated text. This is perhaps another justification for the need of distinguishing system users and translation users (see above). Second, there is often much misconception regarding MT, and a lack of awareness of proper MT use. Many of the negative comments about MT usability have to do with the preconception that human language is too complex and that MT is too mechanical. Yates’ (2006) comments, in her excessively criticizing evaluation of MT, are quite representative: translation is a sophisticated task which requires analyses on morphological, lexical, syntactical and cultural levels, while machines focus on only the morphological and lexical analyses, with “minimal” syntactical analysis, not to mention cultural understanding. This is largely the basis for their belief that MT has very limited usefulness, not even satisfactory “when measured against the ‘general intent’ standard” (Yates, 2006). Here, it is worthy of mention that MT systems today, as well as NLP in general, are by no means that simple (the “direct” MT approach existed some 60 years ago), especially those systems which combine linguistic rules with statistical methods (known as hybrid systems), which can be customized to the specific features of the ST to translate, and which propelled the popularization of online MT in the last decade or so. Also worthy of mention is that MT systems are not the same as online dictionaries, nor should be used as such — a misconception which seems rather common among the public (Gaspari, 2007). Perhaps more important is that how useful a system is depends “not only on how well it can translate, but also largely on how it is utilized” (Kit & Wong, 2008). Generally speaking, translating less ambiguous texts would result in significantly better outcome from the system — this is equally true for human translation — and in fact, custom-specific or domain-specific MT systems can produce very high quality.

18

This highlights one of the stances this thesis takes: the system should be used to translate suitable STs for appropriate purposes and properly customized to the ST and purpose in question. In this regard, the above-mentioned bias of making judgements on the basis of texts which are apparently unsuitable for MT (see 2.2.2) is particularly worthy of mention. Although it is true that language is complicated, ambiguous, dynamic, and perhaps often untranslatable, texts which are intended for practical or informative purposes (e.g. user manuals, questionnaires, and perhaps the academic abstracts in this case) tend to be much less expressive, flexible, or ambiguous. This is in fact not only within the capabilities of MT, but more importantly consistent to its advantages over human translation, as will be illustrated later in this section. On the other hand, if one considers the real-world translation market, it is not hard to find how little proportion expressive texts account for in the entire amount of translated content today, and how fast it continues to decrease. In China, for example, statistics show that this percentage is only about 4% each year (Chan, 2014, p. 330). The issue of customization is perhaps much more important. Although many are not unaware of the proper circumstances where MT should be used, a rather frequent phenomenon is that the way the system is handled seems considerably inconsistent to the features of the ST or the purpose. The huge amount of investigations among MT users where domain-specific texts are translated without adapting the system accordingly is in many ways not dissimilar to asking a human translator who specializes in legal documents, for example, to translate without referential resources a highly technical report for a nuclear power plant. Another often neglected fact is that human translators make various mistakes too, but only of a different kind than the “stupid” and “frequently amusing” errors from MT systems. It is not uncommon for human technical translation to be inconsistent or unprofessional in terminology, to contain typos, and to result in all kinds of troubles when the translator is not familiar with the subject domain in which the text is written, not to mention the lack of speed, as well as the insufficient supply of translators for the increasingly globalized world where information circulates fast and massively across languages. These aspects are what MT is mostly good at and should be used for. The use of MT can in fact effectively satisfy the specific kinds of translation demands for which it is suitable. In other words, the applicability of MT depends “not only on the text to be translated but also on the type of translation that we are trying to produce” (Toral & Way, 2015, p. 241). 19

The apparent “translator resistance” to post-editing, which also comes from a misunderstanding of MT and of the post-editing task itself (O’Brien & Moorkens, 2014), and the somewhat biased descriptions of MT by many translation companies for commercial considerations (Sun, 2005) have often contributed to an unfavourable attitude to using MT at all. Perhaps a sound understanding of the proper use of MT is frequently in much lack, though crucially important. As will be discussed below, this thesis argues that a proper use of MT depends on its ST, the purpose of the translational act, and the customization. In summary, there seems to be a conceptual gap in all these communities regarding the investigation of MT output: Translation Studies, Machine Translation, and the specific domains in which MT is used. As will be illustrated below, this thesis takes the functionalist stance for translation, where a good translation refers to one which is functionally appropriate in view of the translational purpose; meanwhile the thesis also emphasizes translating suitable texts and properly customizing the system.

20

2.2 MT for Scientific Abstracts: A functionalist translation perspective 2.2.1 Equivalence: Problems and paradigm shifts In much of the theory and practice in translation, the notion of “equivalence” has long been a central issue (Chesterman, 1989; Munday, 2001) and a “conceptual basis” for assessing translation quality (House, 2015). Even to this day, translation is still often defined by “equivalence” in many circumstances, typically guiding some practical aspects of professional translating. This paradigm, with its heyday in the 1960s and 1970s, is largely based on the idea that what one says in one language can — and should — “have the same value (the same worth or function) when translated into another language”, where “value” can be on the level of “form, function, or anything in between” (Pym, 2009, p. 6). Although the idea of equivalence is rather simple, it has unfortunately become “quite complex” both as a term and as a theory (Pym, 2009). The exact meaning of “equivalence”, or what exactly has to be equivalent during translation, has long been considerably controversial, if not the most controversial issue ever discussed in Translation Studies (e.g. House, 2015; Munday, 2009). Nevertheless, most theories in this paradigm presuppose that a translation necessarily aims to be, as much as possible, somewhat equivalent to the ST — whatever level or aspect of equivalence it is. Many of the questions they seek answers to, therefore, are in many ways concerning the search for “equivalence” in translation. This dates back to the age-old dichotomy of “literal” versus “free” translations, “word-for-word” versus “sense-forsense” approaches, “foreignizing” versus “domesticating”, etc. — with various terminological variants denoting concepts which are essentially not unrelated to the same dichotomy. Regarding direct and explicit notions of “equivalence”, Jakobson (1959) emphasizes the “equivalence in difference” as the “cardinal” issue; Nida (1964) suggests “formal equivalence” and “dynamic equivalence”, giving overriding priority to the latter (i.e. producing equivalent reader-response); Catford (1965) also puts much importance on the concept and distinguishes “formal correspondence” from “textual equivalence”; Koller (1979, 1989) examines more closely the concepts of “equivalence” in contrast to “correspondence”, in much of a parole-versus-langue manner, and describes five different types of equivalence — denotative, connotative, text-normative, pragmatic, and formal (expressive). These are largely the basis for 21

the move towards a “science” of translating, aiming for objectivity, and the same standpoint can also be found in numerous works by many other scholars, e.g. Wilss (1982), Kade (1968) and Neubert (1970, 1985). While all of these discussions are worthy, valuable and profoundly meaningful, the equivalence paradigm has been, nevertheless, severely and repeatedly criticized over the decades, to the extent that Translation Studies has in fact “marginalized” it (Munday, 2001, p. 50), at least for the time being, with many claiming that the notion is rather unnecessary and illusory (e.g. Hatim & Mason, 1990; Reiss & Vermeer, 1984), or perhaps nothing more than “presumed” equivalence as a “belief structure” (Gutt, 1991; Pym, 1992, 2014; Toury, 1980, 1995). Some reject it completely, for instance Vermeer (1984), Snell-Hornby (1988) Prunč (2007), to name just a few. In several more recent publications, equivalence has even been denied “any value” or “legitimate status” in translation theory (Baker, 2011, p. 5; House, 2015, p. 6; Munday, 2012, p.77). One problem with defining translation by “equivalence”, as Colina (2015) explains, stems from the vagueness of the notion itself, as is also shown by the various controversial descriptions above. As mentioned above (see 2.1.2), when one tries to give specific content to this notion, the issue would become rather tricky. It remains unclear and disagreed as to what counts as equivalence, and yet the designation of different types, aspects or levels of equivalence (e.g. equivalence of meaning, of effect, of function, etc.) would not work either, because they have “little basis in reality” and are “hardly attainable” (Colina, 2015, p. 16). This is also why Machine Translation Evaluation has long been equally controversial and problematic (King, 1997), where evaluation metrics largely define translation quality on the basis of “equivalence”, a notion which is really undefinable in practice. In addition to issues of evaluation, many of the underlying mechanisms of MT themselves are largely aiming to model “equivalence”, if not formal equivalence. MT lags behind “state-of-the art theories in Translation Studies” (Toral & Way, 2015). As a matter of fact, “if perfect equivalence was attainable, Machine Translation would have been much more successful than it has been to date” (Colina, 2015, p. 16). Moreover, while this paradigm was initially claimed to be scientific and objective, it has been “seriously challenged by the principle of uncertainty” (Pym, 2014, p. 118) and cannot avoid the problem of equivalence being always subjective (Munday, 2001, p. 49) and inevitably relative (Baker, 2011, p. 5). 22

The belief underneath the scientific standpoint is that there is a so-called tertium comparationis, an invariant which can be transmitted from the ST to the TT and against which “two text segments can be measured to gauge variation” (Munday, 2001, p. 49). This is perhaps where most contention comes from. The inevitable problems of the notion of “equivalence” have, consequently, led to a few influential paradigm shifts in translation theories (Munday, 2012). For instance, the purpose-oriented paradigm, “functionalism” or “functionalist school” (c.f. Munday, 2001), moves closer to the practical aspect of translation, “reducing equivalence to a special case and insisting that translators and their clients negotiate in order to translate” (Pym, 2014, p. 118); Descriptive Translation Studies (Toury, 2012) in turn discusses the “shifts” and “transformations” produced by the translator; the “indeterminist paradigm”, particularly deconstruction, “sets about undoing illusions of equivalence as a stable semantic relation” (Pym, 2014, p. 118); Pym (2014) also considers the notion of “localization” a self-contained paradigm which effectively addresses many problems resulting from the uncertainty of “equivalence”. Regarding these responses, what seems relevant here is whether they provide guidance for the issues of translation technology and machine translation in particular. As Melby and Warner (1995, pp. 157, 165) rightly states, “for scholars who are involved in developing natural-language processing applications, in particular machine translation, a translation theory that could guide computer programming to enable the translations of controlled language texts in restricted subject fields may be needed” (quoted from Quah, 2006, p. 24). Among the above-mentioned paradigms following “equivalence”, perhaps “functionalism” provides much common ground for machine translation (e.g. Quah, 2006). While others are also relevant to this issue in different aspects, due to the scale of this thesis the focus will be on “functionalism” only.

2.2.2 Functionalism and MT As mentioned above, the purpose-based, functionalist paradigm has been a straight-forward response to the complex, controversial, and sometimes excessively emphasized issues of equivalence. The theories in this paradigm emerged in the 1970s and 1980s as an influential trend which questioned “the validity of the register-inspired equivalence paradigm” (Hatim, 2009) and “set in a practical reaction against the academic detail of extensive linguistics analysis” (Munday, 23

2009). It moved away from the “static linguistic typologies of translation shifts” (Munday, 2001, p. 73), in turn focusing on satisfying the customer or readership. It simplified translation issues, emphasizing keywords (Newmark, 2009), but is considerably meaningful in the pragmatic aspects of translation. In the relatively radical case of Skopos Theory, it shifted the focus from the source to the target; it has “dethroned” the ST, making it no longer superior and allowing one text to be translated in various ways depending on what the translation is used for, rather than on what is most equivalent or faithful to the source. “Equivalence”, consequently, becomes merely a special case where the Target Text happens to be functioning in the same way as its source. In functionalism, translation is considered primarily a communicative act with a purpose, a function, or the so-called “skopos”. The “translational act” has a purpose, as does any kind of act. Every text, be it translated or original, has a purpose. The “purpose” is specifically defined for a particular circumstance under which the translating is conducted, and it is fulfilling this purpose that decides, overridingly, what strategies to adopt in the process of translating and how a translated product should be assessed. A translation, therefore, is no longer necessarily a reproduction of the original aiming for the closest resemblance, but a new version of the ST for a particular group of audience and for a specific purpose; its quality is then to be judged not by “faithfulness” or different types of equivalence, but instead by whether it is functionally appropriate as a text in its own right regarding the defined target reader. A good translation is one that fulfils this purpose. The ST, instead of being somewhat “sacred” or authoritative, or in many ways superior to the TT, becomes merely an offer of information for the process of translating, nothing more than one of the many factors coming into play when a translator is trying to produce a purpose-fulfilling, functionally appropriate text. In a way, the basic message underlying functionalist theories is that the translator is in fact translating “functions”, mostly at the textual level. While the consideration of textual functions is also included in some theories of equivalence (e.g. Nida’s functional/dynamic equivalence), what differentiates this paradigm is the recognition that the TT can, and often should, have different functions than that of the ST. This is especially useful for the translation activities in this day and age, where information circulation is increasingly rapid, massive and technologized. There has been a further highlight on the importance of the perception that every translation is made for a purpose, together with a target-oriented approach in the 24

digital-age translation industry (c.f. Odacıoglu & Kokturk, 2015). In this sense, some even claim that functionalism, or typically the Skopos Theory, is “the only approach that truly acknowledges the professional reality of translating and the demands, expectations and obligations of translators” (Byrne, 2006, p. 58). On the one hand, most of the texts to be translated in general tend to be of pragmatic and informative nature, for which “the purpose or function of the Target Texts is of overriding importance” (Sun, 2005) and in this regard, content matters significantly more than form. It is estimated that in general, about 90% of what circulates in the language industry involves specialized communication (Esqueda & de Jesus, 2015; Kingscott, 2002), and the number is also reflected in the Chinese context: some 96% of the translation activities conducted in China every year are on pragmatic texts (Chan, 2014, p. 330). In technical communication today, the ST is no longer as authoritative as it used to be, and the aim of the translator, therefore, is not so much to “foreground the ST author’s views” in ways advocated by old translation theories (especially the ones before functional theories), but to translate the document “on time” and “accurately”, conforming to the norms of the technical domains in which he/she is translating while “benefiting from so-called artificial equivalence” (Odacıoglu & Kokturk, 2015, p. 1091). The translator facilitates the flow of information or the functioning of the text rather than aim for being uncompromisingly faithful to the ST author. The “artificial equivalence” here, it is worthy of mention, could be the kind of equivalence which is made to exist in the case of terminology (see 2.2.4), or which is presumed between two aligned segments in the database (i.e. Translation Memory). For the translation of scientific abstracts, this is perfectly consistent. On the other hand, the indispensable use of computer technology in the professional translation workflow has become a de facto standard, with the incorporation of, for example, Translation Memories, terminology databases, electronic corpora, translation management tools, and now the increasingly popular Machine Translation. This has undermined the binary notion of “source” and “target” texts, because the source is no longer a single text but a database, or among others “source materials (mostly anonymous) and their translations” (Odacıoglu & Kokturk, 2015); translators are no longer “blindly tied” to the ST (Odacıoglu & Kokturk, 2015) but work on a “Start Text”, complemented by “source materials that take shape of authorized Translation Memories, glossaries, terminology bases and Machine Translation feeds” (Pym, 2013, p. 1). The ST, consequently, becomes not the “source”, but rather the “start”, of a process where different kinds of other, perhaps 25

more superior, sources of reference are utilized; this is especially evident when it comes to technical translation. The “Start Text”, again, is now merely an offer of (perhaps partial) information, only one of the many factors coming into play, in the process of producing a Target Text which fulfils its purpose. In institutional translation, sometimes there is no “Source Text” at all — texts are produced simultaneously in a multi-lingual manner and they are in a way all “Target Texts” coming from various sources of references. For MT, it is equally important to consider such a perspective. As mentioned above, the vague and controversial notion of equivalence causes numerous problems for MT, especially for MT evaluation — an issue which has in effect been “more developed than MT itself” (Wilks, 2009). Based on this, many propose that the evaluation of MT, or perhaps its computational design as well, be based on the context in which it is used, or in a dynamic manner, which in a way seems consistent to the purpose-based theories of Translation Studies. Based on the intended use of the text, the criteria against which the MT output is evaluated would be adjusted accordingly, e.g. post-editing efforts, comprehension test on the reader, automatic evaluation metrics, etc. The relative importance of the specific pieces of information in the text would need to be adjusted as well, as shown in the case study in Chapter 3. In addition, the intended Skopos of the TT will decide, for instance, whether – – and how –– the ST is to be pre-edited (Quah, 2006), a process which Chan (2015) calls “editorial customization” for a computer or computer-aided translation system. The purpose of the text will also decide the post-editing process in accordance to the expectations of quality, and whether to choose human translation or MT in the first place (Quah, 2006). The following sections will briefly illustrate a few theories in this paradigm, regarding the case of machine-translating scientific abstracts.

2.2.3 MT for Scientific Abstracts: Functionalist issues Text type and function Scientific abstracts are perhaps inclined to a type of “informative text”, and in the case of machine-translating abstracts with an information orientation, it is the referential dimension and informative function that are prominent.

26

Based on Bühler’s (1934) triadic categorization, language functions include Darstellungsfunktion (the informative function), Ausdrucksfunktion (the expressive function) and Appellfunktion (the appellative function) (quoted from Munday, 2001, p. 199). These are also consistent to what Pym (2014, p. 46) describes as “the three linguistic persons”: “I” — expressive (form-focused), “you” — operative (appealfocused), “he”, “she”, “it”, “they” — informative (content-focused). Reiss (1977) links them with the corresponding language “dimensions” –– referential, aesthetic, and dialogic, together with the text type, the communicative situations, and the respective translation methods. According to Reiss, each of these text types would require a different translation approach, and the translation should be at the textual level with much consideration of the functions of the text. Text types are classified in accordance to the dominant textual function: informative, expressive or operative, though there is often an overlap of these or other functions within a text. The “informative text” (Reiss, 1976) is “content-focused” (Reiss, 1971), e.g. news, scientific or technical texts, whose dominant function is “plain communication of facts” (Reiss, 1977). In transmitting the information, the language dimension used is logical or referential, with the main focus on the content or topic. The translation method should be, therefore, one which transmits the full referential or conceptual content of the ST correctly, with not necessarily more than acceptable form — in “plain prose” (and sometimes with certain extent of explicitation, if need be). The “expressive text” (Reiss, 1976), on the other hand, is “form-focused” (Reiss, 1971), e.g. poems and many other kinds of texts with a certain extent of literary nature, functioning predominantly as “creative composition” (Reiss, 1977), where the language dimension used is aesthetic. The TT should transmit the corresponding form — the aesthetic and artistic form. The translator should use the “identifying” method, adopting the standpoint of the ST author. The “operative text” (Reiss, 1976), e.g. advertisements, or “texts of a rhetorical or polemical bent” (House, 2015, p. 15), is “appeal-focused” (Reiss, 1971), and its main function is “inducing behavioural responses” (Reiss, 1977). This function aims at appealing to or persuading the reader to “act in a certain way” (Munday, 2001), where the language dimension is dialogic. Since the focus is appellative, in deciding on the translation strategies the “effect” has priority over both content and form. The TT should first and foremost produce the desired

27

response in the “receiver” of the translation, employing the “adaptive” method to create an equivalent effect on the TT reader. Normally, a specific text in question falls not into one of the above three types, but rather, somewhere along a continuum between each two of them. These three types are in effect three extremes; most texts are actually hybrid types located somewhere in the middle, somewhere on a continuous, triangular plane, as illustrated below. Its exact location on this plane is dependent on the extent to which each of the three textual functions is dominant.

Figure 2.1 –– Text types (taken from Munday, 2001, p. 74) With respect to the case study in this thesis, perhaps the academic abstracts can be posited somewhere close to the text variety “report”, since its expressive value should be similar to that of the operative instructions and lectures, with less operative value and a larger extent of expressiveness and informativeness. On the other hand, the abstracts would have more expressive and operative value compared with reference works. Such a predominantly informative text type is clearly in direct relation to the practical use of Machine Translation. It is believed that one of the functions of MT — perhaps the “most important” one (Chan, 2004, p. 105) — is for “information assimilation”, sometimes specifically for “database access” or information “interchange” (Hutchins, 2005, 2007). This is also what Translation Studies calls “information translation” (c.f. Chan, 2004): to convey the referential content, but not the style or form, so that the TT reader grasps the message, sometimes merely the gist or essence, of a text which is written in a language he/she does not understand. The linguistic presentation or standard which is required in this regard can be compromised, and may “vary considerably”, ranging “from paraphrase to summary” (Chan, 2004, p. 105). The expressive or operative functions are largely subordinate 28

compared with the information the text conveys. In this type of translation, MT, especially domain-specific MT systems, can be very effectively used for transmitting correct information in a speedy manner. This seems perfectly consistent to Reiss’s views on translating the “content-focused”, “informative” text type, and works for almost all of Hutchins’s (2005) categorization of the four types of “translation demands” that MT aims to fulfill (namely, assimilation, dissemination, database access and information interchange). Here, the raw output from MT is largely adequate in many circumstances, though sometimes “light” or “rapid” post-editing may need to be done (Wagner, 1985). Based on the text typology above, Reiss’s (1971) criteria for assessing the adequacy of a translation is listed in a number of aspects, both intralinguistic and extralinguistic. Her instruction includes (quoted from Munday, 2001, p. 75): 1. intralinguistic criteria: semantic, lexical, grammatical and stylistic features; 2. extralinguistic criteria: situation, subject field, time, place, receiver, sender and affective implications (humour, irony, emotion, etc.). In accordance to the text type, the importance of these criteria vary significantly, ranging from, say, prioritizing semantic aspects for highly informative texts, to emphasizing stylistic features (e.g. metaphors) for texts with a strong expressive nature. In the evaluation of MT output of informative texts, it is also reasonable to adopt the same standpoint, while putting much emphasis on the transmission of information rather than any other language functions, i.e. “information translation” (see 2.2.4 below). For the customization in this case study, such a perspective is also followed.

Translatorial Action The “translatorial action” (see Figure 2.2) proposed by Holz-Mänttäri (1984) borrows concepts from communication theory and action theory, and can be considered “target-side functionalism” (Pym, 2014, p. 50), viewing translation as a “purpose-driven, outcome-oriented” interaction, with a focus on “messagetransmitter compounds” (Munday, 2001, p. 77). The translation activity, not unlike the many other kinds of actions in the society, fulfils a social function. The process

29

of translating is regarded as the mediating of the communication of messages across languages and cultures, and the message communication itself –– for her as a functionalist theorist –– is of course ruled by the function which the message is to fulfil. A translator’s job is not merely to replicate the ST cross-linguistically; a translator is a “texter”, someone who creates texts, rather than an obedient, loyal and faithful subordinate to the ST author. Attempting to replicate the function of the ST, of course, can be one possible purpose or aim of the translatorial action, but it is completely legitimate for a translator, given his expertise in “enabling functionally oriented communication” (Holz-Mänttäri, 1984; translated by Munday, 2001, p. 77), to create new text functions deemed communicatively adequate for the receiver (as indicated in Figure 2.2 below).

Figure 2.2 –– Translatorial action (taken from Pym, 2014, p. 50) In the process of “enabling functionally oriented communication” (HolzMänttäri, 1984; translated in Munday, 2001, p. 77), there are a series of “roles and players”, including the initiator, the commissioner, the ST producer, the TT producer, the TT user, and the TT receiver. The importance of distinguishing these roles is that each of them would have their specific primary and secondary goals. It is also possible that a translator is an expert neither for the text type, nor for the specific subject area of the text. In such a case the ST writer would need to provide additional resources regarding the subjectspecific knowledge. This is similar to MT systems which need to be customized to the specific features of the ST.

30

As the translational action is focused on the functionality of the communication, the needs of the TT receiver becomes the overriding factor for what kind of TT should be produced. The form and genre of the translated product would therefore have to be dependent on what is functionally appropriate in the target culture, rather than merely replicating the profile of the ST regarding content and form. More specifically, for example, the highly-technical terminologies in a user manual would require considerable clarification if translated for a non-technical reader (Holz-Mänttäri, 1984). Consequently, translations would not be judged on the basis of whether it is faithful, or fidelitous, to the ST, but whether it is functionally communicative for the TT culture. In the scenario of MT use in this case study, the role of the computer is perhaps the “TT producer”, while the intended reader is the “TT receiver”. Roles of the commissioner and the TT user are a bit more complicated depending how the system is used. For example, one may suppose the case in which an English-speaking scientist, A, is using SYSTRAN for the translation of several abstracts written in Russian in order to browse the content of a research journal. SYSTRAN would be the TT producer, while A is at the same time the commissioner, the TT user, and the TT receiver. This would be rather simple. However, in a different case, if the same scientist A is now asking a translation agency to translate the abstracts for his literature reading, and the agency is one which incorporates Machine Translation into their workflow, the situation would be quite different. SYSTRAN is still the TT producer, and the scientist A is still the TT receiver. But the commissioner for the machine translation task would be the human translator or the translation agency, who “commissions” SYSTRAN to translate the text, while the human translator –– or in many ways the post-editor –– is the TT user. Here, the difference in the goals –– primary or secondary –– for these roles would be apparent. In both of these scenarios, the primary goal of A might be, for example, to search for a method to apply to one of his/her research projects, to complement his/her literature review of an article, to recommend scientific readings for a Russian friend who does not understand English, or simply to understand the general features 31

of the Russian journal in question. The secondary goal is to obtain a translated text which fulfills his purpose. Since A is simultaneously the commissioner and TT user for SYSTRAN, the text which is expected from the system is consequently the kind of output which fulfills the same purpose. Perhaps A would be using SYSTRAN for cross-linguistic information retrieval, for example. However, in the second case the situation might be different: the purpose A wants fulfilled is still the same, but the translator or translation agency who “commissions” the task to SYSTRAN would be focused more on cutting cost and saving time while fulfilling the contract with A. The post-editor, or TT user, is more focused on the extent to which SYSTRAN’s output can provide ample convenience for the post-editing efforts needed to produce the text which A wants. For any of these stakeholders, the expectation of SYSTRAN would be largely different. The commissioner would try to balance the cost of the MT system, the language resources needed, and the post-editing. The post-editor would expect an output which needs minimal post-editing distance, or cognitive effort, to be made into a version consistent to A’s expectations. It is important that a higher-quality translation in terms of fluency, comprehensiveness, or other aspects might not be the same as the kind of translation which provides more convenience for post-editing. As can be seen, the distinction of the TT user and the TT receiver seems important for expectations of the output from MT.

Skopos and adequacy As mentioned above, the purpose of a translation or the action of translating is considered the overriding factor in this thesis for MT customization and assessment. This is termed as skopos (Reiss & Vermeer, 1984), and Vermeer’s theory prioritizes adequacy over equivalence as the measure of translational action. Adequacy is described as fulfilment of the skopos outlined by the commission, therefore a translation which is consistent to the skopos would be considered functionally and communicatively adequate. In general, the output from MT is believed to be used for two kinds of purposes in the broad sense: for understanding a text written in a foreign language (i.e. information assimilation), or for transmitting a message with a text in a foreign language (i.e. information dissemination). This is not merely about MT, but about the 32

translation practice in general. As mentioned above, MT has been very popular for both of these purposes, but the latter requires a higher quality and usually includes a post-editing process on the raw outcome from the system. However, this seems a bit simple for the case here. Since the needs of the TT receiver is considered the overriding factor in this process, the goals of the roles in the translatorial action changes in accordance to the change of the TT receiver’s needs. Using the translated abstract for a search of methods would be largely different from searching for results or relevant statistics, or more sharply, from obtaining an understanding of the publication tendency of topics in a journal. These differences would influence the goals of all those roles. Moreover, as Reiss (1977, p. 114) fully acknowledges, there are occasions in which the function of the TT may differ from that of the ST. The famous example she gives of Guilliver’s Travels is quite explanatory: though initially written to be satirical of the government back at the time, the novel is now read and translated widely as “ordinary entertaining fiction”. A text which is primarily operative has been translated into an expressive text. Munday (2001, p. 75) gives another example where “a TT may have a different communicative function from the ST”: translating an election address in a foreign country for analyses of 1). “what policies have been presented” or 2). how these policies are presented by the speech maker. Here, a predominantly operative text becomes 1). informative or 2). expressive in the communication situations concerned. While the functionalist theories of translation emphasize the dynamic characteristics of the translation process regarding text functions, regarding MT the viewpoints are frequently static. Many of the discussions on MT are generally about issues of using MT to translate so-and-so (e.g. literary/general/domain-specific) text, for assimilation/dissemination purposes (i.e. for reading a foreign text or for publication in a foreign language), rather than using MT to translate the ST into a text of so-and-so (informative/expressive/operative) function. For MT, what seems meaningful is that if a system is developed or customized for, say, informative functions, any text regardless of their nature can theoretically be translated by the system if the TT receiver is looking for information alone. The investigations into the usability of its output should accordingly lean

33

towards how much information, or content, is transmitted as a consequence of the skopos. This is also the case in the translation of abstracts discussed in this thesis. In this case study, the skopos is to translate the core information in a correct and speedy manner for domain experts. Presumably, an abstract can be written in a manner that increases the chance of publication, draws attention for more readers, or simply for summarizing its main points of information in plain language. The abstract can be an advertisement for the research article (Van Bonn & Swales, 2007) or a research retrieval tool (Salager-Meyer, 2009). However, based on the skopos, the textual function to be translated is purely informative, and it is the referential dimension that is prioritized in terms of adequacy. As will be described below (see 2.2.4 and 3.4), the skopos also decides that some pieces of information is relatively less important, e.g. background of a research, information which is already mentioned elsewhere, or information which can be readily inferred for a domain-specific reader. Meanwhile, the skopos determines how the system is customized, or specifically, what words are considered inappropriate in this case study, as can be seen in 2.3.3.

2.2.4 MT for Scientific Abstracts: Information translation Information translation The above has illustrated the perspectives of the functionalist school in Translation Studies, and posited the case study within this framework. As mentioned, the functionalist perspectives question the validity of the equivalence paradigm and has in a way “dethroned” the ST. However, it can be seen that not every aspect of the equivalence paradigm has been abandoned, and in this sense the remaining parts are also useful for MT in this study. Perhaps in Translation Studies, the basis for most theories is that “translation is first and foremost a communicative act” (Suojanen, Koskinen, & Tuominen, 2015). To translate means to transmit to the target audience a message conveyed by the ST. The translator decodes the message and transcodes it in the target language, changing his/her role from a receiver of the original into a sender of the recoded message (Nida, 1964). In essence, this idea is not inconsistent between the two paradigms 34

mentioned above. It is whether the ST and TT ought to have the “same value”, be it the message or the function, or whether the ST should be superior to the TT, that constitutes

the

differences

between

these

paradigms.

The

intrinsically

communicative nature of translation does not appear much of a controversy. In a way, it reflects the thinking in the 1950s and 1960s under the influence of, for example, Shannon and Weaver’s (1949) theory of information, in which communication is modelled mathematically as an “information channel” where the “message” moves from the sender to the receiver through the channel, with or without noise. Translation can be viewed in a very similar manner: the translator decodes the message conveyed by the ST and transcodes it in the Target Language, changing his/her role from a receiver of the original into a sender of the recoded message (Nida, 1964; Nida & Taber, 1969), though the standard of linguistic presentation may vary considerably. In this sense, translators also open an information channel during the process, recreating “a linguistic surface with which readers will retrieve the informativity in the order of the original” (Chan, 2004). This general idea is in fact perfectly justifiable, especially when the texts are of informative nature and when the purpose and function of translating is to inform, i.e. to transfer the referential content rather than any other language functions (e.g. expressive or persuasive). In technical, domain-specific translation, especially when the translating process focuses only on “information” rather than the style or form (i.e. “information translation” or “informational translation”), part of the equivalence paradigm is largely compatible, because “natural equivalents do exist”, though “rarely in a state of untouched nature” (Pym, 2014, p. 12). As Kade (1968) argues, this kind of equivalents are mostly terminological items — words which are artificially standardized and “made to correspond to each other” (Pym, 2014). This is obviously true in all specialized fields, where unique, unambiguous and standardized terminologies always exist, in a way creating equivalents in an artificial manner. In this sense, the “illusion” of symmetry between languages which “hardly exists”, (Snell-Hornby, 1988, p. 22), becomes existent. This is consistent to the case in this thesis: translating domain-specific academic abstracts, where the texts are dense in terminology and formulaic language. Therefore, as far as specialized terms are concerned, it is equivalence that should be the priority. As for the information translation, what information is prioritized is determined by the skopos. 35

Implicit Information It is also important that scientific abstracts contain implicit information which is often beyond the specification or clarification of terms. The implicitness of information in a text can be the result of different factors –– some because of the structure of the source language, some because the same information has been included in other parts of the text, and some because the information is shared in the communication situation (Chan, 2004). These types of implicit information are consistent to Beekman and Callow’s (1974) description, which suggests three resources from which the implicit information can be derived: the immediate context, the remote context, and the cultural context. Regarding the case of abstracts for scientists in particular, the type of implicit information resulting from the structure of the source language is important for MT to explicitate. As will be discussed in Chapter 3, a number of the translation problems in the initial output from MT have to do with the implicit relationship between different constituents of the source sentences. This is largely the result of the features of academic writing, with extensive use of embedded clauses and conjunctions, easily causing ambiguity if without some support of domain-specific knowledge. This is also an important aspect for customization, and as will be shown, explicitation of such information is considerably effective for the overall improvement of the translated output. The second type of implicit information is equally important. As will also be illustrated in Chapter 3, many of the translation problems in the customized output are considered minor because the information can be found elsewhere in the text. When investigating into the usability of MT, it does not seem fair for the system if such a type of error is considered significant, especially given that the reader would be reading the whole text in entirety rather than the isolated sentences. Regarding the third, perhaps the specific case of abstract translation is particularly relevant, because the communication is only among domain experts who share remarkable amount of information beyond the text. The abstracts are written by and intended for the same community of experts, and the translated version is also for ones in that same community. In addition to the shared information in the subject 36

domain and in the research circle concerned, academic abstracts follow commonlyagreed format, structure, and content. These are all shared information, even when the abstract in question does not present them explicitly. Chapter 3 also illustrates the fact that many structural ambiguities are dependent on extra-linguistic, domainspecific knowledge, which are hard to resolve without such knowledge but can be fairly easy if the reader is aware of a basic mechanism or principle. In this sense, if the translation from the MT system has lost or slightly distorted certain information of such kinds, the problem would not be significant –– if not unproblematic at all –– given the specific TT receiver.

Definition of Information The above has discussed various issues of “information” and their relevance to language communication, translation, MT, and particularly the specific case of translating academic abstracts with MT. However, there seems to be one question which needs to be addressed: what is meant by information? In the mathematical model proposed by Shannon (1948) for the information theory, the concept of “information” is defined as the value of entropy, which measures the uncertainty involved in estimating the value X can take, i.e. its freedom of choice when one chooses a message. However, it is not hard to find that this definition seems in much lack of meaningfulness in the practical sense, especially for the case study for this thesis, because there is no semantic content in the definition here, although it does indicate the “Inverse Relationship Principle” (Floridi, 2010), or a reverse relation between the probability of a proposition, sentence, event or situation and the amount of semantic information conveyed. As mentioned above, for terminologies there is an artificial equivalence, and academic abstracts are dense in terminologies. Terminologies also seem to be the most important part for conveying semantic information. Another important aspect would be largely related to verbs which connect the terminologies, or phrases to one another. Therefore, information in the micro perspective would be defined, in the case study for this thesis, as the terminologies (mostly nouns) and a limited number of verbs typically used in such writing. As demonstrated in the case study (see

37

Chapter 3), the improvement of the customized translation regarding these two aspects has been significant in terms of information transmission. Perhaps more important is the fact that abstracts follow fixed structures of content: research background, topic, methods, results, and contribution, representing the macro propositions of the article, or the so-called “moves” (Swales, 1990) which are typical of scientific research articles (RA). Such a framework is also useful for the definition of information here. From a macro perspective, the background seems the least important in comparison, while methods and results are crucial, consistent to the recommendations of typical writing manuals for abstracts. (Andrade, 2011; Mack, 2012)

38

2.3 Sublanguage, Ambiguity and MT Customization 2.3.1 Sublanguage and Controlled Language As mentioned above, there has been a general consensus that fully automatic, high-quality Machine Translation of unrestricted texts is still largely impractical, and seems to remain so for the foreseeable future. This reality was in fact already recognized as early as the mid-1980s; but since that time there had been a shift of focus towards looking for ways of developing “usable and useful” MT systems (Somers, 2003, p. 6), even if the more ambitious goal turned out to be impractical. In this sense, many distinguish between the use of MT for assimilation and dissemination (see above), somewhat lowering expectations of the outcome from the systems and properly evaluating its worthiness. Perhaps more useful, in practically using MT for the real-world translation demands, is the idea that the systems would be able to work very well if the text to translate is “somehow restricted” (Somers, 2003, p. 6, his emphasis). This idea has led to the so-called sublanguage approach and controlled-language approach, both with abundant successful examples that guide the use and development of MT systems to this day. Scientific abstracts are in many ways written in sublanguage, restricted in terms of subject matter, lexical and syntactic closure, and more importantly, ambiguity. The subject matter, the text type and the communicative situation can affect not only the vocabulary used in a text, but also its style of expression. This phenomenon is sometimes called “register” by sociolinguists, or “language for specific/special purposes”, “specialized language” (Cabré, 2003), “technical English” (Copeck et al., 1997), “scientific English” (Crystal, 1997), “academic and professional language” (Motos, 2013), to name a few; but the primary issue here is quite similar –– what the MT community call “sublanguage” or restricted language. Sublanguage is viewed as a subset of the general language, a naturally occurring restricted language whose flexibility is significantly less than the “full” set, and MT systems are developed with its specific application in mind. Reasonably, in any specified domain the sublanguage concerned is, one way or another, naturally restricted, using only “a subset of the syntactic constructs and vocabulary of the

39

language in question” (Somers & Rutzler, 1996) so that many of the lexical, syntactic and semantic ambiguities are effectively avoided, e.g. the “one-sense-perdiscourse” claim (Gale, Church, & Yarowsky, 1992). Linguistic ambiguity has long been a major problem for MT (and often for human translation as well), but genreand domain-specific texts involve considerably less ambiguity. In the practical sense, since much of the translation demand is for domain-specific pragmatic texts (see 2.2.2 above), the sublanguage approach is very likely to result in systems that are significantly useful. One famous example of successfully machine-translating naturally occurring restricted language, i.e. sublanguage, has been the Météo system (Kittredge & Lehrberger, 1982), which can translate weather bulletins fairly well from English to French, practically replacing humans for a “very tedious” task (Somers, 2003, p. 6). However, although the sublanguage approach has been in place for decades among MT developers, relatively little has been researched from the user’s aspect. Commercial systems designed with a general-language purpose can now be customized in many ways for the end-users’ specific STs, but as shown in the previous section, a considerable amount of investigations tend to leave the system uncustomized. The benefit from customizing general-purpose systems, rather than developing domain-specific MT in the first place, is twofold: 1). the concept of “domain” is hard to define, and the closure of the “sublanguage” lexicon and syntax tend to be ambiguous and sometimes controversial; and 2) each user has his/her own, unique STs in a certain scenario, with specific linguistic features that might not be simply attributed to a certain text type or subject matter — i.e. text restrictions are sometimes not genre- or discipline-specific. By allowing customizability, the use of MT can be expanded to a wider scope and to a more dynamic use. On the other hand, since today’s MT systems have seen remarkable improvement in their output quality, it seems a safe assumption that custom-specific systems should be able to give satisfactory outcome. Controlled language (CL) is based on the same essential idea — reducing ambiguity and restricting language flexibility in terms of lexicon, grammar and style. While sublanguage is a naturally occurring phenomenon, controlled language tends to be artificial in some sense, conforming to a set of guidelines in the authoring process, so as to produce clearer, succinct, more consistent and easier-to-understand

40

texts for the target reader. This is in effect a very common practice in technical documentation or institutional translation, be it conventional or computer-assisted. In relation to MT, controlled language rules can be adopted to reduce “negative translatability indicators” and to produce MT-friendly — though sometimes not appealing-to-human, input texts. Adopting CL rules before inputting the text into MT is also known as “pre-processing” or “pre-editing”, and can be considered as one aspect of MT customization — instead of adapting the system to the ST in question, controlled language rules move the input text closer to the system’s capabilities. The CL strategy has been empirically proven effective in improving MT quality and reducing post-editing effort (e.g. O’Brien, 2006; Roturier, 2006), and has been largely implemented in translation workflows. A very successful example of CL in the industry, among many others, can be the Perkins Approved Clear English (PACE) adopted by Perkins Engines Ltd. The CL rules of PACE, though not very precise (Nyberg, Mitamura, & Huijsen, 2003, p. 255), has significantly improved the user comprehension of the texts, while also effectively facilitating the frequent and rapid production of their documentation in five languages. For the latter, PACE has made the post-editing of machine-translated output “three to four times faster” than the conventional translation methods (Pym, 1990, p. 91). Here, the distinction of these two is barely necessary for this thesis, because in essence the sublanguage concerned is often “controlled” in the authoring process, considering the writing norms, if not guidelines, in the corresponding domain and text type. Specifically, the scientific research abstracts fall into a type of sublanguage partly because of a convention for writing as such. Academic writing guidelines tend to instruct researchers to conform to specific norms, primarily to reduce ambiguity and ensure cohesion, which can be considered as some extent of controlled authoring, though not MT-oriented. On the other hand, the text type of “abstracts”, its subject domain, and the targeted audience of professional, discipline-specific researchers altogether indicate that the language used in this situation is naturally restricted in its lexicon, syntax and style, regardless of what writing instructions or guidelines to follow. Therefore, the language involved is, in a way, both a sublanguage and a controlled language. 41

What is relevant to MT quality is the extent to which the language is restricted, and to which ambiguities are avoided. The following section will discuss features of such language and the issue of ambiguity.

2.3.2 Ambiguity Ambiguity in human language is one of the major problems for MT. Part of this has to do with ill-formed or ambiguous input sentences, i.e. the ST is not clear enough in itself (“real ambiguity”); but a more common and relevant issue is the kind of ambiguities in language which humans can easily resolve in a given communicative situation while the computer cannot –– a phenomenon often called “accidental ambiguity” (Hutchins & Somers, 1992). MT systems have to tackle similar linguistic problems which cause ambiguity (O’Brien, 1993). The above has mentioned the expressive function of a text, and it is generally believed that MT finds it more difficult to translate an expressive text than the informative type. However, compared with expressiveness, the issue of ambiguity seems more relevant to the problems for MT, because although some texts are expressive in nature, the structural ambiguity of its syntax might not be prominent. Perhaps the following example, which is an automatically translated sentence from a Kindle4, is illustrative:

Figure 2.3 –– Example of MT output Kindle is an electronic-book reading device, which uses “Bing Translator” as the translating engine. For information on Bing Translator, see www.bing.com/translator. 4

42

It can be seen that the problem of the translation here is more relevant to the syntactic structures rather than the expressive features of the sentences. On the other hand, the English text does not seem excessively ambiguous in terms of syntax, at least not as ambiguous as some of the informative or operative texts. Consider the following largely informative sentence for comparison (example taken from Lehrberger & Bourbeau, 1988): The function of the priority valve is to restrict fluid flow to the secondary subsystems and to supply fluid on a priority basis for operation of the flight controls. (p. 92) which is significantly more ambiguous and problematic for MT, despite its highly informative nature. O’Brien (1993) succinctly describes the problem by saying, it is not clear if “and” is a conjunction of predicates meaning “to restrict... and to supply” or a conjunction of prepositional phrases meaning “to the secondary sub systems and to supply fluid” where “supply fluid” is a nounnoun compound. (p. 86) Another typical example might be the following sentence from an aircraft maintenance manual (example taken from Lehrberger & Bourbeau, 1988): Disconnect pressure and return lines from pump. which, again, is perhaps far more ambiguous than many expressive texts: whether the word “return” is an imperative verb, or a noun-modifying noun in the compound “return lines”. Such examples, in fact, frequently occur in texts which are primarily intended for informative functions, including the case study in this thesis.

Types of ambiguity and sublanguage restriction Ambiguity can be divided into three categories: lexical, semantic, and syntactic. All of these types and categories of ambiguities can be reduced significantly when the ST is written in a sublanguage. Lexical ambiguities can be further divided into three types: categorial ambiguities, homographs and polysemy, and transfer or “translational” ambiguities (Hutchins & Somers, 1992). 43

Categorial ambiguities refer to the commonly seen instances where a word is assigned to more than one grammatical or syntactic category, e.g. “show” as a noun and as a verb. For homographs and polysemy, the two concepts are sometimes distinguished in linguistics, but the difference between the two seems quite irrelevant to MT, because the system processes them in the same manner. Therefore, in this thesis “polysemy” will be used to represent the category where a word has different meanings. An example of this could be “paper” being a piece of academic writing, or the physical paper material. As for translational ambiguities, a word in the source language may be translated into different target language lexical items depending on meanings or subject domains. For example, “guide” as a noun is translated differently in Chinese between a person who guides tours and an optical device that guides light beams. In the case study in Chapter 3, polysemy also includes translational ambiguity. Syntactic ambiguities would have to do with the multiple ways of parsing a sentence caused by “phenomena such as prepositional phrase (PP) attachment, conjunction, ellipsis of articles, embedded clauses and adjective scope” (O’Brien, 1993). As for the issue of semantic ambiguities, it is primarily caused by “synonymy, long nominal compounds and referential items whose antecedents are not easily identified” (ibid). It is important to note that categorial ambiguity has significant influence on syntax. Many of the structural issues regarding syntax are in fact the result of lexical items having more than one grammatical functions, as illustrated further in Chapter 3. These types of ambiguities pose significant difficulty for automatic translation. However, in a sublanguage these types of ambiguity are significantly reduced, particularly concerning lexical ambiguity. For example, in the translation of optics research abstracts, it can be assumed that terminological words such as “light”, “surface”, “wave” and “guide” would be mostly used as nouns referring to their terminological sense, although each of them has multiple categories, meanings and translations: “光”/”轻”/”照亮” for “light” as noun/adjective/verb, “表面”/”浮现” for “surface” as noun/verb, etc. As can be seen in Chapter 3, general-language items are also largely restricted: “can”, for example, is only used in the corpora as a modal verb rather than a noun. The two also corresponds to different Chinese translations: “能”/”罐头”. 44

What seems more important here, other than the Chinese translations of these words, is that the categorial restriction significantly reduces complexity of parsing and avoids syntactic ambiguities. Many of the parsing errors are related to categorial ambiguity, as can be seen in Chapter 3, and these types of errors will no longer appear if the glossary is modified to restrict the grammatical functions of those words. In an equal sense, the sublanguage restriction is also related to the “polysemy” issue. The word “wave” above can also mean “a hand wave” in general language, and “guide” can be an instruction book rather than an optical device. For these kinds of usage, each has a different Chinese translation than the translation corresponding to the terminological senses (e.g. “波 ”/”挥 手”, “ 波导”/”指 南 ”). In the optics abstracts, it can be assumed that most of them would be restricted to one meaning only, a terminological one, which corresponds to a fixed Chinese term (“波”,”波导”). However, even in this sublanguage there can also be polysemies that still remain ambiguous. For example, the word “laser” may refer to either the laser beam or the device that produces a laser beam. These two senses correspond to two different Chinese terms, i.e. “激光”/”激光器”. Although in most cases a domainspecific reader can distinguish them without much difficulty, the word is still often very ambiguous within short contexts, if there are not modifiers that clearly indicate whether it is a device, or a beam. For MT, such items seem considerably problematic. Regarding translational ambiguity, the words “linewidth” and “aberration” are illustrative examples, where in optics they are very restricted. These items are not ambiguous in terms of category or polysemy, but often translated differently in different specializations. They can be translated respectively into “行距”/”线宽”, and “变型”/”畸变” depending on the subject matter, but in optics only the latter Chinese terms are used.

2.3.3 Customization The above discussion of ambiguity in a sublanguage could help MT to produce better translation, but as mentioned the system needs to be customized accordingly. As Chan (2015, p. 59) states, “customizing a general-purpose machine translation system is an effective way to improve MT quality”. 45

To what extent the system should be customized, of course, would be dependent on the goals of translation, the circumstances, and the ST (ibid), consistent to the functionalist stances mentioned above. The customization described by Chan is related not only to Machine Translation, but largely to translation technology including Computer-Aided Translation. It includes editorial, language, lexicographical, linguistic and resource aspects, among which this case study is focused on linguistic customization. Linguistic customization includes two aspects: lexical and syntactical. Lexical customization is directly intended for solving the problems of lexical ambiguity mentioned above. These ambiguities can pose significant problems for MT if it uses the system dictionary alone, therefore the preparation of a customized dictionary as an additional resource –– perhaps a preferred one –– would remove the uncertainties when the system has to decide on ambiguous words or word combinations. As will be illustrated in the next Chapter, this is especially effective for an MT system which adopts a certain extent of rule-based approach (RBMT). In this aspect, it is important to note that the customized dictionaries may not be simply effective for words alone, but largely influence issues of syntax, as can be seen also in Chapter 3. This is mainly because of the resolution of categorial ambiguities in this case, but also worthy of mention is that such systems as SYSTRAN allow a considerable extent of customization on lexical items, including indicating expressions and adding contextual information for each translation option of the word in question. Syntactical customization, accordingly, deals with the problems resulting from issues of syntax. The user can add sentences or phrases to the database to help the system translate future instances. While this is important in many ways, the case study for this thesis would focus on the lexical customization. As mentioned previously, the way the MT system is customized is consistent to the skopos (see 2.2.3). This refers to the way by which to judge whether a lexical item’s translation is appropriate in the discussion of polysemy. For example, while in the following phrase the translations for “generation”, “second-harmonic” and “quantum cascade laser” are considered by this case study as erroneous in terms of polysemy (see also 3.4.6 below): AlGaAs guided-wave second-harmonic generation at 2.23 µm from a quantum cascade laser 46

在 2.23µm的 AlGaAs 引导波浪第二泛音一代从量子小瀑布激光 (Here, the translations for these items should be “生成”, “二次谐波”, and “量子级联激光器”, respectively.) in the case of “parameter” as “参数” or “参量” (as in the sentence below), either translation is considered appropriate because it does not influence proper understanding. The design prescription is verified by examples showing reconstruction error versus controlled parameter. In the same manner, the distinction between the translations of the item “show” as “展示” or “显示” is also neglected because it does not convey core information (see below). The correction results show that ICA is a powerful correction algorithm for static or slowly changing phase aberrations in optical systems, such as solidstate lasers.

47

2.4 Summary This chapter has outlined the theoretical perspectives which guides the case study of translating academic abstracts with MT, while arguing for a functionalist, dynamic, and target-oriented stance regarding the concept of translation. After reviewing the viewpoints from three communities –– MT, translation, and domain experts, it focuses on the gap and seeks for aspects of the functionalist paradigm of Translation Studies which can be applied to MT in general, and specifically to machine-translating scientific abstracts. While the equivalence paradigm has been criticized by the functionalist school, the two paradigms in fact share the point that translation is first and foremost a communicative act. The model of information transmission proposed in information theory is the ground for both paradigms, where the concept of information is largely mathematical. In translation, especially if the TT function is informative, information translation is important and can avoid many complicated issues of quality assessment, because the reader needs only the referential content rather than the form. If this is the purpose, or skopos of the text recipient, the guiding principle should be based on informativity, where the case of abstract translation for assimilation is perfectly compatible. It is important that for the translation of abstracts in this study, the skopos is not a replication of an equivalent text function, but to produce the informative function regardless of the other functions. While an abstract can have some aspects of expressiveness (as an advertisement for the article), the skopos denotes a change in its function into a purely informative one. Regarding information translation, the concept of implicit information is also useful, because as will be shown in Chapter 3, the case study considers a translation problem unimportant if the same piece of information can be safely found elsewhere in the TT, or if the information can be readily inferred with some basic knowledge in the subject domain. Information is also discussed and defined. Perhaps an even more important issue is customization. Customizing MT for better output is related to the difficulties it encounters in dealing with human language, where the issue of ambiguity is in many ways crucial. Therefore, this

48

Chapter proceeds to a discussion of relevant aspects, including sublanguage and controlled language, both of which contain considerably less ambiguities compared with general language –– consistent to the features of academic abstracts. Then the types of ambiguities are illustrated, in terms of lexis, semantics, and syntax. All these discussions provide the basis for customization, or in particular lexical customization, because the aim is for resolving ambiguity, while the customization and its evaluation is in line within the general perspectives of functionalism illustrated above.

49

Chapter 3 Case Study This chapter illustrates a case study for the issues mentioned above, using SYSTRAN and Applied Optics. SYSTRAN is a renowned MT system which adopts primarily a rule-based transfer approach, but to some extent it has been combining RBMT and statistical methods 5 recently, describing itself as a “hybrid” system. Rule-based approach is reliable and consistent, but can be somewhat rigid, while statistical MT is datadriven, empirical and can generate output of better fluency, but is not always reliable. Although the latter seems to have taken the mainstream of the technical solutions in the recent decade or so (Goutte, 2009), the advantage of combining the two has been widely acknowledged. Specifically, SYSTRAN has been among the most advanced and popular systems for decades, particularly in Europe. It is employed by numerous institutions of various kinds –– international enterprises such as Symantec, Cisco, and European Aeronautic Defence and Space Company; Internet portals including Yahoo!®, Lycos®, and AltaVistaTM; public agencies like the US Intelligence Community and the European Commission (EC) (Tatsumi, 2010, p. 7). Some companies, such as Semantec, even finds SYSTRAN “the only practical option to satisfy their translation quality requirement”, despite the increasing popularity of purely statistical MT systems (ibid). As will be discussed below, another reason for using SYSTRAN for this case study is the significant extent of customization which the system allows for the specific features of the ST and needs of the user, including customized glossary, language models, etc.

Statistical MT systems translate the input texts by applying the statistical models which are built on the basis of a large collection of language resources available to the system. A typical example of this is Google Translate, which is perhaps the most popular system to the public. 5

50

Here, it is worthy of mention that the core components in SYSTRAN is largely rule-based, and that for the system’s RBMT process, glossary entries and grammatical rules are vital. This study uses SYSTRAN 7 Business Translator, and focuses on glossary entries rather than grammatical rules. What is very convenient about using SYSTRAN in this regard is that user-defined dictionaries for the system are easy to build. “Categories” for an entry can be not only such part-of-speech items as verbs, nouns, adjectives, etc., but also sequences (to describe clusters of words as a segment). As can be safely assumed, one feature of the ST is the highly repetitive occurrence of certain word clusters that are used almost in a fixed, unambiguous and one-to-one-translation-correspondence manner. Adding the clusters into the glossary entries (as “sequences” or other grammatical functions) would reduce the parsing complexity and thus produce correct and consistent translation. Applied Optics is a journal published by Optical Society America on a biweekly basis. Its reputation has been widely recognized in the optics research community, with a large amount of readers around the world. It is also updated very frequently. This not only indicates the worthiness of this investigation in the practical sense, but also provides abundant text materials for analysis on a larger scale. A corpus containing all the abstracts of this journal in 2014 has been compiled (1,328 abstracts in total), together with another two corpora consisting of abstracts of articles published in the same year in, respectively, Optics Express and Optics Letters, also by Optical Society America. Altogether the three corpora contain 4,744 abstracts, 20,675 word types and 822,230 word tokens. As will be illustrated below, a sample of Applied Optics is first used for a small experiment of fully automatic translation in SYSTRAN, involving detailed analysis and customization. The sample consists of seven abstracts that are selected in a systematic sampling manner: all the abstracts from the 2014 volume of this journal are rearranged in alphabetical order according to first-author names (surnames first), before the 2nd, 202nd, 402nd, 602nd, 802nd, 1002nd, and 1202nd abstracts are chosen as the sample. This process of arrangement and selection is completed via a bibliography management tool named Endnote. The sampling process is conducted in such a manner in order to avoid biases in the investigation, so that results would be more representative of the abstracts of this journal in the general sense.

51

These selected abstracts are translated in SYSTRAN, with a close examination of what words are recognized as the wrong categories, or translated into the wrong equivalents. This is a discussion of the lexical disambiguation results from the MT system. Then an error analysis is conducted accordingly, resulting in a list of words or clusters that are worthy of being added into the user-defined glossary. The relevant items are entered with corresponding categories and translations, with the priority for each set as “1”, informing SYSTRAN to prioritize the modified glossary and disregard any alternatives other than the modified entries. Presumably, the modified glossary should result in better translation in SYSTRAN. This is subsequently analyzed for verification by comparing the initial translation and the improved output, with a focus on not only the lexis, but more importantly the syntax, to investigate the effectiveness of the lexical customization – – whether and to what extent the modified glossary can help produce better MT results. Essentially, this is about translation quality assessment (TQA), an issue considerably controversial and complicated. Regarding the assessment here, two points are of much importance: 1). the process will be for the most part an evaluation of the relative, rather than absolute, quality of the customized MT; 2). the discussion in this part of the case study is focused on lexical and syntactic disambiguation results, together with a perspective which is consistent to the functionalist definition of quality. It is also important to mention that although the sample for the experiment is relatively small in size, the discussion is detailed and exploratory. At the same time, the systematic sampling method is considerably helpful for avoiding biases. However, in order to truly eliminate the side-effects from the sampling, further analyses are conducted in view of the entire volume of Applied Optics, together with the same analyses in the other two journals and in the three corpora combined, as can be seen in Chapter 4. This answers the question whether the analysis on the sample is dependable in respect of the entire journal as a whole. Here, a corpus-based investigation would, from a wider perspective, help to shed light on the representativeness of the issues discussed on the sample, and to provide justification for the customization conducted in the experiment.

52

3.1 SYSTRAN 3.1.1 Transfer Architecture The software used in this case study is SYSTRAN, the core component of which is a rule-based transfer architecture consisting of three modules: analysis, transfer and generation. The transfer approach has proven to be relatively realistic, viable and accurate — much more sophisticated than the naïve word-for-word approach and more practical than the ambitious and challenging interlingual approach (Aranberri Monasterio, 2010). Therefore, during the evolution of MT, it is the transfer architecture that has survived as commercial Rule-based Machine Translation (RBMT) systems (ibid). A classical transfer architecture such as SYSTRAN’s is given below in Figure 3.1 (picture taken from Surcin, Lange, & Senellart, 2007). While early proposals of MT were largely based on a direct replacement of the ST words with equivalents in a bilingual glossary, it was found to be too simplistic (e.g. Arnold, Balkan, Meijer, Humphreys, & Sadler, 1994). An adequate degree of syntactic analysis is a necessity for a system to really work satisfactorily. In the ideal situation, there would be an abstract and language-independent representation (i.e. interlingua) to facilitate the translation process, generating the target sentence from pure semantic and rhetorical information (i.e. the interlingual approach), but this proved to be far more challenging than practical. Consequently, research in MT has led to a third, more viable approach which involved an intermediary level of abstraction, with much of the syntactic information of the source sentence. This abstraction is then transferred to a TT-dependent intermediary structure (Arnold et al., 1994). In the transfer architecture (see Figure 3.1), analysis transforms the surface structure of the source sentence into an abstract representation regarding the Source Language (SL), transfer maps this representation to a Target Language (TL)-dependent representation, and generation transforms the mapped representation into the surface structure of the target sentence.

53

Figure 3.1 — Transfer Architecture Specifically, SYSTRAN divides the generation process into synthesis and rearrangement, where synthesis is dependent on the TL information and rearrangement on information from both the SL and TL (ibid), as shown in the following figure (picture from Surcin et al., 2007).

Figure 3.2 –– SYSTRAN’s Architecture In the discussion of translation output below (see Section 3.4), it would be apparent that the combination of synthesis and rearrangement in SYSTRAN’s architecture is visibly effective in generating proper TT syntax, especially regarding the order of words or sequences. This is crucially important for translating into a paratactic language like Chinese. It is also effective for the customization, as the

54

modification of the POS of a very small number of items would result in much more positive effect beyond the words concerned (see 3.4 below). According to SYSTRAN developers, the analysis module in this architecture takes up 80% of the code, with transfer accounting for 19%, synthesis for 5%, and rearrangement for 5% (Aranberri Monasterio, 2010; Surcin et al., 2007). This not only highlights the importance of ST analysis for outputting adequate translation quality, but would enable it to yield, as will be shown below, decent results of grammatical disambiguation, particularly regarding part-of-speech (POS), phrasal boundary and clause dependency (see 3.2 and 3.4).

3.1.2 Customization in SYSTRAN SYSTRAN offers many convenient customization techniques which “help contextualize the system and improve its translation quality” (Aranberri Monasterio, 2010, p. 5). On SYSTRAN’s webpage6, for example, its customization methodology is described as combining “the latest linguistic and statistical techniques” to train the software to customer domains, and allowing users to “fine-tune translations to achieve publishable translation quality results”. The customization here mainly refers to leveraging existing language assets (e.g. monolingual/bilingual texts, Translation Memories, terminology data) to create domain-specific or project-specific dictionaries and translation models. Here, the compilation of user dictionaries (UD) is one important aspect of customizing the MT for the domain or project in question. SYSTRAN’s UD can contain three types of entries: “Multilingual7” terms, “Do Not Translate” (DNT) terms, and “Source Category” terms. DNT entries inform SYSTRAN what should not be translated, and Source Category terms are “used to specify the grammatical category of source terms without specifying their translations”. For Multilingual items, each entry is associated with such information as Source Language, Target Language, Category and Priority. After inserting the terms Retrieved 15 Jan 2016 from http://www.systransoft.com/systran/corporate-profile/translationtechnology/systran-customization-methodology/ 7 All quoted expressions and sentences in this paragraph are from the user manual for SYSTRAN. 6

55

and relevant information, SYSTRAN automatically codes the UD and generates a confidence score for the user to ensure that errors are avoided. Perhaps more important is the extent of customization which allows the user to add additional linguistic information, including context, and files which contain more sophisticated resources for its disambiguation. In addition, the process of customization can also be automatic, which includes 1). the creation of dictionaries and 2). the building of translation models, from existing data –– i.e. language assets such as monolingual or bilingual texts, Translation Memories, and terminology resources. The dictionaries facilitate the system in properly parsing the ST sentences (based on information of “grammatical categories 8“ for the entry items) and in choosing the most appropriate translation equivalents for words and expressions, while the translation models would help normalize or “automatically post-edit” the TT before presenting the final output to the user. For 1), a set of bilingual texts can be used by SYSTRAN to automatically align the words and phrases between the two language resources and to extract translation equivalents for words and expressions, resulting in two “Wizard User Dictionaries” (WUD) which are tailored to the specific bilingual resources provided by the user. One of the two WUDs contains what SYSTRAN calls “Source Category” entries, i.e. entries which do not contain translation equivalents in the target language, “serving instead to influence the analysis of source sentences for words with POS ambiguities”, while another contains all other types of entries, including “Do Not Translate” (DNT) items and “normal” entries. If the resource is monolingual, the system would extract a WUD containing only “Source Category” entries.

3.1.3 Investigation of the Output In this case study, what is also convenient with using SYSTRAN, is that we can somehow detect part of the process of its translation. This is because of the following. 1). The software shows the ST-TT alignment at both the sentence and the word levels, as illustrated in Figure 3.3 below, so that each translated Chinese word All quoted expressions and sentences in this paragraph are from the user manual for SYSTRAN. 8

56

can be traced back to its source (see the word “structure” and “结构” below), and that each auxiliary character which is added by the system (e.g. “的”) can be recognized.

Figure 3.3 –– SYSTRAN’s Output 2). Lexical items which are problematic for the system can be highlighted in different ways, as also shown in Figure 3.3. They include words that are not found in SYSTRAN’s glossary (the red ones in Figure 3.3) and words which are ambiguous (the blue ones) in terms of POS (or grammatical category) and multiple meanings (or multiple translations). Glossary look-up results can also be shown by the side, displaying all the active records regarding a specific word. As Figure 3.3 shows, when we click on the word “structure”, the system shows all the entries of this item in different dictionary files. This would be particularly helpful for analyzing those ambiguous items. 3). Regarding each instance of lexical ambiguity, we can detect which specific decision the system has made, by looking at the context menu after rightclicking on the ambiguous word, as shown in Figure 3.4 below. In the example here, the word “offer” is ambiguous in terms of both POS and translation, as can be seen in 57

the SYSTRAN dictionary on the right side (see Figure 3.4); the context menu reveals that SYSTRAN has chosen “noun” for its category and “提议” for the translation. The context menu works for all the other items that are not highlighted as well. This allows us to partially analyze how each translation is produced in SYSTRAN’s output.

Figure 3.4 –– SYSTRAN’s Lexical Choice As an example, consider the above sentence highlighted in Figure 3.4: EN — Nanostructured [n] materials [n] offer [n] great [adj] prospects [n] in [prep] helping [v] solar-energy [n] harvesting [v] devices [n] to [auto] achieve [v] their [prep] envisioned [v] performances [n]. CN — 在 [in] 收获 [harvesting] 设备 [devices] 的 帮助 [helping] 的 太阳能 [solarenergy] 的 Nanostructured*** [Nanostructured] 材 料 [materials] 提 议 [offer; 提议/聘用] 巨大 [great] 远景 [prospects; 远景/潜在 客户] 完成 [achieve] 他们 [their] 的 被构想 [envisioned] 的 表现 [performances; 表现 /性能]。

58

Here, the square brackets in the English sentence indicate the categories that SYSTRAN has assigned to the words; those in the Chinese sentence indicate the source of the TT word, together with the multiple translations in SYSTRAN’s glossary if any. The triple-asterisk “***” means the item is not found in the active glossary files. This would not only be helpful for the lexical analysis in terms of category and polysemy, but largely for the syntactic discussion as well, as the sentence can then be illustrated in the following manner:

Figure 3.5 –– Analysis of the Translation From this figure, it can be seen where each translation error comes from with regards to semantic and syntactic ambiguity: the main verb in the English source sentence, “offer”, is wrongly recognized as a noun; the prepositional phrases “in helping…” and “to achieve…” are attached to the wrong head words; the “-ing” words — “harvesting” and “helping” — are both recognized as modifiers to the following noun, whereas “helping” is in fact not functioning a noun modifier. This gives us some insight as to how each error occurs, what measures can be taken to improve SYSTRAN’s translation, and more importantly whether a customized output results in improvement of the translation. For example, the analysis of this particular sentence in the above example indicates that restricting the glossary records of “offer” to only a verb might be a solution to the problematic parsing, under the condition that this word is used only as a verb in such texts. As can be seen in Section 3.4, such analysis would also provide much basis for the comparison between the initial output and the customized output.

59

3.2 Initial Translation: Lexical ambiguity The following illustrates the seven abstracts selected and the relevant output from SYSTRAN in the initial run. The results can be found in Appendix D, which also shows the ST and the improved translation after the system has been customized with a very limited number of glossary entries (see Section 3.3). It is obvious that for all of these abstracts, the initial translation contains numerous errors and are — like the many MT usability investigations shown in the previous chapter — far from satisfactory. However, what can be seen from these errors are ways of improvement, and an insight into the aspects of restriction in the language. What this case study aims to show is that most of the problems are actually solvable by customizing the system to the ST, and that once customized, the quality can be very high.

3.2.1 Not Found Words (NFW) Most striking in the MT output is that quite a few words are left untranslated. This is because of the insufficiency of the default technical glossary for the ST, and a closer look at such untranslated words as “nanotube”, “nanostructured” reveals that 1). they are highly technical and discipline-bound, i.e. specific to special texts; and 2) they have fixed Chinese translations (“纳米管”, “纳米结构”). In other words, these are the so-called terminus technicus items (c.f. Cabré, 1999) with natural equivalence in translation (see Chapter 2). Therefore, it is safe to assume that the translations and grammatical categories of these items are fixed in the text. The list of such items is shown in List 3.1 and List 3.2 below.9 It is worthy of mention that some of the NFW items are hyphenated compounds. SYSTRAN highlights every compound if it as a whole is not included in any of the active dictionaries, but there are items whose components are included. For example, in the phrase “five-dimensional (position-angle-spectra) data cubes”, both of the hyphenated terms — “five-dimensional” and “position-angle-spectra” are The way in which the categories of these terms are designated is described later in Section 4.1. 9

60

highlighted as NFWs because none of the SYSTRAN’s active glossaries include these terms. However, the words “five”, “dimensional”, “position”, “angle” and “spectra” are all contained in the glossary. What is particularly interesting is that in the glossary, the translation of “dimensional” (“尺寸”) is not consistent with how it should be translated in “five-dimensional” as a whole (“维”), while for all of the others, the translations are exactly consistent. In either case the system’s output gives a translation, one way or another. For the list below, the hyphenated terms are only listed if SYSTRAN’s translations, or the category assigned to them, are wrong, or part of the hyphenated items is not included in the glossary. In other words, those instances where the hyphenated terms are assigned with the correct categories and translations are not listed (e.g. “position-angle-spectra”/”位置角度光谱”). People’s names, chemical symbols, abbreviations and units are listed in English. English

Translation

Category

nanostructured

纳米结构

adjective

nanotube

纳米管

noun

drift-diffusion

漂移 - 扩散

noun

five-dimensional

五维

adjective

hyperspectral

高光谱

adjective

amplitude-only

纯振幅

adjective

high-earth

远地

adjective

homodyne

同步检波

verb

near-field

近场

adjective

guided-wave

导波

noun

second-harmonic

二次谐波

noun

phase-matching

相位匹配

adjective

proof-of-principle

概念验证

adjective

intracavity

腔内

adjective

reinjection

重新注入

noun

sensorless

无传感器

adjective

List 3.1 –– NFW: Terms

61

English

Translation

Category

CNT

CNT

acronym

CGH

CGH

acronym

AlGaAs

AlGaAs

proper noun

Acket

Acket

proper noun

mm3

mm3

proper noun

Lang-Kobayashi

Lang-Kobayashi

proper noun

FM

FM

acronym

Strehl

Strehl

proper noun

SPGD

SPGD

acronym

List 3.2 –– NFW: Abbreviations, chemical symbols, units and names

3.2.2 Categorial Ambiguity All of the words highlighted by SYSTRAN as “source ambiguities”, together with their frequencies, categories 10 , and SYSTRAN’s disambiguation results, are listed in Appendix A. All these listed items are intrinsically affiliated with multiple grammatical categories — mostly as nouns, verbs and adjectives. In case the highlighting of SYSTRAN for “source ambiguities” might not be exhaustive, the list was subsequently complemented by looking at all the words one by one in the sample, and checking, respectively, their grammatical categories in the corresponding context, SYSTRAN’s disambiguation results, and the relevant information contained in the glossary. All of the words in the sample which are recognized as the wrong categories, together with those with multiple category entries in SYSTRAN’ glossary (see Figure 3.6 for examples), are added to the list in Appendix A.

For a detailed description of how the categories of these items are designated, see Section 4.1. 10

62

Figure 3.6 –– Examples for category ambiguity In the finalized list (see Appendix A), the column “Word” refers to the ambiguous items, each followed by a number indicating its frequency in the sample. “Category” indicates the grammatical category as which the corresponding items function in the sample. The number after “Category” in each row, accordingly, refers to the number of times an item functions as such. “Assigned Category” means the category SYSTRAN has assigned to the corresponding item, followed by a remark of “correct” or “wrong”. In the same manner, the numbers here indicate the number of times SYSTRAN has assigned the item as such. The column “All Categories” lists all of the categories associated with the item in SYSTRAN’s glossary. It can be seen from the list that 1). the ambiguities in these abstracts are considerably restricted, and 2). SYSTRAN’s disambiguation seems rather satisfactory. This will be illustrated below in more detail.

ST restrictedness As the list shows, there are 91 items in total, 33 of them occurring more than once. Some are even highly frequent to a remarkable extent, e.g. “can” and “show” (see Appendix A). Among these multiple-occurrence items, some are used as fixed POS categories, while some — though very few — are not. For example, “can” occurs seven times in the sample, all as verbs (i.e. fixed POS), as shown in Appendix A; on the other hand, “cloak” has 4 occurrences as nouns and 3 occurrences as verbs (i.e. flexible POS). 63

The number of single-occurrence items (i.e. items occurring only once in the sample), multiple-occurrence items (with fixed or flexible POS), the subtotals, and the corresponding frequencies are extracted and shown in the following Table 3.1. Regarding this table, it is important to note that the “flexible POS items” and “fixed POS items” refer to those multiple-occurrence items in Appendix A which are used with diversified (flexible) or restricted (fixed) POS in the sample. Thus, despite the denotation of “flexible” or “fixed”, they are all intrinsically (and as indicated in SYSTRAN’s glossary) affiliated with multiple POS categories. What is meaningful here in distinguishing the two is that if a category-ambiguous word appears in the sample only as one, fixed type of POS category, it should not be considered ambiguous in practice, and can be well avoided by restricting the glossary of the system.

Table 3.1 — Categorial ambiguities As the table shows, among the totally 91 ambiguous items, only 6 are used with multiple grammatical categories (i.e. genuinely ambiguous items) — all of the remaining ones, though intrinsically ambiguous, are in effect with fixed parts of speech in the sample texts. In other words, only less than 6% of those theoretically ambiguous items are, in practice, genuinely ambiguous for these abstracts; 94% of the “ambiguous” words somehow become unambiguous because of the sublanguage concerned (see Chart 3.1 below).

64

Chart 3.1 — Ambiguous items in terms of word types Considering the fact that the entire sample contains 437 word types in total, the 91 items that are deemed intrinsically ambiguous account for about 20%, and the 6 genuinely ambiguous items in this particular case are equivalent to as few as about 2% in the holistic sense — at an almost negligible level (see Chart 3.2 below). Ambiguity is significantly little in view of the complete sample.

Chart 3.2 — All items in terms of word types

65

In the list, there are 58 items which occur only once for each (i.e. hapax legomena), and 33 occur at least twice for each. As mentioned above, some of these 33 items are highly repetitive. What is revealing, with regard to lexical restriction, is that for the 33 multiple-occurrence items, the majority — as many as 27 — are used as only one POS for each. These 27 items are repetitive, yet repetitively restricted. In terms of occurrence, the 91 ambiguous items make up 174 occurrences in total, while the 6 genuinely ambiguous items correspond to a total occurrence of 29 (see Table 3.1). This will render the above 6% into 17%, for the proportion of genuinely ambiguous words among the complete list of items (see Chart 3.3 below). Since the sample contains 916 tokens as a whole, the above 20% (for “source ambiguities”) and 2% (for the genuinely ambiguous items in the sample) would correspond, respectively, to 19% and 3% — also significantly low (see Chart 3.4).

Chart 3.3 — Ambiguous items in terms of word tokens

66

Chart 3.4 — All items in terms of word tokens In short, regarding the sample, only 2% of the word types — and 3% of the word tokens — are genuinely ambiguous in terms of grammatical category. This demonstrates that the sublanguage here is significantly constrained in terms of category ambiguity. Given the prevalence of category ambiguities in English (e.g. Hutchins & Somers, 1992), it may be surprising at the first glance to see how sharply little proportion the ambiguous items make up in the sample. But as discussed above (see Chapter 2), scientific abstracts are terminology-dense, and academic writings tend to avoid ambiguity intentionally. Most of the terminological items, presumably, are very straightforward in their grammatical categories (if not all are nouns), which significantly reduces the extent of category ambiguity in the sample. This section has illustrated that the sample has significantly little ambiguity in terms of the grammatical categories of its lexicon. While almost every lexical item in English can function as more than one category (c.f. Hutchins & Somers, 1992), in this sample such ambiguity is reduced to almost a negligible level, largely because of the sublanguage concerned.

67

SYSTRAN’s disambiguation Appendix A also shows SYSTRAN’s disambiguation results regarding these 91 ambiguous items. The results can be briefly illustrated in the following Table 3.2. As can be seen from Appendix A, most of these categorially ambiguous items have been correctly recognized. Regarding the items which occur as more than one category in the sample, some are recognized in the correct manner no matter the item occurs as nouns, verbs, or other categories. For example, “design” occurs 3 times as a noun and twice as a verb, but for all these instances SYSTRAN has successfully recognized the correct category. For some other items, SYSTRAN has partially recognized their categories. An example of this is “produce”, where among the two occurrences one is correct and the other is wrong (see Appendix A). Among the 58 single-occurrence items, 49 were correctly recognized. For the totally 33 items with multiple occurrences, 25 were recognized in the completely correct manner, 3 partially correctly, and 5 completely wrong. In the overall sense, a total of 74 among the entire 91 category-ambiguous words are completely assigned with the correct category — a proportion of 81%. Perhaps more meaningful is the disambiguation result of those multipleoccurrence items that appear with diversified grammatical categories and thus remain ambiguous to a substantial extent. There are 6 items of this kind (see also Table 3.1), among which 5 were completely correctly disambiguated.

Table 3.2 — Disambiguation results (in terms of number of items) This shows a significantly high success rate in SYSTRAN’s disambiguation. The following Table 3.3 shows SYSTRAN’s disambiguation result in terms of occurrence. As can be seen, 49 instances of single-occurrence items, together with

68

94 instances of the multiple-occurrence items, are assigned with the correct POS. In total, they amount to 143 occurrences.

Table 3.3 — Disambiguation results (in terms of occurrence) A comparison of these numbers with those in Table 3.1 (see above) reveals that the disambiguation accuracy is 84% for single-occurrence items, 81% for the multi-occurrence items and 82% in the overall sense. These numbers are very high, but what seems more revealing is that there is very few errors on the “flexible POS” items, as shown in both Table 3.2 and Table 3.3. This means that nearly all of the errors are on items which appear with only one corresponding category for each. This also means that by modification of SYSTRAN’s glossary, nearly all these errors can be avoided. What is particularly interesting here is that while the disambiguation accuracy rates of 84%, 81% and 82% are roughly at the same level, more specific calculations reveal a relatively sharp contrast between the rates of fixed and flexible POS items. As Table 3.3 shows, the disambiguation result for the “flexible POS” items (the genuinely ambiguous ones), in terms of occurrence, is extraordinarily high (96%), whereas the result for “fixed POS” items, which can somewhat be considered unambiguous in this case, is relatively very low (72%). As described above, there are 6 “flexible POS” items, corresponding to 29 occurrences in total (see Table 3.1). 28 out of these 29 occurrences are successfully disambiguated by SYSTRAN (see Table 3.3). This translates into a significantly high percentage — 96%. None of these items is recognized in the completely wrong manner (see Table 3.3). In theory, these items are presumably problematic for MT in this case, but SYSTRAN has handled them particularly well. The genuinely ambiguous items do not seem to pose significant problem for SYSTRAN.

69

However, this does not contribute to an equivalently high result for the multiple-occurrence items in the overall sense. Out of the 116 occurrences, 94, or 81%, were correctly disambiguated. The percentage here does not seem large compared with the 96% above. As can be seen, there is a surprisingly low disambiguation accuracy of those “fixed POS” items, items which in theory should be easy to disambiguate. Among the 87 occurrences of such items (see Table 1), there are as many as 21 errors — in effect making up the vast majority of all the errors in POS disambiguation. A close look at the errors reveals that almost all of them are concentrated on instances where the items are in fact considerably restricted. The 5 multipleoccurrence items that were completely wrongly recognized (corresponding to 18 occurrences, see Table 3.2 and Table 3.3) are all “fixed POS” items; out of the totally 22 instances of errors regarding multiple-occurrence items (see Table 3.3), as many as 21 are in the column for “fixed POS” items — items that are considerably restricted in this sample. This means that these 21 errors are all avoidable by customizing the system to this specific text. In essence, all of the errors that occur on items other than the “flexible POS” ones can be avoided in this case, and such errors amount to a total of 30 occurrences, or 97%, in the context of 31 overall errors of POS disambiguation (see Table 3.3). This means that the vast majority of them are in fact avoidable. Combining this with SYSTRAN’s significant success in the disambiguation of “flexible POS” items (see above), it is reasonable to assume that if SYSTRAN’s glossary entries for these “fixed POS” items are restricted to the corresponding categories as which they function in the text, SYSTRAN will be nearly 100% accurate in the POS disambiguation of this sample. With the source disambiguation more accurate, it is also not hard to see the potential of much improvement in SYSTRAN’s translation output, though the initial translation is far from satisfactory.

Summary The above description has illustrated mainly three points concerning the selected abstracts:

70

1). Category ambiguity in the sample is considerably limited (3%), 2). SYSTRAN’s disambiguation result is considerably successful, and 3). It is possible to avoid nearly all of the disambiguation errors in the sample.

3.2.3 Polysemy As mentioned above, in the linguistics circles polysemy is often distinguished from homonymy; but regarding MT these two kinds of lexical ambiguity are seldom processed differently. Meanwhile, if these items are ambiguous only in their intrinsic semantic specification but not in terms of translation, they do not make much difference either, since the system is finally outputting the translated text rather than a processing of the information in the ST. Therefore, in this case study, the “polysemous items” refer to all of the words which correspond to multiple translation equivalents. Given the context of the study, the polysemous items to be discussed are those which have more than one translation in the glossary, or ones whose translations in SYSTRAN are different from the corresponding equivalents into which they should be translated. In essence, almost all of the words in this sample can have multiple translations, but similar to the issues of category ambiguity, this study focuses on items that are difficult for SYSTRAN to disambiguate, i.e. items which are ambiguous for SYSTRAN. Meanwhile, it is obvious that items which are ambiguous in terms of grammatical category (i.e. those discussed above) might be semantically ambiguous as well. For example, “light” as a noun, adjective, or verb would have different meanings and translations (“光/轻的/点燃”). But some of these items do not really correspond to different translations in Chinese. An example could be “design”, as included in Appendix A. Either as a verb or as a noun, this item corresponds to the same Chinese translation in the glossary (i.e. “设计”). As such, the item is not considered polysemous in this study. Although it does affect SYSTRAN’s syntactic parsing of the source sentences, such items are dealt with in the section for “category ambiguity”.

71

All of the polysemous items are included in Appendix B. Here, although SYSTRAN can highlight what the system calls “alternative meanings”, this is found to be not exclusive enough for the discussion here. It is discovered, after examining the highlighted items, that SYSTRAN only highlights those that have more than one translation under the assigned grammatical category. In order for a more objective and comprehensive discussion, the items were then examined and supplemented.

ST restrictedness Similar to category ambiguity (see above), the number of polysemous items is rather small. In Appendix B there are 110 items in total, 42 of them occurring more than once. Some frequent items include “phase” and “laser”, together with frequent items which are also listed for category ambiguity (e.g. “show”). Among them, the vast majority are used with fixed meanings, and a very marginal few with flexible meanings (i.e. genuinely polysemous). The relevant figures are shown in the following Table 3.4. Again, regarding this table, the denotations “fixed meaning” and “flexible meaning” refer to those items in Appendix B that are used as one, or more than one, meanings in the sample. Intrinsically they are all polysemous on their own and in SYSTRAN’s glossary. The ones that are used as fixed meanings should not be much of a problem for MT if the system is adjusted to suit the sublanguage of the ST.

Table 3.4 — Polysemy As can be seen from the table, only 5 out of the totally 110 items are used with flexible meanings. All of the remaining 105 items — which are otherwise 72

significantly ambiguous — are in effect very restricted in terms of polysemy. Similar to what is found in category ambiguity, this means that only about 4% of the theoretically ambiguous items are in practice ambiguous for these abstracts, and that 96% of the “ambiguous” words are in fact unambiguous in this sample (see Chart 3.5 below).

Chart 3.5 — Polysemous items in terms of word types In view of the entire sample — a total of 437 word types — the 110 items listed in Appendix B account for about 25%, while the 5 genuinely polysemous items are equivalent to only about 1% (see Chart 3.6 below). In the same way as category ambiguity, words that appear in the sample as polysemous are negligibly few.

73

Chart 3.6 — All items in terms of word types As can be seen from Table 3.4, 68 items in the list are single-occurrence items while 42 are multiple-occurrence words (some of which are highly repetitive). Again, among the 42 repetitive items the majority, as many as 37, are used only with one corresponding meaning each time they occur. No other meaning (which could be very flexible otherwise) is involved for these items. They are repetitive, yet repetitively restricted. In terms of occurrence, the 110 items correspond to a total of 199 occurrences, and the 5 genuinely polysemous items to 24 occurrences. This is also shown in Table 3.4 above. The numbers mean that about 12% of the total occurrences of the listed items are genuinely polysemous (see the following Chart 3.7).

74

Chart 3.7 — Ambiguous items in terms of word tokens Considering the totally 916 word tokens in the sample, the occurrences of the listed source polysemy account for 22%, while the occurrences of the genuinely polysemous items make up only 3%. The percentages here are consistent with those regarding category ambiguity, equally significantly low (see Chart 3.8 below).

Chart 3.8 — All items in terms of word tokens 75

In short, regarding the sample only 4% of the word types, corresponding to 3% of the word tokens, are genuinely ambiguous in terms of polysemy. This is consistent with the percentages regarding grammatical category (see above). Similar to category ambiguity, polysemy is equally — if not more — prevalent in English (e.g. Hutchins & Somers, 1992), with a vast majority of words having multiple meanings. This is made even more so if translational ambiguities are taken into account: a “guide” would correspond to different Chinese equivalents depending on whether it is in the context of a tour guide (导游), of a device which guides light beams (光导), or of a device guiding waves in general (波导); similarly the equivalents for “linewidth” would be different also depending on the domain of the sublanguage (线宽/行距), though in English the two are barely dissimilar in terms of semantics. This creates a sharp contrast between how ambiguous the words intrinsically are and how little proportion the genuinely ambiguous items make up in the sample. This is largely due to the specific features of the sublanguage concerned — domain-specific academic publications. As can be seen from the list in Appendix B, most of the listed words are terminological items, together with a few generalpurpose words which are used with very fixed meanings. Even when it comes to those items with flexible meanings (translations), patterns can be hypothetically gauged: for example, the word “phase” is only translated differently when it appears in the cluster “phase front” (波前), while in all other instances it is not ambiguous at all, referring to the parameter of a light wave (相位).

SYSTRAN’s disambiguation The disambiguation results of these polysemous items in SYSTRAN can be seen in the following tables and charts. As will be discussed below, the results are sharply different from those regarding category ambiguity (see 3.2.2 above) — most of these items were wrongly translated.

76

Table 3.5 — Disambiguation results (in terms of number of items) As can be seen from Table 3.5, among the 68 single-occurrence items, as many as 46 were translated into the wrong equivalents. For the totally 42 multipleoccurrence items, only 9 were translated completely correctly; 27 of them were translated completely wrongly and 6 partially wrongly. Overall, only 31 of the totally 110 polysemous items were translated in the completely correct manner — a proportion of 28%. Compared with the highly successful disambiguation of category ambiguity (81%, see previous section), the 28% seems rather small and problematic. This is even more so when one considers the multiple-occurrence items with fixed meaning and are thus considerably restricted — only 8 out of the totally 37 such items were translated correctly (see Table 3.5). What is shown here is in sharp contrast with the corresponding results for category ambiguity: while the vast majority of the categorially ambiguous items were correctly recognized, the complete opposite is true for polysemous items. The following Table 3.6 shows SYSTRAN’s disambiguation results in terms of occurrence. 22 instances of single-occurrence items and 38 instances of the multiple-occurrence items were correctly translated. In total, they amount to 60 occurrences (in contrast to a total of 139 errors).

Table 3.6 — Disambiguation results (in terms of occurrence)

77

In the same manner as illustrated in the previous section, a comparison of the numbers in Table 3.6 and those in Table 3.4 (see above) illustrates that the relevant percentages of the items correctly translated are: 32% for single-occurrence items, 29% for the multiple-occurrence items, and 30% in the overall sense. As discussed in 3.2.2, the relevant percentages corresponding to category ambiguity are respectively 84%, 81%, and 82% (see Table 3.3). Contrast them with the 32%, 29% and 30% here and one sees the prevalence of failure in disambiguating polysemous words. As can be seen from Table 3.6, the errors far outnumber the correctly translated items. It seems that SYSTRAN has considerable difficulty in choosing the correct meanings for those polysemous items. On the other hand, given the success in SYSTRAN’s POS disambiguation, a modification of these items in the glossary should result in remarkable improvement in its translation. A comparison among the specific numbers in Table 3.6 reveals similar phenomena to what is discussed in category ambiguity — the instances where SYSTRAN is most likely to mistranslate the ST are in fact those where the words are significantly restricted. As can be seen in the table, the lowest percentage of correctly translated items has to do with the “fixed-meaning items” (27%), though the contrast is not as sharp. Among the totally 139 errors in this regard, 78 were fixed-meaning items, 15 flexible-meaning items, and 46 single-occurrence items. In other words, more than half of the errors are regarding the fixed-meaning items, items which are in effect not ambiguous at all in the ST concerned. This means that a modification of the glossary might result in avoidance of the vast majority of the mistranslations.

Summary The above description has illustrated mainly three points concerning the selected abstracts: 1). Polysemy in the sample is considerably limited (4%), 2). SYSTRAN’s disambiguation result is considerably disappointing, and 3). It is possible to avoid more than half of the disambiguation errors in the sample.

78

3.3 Lexical Error Analysis and Glossary Modification 3.3.1 Error Analysis List C.1 in Appendix C shows all of the disambiguation errors in Appendix A. As can be seen, the 16 items in the list are all fixed-POS items. The list also includes some of the highly repetitive fixed-POS items, e.g. “can” and “show”. All of these items can be modified in SYSTRAN’s glossary and restricted to the corresponding categories, so that each time such items occur, SYSTRAN would be able to recognize them in the correct manner. In the same manner, List C.2 in Appendix C shows the wrongly translated items in Appendix B. The list contains 79 items, among which the vast majority are fixed-meaning ones. For these fixed-meaning items, the glossary can be modified to produce correct translation. As can be seen at the end of List C.2, however, there are four items involving some extent of ambiguity — “laser”, “source”, “phase”, and “free”. These items occur in the sample with more than one meaning for each, but patterns can be gauged. A concordance search for these items in the sample reveals how they are used, as illustrated below. As an example, the following is a search for the word “laser” in the ST sample. This is a very typical case of translational ambiguity: while essentially an acronym (i.e. “Light Amplification by Stimulated Emission of Radiation”), the word is often used to refer to either the amplified light beam, or the device which produces such a kind of light (largely because, etymologically speaking, the word ends with “-er”). In Chinese, however, the light beam and the device are generally called differently (激光/激光器). Regarding this sample, a concordance search for “laser” reveals that the word is used mostly in phrases, as shown below. In each of the phrases, the translation of “laser” is fixed. This means that such kinds of ambiguity can be resolved by inputting the entire phrase as an entry in SYSTRAN’s glossary.

79

“laser”

Figure 3.7 –– Concordance for “laser” in the sample quantum cascade laser

量子级联激光器

semiconductor laser

半导体激光器

solid-state laser system

固态激光系统

solid-state laser

固态激光器

It is worthy of mention that the concordance search here is regarding the ST sample only, which is a small fraction of the entire corpus of Applied Optics. For the patterns to be representative, there needs to be a search in a larger text source; however, the search here is merely a preliminary and exploratory step of a further analysis and a more comprehensive concordance search in the entire corpus will be discussed in Chapter 4. In the same manner as the example of “laser”, the other three words can also be searched. Patterns can be gauged regarding each and every one of them (see List C.2), even in this very small sample. This means that the polysemous items can be largely avoided by modifying the glossary entries in SYSTRAN, just as discussed above concerning category ambiguity.

3.3.2 Glossary Modification in SYSTRAN The above analysis has covered items of category ambiguity and items of multiple translations. Presumably, if all the words in List C.1 and List C.2 are added into the user-defined dictionary, informing SYSTRAN how each of them is supposed to be processed and translated, there should be a considerable improvement in the 80

system’s output. The following section investigates the effectiveness of such modification. Here, the glossary modification of the items in List C.1 and List C.2 can be done either 1). separately, or 2). comprehensively. For the former, the two lists would result in two files in SYSTRAN’s userdefined dictionary, one corresponding to category ambiguity (where the entries are all “Source Category” items, see Figure 3.8 below) and one corresponding to polysemy (where the entries are all “Multilingual” items, see Figure 3.9). These two files can then be complemented by a third one for the “Not Found Words” (NFW) and “Do Not Translate” (DNT) items (see Figure 3.10). Altogether the three files should facilitate SYSTRAN’s correct translation in many cases of errors. Then the default priority of the three user-defined dictionaries would be set as “1”, so that SYSTRAN would prioritize the specified categories and translations in these files whenever the items occur in the ST input. Accordingly, the “priority” column in the following figures is always “1”.

Figure 3.8 — “Source Category” items

Figure 3.9 — “Multilingual” items

81

Figure 3.10 — “Not Found Words” (NFW)

Figure 3.11 — “Do Not Translate” (DNT) items However, this would result in much redundancy in the user-defined dictionaries. As can be seen (and illustrated below in more detail), there is a significant overlap between List C.1 and List C.2. Almost all of the items in List C.1 appear also in List C.2, with only three exceptions. Creating user-defined dictionaries separately according to these lists would mean a considerable extent of repetition. In this case study, therefore, the items are added comprehensively in a single dictionary (i.e. the latter method above), so that repetition of the glossary entries is kept to the minimum. Moreover, even when it comes to items which do not overlap in the two lists, the corresponding categories and translations show considerable restriction as well (see below). Comprehensively managing the items to add into the user-defined dictionary would also be relatively more effective in avoiding mistakes in SYSTRAN’s parsing and translation. Again, the default priority of the created user-defined dictionary would be set as “1”, as shown in the following Figure 3.12. 82

Figure 3.12 — Settings for User-defined Dictionary

Items in List C.1 The majority (12 out of the totally 15) of the items in List C.1 appear also in List C.2 (and thus in Appendix B). In other words, most of the items whose grammatical categories were wrongly recognized (i.e. List C.1) were also translated into the wrong Chinese equivalents. These 12 items are: can, show, hide, offer, combine, draw, play, lead, apply, light, produce, present. It can be seen from List C.2 (or Appendix B) that none of these overlapping items are flexible-meaning ones. All of them are also fixed-POS items. Therefore, these overlapping items could be added into the glossary with both the correct categories and the correct translations (see Figure 3.13 below), in the same way as those in Figure 3.9.

83

Figure 3.13 The items which are not included in List C.2 are: influence, search, static. They were assigned by SYSTRAN as the wrong categories, but not as wrong meanings. Strictly speaking these words are not what the glossary is in lack of; but failure in disambiguation of their grammatical categories would significantly influence the parsing of the English input. It can be seen that these three items are not included in Appendix B either, which means that in the sample, they are not considered ambiguous in terms of polysemy. This also means that these items can be dealt with in the same way as the other ones in the list. They can be added into the glossary with the corresponding categories and translations. Alternatively, they can also be added as “Source Category” items like those in Figure 3.8, specifying only their grammatical categories without specifying their translations. These two ways of glossary modification do not make much difference in this regard. For the sake of simplicity, this case study deals with these items in the same manner as with the rest of List C.1. In short, all of the items in List C.1 are fixed in terms of both POS and translation as they appear in the sample, and they are all added to the user-defined dictionary with the corresponding categories and translations.

Items in List C.2 Regarding List C.2, about half of the items (38 out of the totally 79) overlap with Appendix A. This means that these 38 items are among the categorially

84

ambiguous ones; however, none of them are “flexible-POS” items (see below); all are fixed in terms of grammatical categories. Specifically, the items in List C.2 that overlap with Appendix A are: field, passive, can, produce, lead, shear, tilt, fringe, offer, intrinsic, drain, combine, charge, object, hide, scaling, active, backward, plane, applied, cascade, harmonic, impact, step, characteristic, rate, factor, draw, play, index, mirror, solid-state, parallel, light, present, show, phase, free. Among them, 12 are included in List C.1: can, show, hide, offer, combine, draw, play, lead, applied, light, produce, present. These 12 items have been discussed in the previous section. Since the rest of the 37 items are fixed in category as well, all of them can be added into the glossary with the correct categories and translations. The other half of List C.2 are 41 items that are not included in Appendix A, which means these items are not ambiguous in terms of grammatical category. Specifically, they are: spectral, emission, simulation, harvesting, photovoltaic, behavior, solution, available, amplitude, discrete, aliasing, prescription, variable, geometry, truncation, lying (lie), generation, frequency doubling, second, generate, discuss, conversion, emit, cavity, steady-state, expansion, linewidth, enhancement

[factor],

analogy,

modulation,

aberration,

correction,

investigate, adaptive, deformable, interferometric, convergence, error, series, laser, source. In short, all of the items in List C.2 are not flexible in terms of grammatical category. When adding them into the glossary, the corresponding categories are fixed and straightforward. The restriction of these items’ translation is also obvious in List C.2. Most are fixed-meaning ones — which is consistent with Appendix B — with some of them being repetitive (e.g. “correction”,

“field”, “error”), yet repetitively restricted.

Among the totally 79 items in List C.2, as many as 75 are fixed-meaning ones with one-to-one translation equivalents.

85

Here, four items are the “flexible-meaning” ones: “laser”, “source”, “phase”, “free”. These four items, as mentioned above, follow patterns in the sample. The word “laser” has been illustrated in Section 3.3.1 as an example, and the other three is shown below. It is worthy of mention that the search here is exploratory and that the gauged patterns are hypothetical. More comprehensive searches in a larger corpus are discussed in the following sections. “source”: source of error

误差源

(other instances)

源极

Figure 3.14 –– Concordance for “source” “phase”: phase front

波前

(other instances)

相位

Figure 3.15 –– Concordance for “phase” “free”: free from



(other instances)

自由

86

Figure 3.16 –– Concordance for “free” These items could be added into the glossary corresponding to their respective translations for the “other instances”, with the exceptional cluster added as a whole. The rest are added into the glossary as the corresponding translations in List C.2. As mentioned above, all of these items are restricted in terms of grammatical category (with some of them wrongly recognized), hence their corresponding categories in the user-defined dictionary entries.

NFW and DNT items Section 3.2.1 has illustrated the words which SYSTRAN’s active glossaries do not contain and which are thus left untranslated (i.e. Not Found Words, or NFW). As can be seen above, some of these NFWs are acronyms, units, chemical symbols or people’s names, which are generally supposed to be untranslated. In other words, they should be “Do Not Translate” (DNT) items. In the user-defined dictionary, the entry types for these items are selected accordingly, as shown in Figure 3.11 above. This would inform SYSTRAN that such items are not really words to translate. Since the function of “transliteration” is unchecked in the system settings (see above), SYSTRAN would then insert the English words directly in the corresponding position of the translation output. The rest of those items in 3.2.1 are, again, restricted in terms of both grammatical category and polysemy. They can be added into the user-defined dictionary in the same manner as those in List C.1. This complements the glossary created above. This results in a user-defined dictionary, containing mainly terminologies and a very limited number of general-language verbs, to be activated in SYSTRAN not only as an additional resource, but with a top priority. In this manner, each time SYSTRAN encounters the items in question the user-defined dictionary would be 87

considered first before any other system glossaries. The complete list of items added into this dictionary is shown in Appendix C (List C.3).

88

3.4 Customized Translation: Lexical improvement and beyond All of the words were added into the user-defined dictionary according to the above. As illustrated below, this results in significant improvement in the translation output, to the extent that the improved translations are good enough for very smooth assimilation of the information in the ST. On the one hand, many of the syntactic errors in the translation have been corrected because of the glossary modification — or more specifically, because of the restriction on the “category” of each entry in the user-defined dictionary. This leads to, linguistically speaking, more grammatical, accurate and fluent translation output, hence better quality. On the other hand, the functionality of the translated output is very satisfactory, largely due to the linguistic improvement of the translation quality. If one reads the improved translation alone without referring to the ST (see Appendix D), the information tends to be very explicit and comprehensible — the purpose of the work in each abstract, the methods used, as well as the major results and conclusions, are straightforward and correct. It is therefore reasonable to assume that if a researcher in optics uses the customized system to translate English articles into Chinese, the output can be very dependable for getting the gist of the papers concerned. The following discussion focuses on the linguistic improvement in the translation after activating the user-defined dictionary created above. As mentioned, issues of category ambiguity are to a considerable extent interconnected with how a sentence is parsed. Many of the structural ambiguities, in fact, “arise from the fact that a single word may serve in a different function within the same syntactic context” (Hutchins & Somers, 1992, p. 89). The POS recognition of the words in an input sentence is vital for the system to correctly analyze the syntactic structure of the ST and to transfer to a suitable TT syntax representation. The following section aims to show that after modifying SYSTRAN’s glossary information of the items discussed above — especially regarding their categories — structural ambiguities are resolved to a considerable extent.

89

The above has also mentioned the syntactic restriction of academic writings, or abstracts in particular (see 2.3). Regarding such texts, some syntactic constructions are quite absent, e.g. tag questions, exclamatory phrases, ellipsis of articles, etc., which significantly simplifies the parsing process; however, other kinds of complexities are exceptionally common, e.g. embedded clauses, conjunction, strings of nouns, and attachment of (typically) prepositional phrases. These kinds of constructions tend to easily cause structural ambiguities, in turn resulting in translation errors. Here, the focus is on how these structural ambiguities are resolved by the modification to SYSTRAN’s glossary. It is worthy of mention that the ambiguities discussed in this section include both “real” and “accidental” structural ambiguities (see section 2.3.2), as apparently the two kinds do not differ much to Machine Translation systems in ways of resolving the issue (c.f. Hutchins & Somers, 1992, p. 89). The following sections exemplify the typical aspects of improvement on structural disambiguation in the translation output from SYSTRAN. Specifically, they include

main-verb

identification, subordinate clauses, “-ing” verbs, modifier attachment, conjunction, noun strings, etc. These are critical elements in natural language processing, and in the absence of capabilities for these elements, it is likely to be very difficult for the majority of users to put the software into serious applications. Therefore, the improvement in these aspects is meaningful for the practical use of MT, and demonstrates the significance of the customization conducted in this case study. In a quantitative manner, the following Table 3.7 shows such improvement on syntax, in accordance to each of the aspects that are illustrated in the following sections. The ST sample, as well as the translation versions before and after the lexical customization, is closely examined, and each instance of the above aspects is recorded. For each aspect, the numbers of instances in the ST, instances that are problematic in the initial translation, and the remaining problems in the output after the customization are counted respectively, and illustrated in the table (see below). The percentages of problematic instances in each aspect, regarding the initial and customized output, are also included. It is important to note that the “problematic instances” here are relative to the corresponding aspects. For example, the problematic instances for “main verbs” are those where the main verbs are not properly recognized in terms of syntax, those for “attachment” are ones where the constituents are not attached to the appropriate 90

words, and the ones for “noun strings” are instances where the strings are not properly detected in SYSTRAN’s source sentence analysis. These issues are subsequently elaborated in detail in the following sections. Problematic instances Total instances

Before customization

After customization

Main verbs

40

11 (27.5%)

1 (2.5%)

Verbs of subordinate clauses

16

7 (43.8%)

1 (6.3%)

Phrasal verbs

7

3 (42.9%)

1 (14.3%)

“-ing” verbs

30

9 (30%))

2 (6.7%)

Attachment

76

49 (64.5%)

33 (43.4%)

Conjunction

27

5 (18.5%)

1 (3.7%)

Noun strings

91

13 (14.3%)

4 (4.4%)

Table 3.7 –– Translation improvement on syntax In the examples below, TT1 refers to the translation before the glossary modification and TT2 indicates the output after the modification.

91

3.4.1 Verbs of Main and Subordinate Clauses Most obvious is that by modifying the categories of the listed items, it is effective in facilitating SYSTRAN to correctly recognize the verbs in the input sentences — hence better resolution of more complex issues such as grammatical dependencies, clausal boundaries, subject-predicate establishment, and more accurate distinction of main clauses from their subordinates. As can be seen from List C.1, almost all of the errors in SYSTRAN’s category disambiguation are verbs that are mistakenly recognized as nouns. These verbs are crucial for the proper parsing of the input and for the subsequent transfer and synthesis of the TT. It can also be seen from the following sections on issues of attachment, conjunction, and noun strings that most of these aspects of parsing errors are related one way or another to the wrong recognition of verbs in the respective sentences. Since parsing errors cause significant mistranslation of the entire sentence, correct recognition of these verbs, as shown in the following examples, has resulted in a significant improvement in the translated output.

Main Verbs Many of the items in List C.1 function as main verbs of the sentences in which they occur. In total, there are 40 instances in the sample where a lexical item functions as the main verb of the sentence, among which 11 are problematic in the initial translation output (see Appendix D). These main verbs are either recognized as the wrong POS categories, or mistakenly recognized as the wrong syntactic functions in their corresponding sentences. After the modification, almost all are recognized both as the correct category and as the main verbs in the respective instances –– 10 out of the 11 errors are corrected, while in the customized output there are no new errors introduced. The modification in the glossary has been very effective in facilitating SYSTRAN to correctly recognize these main verbs. What is also worthy of mention here is that among the 10 instances of errors that are corrected by the modification of SYSTRAN’s glossary, not all are included in List C.1. This means that the correct recognition of these main verbs is not merely the result of the lexical restriction of the verbs themselves, but also an indirect outcome of the modification on other parts of the sentence. There are two such cases, 92

in Abstract 6 (2nd sentence) and Abstract 7 (2nd sentence) of Appendix D. For the remaining 8 instances, all are because of the restriction on these verbs in the modified glossary. The improvement of main-verb recognition has, as a consequence, effectively improved SYSTRAN’s recognition of the main clauses in contrast to the subordinate ones, or the syntactic relationships within the clauses. As can be seen below, such improvement in syntactic disambiguation has in turn avoided many of the issues illustrated in the following sections, resulting in better translation.

Example 1 — Abstract 4, 53 (27) ST:

Here we present a new design prescription for precise near-field CGHs based on comprehensive analysis of the spatial bandwidth.

TT1: 这里我们当前一张新的设计处方为精确近领域根据对空间带宽的全面分 析的 CGHs。 TT2: 这里我们介绍根据对空间带宽的全面分析的精确近场 CGH 的一个新的 设计方法。 In this example, the ST syntactic structure is quite straightforward, which can be illustrated in the following dependency tree:

93

A look at TT1 reveals that in SYSTRAN’s initial run, the system has completely failed to analyze the ST as such, at least not properly transferring the syntactic and semantic information to the output. TT2, however, is in sharp contrast very adequate. This is illustrated in more detail in the following sections on issues of attachment and noun strings (see 3.4.3 and 3.4.5), and here it can be seen that the two sentences have a major difference — the recognition of “present” as a verb. Since this is the main verb of the entire sentence, such a failure is fatal to the translation as a whole. Based on the POS checking in SYSTRAN, “present” is recognized as an adjective in TT1, and consequently the only word recognized as a verb is “based”. This directly resulted in SYSTRAN’s translating the chunk before “CGHs” word by word and inserting the respective words in a linear manner, while considering the following lexical item –– “based” –– as a passive-voice verb which leads a relative clause modifying “CGHs”. In other words, SYSTRAN seems to find no main verb at all in the ST, therefore the output is not even a complete and comprehensible sentence. The failure in recognizing the main verb has resulted very problematic issues that are demonstrated in the following sections in more detail.

94

In contrast, TT2 has dealt with these issues adequately and is much more comprehensible and accurate, largely due to the correct recognition of the main verb “present”. As can be seen from List C.1 and Appendix D, the only item in this sentence which is customized regarding category is the word “present”, which has directly improved SYSTRAN’s recognition of the sentence’s main verb. This has a significant influence on the rest of the sentence in TT2, resulting in considerable improvement regarding the entire translation.

Verbs of Subordinate Clauses Some of the verbs are within subordinate clauses, and in a similar manner to the main verbs, these verbs are equally important for the proper recognition of clausal boundaries and relationships. Failure in this aspect is also fatal for outputting a proper translation. As can be seen from Appendix D and Table 3.7, the modification in glossary has resulted in considerable improvement. There are altogether 16 instances where the verbs are in subordinate clauses, and in SYSTRAN’s initial translation, approximately half of them are problematic. The translation output after the lexical customization shows a substantial improvement in this regard –– there is only one problematic instance (see Table 3.7). The following is an example of such improvement.

Example 1 — Abstract 3, 53 (9) ST:

We demonstrate three amplitude cloaks that can hide very large spatial objects over the entire visible spectrum using only passive, off-the-shelf optics.

TT1: 我们展示在整个可见光谱的罐头皮非常大空间对象使用唯一的波动,现 成的光学的三个高度斗篷。 TT2: 我们展示使用唯一的无源,现成的光学,能隐藏在整个可见光谱的非常 大空间物体的三个振幅斗篷。 In this example, although the main verb of the sentence (“demonstrate”), as well as its core arguments, has been recognized, the relative clause modifying 95

“cloaks” has been mistranslated and is somewhat incomprehensible. The problems regarding this are illustrated in more detail in the following section for the attachment issue (see 3.4.3 below). These problems are for the most part the result of not recognizing the verbs “can” and “hide”, both considered as nouns in the initial translation. As can be seen from List C.1 and Appendix D, the only items modified for category are “can” and “hide”, and the improvement in TT2 has been apparent regarding this relative clause (see Section 3.4.3).

Phrasal Verbs Phrasal verbs are also ambiguous for MT in terms of syntactic parsing, and this can be even more problematic if the verbs are recognized mistakenly as nouns or adjectives, as shown in the following example. It can also be seen here that the glossary modification has resulted in improvement in this aspect as well. As demonstrated in Table 3.7, there are in total 7 instances of phrasal verbs in the ST, among which 3 instances are not properly detected (i.e. not as phrasal verbs). In the output from the customized system, there is only 1 instance which is still problematic in this regard.

Example 1 — Abstract 1, 53 (19) ST:

This can lead to accurate measurements in both spectrum and distance and allows a thorough characterization of the interferometer, as well as adds passive ranging information to hyperspectral images.

TT1: 这对准确测量的罐头主角在光谱和距离和允许干涉仪的一个详尽的描述 特性,以及增加被动排列的信息到 hyperspectral 图像。 TT2: 这能导致对在光谱和距离的准确测量并且允许干涉仪的一个详尽的描述 特性,以及增加无源排列的信息到高光谱图像。 Here, the “can” and “lead” are both verbs but mistakenly recognized as nouns by SYSTRAN. What is particularly worthy of discussion regarding this sentence is that the phrasal verb “lead to” has been separated by the system with a phrasal boundary, and consequently the words “can lead” are considered as a compound, modified by a prepositional phrase “to accurate measurements”. It is due to this 96

recognition that the complete sequence was translated into “对准确测量的罐头主 角”. Such phrasal verbs as “lead to” often pose much difficulty for MT systems, because the structure might be considered mistakenly as either a verb followed by a prepositional phrase attached to another word, or a noun phrase as the one in this example. More detailed discussion on prepositional phrase attachment regarding this sentence and other instances of the same kind can be found in the following section for “attachment” (Section 3.4.3), and here the case is somewhat different from some other examples because of the phrasal verb. It can be seen from TT2 that by restricting the category for “lead” in the glossary the system seems to have tackled the issue fairly satisfactorily. In addition to the distinction of the phrasal verb from a verb followed by a prepositional phrase, this example involves another “deep structure” ambiguity 11 which is particularly important for MT. Since in English — technical texts in particular — noun compounds are prevalent and category ambiguities regarding verbs and nouns are frequent, such ambiguities “may accumulate” and produce sentences like the one below, which “baffle most readers, until they have identified which words are functioning as verbs” (Hutchins & Somers, 1992, p. 90): Gas pump prices rose last time oil stocks fell12. This sentence is of course what Hutchins and Somers (1992) call a “real ambiguity”, but for MT many of the sentences like Example 1 and the previous examples (i.e. regarding “can hide” and “can lead”) are equally puzzling and difficult. More examples of noun strings are illustrated below in Section 3.4.5, where the nouns, as Hutchins (1992, p. 90) states for example, can be either “a single constituent” (i.e. compound noun), or “with a constituent boundary in the middle”. Here, the fact that many of these nouns can also be verbs, combined with the presence of phrasal verbs, only amplifies the problem, as shown in Example 1. Therefore, the glossary modification of these verbs regarding their category, and the correct recognition of phrasal verbs is both crucially important and considerably effective concerning such “deep structure” ambiguities.

In terms of the Chomskyan model, such kinds of structural ambiguities involve different “deep structures” of the same “surface structure”. 12 Sample sentence taken from Hutchins and Somers (1992), p. 90. 11

97

Lastly, the modification has also resulted in much improvement regarding prepositional phrase attachment (“in both spectrum and distance”), phrasal boundaries, and conjunctions, as will be discussed in the following sections.

98

3.4.2 “-ing” Words The disambiguation of “-ing” words is in fact a typical and prominent problem for MT, especially for rule-based systems (c.f. Aranberri Monasterio, 2010; Roturier, 2006), because the words can have various grammatical functions causing much structural ambiguity. It can be a (verbal) noun, a gerundive verb, or an adjective (see below). On this basis, many of the MT-oriented controlled language (CL) rules recommend that such words be avoided; but in such kinds of texts as the one in this case study, “-ing” words do occur frequently in academic writing. Therefore, this type of errors seems almost unavoidable in translating academic texts, and in this specific case, improvement is apparent. There are totally 30 instances of “-ing” words in the ST, among which 9 have been recognized inappropriately in SYSTRAN’s initial output, regarding their use as verbal nouns, gerundive verbs, or adjectives. In the output from the customized system, however, there are only 2 instances where the problem remains (see Table 3.7). The following examples illustrate this in more detail.

Example 1 — Abstract 4, 53 (27) ST:

The design prescription is verified by examples showing reconstruction error versus controlled parameters.

TT1: 设计处方由例子陈列重建错误核实对受控参数。 TT2: 设计方法由展示重建误差对受控参数的例子核实。 In the ST, the word “showing” is used as a gerundive verb which leads a relative clause attached to “examples”. The recognition of this structure is important for a transfer-based MT system like SYSTRAN: it is crucial for proper parsing and transfer results, adequate rearrangement of the translated phrasal constituents, and finally delivering correctly and sufficiently the information in the ST. Regarding this specific clausal structure, the recognition of the gerundive verb “showing” is vital — if recognized as either a noun or an adjective, the resulting output would be significantly distorted in both structure and meaning.

99

Despite the importance, however, this instance poses considerable difficulty for SYSTRAN’s disambiguation. The fact that the words preceding and following this “ing” verb are all nouns makes it much less likely for the system to discard the possibility of its usage as a noun (forming noun compounds with the other nouns) or as an adjective (modifying “construction errors”). By checking the POS labelling results in SYSTRAN, it can be seen that the word “showing” is recognized as a noun; word-alignment checking also reveals that this “-ing” word has been translated into “陈列” in TT1. In the section for “noun strings” and “attachment” (see below), this study will discuss in more detail the fact that the wrong recognition of this verb alone has resulted in SYSTRAN’s false detection of phrasal boundaries and its mistakenly recognizing the sequence “examples showing reconstruction error” as a noun string. This is a fatal error for TT1. In contrast, TT2 is much more explicit and accurate in this regard. As illustrated above, the modification has restricted the grammatical category of this word to “verb”, and in this sentence, it is the only word modified for grammatical category (see Appendix D and List C.1). Not surprisingly, this small modification has both corrected the translation of the verb “showing”, and more importantly improved the overall syntactic analysis in SYSTRAN and the semantic accuracy of the translation (see following sections). To some extent, this sentence is a typical example of the issue regarding the above sample sentence given by Hutchins and Somers: “Gas pump prices rose last time oil stocks fell” (see Section 3.4.1). For the sequence “examples showing reconstruction error”, the MT system has to recognize the verb (i.e. “showing”) before properly analyzing the ST and outputting an appropriate translation. Although this structure may not be much of an ambiguity to a human reader, for MT it can be as ambiguous as the sample sentence about “gas pump prices”, mainly because of the noun/verb category ambiguity of the word “showing”.

Example 2 — Abstract 2, 53 (6) ST

Nanostructured materials offer great prospects in helping solar-energy harvesting devices to achieve their envisioned performances.

100

TT1

在收获设备的帮助的太阳能的Nanostructured材料提议巨大远景完成他们 的被构想的表现。

TT2

纳米结构材料提供在帮助太阳能采集设备的巨大远景完成他们的被构想 的表现。 In this example, the ST contains two “-ing” verbs: “helping” and

“harvesting”. While either of them can be functioning as nouns, adjectives, or gerunds, in this sentence it is straightforward that “helping” is a gerundive verb with the subsequent noun phrase as a complement, and that “harvesting” is a noun in the compound “solar-energy harvesting devices”. However, for MT this can be considerably confusing — the two “-ing” words appear to be used in very similar manners in a prepositional phrase “in helping solar-energy harvesting devices”. As can be seen in TT1, SYSTRAN had significant problems in recognizing the head of the phrase “helping solar-energy harvesting devices”. This is partly because the two “-ing” words can be both nouns or adjectives, and either of them can also be a verb. In TT1, it seems that SYSTRAN has considered “solar-energy” as the head of the phrase, and “harvesting devices” as a relative clause modifying the phrase “helping solar-energy”. “Helping”, consequently, becomes an adjective modifying “solarenergy”. In other words, the sequence has been considered by SYSTRAN as “the helping solar-energy (which is) harvesting devices”. As none of the words preceding it is recognized as a verb (“offer” is recognized a noun), it can be seen that SYSTRAN has considered the entire sequence “Nanostructured materials offer great prospects” as a noun-functioning sequence modified by the prepositional phrase “in helping solar-energy harvesting devices” — thus the translation “在 [收获设备的帮 助的太阳能] 的 [Nanostructured 材料提议巨大远景]”. In short, SYSTRAN has recognized the two “-ing” words, “helping” and “harvesting”, respectively as a noun-modifying adjective and a participle leading a post-head modifying relative clause. This is an error which has completely distorted the syntactic structure and the meaning of the ST. The resulting translation is apparently undependable and misleading. After the modification, however, improvement in this regard is apparent. As shown in TT2, the complete sequence “helping solar-energy harvesting devices” is recognized and translated entirely in the correct manner: “帮助太阳能采集设备”.

101

Contrast this with the initial translation “收获设备的帮助的太阳能” and it could be seen that the improvement has been apparent. As shown in Appendix D and List C.1, the only words modified for POS are “offer” and “harvesting”. Here, it is worthy of mention that “solar-energy harvesting” or “solar-energy harvesting devices” (太阳能采集/太阳能采集设备) can be considered multi-word terminological items, and by intuition the complete sequence can be added into SYSTRAN’s glossary as a whole. Presumably this would make it much easier for the system to parse the input sentence in an appropriate manner, but as shown in the above example the modification of the single word “harvesting” has achieved the same effect. On the other hand, the strategy of modifying the smallest unit in such instances is beneficial in many ways. By adding “harvesting”, the modification works not only for this instance but for many other instances such as “light harvesting”, “the harvesting of diffuse solar radiation”, “enhanced trapping and harvesting of incident FW energy”, etc. Flexible use of terminology can be taken into account, and the customization can be applicable to a wider range of texts of such kinds. In addition, adding the items in smaller units can also reduce the total number of entries in the user-defined dictionary, so that the customization would be more efficient. Here, the only problem with TT2 is the attachment of the PP “to achieve their envisioned performances”, as will also be discussed below (see Section 3.4.8). If attached to the right word the translation would have been completely correct. However, this part of the sentence does not seem to contain information particularly important for the text as a whole, and by reading TT2 a Chinese reader with some basic understanding of nanotechnology would be able to easily extract the main point of the sentence. In sharp contrast, without the proper translation of “helping solarenergy harvesting devices” and the recognition of the main verb “offer”, an output such as TT1 can be very confusing and misleading. Therefore, the improvement resulting from the modification can be considered effective in this case.

102

3.4.3 Attachment The following illustrates examples of structural ambiguity caused by attachment of prepositional phrases (PP) and relative clauses (RC). For better illustration, some of the sentences are marked with square brackets to indicate the phrases and clauses in question. Not surprisingly, attachment issues are prominent regarding academic texts. Table 3.7 shows that in the sample, there are a total of 76 instances that involve PP or RC attachment issues. Among them, as many as 49 are problematic in the initial translation output from SYSTRAN, either with the constituents attached to inappropriate words or with the phrasal boundaries mistakenly detected. Such instances distort the meaning of the translation substantially, as can be seen from the examples below and from Appendix D. The glossary modification has reduced the problematic instances to 33 (see Table 3.7). Such improvement, as demonstrated by the following examples, has been visibly effective for the overall accuracy and comprehensibility of the translated sentence as a whole.

Example 1 — Abstract 3, 53 (9) ST:

We demonstrate three amplitude cloaks [that can hide very large spatial objects] [over the entire visible spectrum] [using only passive, off-the-shelf optics].

TT1: 我们展示在整个可见光谱的罐头皮非常大空间对象使用唯一的被动,现 成的光学的三个高度斗篷。 TT2: 我们展示 [使用唯一的无源,现成的光学],[能隐藏在整个可见光谱的非 常大空间物体的] 三个振幅斗篷。 In the ST, there is a defining relative clause (“that… optics”), a prepositional phrase (“over… spectrum”) and a reduced relative clause (“using… optics”). While it may not be much of an ambiguity that the defining relative clause is attached to “cloaks”, the respective attachment of the prepositional phrase and the reduced relative clause is rather ambiguous. Linguistically, the PP “over the entire visible spectrum” could be attached to “hide”, “demonstrate”, or “object”, and the clause “using only passive, off-the-shelf optics” could be attached to “hide”, “demonstrate”, or “spectrum”. 103

However, some background knowledge in optical cloaking would reveal that both of these two constituents are in fact attached to the verb “hide” and are part of the defining relative clause. As can be seen from TT1, in SYSTRAN’s initial ST analysis, neither of these constituents are attached to the verb “hide” — largely related to its failure in recognizing this verb. This resulted in a complete loss of the ST meaning in TT1. After the glossary modification, both “can” and “hide” are restricted to their verbal use, and the two relative clauses are properly recognized in TT2. The only problem that remains here is that the PP “over the entire visible spectrum” is attached to a noun “object” in TT2, rather than to the verb “hide”. Technically speaking this is wrong, because the spectrum is in fact not related to the “objects”, but instead to the way in which the cloak “hides” the objects. The objects are cloaked when the optical system bends the light surrounding the object rather than from the object. Since the cloaking only works for light of certain wavelengths, the working condition often corresponds to a range of the spectrum. Therefore, the “spectrum” here refers to the working wavelengths of the cloaks (with the PP attached to “hides”) rather than the wavelength of the light reflected by the objects (with the PP attached to “objects”). Nevertheless, this seems less of a linguistic issue than an ambiguity which requires real-world, technical knowledge in optics to resolve. For sentences like this, even a human translator who is unfamiliar with the mechanisms of the optical cloaks could make the same mistake very easily. A comparison with TT1 reveals that the transmission of information is much more adequate in TT2. Given the apparent improvement in terminology accuracy and better arrangement of the two relative clauses in TT2, it is reasonable to assume that a reader with some domain-specific knowledge in optics would be able to understand the main information correctly. Therefore, this case study considers SYSTRAN’s TT2 a satisfactory output.

Example 2 — Abstract 4, 53 (27) ST:

Here we present a new design prescription [for precise near-field CGHs] [based on comprehensive analysis] [of the spatial bandwidth].

104

TT1

这里我们当前一张新的设计处方 [为精确近领域] [根据对空间带宽的全 面分析的CGHs]。

TT2

这里我们介绍 [根据对空间带宽的全面分析] 的 [精确近场CGH] 的一个 新的设计方法。 In this sentence, there are two prepositional phrases and a relative clause —

“for precise near-field CGHs” (PP1), “based on comprehensive analysis” (RC), and “of the spatial bandwidth” (PP2). Based on linguistic analysis and background knowledge, it is safe to conclude that PP1 and RC are in fact attached to “prescription”, while PP2 is attached to “analysis”. However, for MT they could be very ambiguous. PP1 could be attached to either “prescription” or “present”, and RC could be attached to “CGHs”, “prescription” or “present”. Though PP2 seems easier to handle, its attachment may not be straightforward for the system either. In this regard, SYSTRAN’s output is considerably surprising. As can be seen in TT1, the mistake made by SYSTRAN is even more troubling than the attachment possibilities mentioned above; resulting from its failure in recognizing “present” as a verb, neither PP1 nor RC is attached to the right word. PP1 is not attached to any word or phrase, but seems to be directly inserted into the TT in a linear manner. RC is attached, surprisingly and troublingly, to “CGHs”. Here, the problem is even beyond RC attachment — constituent boundary is not properly detected. As can be seen from TT1, the MT system has recognized the sequence “for precise near-field” as one constituent (translated as “为 精 确 领 域”), and “CGHs… bandwidth” as another (translated as “根据… 的 CGHs”), separating the noun phrase “precise nearfield CGHs” with a constituent boundary. This directly resulted in an incorrect meaning in TT1, and the focus of the sentence has been misleadingly transferred to the wrong element: TT1 can easily lead to the misunderstanding that what is presented are “CGHs” rather than a design prescription. The glossary modification is considerably effective for this sentence. As can be seen from Appendix D, words that are modified include “present” for categorial ambiguity and polysemy, “prescription” and “near-field” for polysemy, and “CGH” for Not Found Words (NFW). While the modification regarding polysemy and NFW has resulted in more appropriate Chinese terminology, the effect on syntax seems more remarkable — by merely changing the category of “present” into a verb, PP1, RC and PP2 are all attached to the right words (see TT2 here). The noun phrase 105

“precise near-field CGHs” is also properly recognized, avoiding the fatal error in TT1. These effects have resulted in correct and comprehensible output in TT2. In addition, the inclusion of “CGH” for NFW items has effectively assisted SYSTRAN to recognize the plural form in the ST. Therefore, in TT2, the acronym “CGH” is in the correct form in Chinese. This is also the case in other instances of acronyms in the sample abstracts.

Example 3 — Abstract 4, 53 (27) ST:

The design prescription is verified by examples showing reconstruction error versus controlled parameters.

TT1

设计处方由 [例子陈列重建错误] 核实对受控参数。

TT2

设计方法由展示重建误差对受控参数的例子核实。 In this example, the relative clause “showing reconstruction error versus

controlled parameters” is attached to “examples”, a structure which is not ambiguous in the real sense; however, as shown in TT1 SYSTRAN has wrongly recognized the structure. The word “examples” is not treated as the antecedent for a clause but as a modifying noun for the subsequent components (see TT1). This is similar to the error in phrasal boundary as in the previous example. As will be illustrated below (see Section 3.4.5), the sequence “example showing reconstruction error” is mistakenly recognized by SYSTRAN as a compound noun, largely because of the wrong recognition of the gerundive verb “showing”. Another perhaps more relevant issue here is the attachment of the PP “versus controlled parameters”. It is obvious in TT1 that SYSTRAN has failed to find the word to which this phrase is attached. Rather, it seems to have inserted the phrase linearly in the translation output. Regarding this input sentence, it is linguistically ambiguous in terms of the attachment of this PP: either to “reconstruction error” or to “showing”. If attached to the former, semantically the PP would be referring to what is shown by the example; if attached to the latter, it would refer to how the examples show the reconstruction error. Some background knowledge on the content of the experiments reveals that the PP is in fact attached to the verb “showing”, describing the way those examples were demonstrated rather than merely using examples to contrast the reconstruction error to the parameters. Similar to the 106

previous example, this is an ambiguity which requires real-world knowledge and even a human translator could easily make mistakes here. The modified words here include “prescription” for polysemy, and “show” for category ambiguity. Again, the modification of the POS of a single item has resulted in significant improvement in the syntax of the translation. As can be seen from TT2, the clause is not only recognized properly, but also attached to the right word. However, as can be seen in TT2, the PP is not attached to the right word, a mistake which this study would consider acceptable for the same reason as above.

Example 4 — Abstract 5, 53 (25) ST:

We demonstrate the frequency doubling [of a quantum cascade laser] [in a multilayered, partially oxidized GaAs/AlOx waveguide].

TT1:

我们展示频率加倍 [在多层,部分地被氧化的 GaAs/AlOx波导] 的 [量子 小瀑布激光]。

TT2: 我们展示一 [量子级联激光器] 的双倍频 [在多层,部分地被氧化的 GaAs/AlOx 波导] 的。 In this sentence, there are two PPs — “of a quantum cascade laser” (PP1) and “in a multilayered, partially oxidized GaAs/AlOx waveguide” (PP2). While the attachment of PP1 (to “doubling”) does not appear ambiguous, PP2 is to a considerable extent a real ambiguity which requires much specialized knowledge in optics. Linguistically, PP2 can be attached to “laser”, “doubling”, or “demonstrate”. Any of the three possibilities could be grammatically valid in the general sense: when attached to “laser”, PP2 modifies a compound noun (i.e. “quantum cascade laser”) referring to a laser-producing device positioned in a waveguide; when attached to “doubling” it modifies an NP (i.e. “frequency doubling of a quantum cascade laser”) referring to a phenomenon occurring in a waveguide; if attached to “demonstrate”, the PP would be a post-head modification for a VP (“demonstrate… laser”) showing the way the phenomenon is presented in the experiments. These three structures, though all grammatical, would lead to significant difference in the meaning of the ST — and in the Chinese translation as well. Therefore, different ways of parsing directly result in different meanings in the output. Regarding this

107

sentence, however, background knowledge on optics and on the optical experiments concerned reveals that PP2 should in fact be attached to “doubling”, since neither the lasing device nor the experimental demonstration is within the waveguide. In the real experiments, the laser beams are produced by the quantum cascade laser before being doubled in frequency within the waveguide. Therefore, in the Chinese output the sequence of characters corresponding to PP2 should ideally be a pre-head modification for “双倍频”. Here, neither TT1 nor TT2 is strictly accurate in dealing with PP2. In TT1, this PP in the input sentence seems to be considered as a post-head modification for “quantum cascade laser” (“在多层,部分地被氧化的 GaAs/AlOx 波导的量子小 瀑布激光”). TT2 appears a rather linear insertion of the complete phrase in the corresponding position of the sentence (i.e. at the end). However, in dealing with PP1, there is a sharp difference between the two versions. In TT1, it seems that SYSTRAN has translated the sentence in accordance to two segments — “We demonstrate the frequency doubling” and “of a… waveguide” — and then combined the two linearly. The latter segment becomes an NP with the “laser” as the head and with PP2 as a modification. PP1 is completely mistakenly recognized, an error which is in many ways fatal for TT1. The emphasis and main content of the ST are largely lost in the translation output. In contrast, TT2 has successfully recognized PP1 as a post-head modification for “frequency doubling”. This directly results in a significant improvement in the accuracy and comprehensibility of the output, with the main content of the TT2 being a demonstration of a laser’s frequency doubling. In this instance, PP1 is wrongly recognized in TT1 and fully corrected in TT2. The improvement regarding PP1 is obvious, and important for the comprehension of the main information in the sentence. As for PP2, the phrase is initially attached to the wrong word in TT1, resulting in a very misleading translation, while in TT2 it is not attached to any word. The improvement concerning PP2 is also important in avoiding misleading information. In addition, as said above the issue of PP2 is a real ambiguity and requires considerable specialized knowledge to resolve, and it is therefore understandable for MT to be unsuccessful in strictly attaching it to the appropriate word. Even with such an output, a reader with ample knowledge of the mechanisms of frequency doubling (which in fact is assumed to be the case in this

108

study) would be able to interpret the relationship between a GaAs/AlOx waveguide and the frequency doubling, as long as the translation does not contain much misleading information. Based on this, the study would consider TT2 a satisfactory version. It is also interesting here that simply changing he translation of “frequency doubling” and “quantum cascade laser” (see Appendix D) has resulted in SYSTRAN’s correct resolution of the accidental ambiguity regarding PP1, and avoided attaching PP2 to the wrong word as in its initial output. Perhaps this is because the two items are added in the user-defined dictionary as multi-word entries, but it remains to be tested whether adding the items individually would lead to different results.

Example 5 — Abstract 4, 53 (27) ST:

We demonstrate that, by controlling two free variables [related to the target image], the designed hologram is free from aliasing and can have minimum error.

TT1: 我们显示出,通过控制二自由可变物 [相关对目标图像],被设计的全息 图是从混叠现象解脱,并且罐头有极小的错误。 TT2: 我们显示出,通过控制二个自由变量 [相关对目标图像],被设计的全息 图是无混叠并且能有极小的误差。 In this sentence, the relative clause “related to the target image” remains problematic. As can be seen in Appendix D, the words added do not include “related to”. It is perhaps interesting to test whether including this phrase in the modified glossary list would result in the correct resolution in this case.

109

3.4.4 Conjunction As mentioned above, the recognition of the scope of conjunctions is a prominent difficulty for MT. Nevertheless, various kinds of conjunctions are in fact very commonly seen in academic writings, and difference in recognition of the coordinated words or phrases can lead to different syntactic parsing and, more importantly, very different meaning. Therefore, resolution of such syntactic ambiguity is important for proper information transmission in the TT. In a quantitative manner, there are 27 instances of conjunction in the sample, among which 5 are problematic in the initial output. Regarding the translation from customized SYSTRAN, there is only 1 instance which is not properly detected in terms of scope (see Table 3.7). The following examples show SYSTRAN’s improvement in this aspect after the lexical customization.

Example 1 — Abstract 1, 53 (19) ST:

This can lead to accurate measurements in both spectrum and distance and allows a thorough characterisation of the interferometer, as well as adds passive ranging information to hyperspectral images.

TT1

这对准确测量的罐头主角在光谱和距离和允许干涉仪的一个详尽的描述 特性,以及增加被动排列的信息到 hyperspectral 图像。

TT2

这能导致对在光谱和距离的准确测量并且允许干涉仪的一个详尽的描述 特性,以及增加无源排列的信息到高光谱图像。 This sentence is a typical example of the problem caused by coordinate

conjunctions, with three conjunctions coordinating two nouns and three verb phrases in a way that can be very ambiguous for MT. As can be seen in the ST, the first “and” coordinates two nouns (“spectrum” and “distance”) as part of a PP (“in both spectrum and distance”), while the second “and” and the phrase “as well as” both coordinate verb phrases (“can lead to…”, “allows…”, and “adds…”). In TT1, there is not much indication that SYSTRAN has appropriately handled these conjunctions, other than linearly inserting the corresponding constituents in the translation. This is possibly because of the failure in recognizing 110

the verbs “can” and “lead”: as can be seen, the constituent “can lead” is recognized as a noun-noun compound post-modified by a PP “to accurate measurements”. Since SYSTRAN is not able to find a verb before the second “and”, the verb phrase coordination cannot be properly recognized. In contrast, with modification of merely “can” and “lead” for category (see Appendix D and List C.1) TT2 shows significant improvement in this aspect. The coordination of the two nouns and the three verb phrases are all appropriately recognized, and the resulting translation is even made more explicit by SYSTRAN’s flexible and largely correct use of Chinese conjunctions: “和”, “并且”, “以及”. This has resulted in a version of TT with substantial improvement in its accuracy, comprehensibility, and fluency, and most importantly, the resulting TT2 is to a considerable extent functionally sufficient for the information transmission from the ST.

111

3.4.5 Noun Strings Noun strings are especially prevalent in the sample, as expected. Nearly every sentence in the ST contains some noun-noun compounds, among which most are terminological items. These noun strings amount to 91 in total. In this aspect, SYSTRAN’s initial output is also problematic, as can be seen in Appendix D. These problematic instances include the 13 syntactic problems shown in Table 3.7, as well as inappropriate choice of Chinese terms that occur in nearly all instances. In discussing the other aspects above, this thesis has mentioned briefly the translation improvement on terminologies as a result of the modification on polysemy, but the syntactic structures of the compound nouns are perhaps more important for proper translation output. The 13 problematic noun string instances in terms of syntax have been reduced to as few as 4, due to the lexical customization on SYSTRAN. The following examples aim to illustrate SYSTRAN’s improvement on noun-noun compounds in more detail, with a focus on syntax.

Example 1 — Abstract 1, 53 (19) ST:

Light field camera as a Fourier transform spectrometer sensor: instrument characterization and passive spectral ranging.

TT1

轻的领域照相机作为傅立叶变换分光仪传感器:仪器描述特性和被动鬼 排列。

TT2

光场照相机作为傅立叶变换分光仪传感器:仪器描述特性和无源光谱排 列。 In this example, the “light field camera” is a compound noun where “light

field” is a terminological concept in itself and modifies the phrasal head “camera”. However, it can be seen in TT1 and in the POS labelling results that SYSTRAN has recognized “field camera” as a compound noun modified by an adjective “light”. In other words, “[light field] camera” was recognized as “light [field camera]”. After the modification, the translation “光 场 照 相 机” is much better in terms of terminological accuracy. Since “light field camera” is the topic of the sentence and of the entire abstract, it contains core information and should be considered very important for proper information transmission through translating. Therefore, this improvement is crucial for the output sentence as a whole. 112

The “Fourier transform spectrometer sensor” is translated very accurately by SYSTRAN, as reflected in many other instances as well. It can be seen from Appendix D that despite the words discussed above concerning polysemy, category, and NFW, a considerable number of compound nouns are translated very satisfactorily by SYSTRAN’s initial glossaries. This is largely because of the settings of the MT system in this case study. As mentioned above, before inputting the ST into SYSTRAN it was especially adjusted to “technical” profile settings, and the corresponding glossaries to be activated was carefully selected as a preliminary condition for the case study. This process is also part of the MT customization, and the correct and accurate translation of such noun compounds as “Fourier transform spectrometer sensor” illustrates that the settings are effective for facilitating the MT system13. The “passive spectral ranging” is an example of the improvement on polysemy. As said above, the improvement in its translation is significantly helpful for the reader’s proper understanding of the information in the ST.

Example 2 — Abstract 5, 53 (25) ST:

AlGaAs guided-wave second-harmonic generation at 2.23 µm from a quantum cascade laser.

TT1

在 2.23 µm 的 AlGaAs 引导波浪第二泛音一代从量子小瀑布激光。

TT2

AlGaAs 导波在2.23 μm 的二次谐波生成从量子级联激光器。 In the ST, the “AlGaAs guided-wave second-harmonic generation” is a long

sequence of nouns, and as can be seen in TT1, SYSTRAN’s initial translation has failed to translate any of the nouns correctly. Rather, the system seems to have picked up from its glossary a translation for each word and inserted them into the phrase sequentially. The resulting output is consequently very misleading. Since this compound conveys the core information for this sentence and for the entire abstract, such an error as in TT1 is fatal.

The effectiveness of such settings can be further investigated by comparison of the output to that from default settings. Due to the scope of this thesis, this is not included in the case study. 13

113

Regarding this, TT2 is to some extent surprisingly effective in terms of improvement. Although the complete sequence is a compound noun in the ST syntax, some of the components are semantically closer than others — “[AlGaAs guidedwave] [second-harmonic] generation”, or in other words, “generation of secondharmonic for AlGaAs guided-wave”. Therefore, although this sequence in the ST can be considered compound nouns as a whole, the Chinese translation does not necessarily have to be a corresponding sequence of nouns but can (and perhaps should) be a version where the semantically closer nouns are placed together while others are not. TT2 is typical of this. Rather than translating the sequence into “在 2.23 µm 的 AlGaAs 导波二次谐波生成” — a version which is already better than TT1 — the resulting output has effectively explicitated the semantic relationship among these nouns. This makes TT2 much better in terms of translation quality. The translation for “quantum cascade laser” is also an effective improvement in terms of polysemy, as illustrated above.

Example 3 — Abstract 4, 53 (27) ST:

Here we present a new design prescription for precise near-field CGHs based on comprehensive analysis of the spatial bandwidth.

TT1: 这里我们当前一张新的设计处方为精确近领域根据对空间带宽的全面分 析的 CGHs。 TT2: 这里我们介绍根据对空间带宽的全面分析的精确近场CGH的一个新的设 计方法。 In the previous sections, this sentence has already been discussed in terms of the main verb (see 3.4.1) and the attachment of prepositional phrases and relative clauses (see 3.4.3), where it is mentioned briefly that the clause led by “based on” is by mistake attached to “CGHs” in SYSTRAN’s initial translation, significantly distorting the information both syntactically and semantically. Section 3.4.3 has also mentioned that this is related to the problem regarding SYSTRAN’s detection of constituent boundaries, as the noun phrase “near-field CGHs” has been wrongly separated. Here, it is worthy of noting that the hyphenated item “near-field” is considered as an adjective in this case study, as shown in Section 3.2.1; therefore, strictly speaking this should not be a noun-compound issue. However, it can be seen 114

in the initial run that SYSTRAN’s glossary does not contain the items “near-field” and “CGHs” (see 3.2.1 above for the list of “Not Found Words”), and that both “near” and “field” are contained nevertheless. Further checking in the system shows that the hyphenated item “near-field”, as well as “CGHs”, is labelled as a noun in SYSTRAN. In this sense, to the system these words are equivalent to a noun-string ambiguity. Here, it is also worthy of emphasizing that the issue of noun strings is problematic for MT not only in terms of their terminological translations, but more importantly for the proper parsing of the ST syntax. Since compound nouns are prevalent in English, such kinds of structural ambiguity are “very common” (Hutchins & Somers, 1992, p. 91) and is therefore a particular problem for MT. This is consistent with the following sample sentences given by Hutchins and Somers (ibid): (1). The mathematics students sat their examinations. (2). The mathematics students study today is very complex. The ambiguity here is about whether the two adjacent nouns “mathematics students” are a single constituent as in (1), or with a constituent boundary in between as in (2). Although this is not much of a “real ambiguity”, for MT it can still be very problematic. Regarding the sentence in Example 3, “near-field CGHs” is equivalent to the “mathematics students” in these two sentences. The initial run in SYSTRAN seems to have parsed a structure similar to that of sentence (1) into the structure of sentence (2), as can be seen in TT1. In TT2, however, the issue is dealt with fairly satisfactorily, and this is primarily because of the added glossary items concerning “Not Found Words” (see 3.3 above). As shown in Section 3.3 and Appendix D, the only words modified here are “present” for category ambiguity, “prescription” for polysemy, and “near-field” and “CGH” for Not Found Words (NFW). Resulting from the modification for these two NFW items, there is a significant improvement in TT2 regarding the noun phrase here. The sample sentence (2) above involves another problem which is particularly important for the case study here — the words “study” and “today” can be nouns as well. Therefore, in addition to the above-mentioned ambiguity concerning the nouns “mathematics students”, it is possible for an MT system to consider the entire chunk “mathematics students study today” as a series of nouns (as in “Fourier transform 115

spectrometer sensor”) followed by a predicate “is very complex”, though this is a rather unlikely interpretation for a human reader. As emphasized previously, what is straightforward for a human reader might be considerably ambiguous for MT, a phenomenon which is often called “accidental ambiguity” (c.f. Hutchins & Somers, 1992, p. 88). Ambiguities like this are in fact very common, including most of the examples in the above sections concerning verbs — such as the one reproduced below.

Example 4 — Abstract 1, 53 (19) ST:

This can lead to accurate measurements in both spectrum and distance and allows a thorough characterization of the interferometer, as well as adds passive ranging information to hyperspectral images.

TT1: 这对准确测量的罐头主角在光谱和距离和允许干涉仪的一个详尽的描述 特性,以及增加被动排列的信息到 hyperspectral 图像。 TT2: 这能导致对在光谱和距离的准确测量并且允许干涉仪的一个详尽的描述 特性,以及增加无源排列的信息到高光谱图像。 As mentioned in Sections 3.4.1 and 3.4.2, the proper identification of which words are functioning as verbs is crucial, and this type of ambiguity is very common. The prevalence of compounds in English, the high frequency of verb-noun categorial ambiguity, and the fact that English permits omission of relative pronouns mean that the system might have considerable difficulty in properly dealing with a series of words that can all function as nouns. It may wrongly recognize the constituent boundary of the noun string, or mistakenly recognize as a string of nouns a sequence of words that should not be a constituent at all. Here, wrongly recognizing words of other categories as a string of nouns is problematic not only for the translation of these words, but for the adjacent constituents or even the entire sentence, as shown in Example 4 and in other examples mentioned above. Therefore, a modification of the limited amount of words in terms of category and polysemy, as shown above, is significantly meaningful and effective regarding the aspect of noun strings when translating the kind of ST in this case study.

116

3.4.6 Terminological Items Another important aspect of improvement in the translation output after activating SYSTRAN’s user-defined dictionary is the accuracy of terminologies. This aspect is also apparent in all of the examples shown in the previous sections, and some of them are reproduced here. The examples below illustrate that by modifying the glossary in terms of polysemy, the output has been improved significantly regarding terminological translation. Example 1 — Abstract 1, 53 (19) ST:

Light field camera as a Fourier transform spectrometer sensor: instrument characterization and passive spectral ranging.

TT1

轻的领域照相机作为傅立叶变换分光仪传感器:仪器描述特性和被动鬼 排列。

TT2

光场照相机作为傅立叶变换分光仪传感器:仪器描述特性和无源光谱排 列。

Example 2 — Abstract 3, 53 (9) ST:

Amplitude-only, passive, broadband, optical spatial cloaking of very large objects.

TT1: 高度,被动,宽频,光学空间掩饰非常大对象。 TT2: 纯振幅,无源,宽频,光学空间掩饰非常大物体。

Example 3 — Abstract 4, 53 (27) ST:

Spatial bandwidth analysis of fast backward Fresnel diffraction for precise computer-generated hologram design.

TT1

对精确计算机生成的全息图设计的快速的落后菲涅耳衍射的空间带宽分 析。

TT2

对精确计算机生成的全息图设计的快速的逆向菲涅耳衍射的空间带宽分 析。

117

Example 4 — Abstract 5, 53 (25) ST:

AlGaAs guided-wave second-harmonic generation at 2.23 µm from a quantum cacscade laser.

TT1

在 2.23µm的AlGaAs 引导波浪第二泛音一代从量子小瀑布激光。

TT2

AlGaAs 导波在2.23µm的二次谐波生成从量子级联激光器。

Example 5 — Abstract 6, 53 (5) ST:

On the nature of Acket’s characteristic parameter C in semiconductor lasers.

TT1

在 Acket 的典型参量C的本质在半导体激光的。

TT2

在 Acket 的特征参量C的本质在半导体激光器的。

Example 6 — Abstract 7, 53 (1) ST:

The correction results show that ICA is a powerful correction algorithm for static or slowly changing phase aberrations in optical systems, such as solidstate lasers.

TT1

更正结果展示ICA是静止或慢慢地改变的阶段变形的一种强有力的更正 算法在光学系统,例如固体激光。

TT2

校正结果展示ICA是静态或慢慢地改变的相位畸变的一种强有力的校正 算法在光学系统,例如固态激光器。

As can be seen from the examples shown above, some of the terminologies are translated into general-language items or into the kind of terms used in other domains, e.g. field (领域/场)14, passive (被动/无源), object (对象/物体). Other examples of such terminologies in Appendix D include aberration (变型/畸变), cascade (瀑布/级联), plane (飞机/平面), phase (阶段/相位), variable (可变物 /变量), etc. Some other terms are translated into the right domain, but into the Here, the item before the slash is SYSTRAN’s initial translation, while the one after the slash is how that word should be translated. In other words, “field” was initially translated into “ 领域 ”, but should be “ 场”. The same applies for the following examples. 14

118

wrong equivalent terms for that particular sense, e.g. laser (激 光 / 激 光 器), interferometric (干 涉 测 量 / 相 干). Such terms are often ambiguous (i.e. polysemous) even in the specific domain, and the issue has been solved by adding them into the user-defined dictionary together with the adjacent words, as shown above. The reason for these problems can be one of the following: 1). the glossary has too many translations for a term, which confuses the system in choosing the right equivalent for it, e.g. field, plane, generation, etc.; 2). the glossary has only one equivalent translation for the term, but not the right or appropriate one, e.g. aberration, cascade, etc.; or 3) the glossary simply does not contain an entry for the term (i.e. NFW items), e.g. CGH, hyperspectral, homodyne, Lang-Kobayashi, etc. The modification for these glossary entries has largely increased the accuracy of the translation of these terms, and given the density of terminology in such texts, this aspect is crucial for the overall translation improvement. The correction of these teminological items, as can be seen from Appendix D, is considerably effective in improving the comprehensibility of the text as a whole.

119

3.4.7 Not Found Words The Not Found Words (NFWs) are shown in Appendix D as red words in ST and TT1. As can be seen in the TT2 of each abstract, adding these items in the userdefined dictionary has not only avoided the instances where the NFW items are left untranslated, but more importantly affected the syntactic parsing and transfer processes to a considerable extent. In the discussion above, many of the examples involve NFW items, as reproduced below. Example 1 — Abstract 4, 53 (27) ST:

Here we present a new design prescription for precise near-field CGHs based on comprehensive analysis of the spatial bandwidth.

TT1: 这里我们当前一张新的设计处方为精确近领域根据对空间带宽的全面分 析的 CGHs。 TT2: 这里我们介绍根据对空间带宽的全面分析的精确近场CGH的一个新的设 计方法。 In this sentence, the two items added as NFW items are “near-field” and “CGH”, the former as an adjective and the latter as a noun (see Section 3.2.1). As mentioned above, the hyphenated items are often considered by SYSTRAN as NFWs even when each of the individual words is included in its glossary. For most of such cases, the system does generate correct and appropriate translations by combining its choice of equivalents for the words concerned, e.g. “position-angle-spectra” (位置角度光谱), or through a certain extent of syntactic analysis, e.g. “earth-based” (基于地球的). Although some of them are not strictly accurate because of a wrong or inappropriate choice of equivalents, such as “driftdiffusion” (漂 泊 扩 散 / 漂 移-扩 散), “five-dimensional” (五 尺 寸 / 五 维), these instances do not seem to affect much of the syntactic analysis for the sentence of clause related to the NFWs. Therefore, such kinds of items are similar to the issue of “terminological items” in the previous section (see 3.4.6). However, the sentence in Example 1 here is one of the instances where the NFW item has in fact improved SYSTRAN’s disambiguation of syntactic structures. As illustrated in Section 3.4.3 and Section 3.4.5 above, a very serious error in the

120

translation of the sequence “near-field CGHs… bandwidth” is corrected fairly satisfactorily in TT2. This is mainly because of the category of “near-field” in the user-defined dictionary as an adjective, avoiding the kind of attachment and nounstring ambiguities discussed above. The presence of “CGH” in the user-defined dictionary has also facilitated the system to recognize the plural form, as can be seen in TT2. Since the Chinese word does not have inflections for plural nouns in this case, the corresponding translation for this abbreviation has left out the “s” at the end of the word, which helps to avoid confusing readers about the specific technique for which the prescription is designed. The same is true for other instances of NFW where plural forms are involved. Regarding the NFW items, perhaps the most important and apparent effect would be the improvement where the words are left untranslated in the initial run. These are usually proper nouns, abbreviations, hyphenated items, or other kinds of terminology. Such an issue has already been discussed in the previous section (see 3.4.6), and it also overlaps with some of the other examples illustrated above, such as the one below: Example 2 — Abstract 1, 53 (19) ST:

This can lead to accurate measurements in both spectrum and distance and allows a thorough characterization of the interferometer, as well as adds passive ranging information to hyperspectral images.

TT1: 这对准确测量的罐头主角在光谱和距离和允许干涉仪的一个详尽的描述 特性,以及增加被动排列的信息到 hyperspectral 图像。 TT2: 这能导致对在光谱和距离的准确测量并且允许干涉仪的一个详尽的描述 特性,以及增加无源排列的信息到高光谱图像。

121

3.4.8 Remaining Errors The above sections have illustrated for the most part the improvement in the translation output after activating the user-defined dictionary. As can be seen from Appendix D, some of the sentences are still problematic despite the considerable improvement discussed above. Among them, most are related to the attachment of prepositional phrases or relative clauses. It is important to note that many of these problems are minor and sometimes do not seem to influence at all the proper information assimilation of the ST for a domain-specific reader, as can be seen from the analysis below. Apart from such minor problems, however, there do exist sentences that are still dissatisfactory in terms of structural ambiguity; but as mentioned above, this study puts an emphasis on the relative quality of translation, and is largely focused on the improvement of SYSTRAN’s output resulting from the customization conducted. Despite these problems that still remain after the glossary modification, the effect of the customization regarding relative translation quality is not to be ignored. The following examples illustrate this in more detail. Example 1 — Abstract 4, 53 (27) ST

Designing near-field computer-generated holograms (CGHs) for a spatial light modulator (SLM) requires backward diffraction calculation.

TT1

设计近领域计算机生成的全息图(CGHs)一个空间光调制器的(SLM) 要求落后衍射演算。

TT2

设计近场计算机生成的全息图(CGH)一个空间光调制器的(SLM) 要求逆向衍射演算。 In this example, the prepositional phrase “for a spatial light modulator” has

not been properly analyzed and translated in either TT1 or TT2. Strictly speaking, the corresponding Chinese for this phrase should be rearranged either before the one for “computer-generated holograms” (as a noun-modifying PP), or before the translation of the verb “design” (as attached to the verb). The resulting translation versions could be “设计一个空间光调制器的 (SLM) 近场计算机生成的全息图 (CGH)”, or “为一个空间光调制器 (SLM) 设计近场计算机生成的全息图 (CGH)”. In spite of the slightly inaccurate translation for “near-field computer-generated holograms” and 122

the perhaps redundant translation of the article “一 个”, both of these resulting versions can be perfectly acceptable, if not comparable to a high-quality human translation. However, this is not the case in TT2. Perhaps only through contextual or extra-linguistic information can a reader of the abstract infer that the holograms are designed for the modulator. Nevertheless, for a domain-specific reader this inference does not seem difficult; some background knowledge in optics would considerably explicitate the relationship between a computer-generated hologram and a spatial light modulator, given the properties of this technique. In addition, since in the real case the reader would not be reading such an isolated sentence, the contextual information from the entire abstract can also help substantially. A look at Appendix D reveals that the improved text as a whole has made it considerably apparent that the theme of the abstract is the design of CGHs for an SLM, despite the linguistic problem in the above sentence. In other words, with some background knowledge in optics and contextual inference from the entire translation of the abstract, such an error concerning the prepositional phrase seems minor with regard to the reader’s understanding of the information in the text. Here, it is worthy of mention that apart from the effect of domain-specific knowledge on the CGH and SLM, this PP error is minor primarily because the main information in the ST has been correctly and adequately transmitted to the TT. Regarding this individual sentence, the specialized terms are important; if any of the terminological items were translated into a wrong, misleading equivalent — such as the case in “cascade” (瀑布/级联), “phase” (阶段/相位), or even “laser” (激 光 / 激 光 器) — the above problem would not be trivial at all (see above sections). On another level, the effect of contextual inference may be more important for making the problem a minor one, and this also results from a relatively successful translation of the core information in the ST. For this abstract, the improvement in the sentence “Here we present a new design prescription for precise near-field CGHs based on comprehensive analysis of the spatial bandwidth” (see 3.4.1 above) is particularly important for the understanding of the other parts of the abstract, as it contains the main topic to which all supplemental information is consistent. 123

Example 2 — Abstract 2, 53 (6) ST

Nanostructured materials offer great prospects in helping solar-energy harvesting devices to achieve their envisioned performances.

TT1

在收获设备的帮助的太阳能的Nanostructured材料提议巨大远景 完成他 们的被构想的表现。

TT2

纳米结构材料提供在帮助太阳能采集设备的巨大远景 完成他们的被构想 的表现。 In the discussion of “-ing” verbs above (see 3.4.2), it has been illustrated that

in spite of the improvement in TT2, there is still a minor problem with the translation of this sentence — the attachment of “to achieve their envisioned performances”. While supposed to be attached to the verb “helping”, in SYSTRAN’s initial run the PP does not seem attached to this word at all. On the contrary, the system may have either attached it to the main verb “offer”, or simply inserted the sequence linearly into the translation (see TT2). Linguistically speaking this is inadequate, but as mentioned above the problem is considered minor, largely because the prepositional phrase in question does not contain crucial information for the text. At the sentential level, the general points which the ST is aiming to make is not considerably influenced in TT2, even when the PP is not attached explicitly to the right word. If attached to the main verb (i.e. “offer”), the PP would not alter the meaning significantly. In addition, this part of the sentence seems far less important than the rest, as the sequence “Nanostructured materials offer great prospects in helping solar-energy harvesting devices” is already adequate to convey the essential points this sentence is trying to make. In comparison with the errors regarding some other parts of the sentence — such as “helping solar-energy harvesting devices” and the main verb “offer” — where an error in translation could be very confusing and misleading, the influence of the error in question here seems significantly trivial. Regarding the sequence which contains the major information of the sentence (i.e. “Nanostructured… devices”), TT2 is in fact very satisfactory. Therefore, it is reasonable to consider TT2 successful in transmitting the crucial information in the sentence. 124

At the textual level, this sentence functions as the background of the topic for the entire abstract, which makes the sentence much less important than some other parts of the text where the research question, method, and results are illustrated. Accordingly, the problems here would be less important than the problems in those other parts of the text; and as an introduction TT2 is very effective in introducing the topic of the research which this abstract summarizes. In this sense, the problem here seems even more of a trivial one, and TT2 is largely qualified for a satisfactory translation for the information in the ST. In short, even if TT2 may not be strictly appropriate in terms of the linguistic structure of this part of the sentence, the problem does not cause much difficulty or distortion of the main and important information which the abstract is aiming to transmit to the reader. On the other hand, TT2 is rather successful in the parts where core information is conveyed. This means that albeit not perfect, the translation can be considered satisfactory regarding information translation. Example 3 — Abstract 5, 53 (25) ST

AlGaAs guided-wave second-harmonic generation at 2.23 µm from a quantum cascade laser.

TT1

在 2.23 µm 的 AlGaAs 引导波浪第二泛音一代从量子小瀑布激光。

TT2

AlGaAs 导波在2.23 μm 的二次谐波生成从量子级联激光器。 In the above sections for noun strings and terminological items, it has already

been illustrated that the quality of TT2 is significantly improved compared with TT1 (see 3.4.5 and 3.4.6). However, the prepositional phrase “from a quantum cascade laser” remains problematic to some extent. Ideally, this PP could have been rearranged in the translation output for more fluency, which would improve the quality of TT2 even further. Similar to the above examples, however, this problem is considered insignificant, because it does not seem to bring much distortion to the information conveyed. On the one hand, even when the PP is arranged at the end of the translated sentence in an equivalent manner to the English syntax, the meaning would not be changed significantly; and on the other, for a reader who has some background knowledge in optics, it would be very straightforward and unambiguous that in this

125

case, the second-harmonic is generated from a laser. At the textual level, this information is also clear when one reads the entire abstract as a whole, as can be seen from Appendix D. Again, this results from two factors — the improvement of the translation of the other pieces of crucial information in this sentence, and the comprehensive improvement of all other sentences in this abstract. For the former, TT2’s improvement concerning the terminologies and noun strings is considerably effective (see 3.4.5); and for the latter, the first sentence is particularly important — “We demonstrate the frequency doubling of a quantum cascade laser in a multilayered, partially oxidized GaAs/AlOx waveguide” (see 3.4.3). Example 4 — Abstract 5, 53 (25) ST

We demonstrate the frequency doubling [of a quantum cascade laser] [in a multilayered, partially oxidized GaAs/AlOx waveguide].

TT1

我们展示 | 频率加倍 | 在多层,部分地被氧化的GaAs/AlOx波导 | 的 | 量 子小瀑布激光。

TT2

我们展示 | 一量子级联激光器的双倍频 | 在多层,部分地被氧化的 GaAs/AlOx 波导的。 Section 3.4.3 above has illustrated in detail the structural differences between

the two translation versions of this sentence regarding PP attachment, while the improvement in terminology is also covered briefly in 3.4.5 and 3.4.6 with examples of similar phenomena (see above). These aspects of improvement are prominent, which seems to far outweigh the problem to be discussed in this section. However, it is not to be neglected that despite the significant improvement in TT2 concerning both grammatical structure and semantic accuracy, the system did encounter some problem in handling the second PP (“in a multilayered… waveguide”). If reordered, this phrase would result in even much better quality of the translated sentence. Section 3.4.3 (Example 4) has also shown that although PP2 is problematic in both TT1 and TT2, TT2 has in fact avoided misleading the reader in terms of information transmission (see above). Coupled with the fact that PP2 is a real ambiguity requiring domain-specific knowledge to resolve, this should make the problem understandable. However, what seems more important for the reasonability 126

of this issue is that for TT2 — as mentioned in 3.4.3 — the relationship between a GaAs/AlOx waveguide and the frequency doubling is not hard to gauge. For a reader with some, if not ample, knowledge of the relevant optical mechanisms, this relationship is rather straightforward and unambiguous when one is reading TT2; but other than the effect of domain-specific knowledge, the sentence in TT2 itself is to some extent indicative of the relationship between the waveguide and the frequency doubling, largely as a result of the characters “在” and “的”, and of the considerably satisfactory translation for PP1 (i.e. “量子级联激光器的双倍频”). In other words, the core information intended by the ST seems to have been preserved adequately and correctly in TT2 in spite of the problem for PP2. The issue in discussion here does not distort the original information either. In addition, the (perhaps excessively) implicitness of the relationships between keywords can be gauged without much difficulty in this case. Therefore, the problem here is considered minor, and TT2 is considered satisfactory despite this problem. Example 5 — Abstract 6, 53 (5) ST

Quasi-static interferometric signals in lasers under feedback arise from slowly varying perturbations of the intracavity electric field resulting from the reinjection of a portion of the emitted field into the cavity.

TT1

在激光的准静态的干涉测量的信号在反馈下 从intracavity电场的慢慢地 变化的扰动出现起因于散发的领域的部分的reinjection入洞。

TT2

在激光的准静态的相干信号在反馈下 从腔内电场的慢慢地变化的扰动出 现起因于辐射的场的部分的重新注入入腔。 This is a sentence with a number of structural ambiguities, and a typical

example of the difficulty for MT when dealing with academic writing. It is long, containing a considerable extent of complexity in terms of embedding. The structural ambiguity here involves both real and accidental ones, and in this regard the improvement after the modification is nearly negligible. As can be seen from TT1 and TT2, the only effect of the glossary modification in this instance is more accurate translation of the specialized terms involved — such as the “polysemous” items including “interferometric”, “field”,

127

“emit”, and “cavity”, as well as the “NFW” items including “intracavity” and “reinjection” (see also Appendix D, Section 3.2 and Section 3.3). Such kinds of improvement are obvious and effective, regarding terminological accuracy, avoidance of information distortion, and the overall understanding and contextual inference for the reader; but other than this, the syntactic structure of the sentence in TT1 remains unchanged at all. Here, both TT1 and TT2 have considerable problems in dealing with the real structural ambiguities, distorting the original information to a noticeable extent. The PP “under feedback” should be attached to “laser”, while the relative clause “resulting from… cavity” should be attached to “perturbations”. Neither is properly managed in the output. It seems that SYSTRAN has attached the PP to the verb “arise” and the relative clause either to “arise” or to no word at all, as can be seen from the translations. These two issues are crucial, as they influence the overall sense of the sentence significantly. Therefore, as an individual sentence the output here is perhaps not satisfactory for information transmission. However, it is important to note that these two aspects of ambiguity require considerable domain-specific knowledge to resolve. As mentioned above, even for a human translator they are very difficult to disambiguate without adequate extralinguistic information. Other than them, there are many other, accidental ambiguities in this sentence which could be potentially very problematic for MT (e.g. the verb/noun ambiguity of “signals”, etc.); but SYSTRAN seems to have dealt with them perfectly well. It is also important that the sentence here functions as the background for the abstract (see Appendix D), a relatively less important section in terms of information. In this sense, perhaps the proper translation of terminological items is already effective enough for the information translation of the text as a whole. Example 6— Abstract 4, 53 (27) ST

To achieve this, we analyze the geometry of the target image, hologram, and Fourier transform plane of the target image to derive conditions for minimizing reconstruction error due to truncation of spatial frequencies lying outside of the hologram.

128

TT1

要达到此,我们分析目标图像的目标图像、全息图和傅立叶变换飞机的 几何获得使减到最小的重建错误的条件由于说谎在全息图外面的空间频 率的截。

TT2

要达到此,我们分析目标图像的几何形状,全息图,并且获得使减到最 小的重建误差的条件的目标图像的傅立叶变换平面由于空间频率的截断 位于在全息图外面。 Similar to the previous example, this sentence is another typical difficulty for

MT. The ST is long and contains very complex structural ambiguities, mainly involving PP attachment and conjunction. In this regard, both TT1 and TT2 are problematic. Regarding the conjunction “and” alone, there can already be a number of different interpretations for the ST structure, all grammatical, if not considering the actual mechanisms of the optical system. For example, a very straightforward interpretation can be as shown the following dependency structure:

where the coordinated phrases are all objects for the main verb “analyze”. Given that the word “transform” can sometimes (if not commonly) function as a verb, the coordinated sequences might also be the following structure:

where “hologram” is in apposition to “the geometry of the target image”, and “Fourier” is the subject for the verb “transform”. These are but two examples of the many possible interpretations that can be both grammatically valid and semantically acceptable, while TT1 and TT2 provide another two interpretations in addition. A checking of the word alignment in SYSTRAN reveals that in TT1, the MT system seems to have parsed the ST in the following manner:

129

where, if not considering the actual experiment in the research, this part of the sentence can make very much sense (see also TT1 above). It can be also seen, from checking the word alignment, that in TT2 the structure appears to be as follows:

which is sharply different, and in this instance the issue of the conjunction has interfered with the latter part of the sentence. The sequence “to derive conditions for minimizing reconstruction error” seems to be considered as an attributive modifying the noun phrase “Fourier transform plane” (hence the translation “获得 | 使减到最 小的重建误差的条件 | 的 | 目标图像 | 的 | 傅立叶变换平面”). It seems that other than the last one, all the above interpretations of the first half of this ST sentence are grammatically plausible. However, if one takes a closer look at the experiment involved, or if one considers the relevant optical mechanisms, it would be apparent that none of them is consistent to the real case15. In fact, the structure for this part of the sentence should preferably be as follows:

where the object of “analyze” is “geometry”, the conjunction “and” coordinates “image”, “hologram”, and “plane”, the PP “of the target image” is attached to “plane”, and the PP “to derive” is attached to the main verb16. Further reference can be found in: Liang, J., & Becker, M. F. (2014). Spatial bandwidth analysis of fast backward Fresnel diffraction for precise computer-generated hologram design. Applied optics, 53(27), G84-G94. 16 See, for example, the following sentence taken from the paper: 15

130

Regarding the latter part of the sentence, the ambiguity is equally, if not more, problematic. In addition to the attachment of the PP “to derive…”, the phrase “due to…” can also be attached to different words — “analyze”, “derive”, or “error”. Within this phrase the relative clause “lying outside of the hologram” is also considerably problematic for the two versions of translation. In short, this sentence is very complicated in terms of embedding. The structural ambiguity has caused much problem for SYSTRAN both before and after the glossary modification. Neither TT1 nor TT2 seems satisfactory in this regard. Given that the sentence illustrates the “method” of the work reported in the abstract, the problem here does not seem as minor as the previous ones. The information distortion resulting from SYSTRAN’s improper syntactic analysis can significantly influence the reader’s understanding of how the “conditions for minimizing reconstruction error” were derived, even with some optics background. This is a piece of information which is very specific to the article concerned. In addition, this sentence is the only one in the abstract which describes the method for deriving the conditions, therefore there is nowhere else in the entire text where the information can be gauged through contextual inference (see Appendix D). The way in which the authors derive those conditions for minimizing reconstruction error, a somewhat important information for an abstract, has been considerably, if completely, lost in the translation. Here, similar to the previous examples, the improvement of TT2 in comparison with TT1 should not be ignored. On the one hand, the terminological items are more accurate, which helps to facilitate information translation. On the other, it is obvious that regarding the phrases coordinated by “and”, TT2 has avoided much of the information distortion which is prominent in TT1. However, this improvement has not been effective in view of the entire sentence, because TT2 has introduced considerable additional information distortion for the latter part (after “and”). The information distortion here seems more troubling than the improvement for the preceding constituents. Therefore, in its entirety TT2 is still not considered satisfactory for this case study. “Third, the geometry of the light path from the target image, past the SLM, and onto the FT plane of the target image allows the path of particular spatial frequency components to be traced. We perform this geometrical analysis, which identifies the regions of the target image that experience truncation of some spatial frequency components, and derive conditions that minimize error in the reconstructed images.” 131

Nevertheless, though TT2 is rather dissatisfactory in respect to the ST, the problem does not seem fatal for the entire abstract. As can be seen in Appendix D, this abstract contains two sentences for the method. The error for the hologram is minimized through “controlling two free variables related to the target image”, and the specific ways of controlling these variables are described by the sentence in this example. Therefore, the information discussed in the ST here is merely part of the larger piece, where the rest is in fact well preserved in the translation. In addition to the method, the background, results and conclusion are all very satisfactory in the output after modification. If one reads the modified translation alone, all crucial information can be properly assimilated (see Appendix D), and in this sense, the issue here does not seem to be fatal in view of the entire text. It is also important to note that problems like this are not common in the sample, as can be seen in Appendix D. Most of the sentences are either satisfactory, or with minor problems.

132

3.5 Summary This chapter has illustrated the case study of lexical customization, using SYSTRAN as the Machine Translation system and the abstracts of Applied Optics as the Source Text. Starting from a general description of SYSTRAN, Applied Optics, and the methods in which lexical customization is analyzed, conducted and evaluated, the chapter proceeds with a detailed discussion of the findings from a sample that has been selected from the corpora via systematic sampling. In the initial translation with SYSTRAN, various problems arise because of the lack of lexical customization, and considering this chapter as a whole, it is not hard to see that these problems are not confined to the lexis alone –– errors in the processing of categorial ambiguity have led to serious issues in syntax, meaning, and the overall quality of the translation. The discussion of the initial translation is focused on lexical ambiguity for the system, including Not Found Words, categorial ambiguity, and polysemy. Other problems of the initial output, particularly at the syntax level, are illustrated in comparison with the output after SYSTRAN’s customization, and are for the most part the result of the failure in resolving the lexical issues above. In the initial run, it was found that the ST, as expected, is substantially restricted in terms of both category and polysemy, that for the sample, SYSTRAN seems to be much more capable of resolving categorial ambiguity than of dealing with polysemy, and that by modifying the glossary entries it is possible to avoid nearly all the errors in categorial ambiguity and more than half of the errors in polysemy. An analysis of the errors in the lexical disambiguation, as well as the way in which SYSTRAN’s glossary can be modified, provides the basis for customization. In the outcome from the customized system, considerable improvement regarding many aspects of syntax, meaning, and terminological accuracy is described. This part of the chapter also addresses how the lexical errors observed above have affected the source syntactic parsing in SYSTRAN and the subsequent transfer and synthesis of the Target Text, in aspects such as main verb identification, subordinate clauses, and modifier attachment problems. The improvement in terminological items and out-ofvocabulary words (i.e. NFWs) is also effective in terms of the overall information transfer of the TT.

133

Although there are still some remaining problems in the output after the lexical customization, in view of information translation many of them seem to be minor, where some background knowledge in optics or information elsewhere can help a domain-specific reader to infer and disambiguation the meaning from the translated output. The quantitative results of the analysis on translation output in Table 3.7 indicate the above in a more straightforward manner, where the number of the problematic instances for each aspect of syntactic issues in question is sharply reduced. In summary, the lexical customization that has been conducted on the basis of the sample is largely effective in improving the translation from SYSTRAN.

134

Chapter 4 Further Discussion: Beyond the sample The above has illustrated that there is a very limited extent of lexical ambiguity in the sample abstracts, and that by modifying the “category” and the translation of each items in the lists, SYSTRAN’s translation would be significantly improved. However, in order for these two aspects to be truly dependable, two questions would still need to be answered: 1). Is the lexical restriction in the sample representative of the entire ST in this case, or of other texts of the same kind? More specifically, the above has illustrated that all of SYSTRAN’s disambiguation errors (i.e. List C.1 and List C.2) are in fact instances where the items are significantly restricted in terms of both category and translation (see above), but beyond the sample, are those items (i.e. List C.1 and List C.2) equally restricted in the entire corpus of Applied Optics, and in other similar texts? 2). Since adding those items to the user-defined dictionary would significantly improve the translation output from SYSTRAN, how do we know what words to add before translating a specific Source Text? For the first, a comprehensive concordance search for the items discussed above would be meaningful. While it is reasonable to search for all the items in Appendices A and B for solid evidence of the language restriction, the ones which SYSTRAN has failed to disambiguate correctly are in many ways crucial. A discussion on the usage of these items in the entire corpus of Applied Optics, together with the same search in the other two corpora, can be sufficient to justify the rationale of the glossary modification and its effectiveness. Therefore, the scope of

135

this thesis will be confined to those discussed in Section 3.3, i.e. items in List C.1 and List C.2. For the second issue, this thesis investigates the automatic extraction of lexical items in SYSTRAN.

136

4.1 Concordance Search Here, it is worthy of reviewing the items discussed in Section 3.3 above. A close look at List C.1 and List C.2 reveals the following points: 1). List C.1 is mostly comprised of items which function as verbs in the sample. These items are mostly ones of noun/verb ambiguity. This means that SYSTRAN’s category disambiguation errors were largely concentrated on instances where verbs need to be recognized. To some extent, this issue does not seem very surprising, given the flexible usage of verbs in English. The above has illustrated that a considerable amount of verbs in English can also be nouns or adjectives, and that the different inflected forms of a verb can have a number of grammatical functions. Appendix A also reveals such a phenomenon: the majority of the categorially ambiguous items in the sample involve noun/verb ambiguity (including adjective/noun/verb ambiguity). This

kind

of

ambiguity



including

55

noun/verb

items,

13

adjective/noun/verb items, and 1 adjective/adverb/noun/verb item, makes up 69 of the totally 91 items in Appendix A (see Table 4.1 below). Though merely a little more than one-third of the total number of items, these words are in fact highly frequent in the sample. In terms of the corresponding occurrences, they add up to 124 (102

noun/verb

occurrences,

20

adjective/noun/verb

occurrences

and

2

adjective/adverb/noun/verb occurrences), which is more than 70% of the total (174 occurrences).

Table 4.1 — Types of categorial ambiguity in the sample 137

While the vast majority of these noun/verb items function as nouns in the sample, when they do appear as verbs, SYSTRAN tends to encounter much difficulty (see Appendix A). This resulted in the fact that List C.1 consists of verbs for the most part. Verbs are in many ways crucially important for properly parsing a sentence (as shown in Section 3.4 above), and these items are valuable for discussion in the entire corpus. 2). The items in List C.1 are mostly general-language words, i.e. words which are not particularly discipline-bound. 3). Items in List C.2, in sharp contrast, are mostly nouns, and apart from a few items such as “can”, “show”, “hide”, etc. (most of which overlap with List C.1), the vast majority are terminological items specifically used in optics and related disciplines. Since apparently terminological items do not involve much ambiguity in a given domain, it is perhaps much more meaningful to search and discuss those items in List C.1, where the words are not discipline-bound and usually more flexible in usage. In addition, as mentioned above, the vast majority of the items in List C.1 overlap with those in List C.2, and the overlapping items make up the majority of the items in List C.2 which are flexible in usage. Therefore, by discussing those items in List C.1, crucial items of both lists are covered. The following sections discuss the concordance search for them in Applied Optics, Optics Letters, and Optics Express.

4.1.1 Method All of the words in List C.1 were searched in each of the three corpora, using a concordancing tool called AntConc (see below). While the focus of this case study is on Applied Optics, the other two corpora are also useful for supporting the findings in this journal. The three corpora, namely Applied Optics, Optics Express, and Optics Letters, are denoted respectively as AP, OE, and OL. The items were searched one by one, and their usage in the corpora was recorded by manually checking the concordance lines (i.e. Key Word In Context, or KWIC). The search was repeated in all three corpora.

138

The corpora In order for the search to be accurate, the corpora were edited so that the search hit would not include too many unwanted items. Author names, publication years, journal titles, etc. were all deleted. Only the abstract titles and the body of the abstracts remain in the corpora, as shown in Figure 4.1 below.

Figure 4.1 — The edited corpora The texts were encoded in Unicode (UTF-8). Although AntConc is fully Unicode compliant (Anthony, 2014b), the corpora in question contain letters of many different languages, and the issue would be substantially simplified if the file uses an international standard which is “designed to display all characters of the languages of the world in a single encoding” (Anthony, 2014b, p. 8), i.e. UTF-8. In the corpora, any instance of abnormal characters due to encoding was corrected, such as the ones shown in the following Figure 4.2.

139

Figure 4.2 — Correction of abnormal characters In this example, the “Maxwell’s” refers to “Maxwell’s”. In other words, the sequence “’” represents an apostrophe; but in order to ensure accuracy the sequence was checked and confirmed in the corresponding issue of Optics Express. Then the sequence of characters “’” was searched, checked and confirmed in view of the entire corpora, before each instance of this sequence could be replaced by the corresponding character (which in this case is an apostrophe). The same process goes for each of the sequences spotted (mostly beginning with “&”) until a search for “&” or “#” does not result in any occurrence in the corpora. Such sequences were present in the text mainly because of the encoding standards in the original Endnote entries. In many corpus studies, characters like apostrophe, quotation marks and so forth are typically substituted with special labels like the one shown above, so as to avoid ambiguity in the concordance search. While these punctuations might be ambiguous for some aspects of data management, they do not seem to pose much problem for the concordance search here in this section. There is thus not much need for substituting these symbols with specifically designated labels. On the contrary, such labels as “’” might be a hindrance for the search accuracy. For example, in this study the plural and possessive forms of nouns were searched in addition to the original form. In this sense, such items as show’s need to be included in the search result for show. For “show’s”, it is sometimes hard to include show’s while excluding showcase. This is made even more problematic by the fact that in the original texts, such labels were not always consistent among themselves (due to the different labelling standards in the journal’s database) — sometimes as an apostrophe, sometimes as “’”, and

140

occasionally as another sequence of characters. By replacing each and every of such instances the concordance search here would be more accurate. These instances were generally Greek letters, punctuation marks, or mathematical symbols, and for the alphabetical letters of English this was not a problem. Therefore, the above substitution is sufficient for ensuring that the concordance search would yield enough hits of the item in discussion. Given that the corpora have been edited to contain only titles and abstracts, the concordance search would yield no hits other than items in the proper texts. Combined with the above editing of characters, this helps to ensure that the search in this section could involve an adequate extent of accuracy.

Search items Since most of the items in discussion are verbs, the search includes all inflections of these words — third person singular, past tense, participles — to investigate different ways of usage and ambiguity. If any of these words has more than one form of past tense inflection, all those forms were searched (e.g. lighted/lit). For the nouns, plural17 and possessive18 forms were included as well. The reason for doing so is that inflections might also be ambiguous for MT: “plays” could be either a verb or a plural noun, “applied” could be a passive verb (e.g. “is applied to”) or an adjective functioning as the pre-head modifier for a noun phrase as in “applied optics”, and more complicatedly, the “-ing” form of these verbs could be nouns, verbs or adjective. By searching all of these forms the discussion here would involve a more comprehensive coverage of the ambiguity — and the language restriction — involved in the corpora. The items searched are listed below. The italicized words are those items in List C.1, while the ones after the dashes are the search items for each of them. can — can, could, canning, canned, cans show — show, shows, showed, shown, showing, showings The plural forms include not only such words as “shows”, but also those of the “-ing” words like “showings”. Here, some of them might not be commonly seen, e.g. “presentings”, but for the sake of consistency and search accuracy such items were also included. 18 Possessive forms were included via the settings in AntConc (see below), rather than by the search term list. 17

141

hide — hide, hides, hid, hidden, hiding, hidings offer — offer, offers, offered, offering, offerings combine — combine, combines, combined, combining, combinings influence — influence, influences, influenced, influencing, influencings draw — draw, draws, drew, drawn, drawing, drawings play — play, plays, played, playing, playings search — search, searches, searched, searching, searchings lead — lead, leads, led19, leading, leadings apply — apply, applies, applied, applying, applyings light — light, lights, lighted, lit, lighting, lightings static — static produce — produce, produces, produced, producing, producings present — present, presents, presented, presenting, presentings

AntConc and tool settings The concordancing program used for this process is AntConc (version 3.4.3, Macintosh OS X 10.7-10.10), a multi-platform toolkit for corpus analysis developed by Anthony (2006; 2013; 2014a). The tool provides various functions such as Key Word In Context (KWIC), Clusters/N-grams, Collocate, Keyword list, etc., and for this section of the case study two main functions were used — KWIC and Clusters. The “Global Settings” of AntConc allows adjustment of “token definition”, i.e. what is to be considered a word in the corpus analysis. The way in which tokens are defined is of crucial importance for the accuracy of the concordance search in the corpus analysis, and for this study here, only “Letter Token Classes” were recognized, as shown in Figure 4.3 below.

In the search here, “LED” as an abbreviation for “light-emitting diode” was excluded, because it is always in capital letters when used in this sense and could be distinguished by using a “Do Not Translate” entry in the dictionary. 19

142

The reason why numbers were excluded in the token definition is that some of the patterns in the corpora might contain numbers within a cluster of tokens, e.g. “wavelength of #nm”. Since there is no space between the number and the unit (nm), including numbers in the definition would influence the accuracy of the trigram list generated in AntConc. “Punctuation Token Classes” were not included in the definition either. This was to ensure that the search for the terms above would include such forms of words as show’s, shows’, show-, etc. A more detailed illustration of this is given in the next section. Symbols and marks do not make much difference in the discussion here, and for the sake of simplicity they were not included in the token definition.

Figure 4.3 — Token definition As mentioned above, the corpora were encoded in Unicode UTF-8, therefore AntConc was adjusted accordingly in this regard. In the “Tool Preferences” for concordance AntConc can be adjusted regarding how the KWIC would be visualized; Figure 4.4 below shows the relevant settings in this study.

143

Figure 4.4 — Concordance preferences

The search process For each item in discussion, all of the corresponding search terms (see above) were inputted into the “Advanced Search” dialogue in AntConc 20 . The relevant settings are shown in Figure 4.5 below.

Alternatively, this can also be done using a lemma file containing the above list of searched terms. Here, inputting the items manually in the search dialogue is by no means different, regarding the issues in discussion. 20

144

Figure 4.5 — Search items It is important to note that the options for “use search term(s) from list below” and “Words” were both ticked. The former ensures that the searched items would be a list of terms rather than a single one, while the latter is perhaps more meaningful in terms of the search accuracy. The above has mentioned excluding “Punctuation Token Classes” in the definition to cover such word forms as show’s, shows’, and show-. As an alternative, it can also be achieved by unchecking the option for “words” in Figure 4.5. An example is shown in the following Figure 4.6, where the search yields any sequence of characters beginning with show. As mentioned above, a side-effect of this is that words like showcase, showdown, etc. might also be included in the search results, which would substantially influence the accuracy of the concordance search. Therefore, for the sake of accuracy the method used here would confine the search term to “words”, while excluding punctuations in the token definition.

145

Figure 4.6 — Example: show The concordancing result was then sorted in accordance to the search items and their adjacent words, so that how each item is used in context can be illustrated. Based on the KWIC, specific categories of the items in question were recorded for further discussion. An example of the word “show” is demonstrated in the following Figure 4.7 and Figure 4.8.

146

Figure 4.7 — Example: Sorted concordance lines for “show”

147

Figure 4.8 — Example: Sorted concordance lines for “showed” As shown above, by sorting the concordancing results the specific usage of each item regarding all its inflected forms can be illustrated. The number of the instances where the item is used as a verb, noun, adjective, or any other grammatical category is then recorded, and this process is conducted for all three journals, resulting in a set of records regarding Applied Optics, Optics Express, and Optics Letters.

POS labels The way these instances should be recorded is largely dependent on how the POS labels are defined21. Here, it is important to note that in this thesis, nouns which function as pre-head modifications for other nouns are recorded as nouns rather than adjectives, though some MT systems prefer to consider them otherwise in their

21

The description here also applies to the labelling of categories in Chapter 3. 148

grammatical rules. Therefore, in this case study such items as “light” in “light absorber” or “laser” in “laser beam” are all labelled as “noun”. For inflected forms of verbs, certain extent of semantics needs to be considered when assigning them with their categories. For example, such noun phrases as “combining frequency” 22 , “combining efficiency”, and “combining schemes” are all considered in this study as noun-noun compounds, while in similar instances like “leading principle” 23 , the “-ing” word is instead an adjective. The former phrases are in essence semantically different from the latter, referring to “the frequency/efficiency/scheme

of

the

combining

process”

rather

than

“the

frequency/efficiency/scheme which combines”. It is also important to note that the above examples like “combining frequency” are sharply different from the word used in “a combiner for combining light waves of 635, 532, and 488 nm”24, where “combining light waves” is in fact neither a noun-noun compound nor a adjective-noun-noun phrase, but rather as a verb phrase. Instances of “combining” where the word is used as participles or gerunds are labelled as verbs, e.g. “By combining optical design software with optical force simulation tools, a highly efficient optical system was developed”25 . Gerunds are also distinguished from pure nouns in this study, as in “spectral beam combining”26 or “image hiding”27, where the item is considered a noun.

An example of this instance can be found in the following sentence: The method is demonstrated by combining two high-power ytterbium fiber lasers with high efficiency from low power to full combined power of 300 W (1.5 kW effective power), while maintaining peak combining efficiency within 0.5%. Drachenberg, D. R., Andrusyak, O., Venus, G., Smirnov, V., & Glebov, L. B. (2014). Thermal tuning of volume Bragg gratings for spectral beam combining of high-power fiber lasers. Applied Optics, 53(6), 1242-1246. 23 As in the following sentence: A general explicit algebraic characterization of Mueller matrices is presented in terms of the non-negativity of a set of leading principal minors of the coherency matrix CA associated with the arrow form MA of a given Mueller matrix M. Gil, J. J., & JosÈ , I. S. (2014). Explicit algebraic characterization of Mueller matrices. Optics Letters, 39(13), 4041-4044. 24 As in the following sentence: As an example, a combiner for combining light waves of 635, 532, and 488 nm, which are commonly used as the three primary colors in laser display systems, is designed and demonstrated through the finite-difference time-domain method. Liu, D., Sun, Y., & Ouyang, Z. (2014). Three-visible-light wave combiner based on photonic crystal waveguides. Applied Optics, 53(21), 4791-4794. 25 Kampmann, R., Chall, A. K., Kleindienst, R., & Sinzinger, S. (2014). Optical system for trapping particles in air. Applied Optics, 53(4), 777-784. 26 For example, as in: 22

149

Such considerations also apply for the past-tense inflected forms: in such phrases as “partially coherent combined beams”, the “combined” is labelled as an adjective, while the same item is considered a (passive-voice) verb in sentences like “The demonstrated method can be potentially combined with the coordinate transformation technique in transformation optics for the fabrication of graded photonic devices” 28 . Similarly, in such phrases as “applied optics” the word “applied” is considered an adjective while in “can be applied to”29 it is labelled as a passive voice verb. In addition, regarding many of the instances in the corpora, some domainspecific knowledge in optics or on the specific experiments and equipment are often involved, including most of the instances illustrated above. Another typical example, in the phrase “second harmonic generation” the item “harmonic” is actually a noun rather than an adjective, as “second harmonic” is a terminology in optics denoting a component frequency of the light wave. Similarly, whether the word “lead” is referring to the metal material and the chemical component (as in “lead zirconate titanate”), a homograph as in “lead angle” and “electrical lead”, or a verb connecting two nouns should also be determined with background knowledge in optics. In short, the search results here are recorded not merely based on the surface form of the concordance lines, but specific, detailed and one-by-one examination of the context in which the item is used. In deciding the POS labels for each of the item, semantics and extra-linguistic issues are also considered. Drachenberg, D. R., Andrusyak, O., Venus, G., Smirnov, V., & Glebov, L. B. (2014). Thermal tuning of volume Bragg gratings for spectral beam combining of high-power fiber lasers. Applied Optics, 53(6), 1242-1246. 27 As in: The proposed method might also be used for other potential applications, such as three-dimensional information encryption and image hiding. Gao, Q., Wang, Y., Li, T., & Shi, Y. (2014). Optical encryption of unlimited-size images based on ptychographic scanning digital holography. Applied Optics, 53(21), 4700-4707. 28 Lutkenhaus, J., George, D., Arigong, B., Zhang, H., Philipose, U., & Lin, Y. (2014). Holographic fabrication of functionally graded photonic lattices through spatially specified phase patterns. Applied Optics, 53(12), 2548-2555. 29 One instance of this, as an example, is in the following sentence: In this paper, we propose a new DRPE implementation for incoherent optical systems based on integral photography that can be applied to “encrypted imaging (EI)” to optically encrypt an image before it is captured by an image sensor. Nakano, K., Takeda, M., Suzuki, H., & Yamaguchi, M. (2014). Encrypted imaging based on algebraic implementation of double random phase encoding. Applied Optics, 53(14), 2956-2963. 150

It is also worthy of mention that the way this section labels the categories of the items in question might not be consistent with some MT systems or grammatical rules, though other ways of labelling may provide more convenience for the system to parse and synthesize sentences. Since the main purpose here is to discuss the genuine issues rather than to improve the system’s quality, it seems more reasonable to label the items in terms of their genuine usage rather than in a manner that makes it easier for MT.

151

4.1.2 Results and Discussion The results of the concordance search are shown in Table 4.2 and Table 4.3 below. The items in Table 4.2 are those which function as fixed categories in the entire texts, and the statistics show the relevant numbers for their occurrence in each of the three corpora. Word can light show produce offer play static

Category verb noun verb verb verb verb adjective

AP 927 637 407 113 27 24 20

OE 1866 1492 941 178 99 50 28

OL 831 633 417 108 44 14 14

Total 3624 2762 1765 399 170 88 62

Table 4.2 — Search results in entire corpus: fixed usage (Total occurrences: 8870) Table 4.3 illustrates the items that do occur with multiple grammatical functions, together with their occurrences as corresponding categories in each corpus 30 . For each of these items, Table 4.3 also shows the percentage of its occurrence as each category in relation to the item’s total occurrence in the corresponding corpus. The statistics in bold refer to the circumstances where the items are used in the corpora as the same category as how it was added in the glossary (see Section 3.2 and Section 3.3 above).

Note that the percentages are calculated in accordance to the individual corpus rather than the entire corpora. For example, in the first line, the percentage in the column “verb” for “present” in AO is calculated via: 420/445=94%. Based on this calculation, it can also be interpreted as the conditional probability for “present” to function as a verb, under the condition that the item occurs in AO, i.e. P(“present”)|AO). 30

152

Word

Corpus

Verb

Noun

present

AO OL OE Total

# 420 332 709 1461

% 94% 93% 93% 93%

# 0 0 0 0

% 0% 0% 0% 0%

Adjective # % 25 6% 26 7% 51 7% 102 7%

Subtotal # 445 358 760 1563

apply

AO OL OE Total

183 111 295 589

88% 93% 88% 89%

0 0 0 0

0% 0% 0% 0%

24 9 42 75

12% 8% 12% 11%

207 120 337 664

combine

AO OL OE Total

86 100 191 377

65% 86% 82% 78%

15 10 28 53

11% 9% 12% 11%

31 6 14 51

23% 5% 6% 11%

132 116 233 481

lead

AO OL OE Total

48 78 162 288

84% 95% 96% 94%

6 3 4 13

11% 4% 2% 4%

3 1 2 6

5% 1% 1% 2%

57 82 168 307

influence

AO OL OE Total

14 7 27 48

13% 15% 18% 16%

93 39 119 251

87% 85% 82% 84%

0 0 0 0

0% 0% 0% 0%

107 46 146 299

search

AO OL OE Total

2 1 6 9

13% 20% 26% 20%

14 4 17 35

88% 80% 74% 80%

0 0 0 0

0% 0% 0% 0%

16 5 23 44

hide

AO OL OE Total

4 4 9 17

40% 57% 36% 40%

3 0 9 12

30% 0% 36% 29%

3 3 7 13

30% 43% 28% 31%

10 7 25 42

draw

AO OL OE Total

3 3 8 14

100% 43% 53% 56%

0 4 7 11

0% 57% 47% 44%

0 0 0 0

0% 0% 0% 0%

3 7 15 25

Table 4.3 — Search results in entire corpus: flexible usage (Total occurrences: 3425) These data show some very important aspects of the items in question and, more importantly, of the MT customization which has been conducted on the basis of the sample of selected abstracts (see above). 1). The vast majority of the items’ occurrences in the corpora function as very restricted categories.

153

As shown above, the total number of occurrence for the items in Table 4.2 is 8,870, compared with the 3,425 in Table 4.3. The contrast here is apparently very sharp. As shown in the following Chart 4.1, as much as 72% of the items in question are restricted to only one category for each. This means that the likelihood for the above glossary modification, which has restricted the items’ categories, to be questionable in view of the entire corpora is relatively low. If considering the statistics in Table 4.3 in more detail, the proportion for genuinely questionable instances would be nearly negligible. This will be further illustrated below. On the other hand, the highly frequent occurrence of the items in Table 4.2 indicates that the modification of only a very few items can be considerably efficient for the corpus as a whole. Since these items are very repetitive, it seems reasonable to assume that once they are added into the user-defined glossary, relevant kinds of improvement as illustrated in Section 3.4 should also exist in the translation of texts other than the selected sample. The above has also illustrated that these highly frequent items as shown by Table 4.2 are also very ambiguous for MT. Their considerable extent of ambiguity, together with the high frequency of their occurrence and their apparent restrictedness in the corpora, suggests that the above glossary modification concerning those items is perhaps not only reasonable, but very effective and desirable. This seems to provide adequate justification for the above customization, regarding texts beyond the selected samples.

154

Chart 4.1 –– Contrast between Table 4.2 and Table 4.3 2). Among the items that do occur with multiple categories, flexibility in usage is still significantly restricted. As can be seen in Table 4.3, although these items are not strictly fixed in terms of categories, they do exhibit a significant extent of inclination to certain parts of speech. For many of them, certain categories do not occur at all, as indicated by the many “0%” instances in Table 4.3. For example, “present” functioning as a noun is completely absent in the corpora, although this is in fact a very common usage and an instance where MT has often encountered difficulties in terms of disambiguation. The initial run in SYSTRAN has already been an evidence of such a problem (see the first sentence below, as a translation for “This paper presents…”):

Figure 4.9 –– Example of “present”

155

where, not very surprisingly, the item “present” in the ST was recognized in the wrong manner regarding both category and polysemy. Here, the fact that this item never occurs as a noun in the corpora, as shown by Table 4.3, indicates that the error mentioned here can be perfectly avoided by restricting the categorial information for this item in the glossary. This is exactly the case in what has been illustrated above. It also indicates that ruling out its noun usage is reasonable, as the word in practice does not function as such at all in the texts concerned. The same is also true for many other items, including “apply” as a noun, and “influence”, “search” and “draw” as adjectives31 (see Table 4.3). Other than the absence of certain categories for these items, it can also be seen from Table 4.3 that most numbers for the “percentage” columns are either very high (some above 95%), or extremely low (even 1% or 2%). Except for “hide” and “draw”, whose occurrences are both negligibly few, all other items show very apparent inclination to their corresponding categories. Adding up the occurrences where the items function as their preferred categories (3,015 in total), it would reveal that 88% of the totally 3,425 instances can in fact be considered restricted (see Chart 4.2 below). This means that although these items do occur in the corpora as multiple categories, they are in fact almost confined to one of the categories concerned.

Chart 4.2 — Data in Table 4.3

31

Note that the items here refer to all inflected forms (see above). 156

3). It is 97% likely for the categorial restriction to be effective. What seems also worthy of mention here, is that the 88% in Chart 4.2 concerns the items in Table 4.3 alone. If considering the complete list of items in these two tables, it would be evident that the overall proportion of the instances which can be considered restricted is in fact much higher than this. As can be seen from the statistics, the total number of occurrences for the two tables is 12,295, the number of occurrences for the strictly restricted items (i.e. Table 4.2) is 8,870, and the number of the occurrences in Table 4.3 which function as the items’ preferred category is 3,015. These numbers would render the above 88% into 97%, for the instances where the items can be considered restricted (see Chart 4.3 below). This means that overall speaking, for any of these items, there is a chance of 97% for it to be restricted to one preferred category in the corpus. It also means that if an item in the list is restricted to the corresponding category in the glossary, there can be a 97% confidence that it would be consistent to how the item is used in the ST.

Chart 4.3 — Data in view of the complete list This aspect is perhaps another justification for the rationale of the above glossary modification. 4). The grammatical functions of these items in the corpora are almost entirely consistent to how they were added in the user-defined glossary on the basis of the sample.

157

A review of the lists in Section 3.3 above reveals readily that in Table 4.2, the categories as which the items function in the entire corpus are all consistent to how they were added into the user-defined glossary. For all the recurrent items in Table 4.3, the categories to which they are inclined are also consistent to the ones added to the glossary. These include “present”, “apply”, “combine”, and “lead”, the categories of which are all apparently inclined to the instances in Section 3.3 above. The percentage statistics in bold for these items are all very high (see Table 4.3). A look at the “subtotal” column for these items reveals that they are relatively recurrent among the items in this table. The only items whose categories are inconsistent, or not apparently consistent, to the ones discussed above are “influence”, “search”, “hide”, and “draw”, as shown in Table 4.3. Instead of directly disproving the validity of the glossary modification above, it is important to note that except for “influence”, the items are extremely rare in the entire corpora. Out of the totally 822,230 tokens in the corpora, they account for merely 44, 42, and 25 (see Table 4.3). Therefore, these can be considered negligible, or sources of noise, if the issue is investigated in view of the entire corpora of abstracts. With regard to “influence”, perhaps this is one instance of error for the modified glossary, which results from the inconsistency between the word usage in the sample and that in the corpora. The error occurred because the selected sample happens to be among the minority concerning this item. 5). The probability for any item in the above user-defined glossary to be erroneous, as compared to how it is actually used in the ST, is 5.06%. The low rate here seems to answer the question whether the glossary is dependable. The fact that the glossary is more than 95% likely to be accurate is perhaps adequate to verify its dependability. The way this statistic is calculated is as follows. If there is an error in the glossary modification, it would have to be on the items in Table 4.3, since the categories for all items in Table 4.2 are consistent to the glossary entries. Here, the item “present” can be used as an example for illustration. Table 4.2 shows that this item can be a noun, an adjective, or a verb, while in the user-defined 158

dictionary that has been compiled previously in this case study, it is considered a verb. Therefore, the scenario in which this item is categorially inconsistent between the glossary entry and the actual usage in the corpora is when it happens to be either a noun, or an adjective in the ST input. This probability, denoted as P(present’), can be calculated via

,

where presentn and presentadj refer to the circumstances where the word functions as respectively a noun and an adjective. Considering the entire input text, the probability for an error to occur on “present” is a conditional probability of this item being categorially inconsistent between the glossary entry and the corpora, under the condition that this item occurs in the input ST. Hence, the probability of error regarding this item

,

where “All items” refers to the total occurrences of Table 4.2 and Table 4.3 combined, i.e. 12,295. In the same manner, the probability of error for other items can also be calculated. On this basis, the probability for any item in the user-defined glossary to be inconsistent with its actual usage in the corpora can be calculated via ,

which can be used as a predicted error rate for the glossary. In accordance to the statistics in Table 4.3, the P(e) of the customized glossary in this case study is 5.06%. This means that when the glossary is used for translating any parts of the corpora beyond the sample, one can expect as much as 95% accuracy regarding these items’ categorial disambiguation. 6). The results for AO is consistent to those for OE, OL, and the entire corpora as a whole. As can be seen from the two Tables, for each item the results are always consistent among the three journals, with only one negligible exception: “draw”. Since its occurrence is apparently negligible in view of the total number of tokens, 159

the exception here can be considered merely a noise. Other than this, all other statistics show considerable conformity among these three corpora. This means that what is found in this thesis on the basis of Applied Optics is representative of other texts of the same kind. Given such conformity, it seems reasonable to assume that the customization conducted above is replicable.

160

4.2 Automatic Customization The previous section has illustrated that the items added into SYSTRAN’s “User Dictionary” (UD) in Section 3.3 have been not only effective for the sample, but largely representative of the entire corpora as a whole. This section aims to investigate whether these items can be automatically detected by SYSTRAN before translating the given sample.

4.2.1 Method As mentioned above in Section 3.1.2, SYSTRAN supports automatic statistical processing of language resources, which results in either dictionaries or target language models, to fine-tune its translation. The “Wizard User Dictionaries” (WUD) are in fact applied by SYSTRAN’s translation process in the same manner as “User Dictionaries” (UD), i.e. the glossary items that are added to the system in Section 3.3. As also described above, one of the WUDs contains specific “Source Category” items for its disambiguation, while the other contains all other types of entries including “Do Not Translate” or “normal” items. In the following investigation, the entire corpus of Applied Optics is used as the source for SYSTRAN’s automatic customization, via the tool “Customization Wizard”. The corpus is in the same format as in 4.1.1 (see above), and all the generated resources are saved for later analysis. Then the items in Section 4.1.1 are searched in the generated WUDs, particularly in the WUD for “Source Category”, while comparing the assigned categories for them with the concordance search results in 4.1.2. Then the items in List C.2 are also searched. If these searched items are present in the generated WUD, it means that the WUD can be dependable for avoiding the lexical errors discussed above –– with some further lexical modification. This would in turn result in the translation improvement shown in 3.4. For better discussion, all three corpora are also tried in the same process, testing if a larger size of linguistic data would lead to improvement in the number of items extracted for the disambiguation errors discussed above.

161

Meanwhile, if this process has resulted in a satisfactory extraction of the items in question while associating them with accurate categories or other information, it would mean that the lexical customization illustrated in the previous sections can perhaps be completed in an entirely automatic manner, a much more meaningful and effective method for the improvement of MT output.

4.2.2 Results and Discussion For Applied Optics, the process has generated 29,954 Language Model entries, 20,597 WUD entries and 583 WUD Source Category entries (denoted below as WUD_SC). This is shown in the following figure:

Figure 4.10 –– Resources generated from Applied Optics It is important to note that for this case study, only “English to Chinese” resources are relevant. Since the training data is monolingual, there is no Language Model entries for English to Chinese translation. The entries in WUD include Not Found Words (NFWs) and multiword terms associated with the corresponding headwords and default translations. The WUD_SC contains words and multiword items associated with category information.

162

Not Found Words and Categorial Ambiguity For NFWs, all of the items discussed in 3.2.1 are present in the WUD. This means that all the translation problems concerning these items can be avoided by editing the WUD. For the items in List C.1, only 5 out of the totally 15 items are present. These items are shown in Appendix E together with relevant information stored in WUD_SC. They include, apply, show, light, produce, present. Among them, “applied” is extracted not as a single-word item but as part of multiword terms –– “applied voltage” and “Applied Optics” (see Appendix E). This is clearly not inclusive considering how the word is used in the corpus, but it does provide some hint on how the item should be restricted in the glossary. The other four items, however, are extracted properly, with their corresponding categories consistent to the way they are used in the corpora. The majority of the items in question, however, are not present in WUD_SC. They include, can, hide, offer, combine, influence, draw, play, search, lead, static, which means that these items would remain problematic if using WUD_SC alone, resulting in the same translation errors as discussed above (see 3.3 and 3.4). In other words, the WUD_SC is helpful for merely 5 of the totally 15 items in List C.1. Most of the categorial disambiguation errors discovered in the sample would not be avoided. The previous sections have illustrated how these errors affect SYSTRAN’s proper translation in many aspects, and the fact that they would not be avoided means most of the issues discussed above in 3.4 would be meaningless, if with no further means of glossary modification. It is also important to note here that since many of these items are highly recurrent in the corpus, such errors are meant to be prominent when the system is put to actual use. Therefore, this result appears rather unsatisfactory. Enlarging the data size does not seem very helpful either. When the three corpora are used as a whole, the system has generated 65,397 WUD entries and 3,221 WUD_SC entries, as shown below. 163

Figure 4.11 –– Resources generated from all corpora The details of the items which are included in this WUD_SC are shown in Appendix F. They include: combine, lead, apply, light, present. Compared with the five items recognized in the previous process, the item “combine” is a new one. However, the item “show”, which has initially been recognized, is now missing. This is important because “show” is highly repetitive in the corpus of Applied Optics. Therefore, this result is not considered much improvement from the previous one. Similarly, most of the items searched are not present in WUD_SC, including, can, show, hide, offer, influence, draw, play, search, static, produce. This is not much difference from the unsatisfactory results of the previous process. What seems particularly troublesome regarding these items is the highly repetitive “can”. As shown in the initial translation output, this item causes a number of problems regarding syntax and the overall meaning of the TT. The fact that the item is highly frequent in the corpus means that such errors would be prominent if not dealt with properly. On the other hand, the previous sections have also illustrated how restricted this item is actually used in the entire corpus (see 4.1.1); therefore, not including items as such in the WUD_SC is largely disappointing.

164

On the other hand, in the WUD from Applied Optics, these items are mostly present. Appendix G illustrates the items which are included in the WUD as headwords. They are: show, offer, combine, influence, play, search, lead, light, present. However, a more detailed look at these items in Appendix G reveals that the results are in fact very problematic, easily misleading SYSTRAN to mistakenly disambiguate them. Some items are assigned with the completely wrong categories in all the expressions, and they include, show, offer, play, lead, present. Other than these items, another three are also worthy of discussion: “combine”, “influence” and “search”. The item “combine” in Appendix G is only assigned with the noun category. While this is not wrong in the expressions listed in this WUD, a look at Table 4.3 in 4.1.2 reveals readily that this usage is rather the minority in the texts concerned. Comparing this with Appendix C, it is not hard to find that the category is not consistent with how the item is used in the sample. Therefore, although the extracted expressions here might be helpful to some extent, the disambiguation error observed in the previous process of this case study regarding “combine” would not be avoided, and the associated translation problems would remain. The items “influence” and “search” can be considered favorable, because the WUD has avoided the potential problem associated with the glossary modification above. As shown in 4.1.2, the way these two items are used in the sample is not consistent to their preferred use in the entire corpus of Applied Optics. Therefore, the assigned categories in the UD (see Appendix C) might result in potential errors when the system is translating other abstracts from the same journal. In Appendix G, however, their categories are consistent to the preferred one. This means that the process of extracting the WUD is helpful for cancelling out the side-effect from the categorial restriction conducted above. In summary, the WUD for source categories (WUD_SC) does not seem effective in correcting the disambiguation errors discussed in 3.3. This would in turn lead to many of the same translation problems involved in the initial translation output in Appendix D, if not introducing more errors. Enlarging the size of training 165

data does not seem helpful either. The other WUD contains all the Not Found Words, many multiword expressions and most of the items which SYSTRAN’s initial run has failed to assign with the appropriate categories. However, the associated categories here are mostly wrong.

Polysemous Items On the contrary, the WUD extracted from Applied Optics is obviously very successful in covering the items in List C.2. Other than the items which overlap with List C.2, the remaining ones are mostly terminologies, 78 in total. Among them, 73 are found in the WUD, including, aberration, active, adaptive, aliasing, amplitude, analogy, backward, behavior, cascade, cavity, characteristic, charge, convergence, conversion, correction, deformable, discrete, discuss, emission, emit, enhancement, error, expansion, factor, field, free, frequency doubling, fringe, generate, generation, geometry, harmonic, harvesting, impact, index, interferometric, intrinsic, investigate, laser, linewidth, mirror, modulation, object, parallel, passive, phase, photovoltaic, plane, prescription, rate, scaling, second, series, simulation, solid-state, solution, source, spectral, step, tilt, truncation, variable. These items are extracted together with their typical usage in multiword clusters, therefore the WUD seems considerably helpful for the modification of the terms as conducted above. However, since the WUD contains only SYSTRAN’s default translation for these terms, the WUD would need significant editing to result in the same level of output improvement as shown in 3.4.

Overall Discussion As can be seen from above, the WUD_SC seems unsatisfactory in terms of category disambiguation, while the WUD is largely effective regarding NFWs and terms.

166

The fact that category ambiguity significantly influences many other aspects of linguistic analysis in SYSTRAN (see 3.4 above) means that using the WUD_SC for categorial disambiguation is not dependable. Nevertheless, the WUD and WUD_SC combined have included almost all of the items for categorial disambiguation errors. This is perhaps adequate for facilitating further lexical customization, since what words to modify can be discovered from the extracted dictionaries before translating the sample, especially given that the items in List C.2 are largely covered in the WUD. For instance, when the system is to translate a group of abstracts in Applied Optics, the entire year’s abstracts of this journal can be used to extract the WUD and WUD_SC. Then the categories and translations of the extracted items can be modified, which may be –– optionally –– followed by a search for them in the entire corpus in the same manner as in Section 4.1. These would, at least, correct most of the lexical and syntactic disambiguation errors discussed in 3.2, 3.3 and 3.4, in turn resulting in many aspects of translation improvement shown in 3.3.

167

4.3 Summary This chapter has described an elaboration of the lexical customization conducted in this case study. There are two main objectives for this chapter –– to analyze the representativeness of the findings obtained from the selected sample, and to investigate methods in which the items in the customized User Dictionary (UD) can be identified. At the beginning of the chapter, the error analysis which is conducted in Chapter 3 is further elaborated, where it was found that the specific items for modification in the glossary are mostly 1). nouns functioning as discipline-specific terminologies, and 2). a relatively small number of general-language verbs that can be very ambiguous for the system. The former constitutes the majority of the polysemy errors in SYSTRAN, and the latter is where the system has problems in categorial ambiguity. Although the items for categorial ambiguity are small in number, they are in fact considerably repetitive in the ST, and the consequences of problems in disambiguating them are severe for the overall translation. These items, therefore, are particularly important for discussion. Here, it was also found that among these items, it is noun-verb ambiguities that make up the most (70%) of categorial ambiguity issues relevant to the ST. In addition, since noun-functioning, discipline-bound terminologies, i.e. those in 1), are generally not ambiguous in a specific domain, this Chapter finds it meaningful to focus on the items in 2), i.e. List C.1. This List overlaps with List C.2, i.e. the items in 1), and the overlapping items include almost all of those in List C.2 that are flexible in usage. This means that a discussion of the items in 2), i.e. List C.1, would cover the crucial items for discussion in both lists. These items were then searched and investigated against a much larger text sample consisting of abstracts in the entire 2014 volume of various journals, namely, Applied Optics, Optics Letters, and Optics Express, respectively and collectively. All inflected forms of these words were included in this process, resulting in some interesting findings which are listed in 4.1.2. These findings are largely consistent to what has been found in the sample, and also confirms that the way in which lexical customization has been conducted in the previous chapter would be dependable. Towards the end of this section, the predicted error rate for the modified User

168

Dictionary when applied to these corpora is calculated, and the equation used for this calculation can also be applicable for predicting the error rate of this lexical customization method in general. The second part of the Chapter aims at the second objective. Since it is both effective and dependable to modify some particular items in the UD, the way in which these items can be identified would be a meaningful, practical, and indispensable issue to investigate. To approach this, the second part of the Chapter tests the automatic extraction of UD items, which SYSTRAN calls WUD. The WUD extracted by the system falls into two kinds: Source Category WUD, which facilitates ST sentence parsing, and normal WUD which is relevant to the translation into the TT. These two kinds are consistent to the way in which the lexical ambiguity issues are discussed in this thesis: categorial ambiguity and polysemy. Results show that the WUD specifically for categorial ambiguity, i.e. WUD_SC, is not effective in identifying the ambiguous items, thus not dependable for resolving the disambiguation issues discussed in Chapter 3. This is not inconsistent when the data for extraction is enlarged from a single corpus of Applied Optics to a larger collection of all three corpora, which further confirms the WUD’s lack of dependability. The other WUD, however, contains most of the items which SYSTRAN has problems concerning categorial ambiguity. When these two WUDs are combined, all items for categorial ambiguity are included. All the Not Found Words, many multiword expressions, as well as almost all of the items discussed for issues of polysemy are included in the second WUD, even when the data for extraction is a corpus of Applied Optics alone. In the same manner, combining the two WUDs would cover all the items for polysemy. On this basis, it seems that the automatic process of lexical customization in SYSTRAN would be sufficient for the identification of the items that need to be modified in the system’s glossary, if the two WUDs are combined. In other words, the issue of what items to modify can be resolved with the support of SYSTRAN’s automatically extracted WUDs. This achieves the second objective of this Chapter. However, although this would help to identify the items for lexical customization, the information contained in the WUDs for these items’ categories and translated terms is mostly wrong or inappropriate. It seems that SYSTRAN is not able to properly customize the glossary on its own with the monolingual data, either 169

for categorial ambiguity or for polysemy. This means that the process of lexical customization which has been conducted in this case study cannot be completely automatic. After the WUDs have been extracted, considerable manual editing on the entries is still needed in order for the system to result in the kind of improvement illustrated in Chapter 3. The investigations in this Chapter also provide some guidelines for the method of lexical customization regarding not only abstracts of optics, but potentially academic or domain-specific writing in general. The example which is given at the end of Section 4.2 is particularly meaningful for the method: when one is translating a certain amount of abstracts in SYSTRAN, the entire volume of the journal can be used for extracting WUDs, which is potentially sufficient for including the items to modify in the glossary. Then the specific information contained in the WUDs for each item would need to be manually edited. This process could in turn lead to substantial improvement in the translation output.

170

Chapter 5 Conclusions 5.1 Theoretical Perspectives This thesis investigates the issue of translating scientific abstracts with MT. The significance for such an investigation is not only related to the needs of such translation, but also the shortcomings in many investigations by domain-experts. Starting from a general discussion of the theoretical issues, it takes the functionalist definition of translation and seeks for the aspects in the functionalist school in Translation Studies which can be applied to MT and to the case of abstract translation. Since this is largely an issue of information translation, relevant concepts regarding information is also investigated. What can be seen from Chapter 2 is that the investigation of MT usability is controversial because different communities seem to approach the issue from different perspectives, all partial to some extent. Another outcome from this chapter is that while Translation Studies has marginalized the equivalence paradigm partly because of the complications of equivalence, the area of MT seems to be lagging behind. Regarding this, the thesis argues for more consideration of the target-side functionalist theories when investigating MT output. The concepts of information, sublanguage, and customization are all important for the issues of the case study, indicating that while the investigations mentioned in Chapter 1 generally lead to negative results, a proper perspective is needed. This is meaningful because regarding MT, rarely is its output investigated from a translation perspective. In addition, the emphasis on customization, together with the support of the results from the case study, seems to explain the reason that while machinetranslating

scientific

abstracts

should 171

be

feasible,

many

domain-specific

investigations by the text recipients are surprisingly negative. It is argued here that without the customization, or a proper use of MT, the outcome can be largely disappointing; but as shown in the case study, lexical customization results in significant improvement in the translated text. Therefore, when investigating MT output in the practical sense, it might not be fair for a system to be evaluated without some degree of customization.

172

5.2 Case Study The case study aims to investigate customization and support this. In the case study, seven abstracts from Applied Optics were selected via systematic sampling, before being inputted into SYSTRAN. The resulting TT was discussed from the perspective of lexical ambiguity — grammatical categories and polysemy. Results showed that, on the one hand, the ST is considerably restricted in terms of lexical ambiguity, and on the other, SYSTRAN is capable of adequately disambiguating categories while having much difficulty regarding polysemy (Section 3.2). With regard to structural (i.e. syntactic) ambiguity, the system seemed even more problematic, where many problems are intrinsically related to lexical issues (as shown in 3.4). On the other hand, these results in 3.2 seemed to indicate that it is possible to avoid the vast majority of the disambiguation errors, once the MT system is lexically customized. A more detailed discussion of the lexical errors was then conducted before modifying the system glossary. In addition to what has been illustrated in 3.3, perhaps a more meaningful observation is that in terms of category ambiguity, verbs are particularly problematic for SYSTRAN, while the system is better at recognizing other categories. These verbs are general-language items (as discussed in 4.1), and the source of many of the translation problems (as shown in 3.4). This can easily relate to the question posed in Chapter 1 regarding the MT investigations by, for example, Anazawa et al. (2012): whether the enrichment of terms alone is sufficient for the MT improvement. Now it seems that, while enriching the terms would certainly lead to some improvement, as shown in 3.4, a short list of general-language verbs are in many ways crucial as well. This justifies the statement in 2.1.2 that the key to better quality should be enrichment of terms and grammatically ambiguous words, where in the case of SYSTRAN, it is terminologies and a smaller set of typical verbs. When the words in discussion were added into the user-defined dictionary in the proper manner, the customization resulted in significant improvement of the translation.

173

The discussion on translation improvement begins with the perspectives of syntax, because the above issues of ambiguity are directly related to parsing. The components include verbs (of different kinds), attachment, conjunction, ellipsis, noun strings, terminological items, and Not Found Words. These are typical problems for MT, especially in the case of academic texts, all significantly improved because of the glossary modification. What is particularly useful here is that improvement in one place also leads to improvement in another. This is apparent both within sentences and at the textual level, and is perhaps a sound justification for the effectiveness of customization. There are, as illustrated in Chapter 3, sentences which are not perfect linguistically. Most of these problems are related to attachment of prepositional phrases and relative clauses. However, if examined from the functional perspective with a focus on information translation, many of these problems are in fact very minor. While these minor problems do not influence much of the adequacy of the output, if they can be resolved, the improvement would be much more apparent. Also worthy of mention is that one can see from Section 3.4 some features of academic writing, and the specific problems associated with MT. Most of the problems are related to verbs and attachment issues. This is because academic writing uses a number of embedded structures causing much syntactic ambiguity for the system. For other types of texts, it can be assumed, there would be different kinds of problems that are prominent. Regarding this case, verbs and attachment seem to be the most prominent issues. Perhaps in the development of MT, more attention can be paid to these two aspects. Then the study extends the scope of the sample to the entire journal, or to similar kinds of journals. Since the sample-based modification has resulted in significant improvement of the output, are these issues representative of the whole? Is the modification truly dependable for the entire corpus? Are the results from Applied Optics reflected in other journals as well? To answer these questions, a concordance search of the terms discussed above was conducted. Results show that the entire corpus is indeed equally restricted like the sample, and that apparent patterns can be found. The above should indicate that the discussion is dependable. But another question remains: given a specific ST, how do we know what words to add for the

174

lexical customization? The investigation here aims to test whether SYSTRAN can automatically complete this task. The entire ST corpus is used as the training data for the system to generate dictionary entries. Results show that although its accuracy is unsatisfactorily low in terms of entry information, the entries do include the items discussed in this case study. This means that before translating an ST, the automatic process in SYSTRAN is in principle feasible for discovering the items to modify in the glossary. This leads to the conclusion at the end of 4.2.2, which provides some insight as to how an MT system can be customized lexically given an ST to translate. As mentioned at the end of Section 4.3, the customization method suggested by this thesis is in essence not confined to Applied Optics alone, but to abstract translation in general. For other general users of MT who are not working in the field of optics, the suggested preliminary steps are also applicable for better translation output. Perhaps this would be an important contribution to the use of MT in practice. Another aspect which users can learn from the work conducted in this thesis has to do with the theoretical issues outlined in the previous chapters. How well an MT system can translate the texts of their domain depends not only on the system itself, but also largely on how it is used. Through the case study, this thesis has demonstrated that proper customization, even at the level of lexis alone, can lead to substantial improvement in the translated output, in sharp contrast to the outcome from a system which is in lack of such customization. It is also suggested that when customizing the system, some manual effort is necessary and vital, and should be conducted with enough support of background knowledge in the domain concerned. It is not hard to find in this thesis the importance of optics-specific knowledge when modifying the glossary entries, and for any other domain of specialized translation, the same can be assumed to be true. This in turn highlights the importance of customizability of an MT system, because it is the end-users who have the best knowledge of their domain and their texts. As mentioned, the justification for using SYSTRAN in this study is partly related to its customizability, although better performing systems (e.g. Neural MT) are emerging at a fast pace in recent years. For developers, it is advised to pay more attention to the extent of customizability of the system, as popular tools such as Google Translate do not always provide as easy an access as SYSTRAN for general,

175

less computer-literate users to customize the software for their specific text features. For MT researchers, perhaps the way in which systems are evaluated should incorporate more of its customizability, and when assessing the MT output, the extent of customization should be taken into account as an important variable. For the users, more awareness of MT customization is needed. Since MT systems that are customized for specific domains or texts could potentially produce very good translation, the use of MT in the proper manner can be significantly helpful to enhance access to information across various languages.

176

5.3 Limitations and Future Work While adequate for giving the above conclusions, it needs to be mentioned that this thesis is limited by a number of factors, leaving many issues only possible for future work. First of all, the case study is confined to the scope of optics, but the underlying assumption is that other subject domains are largely similar regarding the issues investigated. If more domains are incorporated for parallel comparison, the issues investigated would be applicable to a broader scope, and subtle differences across domains can be gauged. Due to the scale of this study, the corpus is limited to one year’s publication of the journals, though the amounts are still very high. The customization is discussed on the basis of a sample consisting seven abstracts, systematically selected for representativeness and analyzed in detail. The issues in discussion seem justified, but if the sample could be enlarged with more glossary entries, perhaps more findings would emerge. It is also obvious that the customization in question is limited to issues of user-defined dictionary. If appropriate parallel corpora are available, other aspects can also be investigated, particularly regarding Translation Memories and the automatic process of extracting items by SYSTRAN. These aspects are important in the sense that manually building a customized glossary database is a hugely laborious task, and perhaps these processes might provide an insight as to how the efficiency of lexical customization can be improved. In addition, SYSTRAN also allows the kinds of customization with style sheets and language models. Whether these are equally or more effective than the customization conducted in this study can also be investigated. In the process of the concordance search for the relative items in Chapter 4, it was found that most of these items follow very apparent patterns: “results show that”, “can be used to”, “we present a (novel/new/simple) method”, etc. These are typical of research abstracts, and if sequences like these are also added into SYSTRAN’s glossary, the result might be better improved. Therefore, some

177

discussion on the lexical bundles, or formulaic language, on the basis of the corpora is meaningful too. Another aspect of future work has to do with one of the issues already mentioned: the remaining problems after MT customization. Some are considered minor because of the background knowledge in optics, and it seems worthy of being further tested with real readers of optics. Empirical investigations in this regard can provide support for the assumption. In addition to lexical and terminological customization discussed in the case study, exploring the possibility of MT customization at deeper levels would be meaningful for future work as well. Meanwhile, software like SYSTRAN is constantly evolving, and as this case study is completed the SYSTRAN company might have developed new statistical methods for its Wizard User Dictionary extraction. Whether this would result differently in terms of accuracy is also worthy of investigating.

178

Appendix A –– Categorial Ambiguity The columns from left to right represent, respectively, 1. The item; 2. The item’s number of occurrence in the sample; 3. The grammatical category as which the item functions in the sample; 4. The number of occurrences where the item functions as the corresponding category as the previous column; 5. The category as which SYSTRAN has labelled the item; 6. Whether SYSTRAN’s labelling is correct or not; 7. The number of instances where SYSTRAN has labelled the item as the “Assigned Category”; 8. All the grammatical categories for the item in SYSTRAN’s glossary. Word

# Category

can show hide offer combine influence draw play search lead help object phase signal field target cascade speed angle condition prospect task structure drain interface charge transport instrument cube distance shear tilt fringe band scaling value orbit size range impact

7 5 2 1 1 1 1 1 1 1 1 5 5 4 4 4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

verb verb verb verb verb verb verb verb verb verb verb noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun noun

# Assigned Category 7 noun 5 noun 2 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 verb 5 noun 5 noun 4 noun 4 noun 4 noun 2 noun 2 noun 2 noun 2 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 1 noun 179

Correct / Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct

# All categories 7 5 2 1 1 1 1 1 1 1 1 5 5 4 4 4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb

step portion rate factor index mirror function cost surface profile applied precise fixed direct related controlled exact those these light model paper front plane major parallel static integral harmonic passive solar-energy characteristic imperialist intrinsic broadband solid active minimum optimum free

1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 6 2 2 2 1 1 1 1 1 1 1 5 2 2 2 1 1 1 1 1 1 2

due [to] backward

result design cloak form produce fast present

noun noun noun noun noun noun noun noun noun noun verb adjective adjective adjective adjective adjective adjective pronoun adjective noun noun noun noun noun adjective adjective adjective noun noun adjective adjective adjective adjective adjective adjective adjective adjective adjective adjective adjective

noun noun noun noun noun noun noun noun noun noun adjective adjective adjective adjective adjective adjective adjective pronoun adjective adjective noun noun noun noun adjective adjective noun noun noun adjective adjective adjective adjective adjective adjective adjective adjective adjective adjective adjective

Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Wrong Correct Correct Correct Correct Correct Correct Correct Correct Wrong Correct Correct Correct Correct Correct Correct Wrong Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct

2 adjective 2 adjective

2 adjective 2 adjective

Correct Correct

noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb adjective/verb adjective/verb adjective/verb adjective/verb adjective/verb adjective/verb adjective/verb adjective/pronoun adjective/pronoun adjective/noun/verb adjective/noun/verb adjective/noun/verb adjective/noun/verb adjective/noun/verb adjective/noun/verb adjective/noun/verb adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/noun adjective/adverb/noun/ver b 2 adjective/adverb/noun 2 adjective/adverb

4 noun verb 5 noun verb 7 noun verb 2 noun verb 2 verb

3 1 3 2 4 3 1 1 2

Correct Correct Correct Correct Correct Correct Correct Correct Wrong Correct Correct Correct Wrong Wrong

3 1 3 2 4 3 1 1 1 1 1 1 1 1

2 adverb adjective 2 verb

1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 6 2 2 2 1 1 1 1 1 1 1 5 2 2 2 1 1 1 1 1 1 2

noun verb noun verb noun verb noun verb noun verb 1 adjective 1 adjective 2 noun adjective 180

1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 6 2 2 2 1 1 1 1 1 1 1 5 2 2 2 1 1 1 1 1 1 2

noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb adjective/adverb/verb adjective/adverb/verb adjective/noun/verb adjective/noun/verb

this that

9 adjective pronoun 6 conjunction

7 adjective 2 pronoun 2 pronoun

Correct Correct Correct

4 conjunction

Correct

181

7 adjective/pronoun 2 adjective/pronoun 2 adjective/pronoun/conjunc tion 4 adjective/pronoun/conjunc tion

Appendix B –– Polysemy The columns from left to right represent, respectively, 1. The item; 2. The item’s number of occurrence in the sample; 3. The Chinese word to which the item should be translated in the sample; 4. The number of occurrences where the item should be translated such; 5. The Chinese word to which SYSTRAN has translated the item; 6. The number of instances where SYSTRAN has translated the item as the “SYSTRAN’s Translation”; 7. All the translations for the item SYSTRAN’s glossary. 8. Whether SYSTRAN’s translation is correct or not Word

# Translation

field passive spectral paper

4 5 3 2

can

7 能

7 罐头

produce cube lead

1 产出 1 立方体 1 导致

1 产物 1 立方体 1 主角

measurement distance add shear tilt fringe

1 1 1 1 1 1

测量 距离 增加 平移 倾斜 干涉条纹

1 1 1 1 1 1

测量 距离 增加 剪 掀动 边缘

emission band simulation harvesting offer task photovoltaic behavior structure intrinsic drain simulator combine

1 1 2 2 1 1 1 1 1 1 1 1 1

辐射 带 模拟 采集 提供 任务 光伏 特性 结构 本征 漏极 模拟器 结合

1 1 2 2 1 1 1 1 1 1 1 1 1

发射 带 模仿 收获 提议 任务 光致电压的 行为 结构 内在 流失 模拟器 组合

solution

3 解

3 解答

interface

1 接口

1 接口

charge

1 电荷

1 充电

场 无源 光谱 文

# SYSTRAN’s Translation 4 领域 5 被动 3 鬼 2 文

182

# Translations in Glossary 4 领域,域;调遣 5 被动 3 鬼;光谱 2 纸,文件,文,论 文;裱糊 7 装…于罐中;能;罐 头 1 生产;产物 1 求…的立方;立方体 1 带领;导致;主角; 线索 1 测量;评定 1 距离;疏远 1 增加,添加 1 剪 1 掀动 1 装饰;边缘,附加费 用 1 发射 1 结合;带,范围 2 模仿,模拟 2 收获 1 提议,聘用;提供 1 任务;分配 1 光致电压的 1 行为,工作情况 1 结构;构造,构建 1 内在 1 排泄;流失 1 模拟器,模拟程序 1 结合;组合,联合收 获机 3 解答,解决方法,办 法 1 链接;接口,界面, 界面 1 充电,费用,罪名

Correct/ Wrong Wrong Wrong Wrong Correct Wrong Wrong Correct Wrong Correct Correct Correct Wrong Wrong Wrong Wrong Correct Wrong Wrong Wrong Correct Wrong Wrong Correct Wrong Wrong Correct Wrong Wrong Correct Wrong

available object amplitude hide

2 5 1 2

Due [to]

2 由于

2 由于

2

scaling value orbit active fast backward precise

1 1 1 1 1 2 4

比例缩放 价值 轨道 有源 快速 逆向 精确

1 1 1 1 1 2 4

结垢 价值 轨道 活跃 快速 落后 精确

1 1 1 1 1 2 4

direct discrete model integral

1 1 2 1

直接 离散 模型 积分式

1 1 2 1

直接 分离 模型 积分式

1 1 2 1

target aliasing major

4 目标 2 混叠 1 主要

4 目标 2 混淆现象 1 主要

4 2 1

prescription variable geometry plane truncation lying (lie) controlled [parameter] applied (apply) [to] generation

2 1 1 1 1 1 1

2 1 1 1 1 1 1

2 1 1 1 1 1 1

quantum cascade frequency doubling second harmonic generate range discuss impact conversion

step characteristic

portion

现有的 物体 振幅 隐藏

方法 变量 几何形状 平面 截断 位于 受控

2 5 1 2

可利用的 对象 高度 皮

处方 可变物 几何 飞机 截 说谎 受控

2 5 1 2

可利用的,可用的 反对;对象 高度 掩藏,隐藏;皮,隐 藏 交付,由于;应得 物,到期 结垢,比例缩放;称 重视;价值,值 轨道;循轨道运行 活跃,有效的;激活 快速;斋戒 落后,方向;向后 写…的大意;精确, 准确的 指挥,处理;直接 分离 塑造;模型,设计 积分式;缺一不可 的,集成 瞄准,标定;目标 混淆现象,别名 主修;主要,专业; 少校 处方,规定 可变物,变量 几何,几何学 飞机,平面;飞行 截,截断 说谎,位于 控制;受控

Wrong Wrong Wrong Wrong Correct Wrong Correct Correct Wrong Correct Wrong Correct Correct Wrong Correct Correct Correct Wrong Correct Wrong Wrong Wrong Wrong Wrong Wrong Correct

1 应用于

1 应用的

1 申请,适用;应用的

Wrong

1 生成

1 一代

Wrong

3 量子 2 级联 1 双倍频

3 量子 2 小瀑布 1 频率 加倍

1 世代,一代,生成, 量 3 量子,数量 2 小瀑布,级联;落下 1 频率 加倍

2 二次 (second harmonic) 2 谐波 1 生成 1 范围 1 探讨 1 影响 1 转换

2 第二

2 其次;支持;秒钟, 秒;第二 2 泛音 1 引起,生成,组建 1 排列;范围 1 谈论,讨论 1 冲击,影响 1 转换;换能

Wrong

1 步,步骤;跨步 2 特征,特性;典型

Wrong Wrong

1 部分;分配

Correct

1 步骤 2 特征 (charactersitc parameter) 1 部分

2 1 1 1 1 1

泛音 引起 范围 谈论 冲击 换能 (conversion efficiency) 1 步 2 典型

1 部分 183

Correct Wrong Wrong

Wrong Wrong Correct Wrong Wrong Wrong

emit cavity signal steady-state rate

1 1 4 2 1

辐射 腔 信号 稳态 速率

1 1 4 2 1

散发 洞 信号 稳定 率

1 1 4 2 1

exact expansion linewidth enhancement [factor] factor draw analogy play

1 1 1 1

确切的 展开式 线宽 增强 (因子)

1 1 1 1

确切的 展开 行距 改进

1 1 1 1

1 1 1 1

因子 得出 类比 起(作用)

1 1 1 1

因素 凹道 比喻 戏剧

1 1 1 1

modulation index aberration correction investigate adaptive deformable mirror interferometric function

1 1 4 6 2 1 1 1 2 1

调制 系数 畸变 校正 探讨 自适应 可变形 反射镜 相干 函数

1 1 4 6 2 1 1 1 2 1

模块化 索引 变型 更正 调查 能适应的 可变性的 镜子 干涉测量的 函数

1 1 4 6 2 1 1 1 2 1

surface

1 表面

1 表面

1

profile

1 外形

1 外形

1

solid-state convergence speed parallel light

2 2 2 1 2

2 2 2 1 2

2 2 2 1 1 1

present

2 介绍

固态 收敛 速度 并行 光

error

4 误差

series

2 级数

show

5 展示/显示

laser

6 激光器 激光 3 源极 源 (source of error) 7 斗篷

source

cloak

固体 汇合 速度 平行的 轻的 光

散发 洞 信号;发信号 稳定 状态 率,费率,利率; 对…估计 精确;苛求 扩展,展开 行距,行宽 改进

Wrong Wrong Correct Wrong Wrong

因素,系数;析因 凹道;画 比喻 扬州,扮演;戏剧, 作用 模块化,调制 标注;索引,指数 变型 更正,校正 调查 能适应的,可适应的 可变性的 反映;镜子 干涉测量的 作用,功能,函数; 发挥作用 表面;浮出水面,出 现 外形,配置文件;描 出 固体 状态 汇合 速度;加速 平行的,并行 轻的;点燃;光 light modulator 光调制 器 礼物,存在;提出, 存在,显示 礼物,存在;提出, 存在,显示 错误,误差 错误,误差 系列,串联,级数 系列,串联,级数 展示,显示;陈列 展示,显示;陈列 激光 激光 来源,源 来源,源

Wrong Wrong Wrong Wrong

2 礼物

1

当前

1

4 错误 误差 1 级数 1 系列 5 陈列 展示 5 激光 1 激光 2 来源 1 源

3 1 1 1 2 3 5 1 2 1

4 斗篷

4 掩饰;斗篷 184

Correct Wrong Wrong Wrong

Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Correct Correct Correct Wrong Wrong Correct Wrong Wrong Correct Wrong Wrong Wrong Correct Correct Wrong Wrong Correct Wrong Correct Wrong Correct Correct

phase

free

掩饰 6 波前 (phase front) 相位 2 无 (free from) 自(变量) (free variable)

3 掩饰 1 阶段 前面

3 掩饰;斗篷 1 阶段;逐步采用

Correct Wrong

5 阶段 1 从…解脱

5 阶段;逐步采用 1 释放;任意;自由

Wrong Wrong

1 自由

1 释放;任意;自由

Wrong

185

Appendix C –– Lists of Items in Discussion List C.1 — Disambiguation errors (category ambiguity) (The meanings of the columns in this List are the same as those in Appendix A.) Word can show hide offer combine influence draw play search lead applied light static produce

# 7 5 2 1 1 1 1 1 1 1 1 2 1 2

Category verb verb verb verb verb verb verb verb verb verb verb noun adjective verb

# 7 5 2 1 1 1 1 1 1 1 1 2 1 2

present

2

verb

2

Assigned Category noun noun noun noun noun noun noun noun noun noun adjective adjective noun noun verb noun adjective

Correct / Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Correct Wrong Wrong

# 7 5 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1

All categories noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb noun/verb adjective/verb adjective/noun/verb adjective/noun noun/verb noun/verb adjective/noun/verb adjective/noun/verb

List C.2 — Disambiguation errors (polysemy) (The meanings of the columns in this List are the same as those in Appendix B.)

Word

# Translation

field passive spectral can produce lead shear tilt fringe emission simulation harvesting offer

4 5 3 7 1 1 1 1 1 1 2 2 1

场 无源 光谱 能 产出 导致 平移 倾斜 干涉条纹 辐射 模拟 采集 提供

# SYSTRAN’s Translation 4 领域 5 被动 3 鬼 7 罐头 1 产物 1 主角 1 剪 1 掀动 1 边缘 1 发射 2 模仿 2 收获 1 提议 186

# Translations in Glossary 4 5 3 7 1 1 1 1 1 1 2 2 1

领域,域;调遣 被动 鬼;光谱 装…于罐中;能;罐头 生产;产物 带领;导致;主角;线索 剪 掀动 装饰;边缘,附加费用 发射 模仿,模拟 收获 提议,聘用;提供

Correct/ Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong

photovoltaic behavior intrinsic drain combine solution charge available object amplitude hide scaling active backward discrete aliasing prescription variable geometry plane truncation lying (lie) applied (apply) [to] generation cascade frequency doubling second

harmonic generate discuss impact conversion

step characteristic

emit cavity steady-state rate expansion linewidth enhancement [factor] factor draw analogy play modulation

1 1 1 1 1 3 1 2 5 1 2 1 1 2 1 2 2 1 1 1 1 1 1

光伏 特性 本征 漏极 结合 解 电荷 现有的 物体 振幅 隐藏 比例缩放 有源 逆向 离散 混叠 方法 变量 几何形状 平面 截断 位于 应用于

1 1 1 1 1 3 1 2 5 1 2 1 1 2 1 2 2 1 1 1 1 1 1

光致电压的 行为 内在 流失 组合 解答 充电 可利用的 对象 高度 皮 结垢 活跃 落后 分离 混淆现象 处方 可变物 几何 飞机 截 说谎 应用的

1 1 1 1 1 3 1 2 5 1 2 1 1 2 1 2 2 1 1 1 1 1 1

光致电压的 行为,工作情况 内在 排泄;流失 结合;组合,联合收获机 解答,解决方法,办法 充电,费用,罪名 可利用的,可用的 反对;对象 高度 掩藏,隐藏;皮,隐藏 结垢,比例缩放;称 活跃,有效的;激活 落后,方向;向后 分离 混淆现象,别名 处方,规定 可变物,变量 几何,几何学 飞机,平面;飞行 截,截断 说谎,位于 申请,适用;应用的

Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong

1 生成 2 级联 1 双倍频

1 一代 2 小瀑布 1 频率 加倍

1 世代,一代,生成,量 2 小瀑布,级联;落下 1 频率 加倍

Wrong Wrong Wrong

2 二次 (second harmonic) 2 谐波 1 生成 1 探讨 1 影响 1 转换

2 第二

2 其次;支持;秒钟,秒;第二

Wrong

2 1 1 1 1

泛音 引起 谈论 冲击 换能 (conversion efficiency) 1 步 2 典型

2 1 1 1 1

泛音 引起,生成,组建 谈论,讨论 冲击,影响 转换;换能

Wrong Wrong Wrong Wrong Wrong

1 步,步骤;跨步 2 特征,特性;典型

Wrong Wrong

1 1 2 1 1 1 1

散发 洞 稳定 率 展开 行距 改进

1 1 2 1 1 1 1

散发 洞 稳定 状态 率,费率,利率;对…估计 扩展,展开 行距,行宽 改进

Wrong Wrong Wrong Wrong Wrong Wrong Wrong

1 1 1 1 1

因素 凹道 比喻 戏剧 模块化

1 1 1 1 1

因素,系数;析因 凹道;画 比喻 扬州,扮演;戏剧,作用 模块化,调制

Wrong Wrong Wrong Wrong Wrong

1 步骤 2 特征 (charactersit c parameter) 1 辐射 1 腔 2 稳态 1 速率 1 展开式 1 线宽 1 增强 (因子) 1 1 1 1 1

因子 得出 类比 起(作用) 调制

187

index aberration correction investigate adaptive deformable mirror interferometric solid-state convergence parallel light

1 4 6 2 1 1 1 2 2 2 1 2

present

2 介绍

系数 畸变 校正 探讨 自适应 可变形 反射镜 相干 固态 收敛 并行 光

error

4 误差

series

2 级数

show

5 展示/显示

laser

6 激光器 激光 3 源极 源 (source of error) 6 波前 (phase front) 相位 2 无 (free from) 自(变量) (free variable)

source

phase

free

1 4 6 2 1 1 1 2 2 2 1 2

索引 变型 更正 调查 能适应的 可变性的 镜子 干涉测量的 固体 汇合 平行的 轻的 光 2 礼物

1 4 6 2 1 1 1 2 2 2 1 1 1 1

标注;索引,指数 变型 更正,校正 调查 能适应的,可适应的 可变性的 反映;镜子 干涉测量的 固体 状态 汇合 平行的,并行 轻的;点燃;光 light modulator 光调制器 礼物,存在;提出,存在,显 示 礼物,存在;提出,存在,显 示 错误,误差 错误,误差 系列,串联,级数 系列,串联,级数 展示,显示;陈列 展示,显示;陈列 激光 激光 来源,源 来源,源

Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Wrong Correct Wrong

当前

1

4 错误 误差 1 级数 1 系列 5 陈列 展示 5 激光 1 激光 2 来源 1 源

3 1 1 1 2 3 5 1 2 1

1 阶段 前面

1 阶段;逐步采用

Wrong

5 阶段 1 从…解脱

5 阶段;逐步采用 1 释放;任意;自由

Wrong Wrong

1 自由

1 释放;任意;自由

Wrong

List C.3 –– User Dictionary EN

ZH

HEADWORD_EN

CATEGORY

ENTRY TYPE

aberration

畸变

aberration

noun

Multilingual

active

有源

active

adjective

Multilingual

adaptive

自适应

adaptive

adjective

Multilingual

aliasing

混叠

aliasing

noun

Multilingual

amplitude

振幅

amplitude

noun

Multilingual

amplitude-only

纯振幅

only

adjective

Multilingual

analogy

类比

analogy

noun

Multilingual

188

Wrong Wrong Correct Correct Wrong Wrong Correct Wrong Correct Wrong Correct

applied

应用于

applied

verb

Multilingual

available

现有的

available

adjective

Multilingual

backward

逆向

backward

adjective

Multilingual

behavior

特性

behavior

noun

Multilingual

can



can

verb

Multilingual

cascade

级联

cascade

noun

Multilingual

cavity



cavity

noun

Multilingual

characteristic

特征

characteristic

adjective

Multilingual

charge

电荷

charge

noun

Multilingual

combine

结合

combine

verb

Multilingual

convergence

收敛

convergence

noun

Multilingual

conversion

转换

conversion

noun

Multilingual

correction

校正

correction

noun

Multilingual

deformable

可变形

deformable

adjective

Multilingual

discrete

离散

discrete

adjective

Multilingual

discuss

探讨

discuss

verb

Multilingual

drain

漏极

drain

noun

Multilingual

draw

得出

draw

verb

Multilingual

drift-diffusion

漂移-扩散

diffusion

noun

Multilingual

emission

辐射

emission

noun

Multilingual

emit

辐射

emit

verb

Multilingual

enhancement

增强

enhancement

noun

Multilingual

error

误差

error

noun

Multilingual

expansion

展开式

expansion

noun

Multilingual

factor

因子

factor

noun

Multilingual

field



field

noun

Multilingual

five-dimensional

五维

five-dimensional

adjective

Multilingual

free from



free

adjective

Multilingual

frequency doubling

双倍频

doubling

noun

Multilingual

fringe

干涉条纹

fringe

noun

Multilingual

generate

生成

generate

verb

Multilingual

generation

生成

generation

noun

Multilingual

geometry

几何形状

geometry

noun

Multilingual

guided-wave

导波

wave

noun

Multilingual

harmonic

谐波

harmonic

noun

Multilingual

harvesting

采集

harvesting

noun

Multilingual

hide

隐藏

hide

verb

Multilingual

high-earth

远地

earth

adjective

Multilingual

homodyne

同步检波

homodyne

verb

Multilingual

189

hyperspectral

高光谱

hyperspectral

adjective

Multilingual

impact

影响

impact

noun

Multilingual

index

系数

index

noun

Multilingual

influence

影响

influence

verb

Multilingual

interferometric

相干

interferometric

adjective

Multilingual

intracavity

腔内

intracavity

adjective

Multilingual

intrinsic

本征

intrinsic

adjective

Multilingual

investigate

探讨

investigate

verb

Multilingual

lead

导致

lead

verb

Multilingual

light



light

noun

Multilingual

linewidth

线宽

linewidth

noun

Multilingual

lying

位于

lying

verb

Multilingual

mirror

反射镜

mirror

noun

Multilingual

modulation

调制

modulation

noun

Multilingual

nanostructured

纳米结构

nanostructured

adjective

Multilingual

nanotube

纳米管

nanotube

noun

Multilingual

near-field

近场

field

adjective

Multilingual

object

物体

object

noun

Multilingual

offer

提供

offer

verb

Multilingual

parallel

并行

parallel

adjective

Multilingual

passive

无源

passive

adjective

Multilingual

phase

相位

phase

noun

Multilingual

phase front

波前

front

noun

Multilingual

phase-matching

相位匹配

match

adjective

Multilingual

photovoltaic

光伏

photovoltaic

adjective

Multilingual

plane

平面

plane

noun

Multilingual

play



play

verb

Multilingual

prescription

方法

prescription

noun

Multilingual

present

介绍

present

verb

Multilingual

produce

产出

produce

verb

Multilingual

proof-of-principle

概念验证

proof

adjective

Multilingual

quantum cascade laser

量子级联激光 器

laser

noun

Multilingual

rate

速率

rate

noun

Multilingual

reinjection

重新注入

reinjection

noun

Multilingual

scaling

比例缩放

scaling

noun

Multilingual

search

查寻

search

verb

Multilingual

second

二次

second

adjective

Multilingual

second-harmonic

二次谐波

harmonic

noun

Multilingual

190

semiconductor laser

半导体激光器

laser

noun

Multilingual

sensorless

无传感器

sensorless

adjective

Multilingual

series

级数

series

noun

Multilingual

shear

平移

shear

noun

Multilingual

show

展示

show

verb

Multilingual

solid-state

固态

state

adjective

Multilingual

solid-state laser

固态激光器

laser

noun

Multilingual

solid-state laser system

固态激光系统

system

noun

Multilingual

solution



solution

noun

Multilingual

source

源极

source

noun

Multilingual

source of error

误差源

source

noun

Multilingual

spectral

光谱

spectral

adjective

Multilingual

static

静态

static

adjective

Multilingual

steady-state

稳态

state

adjective

Multilingual

step

步骤

step

noun

Multilingual

stimulation

模拟

stimulation

noun

Multilingual

tilt

倾斜

tilt

noun

Multilingual

truncation

截断

truncation

noun

Multilingual

variable

变量

variable

noun

Multilingual

Acket

DNT

AlGaAs

DNT

CGH

DNT

FM

DNT

Lang-Kobayashi

DNT

mm3

DNT

SPGD

DNT

Strehl

DNT

CNT

191

Appendix D –– Translation Output In the ST, the red words refer to Not Found Words, while the blue words refer to categorial ambiguity (“source ambiguities”) and polysemy (“alternative meanings)”. In the TT, the red words refer to Not Found Words, the blue words refer to polysemy (“alternative meanings”).

Abstract 1 ST:

Before:

After:

192

Abstract 2 ST:

Before:

After:

193

Abstract 3 ST:

Before:

After:

194

Abstract 4 ST:

Before:

195

After:

196

Abstract 5 ST:

Before:

After:

197

Abstract 6 ST:

Before:

After:

198

Abstract 7 ST:

Before:

After:

199

Appendix E –– Search Results of WUD_SC from Applied Optics The columns from left to right refer to: 1. The item; 2. The expression extracted by SYSTRAN where the item is the headword; 3. The Chinese translation of the expression; 4. The labelled category, together with SYSTRAN’s designated probability of the category (in terms of percentage); 5. The priority set by SYSTRAN for that entry; 6. The number of occurrences for the expression in the corpus; 7. Example of sentences for the expression concerned.

200

ITEM

applied

show

light

EN

ZH

NOTE

PRIORITY

FREQUENCY

EXAMPLE

applied voltage

NO_MEANING

[noun100]

6

5

Liquid crystal lenses are an emerging technology that can provide variable focal power in response to applied voltage.

Applied Optics

NO_MEANING

[noun100]

6

6

Applied Optics is launching new focus issues to highlight optics research at institutes, including government labs, universities, and industries. The following highlights research taking place at the Georgia Institute of Technology (Georgia Tech).

show

NO_MEANING

[verb100]

6

259

light

NO_MEANING

[noun100]

6

494

yellow light

NO_MEANING

[noun100]

6

5

The obtained higher reflectivity in the yellow light wavelength region will benefit the phosphor-converted LEDs because yellow light backscattered by phosphor particles is reflected upward.

white light

NO_MEANING

[noun100]

6

12

The surface profiles of thin-film specimens were measured under an external magnetic field with white light interferometry.

visible light

NO_MEANING

[noun100]

6

10

Typical situations, which can be met during the process of absolute calibration, are shown in the case of a visible light observation system for the COMPASS tokamak.

stray light

NO_MEANING

[noun100]

6

12

Last, the impact of the stray light on the SD since “first light” is cleanly exhibited in the improved SD degradation result.

spatial light modulator

NO_MEANING

[noun100]

6

24

A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane.

slow light devices

NO_MEANING

[noun100]

6

5

Finally, physical parameters and applied external fields are changed for measuring frequency shift and SDF for coherent population oscillation slow light devices.

polarized light

NO_MEANING

[noun100]

6

24

The angle of incidence 𝜙=𝜙𝑢 min ϕ = ϕ u min of minimum reflectance for incident unpolarized or circularly polarized light at a dielectric-conductor interface is determined for any complex relative refractive index 𝑁=(𝑛,𝑘) N = ( n , k ) , and

201

contours of constant 𝜙𝑢 min ϕ u min in the 𝑛𝑘 n k plane are presented. light-emitting diodes

NO_MEANING

[noun100]

6

14

Angular color uniformity is a key optical property of white light-emitting diodes and high ACU is strongly demanded in illumination applications.

light-emitting diode

NO_MEANING

[noun100]

6

7

The practical instrumental example chosen to illustrate this method is a rotationally symmetric catadioptric collimator for a light-emitting diode source.

light utilization efficiency

NO_MEANING

[noun100]

6

6

Furthermore, we model the light utilization efficiency, illumination uniformity, and veiling luminance of glare due to one or several LED streetlamps.

light sources

NO_MEANING

[noun100]

6

8

Light emitting diodes (LEDs) are considered next-generation light sources.

light source

NO_MEANING

[noun100]

6

27

Here, a light source for laser cooling of trapped strontium ions is described.

light modulator

NO_MEANING

[noun100]

6

24

A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane.

light intensity

NO_MEANING

[noun100]

6

14

The known 120 Hz autostereoscopic displays with dynamic amplitude parallax barriers have full-screen resolution but are characterized by essential light intensity losses and crosstalk in each of displayed views.

light guide

NO_MEANING

[noun100]

6

6

The controller utilizes the optothermally induced volume increase in the elastomerencapsulated paraffin wax to produce pneumatic force, which subsequently actuates the cantilever light guide to control the level of frustrated total internal reflection.

light devices

NO_MEANING

[noun100]

6

5

Finally, physical parameters and applied external fields are changed for measuring frequency shift and SDF for coherent population oscillation slow light devices.

light beam

NO_MEANING

[noun100]

6

5

Such a device combines red, green, and blue color beams into one output light beam.

light -

NO_MEANING

[noun100]

6

6

Spectral assemblage using light emitting diodes to obtain specified lighting characteristics.

laser light

NO_MEANING

[noun-

6

10

In particular, by matching the plasmon frequency of GNPs to the frequency of the

202

100]

laser light source we have observed a strong luminescence enhancement of the nanocomposite consisting of GNPs coupled with luminescent dye Nile blue 690 perchlorate.

incoherent light

NO_MEANING

[noun100]

6

6

Therefore, in this paper we propose an accurate technique for in-focus plane determination, which is based on coherent and incoherent light.

incident light

NO_MEANING

[noun100]

6

22

Through modulating intensity distribution of incident light, light emitting from the pinhole is capable of containing information on binary aberration coefficients.

cladding light

NO_MEANING

[noun100]

6

5

The results show that sheets of indium are very effective in stripping unwanted cladding light.

produce

produce

NO_MEANING

[verb100]

6

35

present

presents

NO_MEANING

[verb100]

6

73

203

Appendix F –– Search Results in WUD_SC from All Corpora The columns from left to right refer to: 1. The item; 2. The expression extracted by SYSTRAN where the item is the headword; 3. The Chinese translation of the expression; 4. The labelled category, together with SYSTRAN’s designated probability of the category (in terms of percentage); 5. The priority set by SYSTRAN for that entry; 6. The number of occurrences for the expression in the corpora; 7. Example of sentence for the expression concerned.

204

ITEM

#EN

ZH

NOTE

PRIORI TY

FREQUE NCY

combines

NO_MEANING

[verb-100]

6

55

combine

NO_MEANING

[verb-100]

6

29

method combining

NO_MEANING

[noun-100]

6

5

A highly efficient hybrid method combining physical optics with physical optics is adopted to analyze the electromagnetic scattering from a perfectly electric conducting object situated above the conducting rough surface.

combining efficiency

NO_MEANING

[noun-100]

6

12

The purpose of our study is to decipher the desired effect of nonlinearity on the combining efficiency in two architectures of a globally coupled fiber-laser array:

combined effects

NO_MEANING

[noun-100]

6

5

Precise measurement of aberrations within an optical system is essential to mitigate combined effects of user-generated aberrations for the study of anisoplanatic imaging using optical test benches.

leading

NO_MEANING

[verb-100]

6

90

applied voltage

NO_MEANING

[noun-100]

6

25

The core changes its refractive index by means of partial in-plane to out-of-plane reorientation of ferroelectric domains in bismuth ferrite under applied voltage.

applied electric field

NO_MEANING

[noun-100]

6

6

Recently, the dark conglomerate phase, which is an optically isotropic liquid crystalline state, has been shown to exhibit a large change in refractive index in response to an applied electric field (Δn=0.04).

Applied Optics

NO_MEANING

[noun-100]

6

7

Applied Optics is launching new focus issues to highlight opticsresearch at institutes, including government labs, universities, and industries.The following highlights research taking place at the Georgia Institute ofTechnology (Georgia Tech).

yellow light

NO_MEANING

[noun-100]

6

5

The obtained higher reflectivity in the yellow light wavelength region will benefit the phosphor-converted LEDs because yellow light backscattered by phosphor particles is reflected upward.

white-light interferometry

NO_MEANING

[noun-100]

6

5

High-resolution wide-dynamic range electronically scanned white-light interferometry.

white lightemitting diodes

NO_MEANING

[noun-100]

6

6

Angular color uniformity is a key optical property of white light-emitting diodes and high ACU is strongly demanded in illumination applications.

combine

lead

apply

light

205

EXAMPLE

white light interferometry

NO_MEANING

[noun-100]

6

5

The surface profiles of thin-film specimens were measured under an external magnetic field with white light interferometry.

white light

NO_MEANING

[noun-100]

6

54

Its operation is optimal when using almost monochromatic light but an extremely strong diffractive dispersion occurs when white light is applied.

visible light communication systems

NO_MEANING

[noun-100]

6

6

In this Letter, polarization division multiplexing is proposed and experimentally demonstrated for the first time that we know of, in visible light communication systems based on incoherent light emitting diodes and two orthogonal groups of linear polarizers.

visible light

NO_MEANING

[noun-100]

6

61

Typical situations, which can be met during the process of absolute calibration, are shown in the case of a visible light observation system for the COMPASS tokamak.

ultraviolet lightemitting diodes

NO_MEANING

[noun-100]

6

6

High performance 365 nm vertical-type ultraviolet light-emitting diodes are demonstrated by the insertion of a self-textured oxide mask structure using metalorganic chemical vapor deposition.

ultraviolet light

NO_MEANING

[noun-100]

6

6

Under reverse bias, electrons in the valence band of the p-GaN layer move into the conduction band of the GaZnO layer, through a QW-state-assisted tunneling process, to recombine with the injected holes in the GaZnO layer, for emitting yellow–red and shallow ultraviolet light over the entire mesa area.

transmitted light

NO_MEANING

[noun-100]

6

5

A clear difference in the grating transmitted light due to surface functionalization was observed in presence of TM polarized illumination.

stray light

NO_MEANING

[noun-100]

6

23

Last, the impact of the stray light on the SD since “first light” is cleanly exhibited in the improved SD degradation result.

spatial light modulators

NO_MEANING

[noun-100]

6

23

We demonstrate that two high-speed spatial light modulators, located conjugate to the image and spectral plane, respectively, can code the hyperspectral datacube into a single sensor image such that the high-resolution signal can be recovered in postprocessing.

spatial light modulator

NO_MEANING

[noun-100]

6

97

In contemporary optics, the spatial light modulator is effectively used as a flexible optoelectronic device playing the key role in a number of experiments of science and technology.

slow light devices

NO_MEANING

[noun-100]

6

7

Our results suggest that graphene may be a very promising slow light medium, promoting future slow light devices based on graphene.

206

silicon spatial light modulator

NO_MEANING

[noun-100]

6

6

A dual-plane in-line digital holographic method is proposed with a liquid crystal on silicon spatial light modulator for recording holograms at two slightly displaced planes.

scattered light

NO_MEANING

[noun-100]

6

26

An iterative model to predict the remote phosphor module power and photon budget, including the recuperation of backward scattered light by a mixing chamber, is introduced.

reflected light

NO_MEANING

[noun-100]

6

8

The four Stokes parameters of the reflected light wave (S0, S1, S2, and S3) are generally estimated by observing the scene, with a CCD sensor, through a polarimeter.

polarized light

NO_MEANING

[noun-100]

6

67

In normal incidence, the oblique lattice, in contrast to square lattice, showed strong asymmetric, non-reciprocal transmission of circularly polarized light.

phase-only spatial light modulator

NO_MEANING

[noun-100]

6

15

We employed a phase-only spatial light modulator to generate several vortex beam traps with one spheroid in each of them.

organic lightemitting diodes

NO_MEANING

[noun-100]

6

6

Low driving voltage blue, green, yellow, red and white phosphorescent organic lightemitting diodes with a common simply double emitting layer (D-EML) structure are investigated.

organic light -

NO_MEANING

[noun-100]

6

6

Self-organized nanoparticle photolithography for two-dimensional patterning of organic light emitting diodes.

7

The experimental results were precisely fitted with a phenomenological model, assuming the simultaneous formation of one absorption grating induced by the 532 nm light and two coupling phase gratings generated from the refractive index changes by recording and auxiliary beams.

nm light

NO_MEANING

[noun-100]

6

liquid crystal spatial light modulator

NO_MEANING

[noun-100]

6

11

A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane.

light-matter interaction

NO_MEANING

[noun-100]

6

8

The method treats the inspected material within its environment locally as a stratified system and describes the light–matter interaction of each layer in a realistic way.

light-emitting diodes

NO_MEANING

[noun-100]

6

73

Angular color uniformity is a key optical property of white light-emitting diodes and high ACU is strongly demanded in illumination applications.

207

light-emitting diode light-emitting devices

NO_MEANING

NO_MEANING

[noun-100]

[noun-100]

6

6

45

The practical instrumental example chosen to illustrate this method is a rotationally symmetric catadioptric collimator for a light-emitting diode source.

8

In this work, we conducted studies of tandem organic light-emitting devices based on the connecting structure consisting of n-doped electron-transport layer (nETL)/1,4,5,8,9,11-hexaazatriphenylene hexacarbonitrile (HATCN)/hole-transport layer.

light waves

NO_MEANING

[noun-100]

6

16

In spectral domain interferometry, the interference signal generated by directly reflected waves from the two surfaces of a sample plate under test is greatly enhanced by the blockage of those light waves reflected by the two arm mirrors in the Michelson interferometer.

light utilization efficiency

NO_MEANING

[noun-100]

6

6

Furthermore, we model the light utilization efficiency, illumination uniformity, and veiling luminance of glare due to one or several LED streetlamps.

light transmission

NO_MEANING

[noun-100]

6

9

Optical characterisation of these nuclei is an important first step towards an improved understanding of how light transmission through the retina is influenced by its constituents.

light sources

NO_MEANING

[noun-100]

6

43

Their small footprints, tens of microwatts output powers and sub-milliwatt thresholds introduce such rare-earth-doped microlasers as scalable light sources for silicon-based microphotonic devices and systems.

light source

NO_MEANING

[noun-100]

6

115

pattern which is formed on the reflecting surface by external light and plays as a new light source with intensity profile.

light reflected

NO_MEANING

[noun-100]

6

6

Using field measurements to measure the reflected sunlight of two types of glass curtain walls, the energy distributions of the light reflected from these two different glass curtain walls are determined.

light pulses

NO_MEANING

[noun-100]

6

13

Fourier-transform-limited light pulses were obtained at the laser-plasma interaction point of a 100-TW peak-power laser in vacuum.

light pulse

NO_MEANING

[noun-100]

6

9

We show that a TLS may be excited by an external light pulse whose spectral components are below the absorption line of the TLS.

light patterns

NO_MEANING

[noun-100]

6

5

This paper describes a numerical algorithm to obtain this electric field generated by several relevant light patterns, and uses them to calculate the dielectrophoretic potential acting over neutral, polarizable particles in the proximity of the crystal.

208

light modulators

NO_MEANING

[noun-100]

6

27

Spatial light modulators with submicron-size pixels are promising devices for use in wide-viewing-angle glasses-free holographic 3D displays.

light modulator

NO_MEANING

[noun-100]

6

97

In contemporary optics, the spatial light modulator is effectively used as a flexible optoelectronic device playing the key role in a number of experiments of science and technology.

light microscopy

NO_MEANING

[noun-100]

6

6

Inspecting biological cells with bright-field light microscopy often engenders a challenge, owing to their optical transparency.

light intensity fluctuation

NO_MEANING

[noun-100]

6

7

In its linear response regime, it demonstrated 33% reduction in light intensity fluctuation in terms of the root-mean-square value.

light intensity

NO_MEANING

[noun-100]

6

49

Through finite-difference time-domain simulation, we demonstrate that the phase changing sensitivity obtained can be 4 orders higher than that by a single graphene under the same input light intensity.

light incident

NO_MEANING

[noun-100]

6

6

This unidirectional transportation property originates from the diffraction of grating to change the direction of light incident into the PC from pseudobandgaps to passbands of the PC.

light guide

NO_MEANING

[noun-100]

6

7

The controller utilizes the optothermally induced volume increase in the elastomerencapsulated paraffin wax to produce pneumatic force, which subsequently actuates the cantilever light guide to control the level of frustrated total internal reflection.

light fields

NO_MEANING

[noun-100]

6

8

This Letter proposes a novel quantitative phase-imaging approach by optically encoding light fields into a complementary image pair followed by computational reconstruction.

light field microscope

NO_MEANING

[noun-100]

6

6

To implement a 3D live in-vivo experimental environment for multiple experimentalists, we generate elemental images for an integral imaging system from the captured light field with a light field microscope in real-time.

light extraction efficiency

NO_MEANING

[noun-100]

6

16

We think that the fabricated microlenses could be attractive for enhancing the light extraction efficiency of light emitting diodes.

light emitted

NO_MEANING

[noun-100]

6

9

With the developed camera-based system we can quantify the transmitted light scattered by textured samples or the light emitted from light sources in a few second’s time.

209

light control

NO_MEANING

[noun-100]

6

5

Such a MIM structure can serve as a heater for achieving all-optical light control based on the thermo-optical effect.

light confinement

NO_MEANING

[noun-100]

6

6

Furthermore, a novel fabrication technology is developed to pattern the PMMA into ridge structures by UV lithography in order to provide additional light confinement.

light beams

NO_MEANING

[noun-100]

6

32

A hole-pattern electrode and LC optics with external voltage input were employed to generate a symmetric nonuniform electrical field in the LC layer that directs LC molecules into the appropriate gradient refractive index distribution, resulting in the convergence or divergence of specific light beams.

light beam

NO_MEANING

[noun-100]

6

37

Such a device combines red, green, and blue color beams into one output light beam.

light absorption enhancement

NO_MEANING

[noun-100]

6

5

Surface plasmon polaritons, magnetic plasmon polaritons, localized surface plasmons, and optical waveguide modes were found to participate in the EOT and the light absorption enhancement.

light absorption

NO_MEANING

[noun-100]

6

37

Using the finite-difference time-domain method, the stable omnidirectional light absorption is achieved in the structure inspired from the Papilio ulysses over a wide incident angle range and with various wavelengths.

light absorbing carbon aerosols

NO_MEANING

[noun-100]

6

9

Our studies indicate that the complex morphology of internally mixed light absorbing carbon aerosols must be explicitly considered in climate radiation balance.

light -

NO_MEANING

[noun-100]

6

41

Self-organized nanoparticle photolithography for two-dimensional patterning of organic light emitting diodes.

infrared light

NO_MEANING

[noun-100]

6

9

Spectral purification and infrared light recycling in extreme ultraviolet lithography sources.

incident light

NO_MEANING

[noun-100]

6

63

We find that some nuclei efficiently focus incident light confirming earlier predictions based on comparative studies of chromatin organisation in nocturnal and diurnal mammals.

6

According to a 24-hour observation of the phototaxis of Poecilia reticulata to evaluate the effectiveness of the proposed light pattern to attract fish, when a fish shoal was habituated to a light source that emitted constant illumination light, it gradually moved away from the intense light zone and hovered around the junction of the light and dark zones.

illumination light

NO_MEANING

[noun-100]

6

210

present

green light

NO_MEANING

[noun-100]

6

7

A multimodality optical imaging setup is developed to validate the advantages of the entropy method based on laser speckle imaging, green light imaging, and fluorescence imaging.

cladding light

NO_MEANING

[noun-100]

6

5

The results show that sheets of indium are very effective in stripping unwanted cladding light.

UV light

NO_MEANING

[noun-100]

6

15

(1) creating liquid bottle-like microcavities along the taper waist of an optical fiber taper under interfacial tension and (2) curing the liquids into solids by UV light irradiation.

Light Source

NO_MEANING

[noun-100]

6

5

The first hard X-ray laser, the Linac Coherent Light Source, produces 120 shots per second.

LED lighting

NO_MEANING

[noun-100]

6

5

We simulate and compare the illuminance, uniformity, and efficiency of metal-halide lamps, white LED light sources, and hybrid light box designs combining sunlight and white LED lighting used for indoor basketball court illumination.

LED light source

NO_MEANING

[noun-100]

6

6

The illumination system consists of an RGB LED light source with a collimator lens group and a mirror with a color filter and a lens array integrator instead of an integrated rod so as to improve the uniformity of the light intensity.

InGaN/GaN lightemitting diodes

NO_MEANING

[noun-100]

6

8

The p-type AlGaN electron blocking layer is widely used in InGaN/GaN lightemitting diodes for electron overflow suppression.

results presented

NO_MEANING

[noun-100]

6

6

The results presented here is promising for applications of AlGaN-based SB-UV detectors.

presented method

NO_MEANING

[noun-100]

6

7

In the presented method of obtaining the reflectance profile of a screen element, the PSF convolves with a modelled reflectance profile of that element.

presented approach

NO_MEANING

[noun-100]

6

5

The good agreement between retrieval results and theoretical values confirms the feasibility of the presented approach.

present work

NO_MEANING

[noun-100]

6

14

In the present work, different varied line space and reflection zone plate gratings are analyzed for their suitability in low-signal femtosecond soft X-ray spectroscopy.

present study

NO_MEANING

[noun-100]

6

9

Accordingly, the present study proposes a robust numerical method for determining caustic surfaces based on a point spread function and the established analytical Jacobian and Hessian matrices of a ray by our group.

211

present method

NO_MEANING

[noun-100]

6

To validate the efficiency of our present method, the EM scattering of the composite model by the hybrid PO-PO method for different polarizations is calculated and compared with those using the conventional method of moments as well as computational time and memory requirements.

6

212

Appendix G –– Search Results in WUD from Applied Optics The columns from left to right refer to: 1. The item; 2. The expression extracted by SYSTRAN where the item is the headword; 3. The Chinese translation of the expression; 4. The labelled category, together with SYSTRAN’s designated probability of the category (in terms of percentage); 5. The Chinese headword for the translation of the expression; 6. The priority set by SYSTRAN for that entry; 7. The number of occurrences for the expression in the corpus; 8. Example of sentence for the expression concerned.

213

ITEM

show

PRIO RITY

FRE QU EN CY

EXAMPLE

#EN

ZH

NOTE

HEAD WORD _ZH

plane crystal show

飞机 水 晶 展示

[noun89]

展示

8

1

The processing results of (001) plane crystal show we can get the best surface roughness (RMS of 0.809 nm) if the directions of cutting and MRF polishing are along the (110) direction.

original image show

原始 的 图象 展 示

[noun89]

展示

8

1

The computed values of mean squared error between the retrieved and the original image show the efficacy of the proposed scheme.

laser show

激光 展 示

[noun89]

展示

8

1

In the range 20>R>0.05, the laser beam shows enhanced self-defocusing behavior with increasing external electric field, while it shows self-focusing in the range 0.03>R>0.01. Spatial solitons are observed under a suitable reverse external electric field for R=0.025. A theoretical model is proposed to explain the experimental observations, which suggest a new type of soliton formation due to “enhancement” not “screening” of the external electrical field.

laser sensors show

激光 传 感器 展 示

[noun80]

展示

8

1

Self-mixing laser sensors show promise for a wide range of sensing applications, including displacement, velocimetry, and fluid flow measurements.

image show

图象 展 示

[noun89]

展示

8

1

The computed values of mean squared error between the retrieved and the original image show the efficacy of the proposed scheme.

experiment al data show

实验 数 据 展示

[noun89]

展示

8

1

The model and the experimental data show similar qualitative trends in plasma energy density as the beam power is increased.

data show

数据 展 示

[noun89]

展示

8

1

The model and the experimental data show similar qualitative trends in plasma energy density as the beam power is increased.

crystal show

水晶 展 示

[noun89]

展示

8

1

The processing results of (001) plane crystal show we can get the best surface roughness (RMS of 0.809 nm) if the directions of cutting and MRF polishing are along the (110) direction.

Studies show

研究 展 示

[noun89]

展示

8

1

Studies show that the number of debris in low Earth orbit is exponentially growing despite future debris release mitigation measures considered.

214

offer

combine

influence

Laboratory tests show

实验室 试验 展 示

[noun80]

展示

8

1

Laboratory tests shows strong stability and high precision compared to the classical control.

technique offer

技术 提 议

[noun89]

提议

8

1

The technique offers reduced camera noise, automatic background light suppression, and crosstalk levels of typically