PhD Thesis

5 downloads 0 Views 4MB Size Report
celles-ci ne seront en réalité utiles qu'aux utilisateurs qui en maitrisent la langue. ...... domain, they consist mostly of text from user manuals, which differs strongly from the forum data ...... The API provides a convenient and fast way to upload ...... 6.2 Portability to other MT systems (Lucy, Systran). 0c. 20c. 40c. 60c. 80c. 100c.
Improving Statistical Machine Translation of Informal Language: A Rule-based Pre-editing Approach for French Forums

Th`ese pr´esent´ee ` a la Facult´e de Traduction et d’Interpr´etation de l’Universit´e de Gen`eve pour obtenir le grade de Docteur en Traitement Informatique Multilingue par

Johanna Gerlach

Jury: Prof. Pierrette Bouillon, FTI/TIM, Universit´e de Gen`eve (Directeur de th`ese) Prof. Aur´elie Picton, FTI/TIM, Universit´e de Gen`eve (Pr´esident du Jury) Dr. Ana Guerberof, Pactera (Jur´e externe) Dr. Sabine Lehmann, Acrolinx (Jur´e externe) Dr. Emmanuel Rayner, FTI/TIM, Universit´e de Gen`eve (Jur´e) Dr. Johann Roturier, Symantec (Jur´e externe)

Soutenue le 13 mars 2015 `a l’Universit´e de Gen`eve

Th`ese No 22

Abstract

Forums are increasingly used by online communities to share information about a wide range of topics. While this content is in theory available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written. Machine translation (MT) seems the most practical solution to make this content more widely accessible, but forum data presents multiple challenges for machine translation. The central objective of the thesis is to investigate the possibility of improving the outcome of statistical machine translation of French forum data through the application of pre-editing rules. In particular, our work aims at identifying which transformations are useful to improve translation and whether these transformations can be applied automatically or interactively with a rule-based technology. To evaluate the impact of these rules, we propose a human comparative evaluation methodology using crowdsourcing. Results show that pre-editing significantly improves the machine translation output. To assess the usefulness of these improvements, we perform an evaluation of temporal and technical post-editing effort. Findings show that improvements coincide with reduced effort. Another aspect we consider is whether the pre-editing task can concretely be performed in a forum context. Results of a pre-editing experiment with real forum users suggest that the interactive pre-editing process is accessible, with users producing only slightly less improvement than experts. Finally, to assess the portability of the developed pre-editing process, we perform evaluations with other MT systems, notably rule-based systems, as well as with data from forums from different domains. Findings indicate that, for the most part, the developed pre-editing rules are easily portable.

R´ esum´ e

Aujourd’hui les forums internet jouent un rˆole de plus en plus important dans le partage d’informations par les communaut´es. Cependant, bien que les informations qui s’y trouvent soient techniquement accessibles `a tous, celles-ci ne seront en r´ealit´e utiles qu’aux utilisateurs qui en maitrisent la langue. La traduction automatique (TA) semble une solution int´eressante pour aborder cette limitation, mais elle est ici confront´ee `a des textes pr´esentant de nombreuses difficult´es. L’objectif principal de cette th`ese est d’investiguer les possibilit´es d’am´elioration de la traduction automatique statistique de textes de forums fran¸cais par la pr´e´edition. En particulier, notre travail a pour but d’identifier quelles transformations sont utiles pour am´eliorer la traduction, et de d´eterminer si ces transformations peuvent ˆetre effectu´ees automatiquement ou interactivement en utilisant une technologie ` a base de r`egles. D’abord, pour ´evaluer l’impact de ces r`egles, nous proposons une m´ethode d’´evaluation humaine comparative utilisant une plateforme participative en ligne. Les r´esultats montrent que la pr´e´edition am´eliore significativement le r´esultat de traduction automatique. Pour mesurer l’utilit´e de ces am´eliorations, nous ´evaluons ensuite l’effort temporel et technique de post-´edition. Les r´esultats r´ev`elent que les am´eliorations co¨ıncident avec un effort de post-´edition r´eduit. Puis nous consid´erons la faisabilit´e de la tˆache de pr´e´edition dans le contexte des forums. Une exp´erience avec de vrais utilisateurs de forums sugg`ere que la pr´e´edition interactive est accessible, les utilisateurs obtenant des am´eliorations faiblement inf´erieures `a celles obtenues par des experts. Finalement, pour mesurer la portabilit´e du processus de pr´e´edition d´evelopp´e, nous effectuons des ´evaluations avec d’autres syst`emes de TA, notamment des syst`emes linguistiques, ainsi qu’avec des donn´ees extraites de forums d’autres domaines. Les r´esultats indiquent que la plupart des r`egles sont ais´ement portables.

Acknowledgements

First and foremost, I am deeply grateful to my advisor Pierrette Bouillon, without whom this thesis would never have come into existence. Her experience and enthusiasm have been invaluable for the completion of this work. I am greatly indebted to Sabine Lehmann for taking the time to introduce me to the Acrolinx technology and rule development, as well as for the many thought-provoking discussions. I am also very grateful to the other members of my thesis committee, Ana Guerberof, Manny Rayner and Johann Roturier, for providing valuable comments and suggestions. I would also like to thank Aur´elie Picton for accepting the role of president of the jury, and for motivating me over these last years. Many thanks to the members of the ACCEPT project for providing a stimulating research context. In particular, I would like to thank Victoria Porro, who significantly contributed to rule development and experiment setup. Many thanks also to Liliana Gaspar for setting up the pre-editing experiment with Norton forum users, and to Philip Koehn, who kindly let me use his Amazon Mechanical Turk Requester account for all my evaluations. I would also like to thank Magdalena Freund who took on the specialisation of Systran. I am also very grateful to the translators and AMT workers who completed a countless number of translation evaluations. I would like to express my gratitude to my colleagues at the FTI/TIM department, Donatella, Luc`ıa, Marianne, Nikos, Silvia and Violeta, who have been an invaluable source of advice and motivation. Special thanks to Claudia and Tobias who shared an office with me and had to deal with all my doubts and complaints. I would like to acknowledge funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 288769.

Lastly, I would like to thank those closest to me, my partner Simon and the wonderful cats with whom we share our home, for bearing with me through all my ups and downs. I would like to dedicate this thesis to my parents, my mother Silke and my late father Dr. Dieter Gerlach.

Contents List of Figures

vii

List of Tables

ix

1 Introduction

1

1.1

Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.5

Published work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Pre-Editing

9

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Pre-editing approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2.1

Unknown words

. . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.2

Grammar errors and unknown structures . . . . . . . . . . . . .

18

2.2.3

Ambiguity and complexity . . . . . . . . . . . . . . . . . . . . . .

19

2.2.4

Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Pre-editing with Acrolinx . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.3.1

Acrolinx technology . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.3.2

Using Acrolinx . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.3.3

Developing rules with Acrolinx . . . . . . . . . . . . . . . . . . .

28

2.3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Pre-editing forum data with Acrolinx . . . . . . . . . . . . . . . . . . . .

29

2.4.1

Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.4.2

MT technology . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.3

2.4

i

CONTENTS

2.5

2.4.3

Target text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.4.4

Pre-editing in the forum context . . . . . . . . . . . . . . . . . .

32

2.4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Rule Development

34 37

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.2

Developing Acrolinx rules . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2.1

Rule formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.2

Development environment . . . . . . . . . . . . . . . . . . . . . .

45

3.2.3

Development methodology . . . . . . . . . . . . . . . . . . . . . .

45

Spelling and grammar rules . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.3.1

Non-word errors - Acrolinx spelling . . . . . . . . . . . . . . . . .

47

3.3.2

Real-word errors - Acrolinx grammar rules

. . . . . . . . . . . .

49

3.3.3

Performance of spelling and grammar rules on forum data . . . .

57

3.3.4

Spelling and grammar - summary . . . . . . . . . . . . . . . . . .

64

3.4

Punctuation and spacing rules . . . . . . . . . . . . . . . . . . . . . . . .

65

3.5

Informal language rules . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.5.1

Informal vocabulary . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.5.2

Informal syntactic structures . . . . . . . . . . . . . . . . . . . .

70

3.6

Controlled language and simplification rules . . . . . . . . . . . . . . . .

71

3.7

Rules for the machine . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.7.1

Development methodology . . . . . . . . . . . . . . . . . . . . . .

74

3.7.2

Reformulation rules . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.7.3

Clitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.7.4

Reordering rules . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.7.5

Informal second person . . . . . . . . . . . . . . . . . . . . . . .

81

3.8

Rule application in the forum context . . . . . . . . . . . . . . . . . . .

83

3.9

Rule precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

3.9.1

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

3.9.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

3.9.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

3.3

3.10 Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

94

CONTENTS

4 Rule Evaluation: Impact

97

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.2

Machine translation evaluation . . . . . . . . . . . . . . . . . . . . . . .

98

4.2.1

Human evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .

99

4.2.2

Automatic evaluation . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3

4.4

4.5

Evaluating the impact of pre-editing on SMT of forum data . . . . . . . 106 4.3.1

Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3.2

Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Rule by rule evaluation

. . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.4.1

Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.4.2

Results by rule categories . . . . . . . . . . . . . . . . . . . . . . 115

4.4.3

Evaluator groups and rater agreement . . . . . . . . . . . . . . . 142

Global evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.5.1

Human comparative evaluation . . . . . . . . . . . . . . . . . . . 150

4.5.2

Automatic evaluation . . . . . . . . . . . . . . . . . . . . . . . . 151

4.6

Global evaluation - automatic rules only . . . . . . . . . . . . . . . . . . 155

4.7

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5 Rule Evaluation: Impact on Post-editing Temporal Effort

159

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2

Post-editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.2.1

Assessing post-editing effort . . . . . . . . . . . . . . . . . . . . . 160

5.2.2

MT output quality and post-editing effort . . . . . . . . . . . . . 162

5.3

Post-editing effort and pre-editing . . . . . . . . . . . . . . . . . . . . . 163

5.4

Post-editing in ACCEPT . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.5

First experiment - impact of successful pre-editing . . . . . . . . . . . . 167 5.5.1

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.5.2

Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.5.3

Pre-editing task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

5.5.4

Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.5.5

Post-editing task . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.5.6

Post-editing temporal effort . . . . . . . . . . . . . . . . . . . . . 172

5.5.7

Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

iii

CONTENTS

5.5.8 5.6

5.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Second experiment - impact of all pre-editing . . . . . . . . . . . . . . . 176 5.6.1

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.6.2

Post-editing task . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.6.3

Post-editing temporal effort . . . . . . . . . . . . . . . . . . . . . 178

5.6.4

Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

5.6.5

Comparison with first experiment . . . . . . . . . . . . . . . . . . 183

5.6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6 Rule Evaluation: Portability

189

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

6.2

Portability to other MT systems (Lucy, Systran) . . . . . . . . . . . . . 190

6.3

6.4

6.2.1

Specialising the rule-based systems . . . . . . . . . . . . . . . . . 191

6.2.2

Rule by rule evaluation with Lucy and Systran . . . . . . . . . . 193

6.2.3

Global evaluation with Lucy and Systran . . . . . . . . . . . . . 213

Portability to other forums . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3.1

Forum selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

6.3.2

Rule precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

6.3.3

Impact on translation . . . . . . . . . . . . . . . . . . . . . . . . 219

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

7 Rule Evaluation: Usability

223

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

7.2

Pre-editing in the Norton Community forums . . . . . . . . . . . . . . . 225

7.3

7.4

7.2.1

The ACCEPT pre-editing plugin . . . . . . . . . . . . . . . . . . 225

7.2.2

Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

7.2.3

Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

7.2.4

Pre-editing scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 227

Fully manual vs semi-automatic pre-editing . . . . . . . . . . . . . . . . 228 7.3.1

Pre-editing activity . . . . . . . . . . . . . . . . . . . . . . . . . . 228

7.3.2

Impact on translation . . . . . . . . . . . . . . . . . . . . . . . . 230

Users against Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.4.1

Pre-editing activity . . . . . . . . . . . . . . . . . . . . . . . . . . 231

iv

CONTENTS

7.4.2 7.5

Impact on translation . . . . . . . . . . . . . . . . . . . . . . . . 233

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

8 Conclusion

237

8.1

Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

8.2

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

8.3

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

References

245

A Research Overview

253

B Data

257

C Pre-editing rules ordered by Set

259

D Rule by rule results

263

D.1 Grammar (agreement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 D.2 Grammar (mood/tense) . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 D.3 Grammar (sequence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 D.4 Homophone confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 D.5 Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 D.6 Informal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 D.7 Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 D.8 Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 D.9 Informal 2nd person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 D.10 Clitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 D.11 Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 E Pre-editing in the Norton Community forum: instructions for participants 339 F Post-editing guidelines

343

v

CONTENTS

vi

List of Figures 2.1

Pre-editing plugin in forum interface . . . . . . . . . . . . . . . . . . . .

28

3.1

Example of trigram extraction output . . . . . . . . . . . . . . . . . . .

76

4.1

AMT evaluation interface . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.2

Tool evaluation interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.3

% Impact of individual grammar (agreement) rules . . . . . . . . . . . . 116

4.4

% Impact of individual grammar (mood/tense) rules . . . . . . . . . . . 118

4.5

% Impact of individual grammar (sequence) rules . . . . . . . . . . . . . 120

4.6

% Impact of individual homophone rules . . . . . . . . . . . . . . . . . . 121

4.7

% Impact of individual punctuation rules . . . . . . . . . . . . . . . . . 123

4.8

% Impact of individual informal language rules . . . . . . . . . . . . . . 127

4.9

% Impact of individual simplification rules . . . . . . . . . . . . . . . . . 131

4.10 % Impact of individual reformulation rules . . . . . . . . . . . . . . . . . 135 4.11 % Impact of informal second person rule . . . . . . . . . . . . . . . . . . 137 4.12 % Impact of individual clitics rules . . . . . . . . . . . . . . . . . . . . . 138 4.13 % Impact of individual reordering rules . . . . . . . . . . . . . . . . . . 141 5.1

Post-editing interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.2

XLIFF output from ACCEPT post-editing portal . . . . . . . . . . . . . 167

6.1

% of sentences improved by grammar (agreement) rules . . . . . . . . . 195

6.2

% of sentences improved by grammar (mood/tense) rules . . . . . . . . 196

6.3

% of sentences improved by grammar (sequence) rules . . . . . . . . . . 197

6.4

% of sentences improved by homophone rules . . . . . . . . . . . . . . . 199

6.5

% of sentences improved by punctuation rules . . . . . . . . . . . . . . . 202

vii

LIST OF FIGURES

6.6

% of sentences improved by informal language rules . . . . . . . . . . . . 203

6.7

% of sentences improved by simplification rules . . . . . . . . . . . . . . 205

6.8

% of sentences improved by reformulation rules . . . . . . . . . . . . . . 208

6.9

% of sentences improved by clitic rules . . . . . . . . . . . . . . . . . . . 210

6.10 % of sentences improved by reordering rules . . . . . . . . . . . . . . . . 212 7.1

Pre-editing plugin in forum interface . . . . . . . . . . . . . . . . . . . . 226

B.1 Overview of data used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 E.1 Pre-editing guidelines page 1 of 2 . . . . . . . . . . . . . . . . . . . . . . 340 E.2 Pre-editing guidelines page 2 of 2 . . . . . . . . . . . . . . . . . . . . . . 341 F.1 Post-editing guidelines page 1 of 4 . . . . . . . . . . . . . . . . . . . . . 344 F.2 Post-editing guidelines page 2 of 4 . . . . . . . . . . . . . . . . . . . . . 345 F.3 Post-editing guidelines page 3 of 4 . . . . . . . . . . . . . . . . . . . . . 346 F.4 Post-editing guidelines page 4 of 4 . . . . . . . . . . . . . . . . . . . . . 347

viii

List of Tables 3.1

Grammar (agreement) rules . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.2

Grammar (tense/mood) rules . . . . . . . . . . . . . . . . . . . . . . . .

51

3.3

Grammar (sequence) rules . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.5

Homophone rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.4

Sequences flagged by the wrongSeq rule . . . . . . . . . . . . . . . . . .

55

3.6

Precision and recall of non-word error detection on 500 sentences . . . .

59

3.7

Precision and recall of non-word error correction on 500 sentences . . . .

60

3.8

Distribution of correction types for non-word errors on 500 sentences . .

61

3.9

Precision and recall of non-word error correction on 500 sentences, taking into account cases with multiple replacement suggestions . . . . . . . . .

61

3.10 Precision and recall of real-word error detection on 500 sentences . . . .

62

3.11 Precision and recall of real-word error correction on 500 sentences . . . .

63

3.12 Distribution of correction types for real-word errors on 500 sentences . .

64

3.13 Punctuation and spacing rules . . . . . . . . . . . . . . . . . . . . . . . .

65

3.14 Informal language rules . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.15 Simplification rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.16 Reformulation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

3.17 Clitic rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

3.18 Reordering rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

3.19 Informal second person rule . . . . . . . . . . . . . . . . . . . . . . . . .

83

3.20 Pre-editing rule sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

3.21 Flags and precision of rule sets on 10,000 sentences . . . . . . . . . . . .

88

3.22 Causes of erroneous flags on forum data . . . . . . . . . . . . . . . . . .

88

4.1

Distribution of HITs among workers . . . . . . . . . . . . . . . . . . . . 111

ix

LIST OF TABLES

4.2

Comparative evaluation results for grammar (agreement) rules . . . . . 116

4.3

Comparative evaluation results for grammar (mood/tense) rules

4.4

Comparative evaluation results for grammar (sequence) rules . . . . . . 119

4.5

Comparative evaluation results for combined homophone rules

4.6

Comparative evaluation results for combined punctuation rules . . . . . 123

4.7

Comparative evaluation results for combined informal language rules . . 127

4.8

Comparative evaluation results for simplification rules . . . . . . . . . . 130

4.9

Comparative evaluation results for reformulation rules . . . . . . . . . . 135

. . . . 117

. . . . . 120

4.10 Comparative evaluation results for informal second person rule . . . . . 137 4.11 Comparative evaluation results for combined clitic rules . . . . . . . . . 138 4.12 Examples for the cliticsPersPron rule . . . . . . . . . . . . . . . . . . . . 140 4.13 Comparative evaluation results for combined reordering rules . . . . . . 141 4.14 Judgement distribution for both evaluator groups . . . . . . . . . . . . . 144 4.15 Unweighted Cohen’s Kappa computed over each evaluator pair for both scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.16 Contingency table for raters 1 and 2 . . . . . . . . . . . . . . . . . . . . 146 4.17 Agreement between and within groups . . . . . . . . . . . . . . . . . . . 147 4.18 Flags for each rule category on 1,030 sentences of Norton Community forum data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.19 Flags by rule sets on 1,030 sentences of Norton Community forum data

150

4.20 Comparative evaluation results for complete pre-editing sequence . . . . 151 4.21 Document level automatic metric results . . . . . . . . . . . . . . . . . . 152 4.22 Sentence level correlation between metrics and human judgements . . . 152 4.23 Comparative evaluation results for automatic pre-editing only . . . . . . 156 5.1

Flags for each pre-editing rule set . . . . . . . . . . . . . . . . . . . . . . 169

5.2

Comparative evaluation results . . . . . . . . . . . . . . . . . . . . . . . 170

5.3

Throughput (words/min) for translations of raw and pre-edited data . . 173

5.4

Combined pre- and post-editing times (minutes) 1st experiment . . . . . 174

5.5

TER scores computed between MT output and post-edited versions . . 175

5.6

Sentences where no post-editing was performed . . . . . . . . . . . . . . 175

5.7

Comparative evaluation results for 100 random posts . . . . . . . . . . . 177

x

LIST OF TABLES

5.8

Throughput (words/min) based on editing time for translations of raw and pre-edited data, for each of the three cases . . . . . . . . . . . . . . 179

5.9

Throughput (words/min) based on thinking time + editing time for translations of raw and pre-edited data, for each of the three cases . . . 181

5.10 Estimation of global impact of pre-editing on post-editing throughput . 181 5.11 Estimated combined pre- and post-editing times (minutes) 2nd experiment182 5.12 TER scores computed between MT output and post-edited versions . . 182 5.13 Global throughput in both experiments . . . . . . . . . . . . . . . . . . 183 5.14 Proportion of sentences were MT output was left unedited for both experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.1

Comparative evaluation of grammar (agreement) rules for three MT systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

6.2

Comparative evaluation of grammar (mood/tense) rules for three MT systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

6.3

Comparative evaluation of grammar (sequence) rules for three MT systems197

6.4

Comparative evaluation of homophone rules for three MT systems . . . 198

6.5

Comparative evaluation of punctuation rules for three MT systems . . . 201

6.6

Comparative evaluation of informal language rules for three MT systems 203

6.7

Comparative evaluation of simplification rules for three MT systems . . 204

6.8

Comparative evaluation of reformulation rules for three MT systems . . 206

6.9

Comparative evaluation of tuVous rule for three MT systems . . . . . . 208

6.10 Comparative evaluation of clitic rules for three MT systems . . . . . . . 209 6.11 Comparative evaluation of reordering rules for three MT systems . . . . 211 6.12 Comparative evaluation results for complete pre-editing sequence for three MT systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.13 Rule precision on 10,000 sentences by categories for the three forums . . 216 6.14 Comparative evaluation results for alternate forums

. . . . . . . . . . . 219

7.1

Pre-editing activity for the two user scenarios . . . . . . . . . . . . . . . 228

7.2

Edit distance (words) between raw and pre-edited versions for user scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

7.3

Comparative evaluation results for Raw vs SemiAuto and AllManual . . 230

7.4

Comparative evaluation results for SemiAuto vs AllManual . . . . . . . 231

xi

LIST OF TABLES

7.5

Edit distance between raw and pre-edited versions for all scenarios . . . 232

7.6

Flags rejected by forum users . . . . . . . . . . . . . . . . . . . . . . . . 232

7.7

Comparative evaluation results for Raw vs Expert and Raw vs Oracle . 233

7.8

Comparative evaluation results for User against Expert . . . . . . . . . 234

A.1 Research overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 C.1 Set 1 (rules for humans; automatic application) . . . . . . . . . . . . . . 260 C.2 Set 2 (rules for humans; interactive application) . . . . . . . . . . . . . . 261 C.3 Set 3 (rules for the machine; automatic application) . . . . . . . . . . . 262 D.1 Pre-editing rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

xii

1

Introduction 1.1

Motivation

The Web 2.0 paradigm, which has transformed users from passive viewers to active contributors, has brought into existence a new form of textual data: user-generated content (UGC). Forums, blogs and social networks are increasingly used by online communities to share information about a wide range of topics, from technical issues like IT support, to all sorts of hobbies, crafts, trades, health and lifestyle issues. UGC now represents a large share of the informative content available on the web. While technically this content is readily available to anyone with internet access, it is in fact accessible only to those users who understand the language in which it was written. To breach the language barrier and make this content more widely accessible, some form of translation is necessary. Since human translation is not an option, due both to the sheer volume of data and to the cost this would engender, machine translation (MT) seems the most practical solution. However, UGC presents multiple challenges for machine translation. In the context of a forum, where the focus is on solving problems, linguistic accuracy is often not a priority. Spelling, grammar and punctuation conventions are not always respected. Additionally, the language used is closer to spoken language, using informal syntax, colloquial vocabulary, abbreviations and technical terms. This combination of poor linguistic quality and informal language makes this content difficult to translate automatically (Carrera et al., 2009; Roturier & Bensadoun, 2011; Jiang et al., 2012). To address these issues, two possibilities have been investigated: adapting MT

1

1. INTRODUCTION

systems to handle this type of data (e.g. Banerjee et al., 2011, who perform a language and translation model adaptation) or pre-processing the data to bring it closer to traditional text forms that MT systems are designed handle (e.g. Jiang et al., 2012, who perform pre-processing by means of regular expressions). The latter process, preediting, and its application to forum content, are the main focus of the present thesis. Pre-processing text to improve MT is an old topic (e.g. Ruffino, 1981). A fair share of research on machine translation improvement has gone to investigating the input that is being translated, the difficulties it presents for MT systems, and how they can be removed by pre-editing to improve the translatability of text (e.g. Bernth & Gdaniec, 2001). Pre-editing can take on numerous forms, such as spelling and grammar checking, normalisation, the application of controlled languages (CLs), simplification and reordering. The difficulties of natural language tackled by these pre-editing approaches are by no means exclusive to machine translation. Indeed, they are problematic for many other natural language processing (NLP) tasks. Of the numerous approaches that try to deal with these issues, only a fraction have been developed specifically to improve machine translation. Nevertheless, since many processing steps such as partof-speech tagging or parsing are common to multiple NLP tasks, the approaches found to facilitate these steps could also be shared across tasks, and thus be beneficial to MT. Traditionally, pre-editing is called for in authoring situations where high quality content is produced, for example for technical documentation (Huijsen, 1998). It has only recently been associated with community content, and then mostly in the form of normalisation, investigated for processing of social media data, such as Twitter, to make it more accessible to processes like data mining or sentiment analysis (e.g. Han & Baldwin, 2011; Clark & Araki, 2012; Sidarenka et al., 2013). Few studies however have focussed on pre-editing of forum data to improve machine translation (with the exception of Banerjee et al., 2012; Lehmann et al., 2012, for example). When applied before machine translation, pre-editing has mostly been associated with rule-based machine translation (e.g. Pym, 1988; Mitamura & Nyberg, 1995; Bernth, 1998). Controlled languages in particular have often been combined with, or even developed specifically for, rule-based machine translation (Nyberg & Mitamura, 1996). This can be explained by the fact that the difficulties encountered by RBMT, such as dealing with specific ambiguities, are well known. It is thus relatively straightforward to pre-process the data in order to reduce them. In the context of

2

1.2 Context

statistical machine translation, identifying transformations that will improve translatability is not as straightforward. With the exception of a few studies (e.g. Aikawa et al., 2007; Temnikova, 2010; Lehmann et al., 2012), pre-editing, in particular in the sense of a controlled language has rarely been associated with statistical machine translation.

1.2

Context

Most of the work described in this thesis was carried out within the context of the ACCEPT (Automated Community Content Editing PorTal) research project1 . This project, funded by the seventh European framework programme, has brought together the Universities of Edinburgh and Geneva, Acrolinx, Symantec and Lexcelera/Translators Without Borders from 2012 to 2014. It has aimed at improving Statistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies, thus allowing community content to be shared across the language barrier. The context given by this project has conditioned our choice of data, pre-editing technology and MT system. Within the project, the forums used are the Norton Community forums2 , administered by Symantec, one of the partners in the project. Preediting and post-editing are performed with the technology of another project partner, the Acrolinx IQ engine (Bredenkamp et al., 2000). This rule-based engine uses a combination of NLP components and enables the development of declarative rules, which are written in a formalism similar to regular expressions, based on the syntactic tagging of the text. Acrolinx allows both automatic and interactive rule application. The machine translation system, developed by the University of Edinburgh, is a phrase-based Moses system trained using the standard Moses pipeline (Koehn et al., 2007) with Symantec translation memory data, complemented by Europarl and news-commentary data. Forum text was included in the data used to train the language models. The data and technology just listed are thus the main resources used in the present thesis. 1 2

http://www.accept-project.eu/ http://community.norton.com/

3

1. INTRODUCTION

1.3

Objectives

The central objective of the thesis is to investigate whether pre-editing can improve the outcome of statistical machine translation of French forum data. To address this question, a number of aspects must be considered, from the definition of a pre-editing process to the means of measuring the improvement of MT, but also the feasibility of pre-editing in a forum context. We shall now present these aspects in further detail and introduce the research questions addressed by this thesis. Pre-editing can take on many forms. Our aim is to explore various types of transformations to see whether they are applicable to French forum data and whether they improve translation. Rule development is motivated by aspects of the input data and of the SMT system: on the one hand we have error fraught forum content, on the other a system trained on clean published data, from the same domain but different in terms of register. The developed rules thus include rules for spelling and grammar correction, normalisation, disambiguation, simplification, reformulation and reordering. Our objective is not to produce an extensive set of rules, but rather to investigate different types of transformations to identify which ones should be considered for more extensive development. An important aspect of rule development is that, beside being problematic for MT, the forum data can be equally challenging for the technology used for pre-editing. Pre-editing resources must therefore be sufficiently robust. Consequently, our objective is to investigate whether a technology such as Acrolinx, relying on a simple declarative formalism in combination with shallow NLP components, is flexible enough to describe the different types of rules while being robust enough to handle forum data. The research question we seek to answer is whether a declarative rule-based formalism such as Acrolinx is suited to pre-edit forum data. Since our focus is on translation improvement, rule development and selection is driven by the impact of source transformations on MT. A reliable evaluation method to measure this impact is thus essential. We describe a human comparative evaluation framework, which allows us to quickly determine whether the translation of a preedited version is better or worse than that of a raw version. This framework serves multiple purposes. First, in the rule development context, identifying successful and unsuccessful transformations is indispensable to the fine-tuning of rules. Second, it serves as basis for rule selection, by enabling the identification of the most appropriate

4

1.3 Objectives

rules for a given system and data. Ultimately, this framework allows us to establish the impact of the pre-editing process as a whole and thus serves to answer the question whether pre-editing can improve statistical machine translation of French forum data. With the objective of improving the efficiency of this human evaluation, we investigate the possibility of crowdsourcing judgements using an online microworking platform (Amazon Mechanical Turk). This raises another question, namely whether a comparative translation evaluation on Amazon Mechanical Turk (AMT) can produce comparable results to evaluations performed with language professionals. While comparative evaluation allows the identification of improvement, it gives no insight into the usefulness of these improvements. Taking a broader view, where preediting is part of a machine translation process which aims at producing a final, usable translation, pre-editing should contribute to bringing the MT output closer to this final translation, thereby reducing post-editing effort, i.e. the human effort involved in the correction of MT output. Therefore, to assess the usefulness of pre-editing, we will investigate how the improvements identified by our comparative evaluation relate to post-editing effort. More precisely, the research question we seek to answer is whether pre-editing that is found to improve MT also reduces post-editing effort. Another aspect to consider in this context is that pre-editing, when it cannot be done automatically, for example in the case of complex or ambiguous transformations, also requires effort. This raises the question of return on investment, since it would make little sense to invest effort in pre-editing if the resulting changes had little or no impact on post-editing effort. Thus we will address a second research question relating to post-editing, namely how the effort invested in pre-editing relates to the gain in terms of post-editing effort. The pre-editing approach developed in this study focuses on the translation of IT forum data with a specially trained SMT system, which is only one amongst a multitude of possible cases of machine translation of forum data. To investigate whether this approach can be generalised, we will address two further research questions, namely whether the pre-editing approach defined in this study is portable to other MT systems, notably RBMT systems, and whether the pre-editing approach is portable to forums from another domain. Investigating these issues will also provide insights into which rules are specific and which must be specialised.

5

1. INTRODUCTION

Finally, we must consider whether the developed pre-editing rules can successfully be implemented in a forum. The best pre-editing rules with high impact on machine translation will be of little use if they cannot be applied reliably. Since part of the rules require interactive application, and in the forum context the pre-editing task will have to be accomplished by the community members themselves, it is necessary to investigate whether these rules are accessible to forum users. Our final research question therefore is whether forum users can successfully perform the transformations required by our pre-editing rules, and thus achieve the desired impact on MT output.

1.4

Structure of the thesis

The body of the thesis is divided into two main parts. The first, consisting of Chapters 2 and 3, presents the pre-editing rules and their development. The second (Chapters 4 to 7) describes the different evaluations of the rules. In more detail, the content of the individual chapters is as follows:

Chapter 2 presents the pre-editing approach. After an overview of common preediting practices, such as spell-checking, normalisation and controlled language application, we consider how pre-editing can be applied to forum data.

Chapter 3 describes rule development with Acrolinx. An introduction to the Acrolinx rule formalism is followed by a description of the different resources used for rule development and the related rule development methodologies. We then describe the specific rules developed in this thesis. An evaluation of rule precision addresses the question of the suitability of the Acrolinx formalism for the task of pre-editing forum data.

Chapter 4 presents the evaluation of the impact of pre-editing rules on machine translation. After describing the comparative human evaluation methodology, we provide results for the impact of pre-editing rules on forum data, both on a rule-by-rule basis and as a complete process. This evaluation shows that in our application preediting has a significant positive impact on SMT output.

6

1.5 Published work

Chapter 5 addresses the question of the impact of pre-editing on post-editing effort. We describe two experiments involving post-editing of translations of raw and preedited content. The results show that pre-editing that improved translation quality also significantly reduces post-editing time. Chapter 6 addresses the question of the portability of the rules to other domains and other MT engines. In the first part of this chapter, we focus on the portability to other MT systems, by evaluating the impact on two rule-based systems: one uses a transfer approach, the other an improved direct approach. In the second part, we evaluate the usefulness of the pre-editing rules for a forum taken from another domain (DIY), by evaluating both rule precision and impact on machine translation. Chapter 7 addresses the question of rule usability by real forum users. In this chapter we describe an experiment where pre-editing rules are applied by real users of the Norton Community forums. The results show that rule application by forum users is close to that by experts. Chapter 8

concludes, presents limitations of this work and outlines future work.

Appendix A presents a summary of the different experiments performed in this thesis to address the research questions. An overview of the datasets used for these experiments is provided in Appendix B.

1.5

Published work

The work described in this thesis has been discussed in previous publications. A first evaluation of the pre-editing rules was published in Gerlach et al. (2013b). A subset of the pre-editing rules described in this thesis were used in an experiment investigating hybrid vs rule-based approaches for the correction of homophones (Bouillon et al., 2013). Results of the first post-editing experiment were published in Gerlach et al. (2013a). The pre-editing experiment involving Norton Community forum users was published in Bouillon et al. (2014). A second evaluation of the impact of the preediting rules was published in Seretan et al. (2014). Finally, results pertaining to rule portability were published in the project deliverable ACCEPT D9.2.4 (2014).

7

1. INTRODUCTION

8

2

Pre-Editing This chapter provides background on pre-editing for machine translation, introduces the Acrolinx pre-editing technology and outlines our pre-editing approach for forum data.

2.1

Introduction

In the current research on machine translation improvement, a fair share of attention has gone to the input that is being translated and what difficulties it presents for MT systems. Pre-processing methods have been considered to remove these difficulties, and thus improve the translatability of text (e.g. Bernth & Gdaniec, 2001). These preediting methods focus on the different aspects of natural language that are problematic for machine translation. Difficulties begin at the word level: MT systems are thwarted by unknown words, i.e. tokens that are not covered by the system’s resources, and therefore cannot be processed. Often, unknown words are the result of misspellings, thus a commonplace form of pre-editing is spell-checking. Unknown words can also result from the use of non-standard tokens, such as the colloquial language used in social media data. An increasing number of approaches regrouped under the denomination of normalisation transform these non-standard tokens. The next problem comes with the way words are arranged into phrases and sentences. If the structures do not match those the MT system can process, based either on its linguistic resources or its training data, they will not be translated correctly.

9

2. PRE-EDITING

This can happen in case of sentences that do not match the conventions of a language; in these cases, grammar checking would be a useful form of pre-editing. Even in the absence of errors, the ambiguities of natural language, both at the lexical and syntactic level, can be difficult to handle. Long, complex sentences with multiple clauses and long distance dependencies are difficult to translate. Thus probably the most well known form of pre-editing is the application of controlled languages (CLs), languages defined by sets of rules designed to reduce the ambiguity and complexity of texts. On the same principles, but for different applications, simplification approaches also aim at reducing complexity. Finally, more specifically for statistical MT as well as rule-based systems using a direct translation architecture, word order differences between source and target language are also problematic (Niessen & Ney, 2001). This has led to the development of reordering approaches, i.e. transforming word order to improve translatability (Collins et al., 2005). These difficulties of natural language are by no means exclusive to machine translation. Indeed, they are problematic for many other NLP tasks. Numerous approaches have thus been developed to deal with these issues, only a fraction of them specifically to improve machine translation. Nevertheless, since many processing steps such as POS tagging or parsing are common to multiple NLP tasks, the approaches found to facilitate these steps could also be shared across tasks. While pre-editing texts to improve human readability or MT performance is an old topic (e.g. Ruffino, 1981), the pre-editing approach we will explore in the present thesis will differ from the approaches found in the literature in several aspects. Most studies involving pre-editing have focussed on one approach at a time, such as applying a controlled language to improve translation (e.g. Mitamura & Nyberg, 1995) or normalising certain phenomena in social media to facilitate processing (e.g. Sidarenka et al., 2013). In our study we intend to combine aspects of different approaches, including several kinds of transformations that can potentially improve translation. To achieve this, we will use Acrolinx, a content control tool which allows development of different types of rules, thereby making it possible to include different transformations within a single tool. Often, pre-editing is called for in authoring situations where high quality content is produced, for example for technical documentation. It has therefore rarely been associ-

10

2.2 Pre-editing approaches

ated with community content. One exception is normalisation, which has recently been applied to social media data, such as twitter, to make it more accessible to processes like data mining or sentiment analysis (e.g. Clark & Araki, 2012; Han & Baldwin, 2011; Sidarenka et al., 2013). Few studies however have focussed on pre-editing of forum data (Roturier & Bensadoun, 2011). Pre-editing often involves complex rules, such as controlled language rules, and the application of rules is considered a difficult task (Goyvaerts, 1996). It is generally performed by professional writers. In our context however, pre-editing will have to be performed by forum users, who present an entirely different profile. Defining an accessible and reliable pre-editing process for this context is a novel question. Finally, when applied before machine translation, pre-editing has mostly been associated with rule-based machine translation (e.g. Mitamura & Nyberg, 1995; Bernth, 1998; Pym, 1988). Controlled languages in particular have often been combined with, or even developed specifically for, rule-based machine translation (Nyberg & Mitamura, 1996). Since the difficulties encountered by RBMT, such as dealing with specific ambiguities, are well known, it is relatively straightforward to pre-process the data in order to reduce these. In the context of statistical machine translation, identifying transformations that will improve translatability is not as straightforward. Only a few studies investigate pre-editing in association with SMT. This chapter is organised as follows: we begin with a first section providing the background on different pre-editing approaches and their association with machine translation (2.2). We then introduce the pre-editing technology used in this thesis, Acrolinx (2.3). We conclude by outlining how these approaches and technology are combined in the pre-editing approach we have defined with the objective of improving machine translation of forum data (2.4).

2.2

Pre-editing approaches

Different pre-editing approaches attempt to resolve or diminish the issues of natural language to make text more easily processable, and possibly more translatable. In this section, we will discuss the difficulties natural language presents for machine translation, and the different pre-editing approaches that could reduce these difficulties. We begin

11

2. PRE-EDITING

this section with the problem of unknown words (2.2.1), continue with grammar issues (2.2.2) followed by ambiguity and complexity (2.2.3, and conclude with word order differences between language pairs (2.2.4).

2.2.1

Unknown words

A very common problem encountered by MT systems of any kind are unknown words, also called out-of-vocabulary items (OOVs). For a RBMT system, these are words that are not in the system’s linguistic resources or dictionaries; for SMT, these are words absent from the training data. Occurrence of these invariably results in poor translations, where these words are either left untranslated or removed altogether. Worse, these OOVs can disrupt the analysis and the translation of the entire context. OOVs can be of different natures: domain specific terminology, proper nouns, abbreviations, colloquial language, misspelt words, letter and digit sequences, acronyms or tokens resulting from word boundary infractions. Handling domain terminology and well defined abbreviations or acronyms is generally considered a matter of system specialisation rather than pre-processing, and will not be discussed here. The other issues however, can benefit from pre-processing to replace them with in-coverage tokens. The task of identifying and replacing such undesirable tokens has led to much research, with spell-checking addressing misspelt words, and normalisation handling non-standard tokens. We will now provide some background for these two approaches and discuss their application in the context of MT. 2.2.1.1

Spell-checking - Correcting non-word errors

Spelling errors that produce OOVs are ones that result in words that do not exist, commonly referred to as non-word errors. Identifying and replacing these is the task of spell-checkers, development of which began in the sixties (Blair, 1960; Damerau, 1964). The first tools identified misspellings in a text by looking up each token in a correct word list (often also referred to as dictionary or lexicon in the literature), and rejecting those that could not be found. This a priori simple task requires correct segmentation and tokenisation of text into words. Depending on the language, identifying word boundaries correctly can be quite complicated: punctuation, spacing characters and casing have to be taken into account, each with their share of exceptions (Fontenelle, 2006). Correct tokenisation becomes even more complicated in the presence of word

12

2.2 Pre-editing approaches

boundary infractions, such as words that are run together (e.g. *tuas fait (youhave done)) or single words that are mistakenly split (e.g. *mon ordina teur (my comp uter)). Besides correct tokenisation, the efficiency of non-word error detectors depends greatly on the word list against which tokens are checked. The larger the word list, the better the coverage, and the less chances that valid words will be flagged as incorrect. However, a too large dictionary can also be problematic as some misspelt words might not be identified. Once non-word errors have been identified, replacement candidates must be found. The most common algorithms to find replacement candidates use some form of word similarity measure (Damerau, 1964; Levenshtein, 1966; Wagner, 1974). A widespread measure is the minimum edit distance, often also referred to as Damerau-Levenshtein distance, the principle of which is to find the minimum number of edit operations (insertions, deletions, substitutions or transpositions) to transform one token into another. Replacement candidates are thus selected among words with a minimum edit distance to the misspelled word. Edit distance is also used to rank replacement candidates, on the assumption that the shorter the distance, the better the candidate. A more sophisticated method to transform misspelt words into a correct equivalent is rule-based correction, which uses transformation rules based on patterns inferred from a set of misspellings (Yannakoudakis & Fawthrop, 1983). A further development to find replacement candidates is the noisy channel model (Shannon, 1948). Widely used in the domain of speech recognition (Jelinek, 1997), this model has first been applied to spelling correction in the nineties (Kernighan et al., 1990; Mays et al., 1991). The aim of this model is to find the most likely valid word, given the observed distorted word. Finally, yet another approach is that based on the assumption that many errors are caused by confusions. It uses so-called confusion lists, or confusion sets, which are sets of words which are likely to be confused with each other (e.g. Yarowsky, 1994; Pedler & Mitton, 2010). While intuitively the words with the shortest edit distance would seem to be the most obvious suggestion, an important exception are some language specific phonetic confusions. The mathematical approach based on sequences of edit operations is particularly unfavourable for phonographic errors (e.g. the confusion of f and ph) which are frequent in French (Veronis, 1988). Finding the correct replacement for these words is therefore not a result of minimal edits, but of specific replacement based on phonetic

13

2. PRE-EDITING

similarity (Fontenelle, 2006). To this end, Brill & Moore (2000) improve a noisy channel model by enhancing the error model to include edit operations composed of multiple letters (such as “ph|f” or “ent|ant”), bringing the model closer to errors humans actually make. In a similar line of reasoning, Toutanova & Moore (2002) incorporate word pronunciation information in a noisy channel model, thus achieving a substantial performance improvement. Another earlier approach using phonographic information is that investigated by van Berkelt & De Smedt (1988), which relies on a so-called triphone analysis to correct both orthographical and typographical errors in Dutch. Finally, we should also mention some more language specific corrections, such as the restoration of diacritics. At the early stages of computer technology, when encodings supporting accented characters were not commonplace and standard keyboards were not adapted to typing these characters, accents were often left out. Simard & Deslauriers (2001) describe a method to reinsert missing accents based on a statistical language model. Another phenomena specific to French is the omission of the apostrophe in the case of an elision which leads to a simple concatenation of an article or pronoun with the following word (*sinstalle, which should be s’installe). Including grammatical information in the lexicon allows correct insertion of the apostrophe in such cases (Fontenelle, 2006). As we have seen, numerous methods exist to identify and replace non-word errors, and checking spelling has become very simple. Spell-checkers have nowadays become ubiquitous: many tools where users can enter text, from text-processors to e-mail clients, or even website forms and search engines, have some kind of checking functionality. As an example, Google Translate’s online interface1 includes on-the-fly spell-checking. The ubiquitousness of spell-checking probably explains why its impact on machine translation has not been much studied, although some studies (e.g. Banerjee et al., 2012; Liu et al., 2012; Khan et al., 2013) include a spell-checking step as part of a pre-editing process. 2.2.1.2

Normalisation - Replacing non-standard tokens

Besides misspelt words, a number of other non-standard tokens, such as letter and digit sequences, acronyms, abbreviations, informal language, URLs etc. can disrupt 1

https://translate.google.com/

14

2.2 Pre-editing approaches

machine processing. The process of replacing non-standard tokens by some standard form is referred to as normalisation. It is found to have a positive impact on tasks such as lemmatisation, POS-tagging and assignment of morphosyntactic features (Melero et al., 2012). Most normalisation approaches are closely related to the spell checking task, with distorted or contracted words being replaced by their “normal” equivalent. A common issue is the expansion of abbreviations, for which both rule-based and machine learning approaches have been investigated. Rowe & Laitinen (1995) describe a semi-automatic technique to expand abbreviations, using a rule-based approach and a large dictionary, which is found to improve the readability of technical text, programs and technical captions. In the medical domain, where many abbreviations and acronyms can stand for different things, depending on the concerned medical speciality and the context Pakhomov (2002) considers abbreviation normalisation as a special case of word sense disambiguation, and resolves this by means of a Maximum Entropy classifier. Besides abbreviations, normalisation also treats unusual lexical elements, such as the informal language used in social media. To normalise English tweets, Clark & Araki (2012) use a manually compiled and verified database of “casual English” words mapped to their normalised English equivalent to replace tokens. Additionally, phrase matching rules are employed to identify misused existing words (e.g. right/rite). This pre-processing is found to improve machine translation by reducing the number of untranslated words. Relatedly, Sidarenka et al. (2013) use a rule-based method to preprocess German tweets, replacing Twitter-specific phenomena with artificial tokens, using regular expressions to check left and right context. This approach is found to improve the accuracy of POS tagging. Another approach to normalise Twitter data uses a confusion set of in vocabulary normalisation candidates generated for each OOV word (Han & Baldwin, 2011; Han et al., 2013). Candidates are then selected based on multiple factors, such as lexical edit distance, morphophonemic similarity, prefix and suffix substrings, and the longest common subsequence. An extrinsic evaluation of this approach shows improvement of POS tagging accuracy. Considering expressions instead of individual tokens, another approach uses paraphrases to normalise travel conversations (Shimohata & Sumita, 2002). Expressions are replaced by more frequent synonymous expressions extracted from a parallel corpus by grouping sentences that

15

2. PRE-EDITING

have the same translation. The effect of these replacements on NLP tasks is not measured. Normalisation approaches often also use noisy channel models, widely employed in spelling correction tools, to find the most probable valid word given a distorted or compressed word (Clark, 2003; Choudhury et al., 2007; Cook & Stevenson, 2009; Beaufort et al., 2010). In these studies, the normalisation process is evaluated intrinsically, and no evaluation of its impact on other NLP tasks is performed. Another approach for text normalisation is inspired by machine translation, where social media language is considered as a foreign language. The normalisation task is thus approached as a translation task. Aw et al. (2006) describe a phrase-based statistical machine translation approach to the task of transforming SMS language into normal English. Kaufmann & Kalita (2010) use a two step approach for syntactic normalisation of Twitter messages, combining a pre-processor with a Moses machine translation system to convert phrases, in order to remove noise and thus improve readability and facilitate processing with NLP tools. Considering that the language used in social media is similar to spoken language, Kobus et al. (2008) combine a machine translation approach with a system inspired by automatic speech recognition to normalise French SMS. A SMS stream is converted into a phone lattice which is then decoded to text. While many normalisation approaches are close to spell checking, some actually include a spelling component or module. Banerjee et al. (2012) use a combination of approaches for the normalisation of technical English user forum data. Tokens such as URLs or paths are replaced by placeholders by means of regular expressions and spelling errors are treated with an off-the-shelf spellchecker adapted to the domain by including domain-specific terms. This pre-processing is found to improve SMT in terms of BLEU scores. Similarly, Khan et al. (2013) investigate the impact of text normalisation on parsing web data, also replacing URLs and emoticons with placeholders and correcting spelling. Liu et al. (2012) describe a broad-coverage normalisation system integrating three subnormalisers which each suggest candidates, the first performing letter transformations, the second selecting words on the basis of visual similarity, and the third being a spell checker. While most of the research on normalisation has not focussed specifically on the improvement of translatability (with some exceptions, e.g. Banerjee et al. (2012); Clark

16

2.2 Pre-editing approaches

& Araki (2012)), and most approaches have been tested only intrinsically, we suspect that many of these approaches could benefit machine translation. 2.2.1.3

Summary

The sources of OOVs problematic for machine translation are numerous: misspelt words, domain specific terminology, proper nouns, abbreviations, colloquial language or non-standard letter and digit sequences, among others. While some of these can be taken care of by system specialisation, others require pre-processing to replace them with in-coverage tokens. Two technologies provide the means of completing this task: spell-checking and normalisation. Spell-checking has been the object of much research and many rule-based and data-driven methods now exist to identify errors and find replacement candidates. Spell-checking as such has no particular association with machine translation, and its impact on MT has hardly been studied. This is unsurprising; since most texts that have been considered for translation were published text in a traditional sense, which had undergone editing and revision, misspelled words were not an issue. Normalisation, which aims at replacing non-standard tokens to facilitate subsequent processing, has gained increasing interest with the emergence of social media data. Among other applications, normalisation has been found to improve machine translation. In our particular context, we expect OOVs to be an important issue, for several reasons. First, considering the linguistic quality of forum data, misspelt words are bound to be frequent. Spell-checking will therefore be necessary. Second, the technical domain to which our data belong, the IT domain, is known for its complex jargon and use of anglicisms, numerous abbreviations, proper nouns, or unusual tokens such as the names of files or processes. Finally, as no large amounts of bilingual aligned data were available of the same nature as the forum content we propose to translate, the SMT system was trained on more traditional data, mainly Europarl (Koehn, 2005) and Symantec translation memories. While the translation memory data are from the same domain, they consist mostly of text from user manuals, which differs strongly from the forum data on the lexical level. For all these reasons, our pre-editing process will include spell-checking for non-word errors and normalisation-inspired rules to handle abbreviations, non-standard tokens and informal language. We will focus on tokens

17

2. PRE-EDITING

that are not covered by the system (i.e. that are absent from the training data), and attempt to replace these with in-coverage alternatives.

2.2.2

Grammar errors and unknown structures

Beyond the word-level difficulties, MT systems also encounter difficulties with ill-formed structures. This can again be caused by errors, such as incorrect syntax, which disrupt processing steps such as tagging or parsing. Correcting grammar and punctuation before machine translation is found to improve output (Bernth & Gdaniec, 2001; O’Brien, 2005). The issue of ungrammaticality is however considered more problematic for rulebased systems, which require a correct analysis, than for SMT systems which are more robust in the face of grammar/syntax inaccuracies (Carrera et al., 2009). Correcting these syntax and grammar errors also lies within the scope of spellcheckers, yet these require more sophisticated technologies to identify what are commonly referred to as real-word errors, or context-sensitive errors, since analysis of the context is required for their detection. Errors can be of a syntactic nature (e.g. agreement errors) or at the semantic, discourse structure or pragmatic level (Kukich, 1992). Two main approaches are distinguished for grammar checking: syntax-based approaches, and approaches combining tagging and patterns, often taking the form of regular expressions. Syntax-based approaches use some form of parsing to analyse structures for correctness. These approaches mostly rely on grammars to describe correct structures, and different strategies to allow parsing of ungrammatical content (Carbonell & Hayes, 1983; Thurmair, 1990). The other approaches, which are sometimes referred to as pattern-based approaches since they use pattern-matching to identify errors, perform POS-tagging or other shallow linguistic annotations and then apply rules describing sequences of tokens or tags. The sequences are either developed manually (e.g. Acrolinx (Bredenkamp et al., 2000) or LanguageTool (Naber, 2003)), or derived statistically by data-driven methods. 2.2.2.1

Summary

Similarly to spell-checking, grammar-checking has not been explicitly associated with machine translation. For rule-based MT architectures, the usefulness of correcting grammar as a pre-editing step seems obvious, since incorrect syntax is bound to cause problems for analysis. The use for statistical machine translation is less obvious, since

18

2.2 Pre-editing approaches

these systems do not perform any analysis as such. To our knowledge the impact of grammar checking on SMT has not been the object of much investigation. The forum data presents a wealth of grammar errors, including among others number and gender agreement, wrong verb forms and word confusions. In our particular case, translating from French to English, where a more inflected language is translated into a less inflected language, it is reasonable to suspect that a number of these issues, for example gender agreement within noun phrases, will not have a major impact on machine translation. Other errors however, such as the confusion of words of different categories, are more likely to have a significant impact on MT. Correcting grammar will therefore be part of the pre-editing process developed in this thesis.

2.2.3

Ambiguity and complexity

The highly ambiguous nature of natural language, which can be problematic for humans, is even more problematic for any kind of machine processing, including machine translation. Even the most ordinary words can be ambiguous, if they have more than one meaning or category and the context is insufficient for disambiguation. Additionally, besides lexical ambiguity, the way words are arranged into sentences can also be challenging if different parses are possible. Ambiguity of natural language has long been a problem in the technical domain, where safety-critical instructions for example need to be totally unambiguous. This issue has led to the development of Controlled languages (CL), sub-languages defined by lexical, syntactic and stylistic constraints designed to reduce or eliminate ambiguity (Huijsen, 1998). Although CLs were originally developed to improve human readability of technical documentation, it was soon found that they also improve machinetranslatability (Mitamura & Nyberg, 1995; Bernth, 1998; Nyberg et al., 2003; Roturier, 2004). CLs have therefore often been associated with machine translation, mostly with rule-based systems. However, with some exceptions (e.g. Aikawa et al., 2007, Temnikova, 2010 or Lehmann et al., 2012), pre-editing in the sense of controlled language for statistical machine translation remains a mostly unexplored field. As described by Kuhn (2014), there is no universally accepted definition for controlled natural language. Approaches range from “technical languages that are designed to improve comprehensibility” to “languages that can be interpreted by computer”. However, controlled languages have in common the fact that they are defined by sets of

19

2. PRE-EDITING

rules, which govern different linguistic phenomena. O’Brien (2003) classifies rules into three categories: lexical, syntactic and textual rules. Lexical rules control word usage, i.e. phenomena such as homography, synonymy, pronoun usage or date formats, among others. Many ambiguities result from the use of homographs. To avoid these, Simplified English (AECMA, 1995) for example allows only one category for each word. The reduction of lexical ambiguity is one aspect of CLs that was found to have a positive impact on MT (Baker et al., 1994). Syntactic rules define how words can be combined to form phrases and sentences. They mostly deal with structures that can cause ambiguity or reduce understandability, such as long noun clusters, ambiguous attachments or passives (Nyberg & Mitamura, 1996). They also restrict usage of expressions that can only be interpreted correctly in light of another expression in the context, such as ellipses or anaphora. Textual rules control sentence length, text structure or capitalisation. The objective of these rules is mainly to improve clarity and readability by defining a coherent form. The most common rule in this category, which appears in most CLs, is to avoid long sentences (O’Brien, 2003). Splitting long sentences is also a very common approach to improve MT. Goh & Sumita (2011) argue that partitioning sentences into individual clauses prevents word reordering across clauses. Applying the principles of CL to improve MT, Pym (1988) designed ten simplification rules. Five of these tell the writer to “keep it short and simple”, by writing short sentences, avoiding redundancies and taking care with logical constructions, among others. The remaining five rules direct them to “make it explicit” by for example avoiding ellipses, adhering to a dictionary which allows only one meaning per word and avoiding noun clusters. Application of these ten rules improves raw RBMT output and reduces post-editing time for the translation of workshop manuals. Another CL approach for machine translation, is that described by Nyberg & Mitamura (1996), where KANT Controlled English was developed specifically to improve translations by the knowledge-based KANT translation system. This CL is found to greatly reduce the average number of parses per sentence. Not exactly a CL, but on the same principles, Bernth & Gdaniec (2001) aim at improving “MTranslatability” by commercial RBMT systems. The central idea is again to reduce ambiguity, notably by controlling coordination, using explicit post-nominal modifiers introduced by relative pronouns (that, which, who etc.) and avoiding personal pronouns.

20

2.2 Pre-editing approaches

Also for RBMT systems, a study by de Preux (2005) suggests that although application of a CL does not decrease the number of translation errors, it decreases their severity. More recently, a study has used eye tracking measures to evaluate the readability and comprehensibility of MT output, both raw and controlled (Doherty, 2012). A study for the English-German language pair, measuring both post-editing effort and MT output comprehensibility (O’Brien & Roturier, 2007) shows that not all rules have equal impact on MT. Some rules, such as replacing personal pronouns without antecedents, were found to have a high impact, while others, such as the use of parentheses, only had a limited impact. Also investigating the impact of CL rules on MT output quality and post-editing effort, but using a statistical MT system, Aikawa et al. (2007) translate from English to four different languages: Chinese, French, Dutch and Arabic. It was found that rules correcting informal style, spelling and capitalisation had the greatest crosslinguistic impact. Other rules had more language specific impact. The question of portability of rules, between domains, MT systems or language pairs is still worthy of research. While controlled languages are traditionally associated with technical documentation, very similar approaches have been investigated for more general texts. The objective of these simplification methods is to reduce grammatical complexity while maintaining meaning and information content. A real world example of text simplification is Simple Wikipedia1 . Simplification is beneficial for human readers, and has been found useful, among others, for aphasic readers (Carroll et al., 1998), language learners (Petersen & Ostendorf, 2007) as well as poor literacy readers (Candido et al., 2009). Simplification can also be of use for many natural language processing tasks, including machine translation (Chandrasekar et al., 1996). Approaches have been developed for summarisation (Siddharthan et al., 2004), to improve syntactic parser performance for text mining (Jonnalagadda et al., 2009), for semantic role labelling (Vickrey & Koller, 2008) or information retrieval (Beigman Klebanov et al., 2004). While most simplification research has focused on English, some research exists for other language such as Brasilian Portuguese (Alu´ısio et al., 2008; Candido et al., 2009), Dutch (Daelemans et al., 2004) or French (Seretan, 2012). Like controlled language rules, simplification rules act both at the lexical and syntactic level, aiming at producing simpler variants. Different methods are used to acquire 1

http://simple.wikipedia.org/

21

2. PRE-EDITING

and apply rules. Rules can be entirely hand-crafted to target linguistic phenomena that are known as complex (e.g. Carroll et al., 1998) or learned from corpora (e.g. Petersen & Ostendorf, 2007). For syntactic simplification rules, mostly aiming at splitting long complex sentences, Seretan (2012) describes a semi-automatic rule acquisition process, using linguistic tools to identify sequences that could potentially be simplified. Another approach to simplification is to consider it as a translation task, where a SMT system is trained on a parallel corpus of original and simplified texts (e.g. Specia et al., 2010; Coster & Kauchak, 2011). Finally, an important aspect to bear in mind when considering CLs or simplification is that applying rules can be a difficult task due to the number and complexity of pre-editing rules (Goyvaerts, 1996; Temnikova, 2011). This has led to the development of tools to assist writers in producing controlled or simplified texts. The aim of most of these tools is twofold: 1) check for compliance with rule specifications to ensure the quality of the final output, and 2) facilitate application of rules. Most controlled languages have their own checking tool, often developed in-house, for example the Boeing Simplified English Checker (Wojcik et al., 1990), IBM’s EasyEnglishAnalyzer (Bernth, 1999), the KANT Controlled English Checker (Mitamura & Nyberg, 2001) or Compagnon LiSe for the French security domain (Renahy et al., 2009). For the simplification task Max (2006) describe a rule-based interactive system for writing simplified text, integrated into a word processor. Likewise, Candido et al. (2009) describe Simplifica a rule-based authoring system for Brazilian Portuguese. This system performs automatic simplifications, which are then presented for revision in a web-based editor. All these tools are highly specialised, designed for specific sets of rules, types of input text and, for some, specific machine translation systems. 2.2.3.1

Summary

Originally developed to improve comprehensibility of technical texts, controlled languages have soon been associated with rule-based machine translation where they are found to improve the quality of output, decrease the severity of translation errors and reduce post-editing effort. Little research has however gone into the combination of CLs and statistical machine translation. Another approach following similar principles to CL is simplification, employed both to improve human readability and to facili-

22

2.2 Pre-editing approaches

tate diverse NLP tasks, including machine translation. An important aspect of both approaches is the difficulty of implementing the changes required by the rules. While it seems unrealistic to attempt to make forum data comply with a CL, some of the principles of CL and simplification can certainly be adapted to this data. We will therefore include rules following the principles of CL and simplification in our pre-editing process.

2.2.4

Reordering

To complete our overview of pre-editing approaches, we should mention reordering. Unlike the other approaches described here, reordering presents a stronger integration into the machine translation process, as it is applied not only to data that will be processed by MT, but also to the data used to train the MT systems. Differences in word order between languages are one of the main sources of errors in statistical machine translation (Niessen & Ney, 2001). Thus research has turned towards reordering, i.e. moving words or clauses within a sentence to improve translation by SMT systems. Reordering can be performed as pre-processing, generally applied to both training and test data, or within the SMT system itself, in the decoder and in the reranking framework (Feng et al., 2013). Here we will focus on reordering as a pre-processing task. Reordering rules aim at making the sentences in source and target language more similar to each other. Different methods have been investigated to define reordering rules. Earlier approaches rely on linguistically motivated rules, focussing on specific syntactic phenomena that differ between two languages. For example, Berger et al. (1996) focus on a single phenomenon and describe a noun de noun reordering module for translation from French to English. Different transformations are considered, based on the best English translation: in some cases, English keeps the same word order as French, in others inversion of the two nouns and deletion of the preposition de is necessary to obtain the correct translation. To determine whether a given noun de noun requires inversion or not, a maximum entropy model is used. For the same language pair, Brown et al. (1992) describe analysis and synthesis components that, among other annotations and transformations, perform syntactically motivated transformations to increase sentence similarity, such as moving French adjectives or adverbs into positions closer to those usual in English.

23

2. PRE-EDITING

The German to English language pair has also been the object of reordering research. Niessen & Ney (2000, 2001) describe manually designed transformations. German detachable verb prefixes, which are usually at the end of the clause are moved before the corresponding verb (ich fahre morgen los −→ ich losfahre morgen), thus facilitating word alignment. Collins et al. (2005) describe transformations to the source parse tree to improve translation from German to English with a phrase-based system. The reordering process is divided into six restructuring steps applied in sequence, each performing a linguistically-motivated transformation such as moving the verb to the initial position in the verb phrase, or moving verb particles to precede the verb when found in the same clause. On a similar note, Wang (2007) describe a set of syntactic reordering rules to handle differences between Chinese to English. Rules focus on verb phrases, noun phrases and localiser phrases, which were identified as presenting the most differences in word order for this language pair. As opposed to the approaches described so far, which use linguistically motivated transformations, Xia & McCord (2004) rely on automatically learned reordering patterns. These patterns are learned on parsed and aligned training data, and then applied at runtime to the parse trees of source data. Also learning reordering rules automatically from word-aligned training data, Genzel (2010) describes a reordering approach for eight different language pairs. Beyond language pair specific aspects, word order also has an influence on analysis in general, especially for the resolution of long-distance dependencies. Reordering can be used to produce sequences that are easier to analyse, for example by moving adverbs to the end of verb sequences, in order to decrease the distance between subject and verb (Brown et al., 1992). It has also been found that unifying word order between different sentence types, such as undoing question inversion to produce structures similar to the more frequent declaratives, can enable a better exploitation of training data (Niessen & Ney, 2001). Obviously, such transformations require a post-processing phase to reconstruct the correct form. 2.2.4.1

Summary

Reordering, a pre-editing approach used in conjunction with statistical machine translation, consists in transforming the word order of the source data to improve translatability, often by changing the order to bring it closer to that of the target language. Unlike

24

2.3 Pre-editing with Acrolinx

the other pre-editing processes described so far, these transformations are mostly performed both on training and test data. In some cases, post-processing after translation is required to reverse transformations. Consequently, reordering is generally closely integrated with the SMT process. In our particular context, the pre-editing process must be independent of the machine translation process. Thus we will not be able to perform transformations on both training and test-data, with subsequent retraining of the translation models. Postprocessing to reverse the effect of specific rules will not be possible either, at least at the present time. We will however see whether some simple transformations, focussed specifically on the French-English language pair, can improve machine translation. After this overview of pre-editing approaches, we will now present Acrolinx, the tool that has been chosen for pre-editing within the ACCEPT project and this thesis.

2.3

Pre-editing with Acrolinx

Acrolinx is a checking and authoring support tool designed to help companies or organisations improve published content, mainly by ensuring grammatical accuracy, terminological consistency and compliance with style guidelines or CL rules. It is based on technology originally developed at the language technology laboratory of the German Research Center for Artificial Intelligence (Deutsches Forschungszentrum f¨ ur K¨ unstliche Intelligenz, DFKI), performing “Phenomenon-based Language Checking” (Bredenkamp et al., 2000). It also includes technology developed at the University of Geneva, for the morphological analysis component, MMORPH (Petitpierre & Russel, 1995).

2.3.1

Acrolinx technology

Acrolinx uses a phenomena-oriented approach to language checking, using error patterns to scan input text to locate error candidates. A combination of shallow NLP components is used to obtain linguistic annotations for each token in a text: a two-level morphological analyser provides morphological information (MMORPH, Petitpierre & Russel, 1995), and a probabilistic tagger provides part-of-speech tags (TnT, Brants, 2000). Based on these annotations, represented in the system as feature structures, complex linguistic objects can be defined. These objects are then used in a declarative

25

2. PRE-EDITING

formalism similar to regular expressions to describe the patterns which will be used to scan the text. The patterns are in turn integrated into a combination of different rules: recall-oriented Trigger rules identify error candidates and precision-oriented Evidence rules refine these results by eliminating erroneous candidates. Once candidates are reliably identified, Suggestion rules allow reordering or replacement of the flagged tokens. The Acrolinx rule formalism will be described in greater detail in the next chapter (3). Originally developed for error detection (Bredenkamp et al., 2000), the formalism used by Acrolinx also lends itself well to the identification of other patterns in text, such as those targeted by many CL or simplification rules. By adding customised rules, Acrolinx can therefore be used to check for compliance with CLs, style guidelines, textual rules etc. In addition to the rule-based approach to real-word error checking, Acrolinx also has a spelling module for non-word error checking, which uses the same morphological resources as the analyser mentioned above. This spelling module can be customised by specialisation of the morphology, as well as by addition of so-called gazetteers, indicating tokens that should not be flagged. Acrolinx also offers terminology management functionalities, which allow bilingual term extraction as well as checking for terminological consistency. These features were not used in the context of this study and will not be described further. While Acrolinx is mostly used as an interactive checking tool, it is also possible to apply rules automatically if they satisfy certain conditions. Thus an entirely automatic pre-editing process is also achievable. Through these different aspects, the Acrolinx technology lends itself particularly well to the pre-editing scenario we envision for the forum context. The shallow analysis process should be sufficiently robust to handle the messy forum data. The high flexibility of the rule formalism will allow us to develop different types of rules, for grammatical phenomena, controlled language, reformulation and even reordering. The possibility of applying rules interactively or automatically permits experimentation to determine which approach is best suited to the forum context. Finally, the simplicity of the rule formalism and the available tools to assist rule development make it relatively simple to implement pre-editing rules.

26

2.3 Pre-editing with Acrolinx

2.3.2

Using Acrolinx

In practice, Acrolinx relies on a client-server architecture. A central server hosts the main checking engine and the language resources. The checking engine manages the NLP components used for the linguistic annotations. The language resources consist of rule files, which can be grouped into sets for specific application, or for application in sequence. The server can either be a shared server on a network, making the checking functionalities available to multiple users, or a local server, convenient for rule development and testing. Server settings, such as the selection of specific rules for example, can be configured by means of a dashboard accessible through a web browser. On the client side, different options are possible. A batch check of multiple files can be done with the Acrolinx Batch Checker. This generates an extensive report in XML format, including all flagged errors with their context, as well as statistics for each rule. These reports are intended to provide a global overview of text quality. We will see in the next chapter (3) how these reports are also very useful for rule development. For interactive rule application, Acrolinx provides plugins for text processors and desktop publishing tools, as well as XML authoring tools and web content management systems. These plugins allow users to trigger the interactive checking process wherever text is being composed. The checking client flags words or phrases and highlights them in the text. For each flagged word or phrase, a contextual menu provides the user with more information about the error as well as correction suggestions, if available. Checking consistency with pre-editing rules thereby becomes as convenient as spelling and grammar checking. Within the ACCEPT project, a plugin has been developed to access these checking resources directly from the forum interface (ACCEPT D5.1, 2012). Similarly to the text editor plugins, this provides highlighting of phenomena, and a contextual menu for correction. The plugin interface is shown in Figure 2.1. Finally, it is also possible to apply rules automatically, by means of the autoApplyClient, a java executable, which applies all suggestions automatically to a text file. Automatic rule application is interesting in several respects. First, it means pre-editing can be performed without requiring any human effort, which is clearly an interesting feature in the context of forum data, where users might not see the use of pre-editing or might not be motivated to invest any time in this step. Second, it allows application

27

2. PRE-EDITING

Figure 2.1: Pre-editing plugin in forum interface

of transformations that might be counter-intuitive to a human, for example reordering that produces incorrect source language, and which therefore do not lend themselves to interactive application.

2.3.3

Developing rules with Acrolinx

Acrolinx provides a set of resources to facilitate rule development, including an IDE to support rule writing, real-time checking with rules during development and visualisation of the linguistic annotations provided by the NLP components on tested data. These resources will be discussed in further detail in the following chapter (cf. 3.2.2) where we describe rule development.

2.3.4

Summary

We have introduced Acrolinx, a pre-editing tool which relies on different shallow NLP components to produce linguistic annotations that are used in declarative rules to describe error patterns. A combination of recall oriented Trigger rules and precision oriented Evidences allows identification of complex cases, while replacement sequences are defined by means of Suggestion rules. Acrolinx checking functionalities are accessed by means of plugins. Within the ACCEPT project, a plugin has been developed for integration into web environments, allowing forum users to check text directly in the fo-

28

2.4 Pre-editing forum data with Acrolinx

rum interface. The plugin provides highlighting of phenomena, and a contextual menu for correction. Another possibility is automatic rule application. So far we have discussed different pre-editing approaches and their association with machine translation, and have presented the pre-editing technology that we will use in this thesis. We now outline how we plan to combine these factors into a pre-editing process for forum data.

2.4

Pre-editing forum data with Acrolinx

We have seen that pre-editing can take on many forms, most of which are applicable to our data, forum content, and our pre-editing objective, the improvement of machine translation. Our pre-editing process will include spelling and grammar checking, normalisation, simplification/CL-inspired rules as well as some reordering rules. In this endeavour, we need to take into account different constraints resulting from our chosen context: the specificities of the input text, the machine translation system, the quality requirements imposed on the target text, and finally, the pre-editing context itself, or how pre-editing can be performed in the forum environment. We shall now discuss these aspects in further detail.

2.4.1

Input

The forum posts used in this study are extracted from the French Norton Community forums1 . These forums are aimed at non-professional end-users of Norton Products discussing issues with these products as well as tips about security solutions. The following example shows an extract of this forum: (1) je suis en renouvellement automatique mais ma cb ´etait expir´e donc mon abo c’est pas renouvell´e automatiquement, j’ai voulu racheter 12 mois d’abo ... mais y’a tellement de produits dispo sans vraiement que je sache faire la diff´erence entre les un ou les autres que j’me suis plant´e de produits quoi ... This type of writing is typical for French forum content. While it is reasonably understandable for a French native speaker, it presents numerous difficulties for natural 1

http://community.norton.com/

29

2. PRE-EDITING

language processing, as it differs from traditional edited content in many aspects. This is the result of multiple factors, including the publishing context, the writers’ aims in contributing content, the writing environment and the writers’ skills. The informal context leads to the use of a language closer to spoken language, using informal syntax, colloquial vocabulary, abbreviations or phonetic substitutions (Subramaniam et al., 2009; Jiang et al., 2012; Melero et al., 2012). Another effect of the forum context is that, while exchanges are not real-time like on a chat, the communication is still fast. Writers are focused on answering a question or providing technical information, and linguistic accuracy is put in second place. As a result, spelling, grammar and punctuation conventions are not always respected (Baldwin et al., 2013). For example, in French, careless writing often leads to the confusion of homophones, which are very frequent. Most of these homophones differ only on single letters or accents, making it easy to confuse a (“has”) with ` a (“to”), or sa (“his/her”) with ¸ca (“this”). Finally, with the generalisation of tablets and smart phones, forum content can be created using many different input methods. The users’ proficiency with these tools will also condition the quality of the content. The pre-editing process needs to address all these issues. In our context we will focus on modifications that improve MT output. Indeed, while the advantages of improving the source for enhanced readability and clarity seem obvious, our main objective remains translatability. Therefore, rules that improve the source but degrade translation output are not desirable. Another case are rules that improve the source, but have no impact on translation. If these rules can be applied automatically, requiring no effort from the user, then they can be maintained. If they require human intervention however, they should be discarded. It should be noted that, besides being problematic for the MT technology, a high density of spelling and grammar issues can be equally challenging for the technology used to correct them. To handle this data, the pre-editing resources must therefore be sufficiently robust. In this respect, using the Acrolinx technology for pre-editing presents several advantages. Since it relies on shallow NLP components (cf. 2.3.1), it is inherently robust. Additionally, as different analysis components provide annotations that we can use to define objects, rules can easily be made more or less robust. This aspect will be discussed in greater detail in the next chapter (3.2.1.1). Obviously, it influences rule precision, which will also be discussed in the next chapter. Finally, as

30

2.4 Pre-editing forum data with Acrolinx

any number of patterns can be defined, it is possible to tailor Triggers specifically to the language found on forums, for example by taking into account potential errors in the context of another error. We shall see that this is very useful when treating data with a high density of errors.

2.4.2

MT technology

In the context of the ACCEPT project, pre-editing is a standalone process, entirely distinct from the subsequent machine translation process. The advantage of this clear division is that the pre-editing process is independent, making it portable to other systems or other domains, and allowing its development without system retraining. The SMT system will therefore not be changed in any way, e.g. no retraining on pre-edited data will be performed. The only aspect of the SMT system that will be taken into account for pre-editing rule development is the training data, since it is well known that divergences between training and test data are an important source of mistranslation. As no large amounts of bilingual aligned informal data resembling forum content were available, the SMT system was trained on more traditional data, mainly Europarl (Koehn, 2005) and Symantec translation memories. While the translation memory data is from the same domain as the forum data, it consists mostly of text from user manuals, which differs strongly from the forum data, at both lexical and syntactic levels. One objective of pre-editing will therefore be to bring the forum data closer to this training data.

2.4.3

Target text

The aim of the ACCEPT project, namely making UGC available to other language communities, can be understood as “Making the information provided in UGC available to other language communities”. In this sense, the focus is put on the informative content, and not on the way it is provided, i.e. the quality of the language it is conveyed with. This places different requirements on the MT output from the usual accuracy/fluency approach. If the main objective is to convey a message, an accurate but not fluent translation, provided it is understandable, is clearly better than a fluent but not accurate one. Consequently, pre-editing rules should be dedicated primarily to the improvement of accuracy.

31

2. PRE-EDITING

In addition to these purely functional considerations, we must also take into account that the translations of forum posts will also be posted on forums. They therefore do not have to be of particularly high quality and some errors or imperfections will be perfectly acceptable. It is thus debatable whether a pre-editing action that brings the MT output from disfluent yet perfectly understandable to slightly more idiomatic is profitable. If the pre-editing action can be performed reliably by an automatic rule, then it is certainly interesting. If performing this action involves human effort, then much less so. As an illustration, the following example shows the effect of an existing Acrolinx rule that adds missing spaces: (2) Raw: Merci ` a Norton et Stephen,tout est ok ,j’ai la nouvelle version.Bravo pour votre r´eactivit´e −→ Thank you to Norton and Stephen, everything is ok, I have the new version.Bravo for your response Pre-edited: Merci ` a Norton et Stephen,tout est ok ,j’ai la nouvelle version. Bravo pour votre r´eactivit´e −→ Thank you to Norton and Stephen, everything is ok, I have the new version. Well done for your response While the pre-edited version does produce a more correct English output, adding the space does not fundamentally improve the message. In both cases, Norton and Stephen will understand that the user is happy with their response. In this particular case, since this rule can be applied automatically, requiring no effort from the forum users, it remains interesting in our context. Another aspect to consider is the impact of the rules on each other. As we shall see in Chapter 4, it can be interesting to apply rules in sequence, on the basis that a first set of rules could correct rough errors (such as punctuation that influences segmentation, for example) and would thereby pave the way for more precise subsequent rules.

2.4.4

Pre-editing in the forum context

As observed in many studies, applying pre-editing rules manually is both time-consuming and difficult (Goyvaerts, 1996; Huijsen, 1998; Temnikova, 2011), even if the users are language experts or trained professionals. In our context, the pre-editing task will be performed by forum users directly in the forum interface, before they submit their posts for publication or translation. It is not reasonable to ask these users, whose main objective is obtaining or providing solutions to technical issues, to painstakingly

32

2.4 Pre-editing forum data with Acrolinx

study pre-editing guidelines and manually edit their text. Therefore pre-editing must be made as easy as possible. We have seen that ACCEPT users can check for compliance with the rules in the forum interface, by means of a plugin that highlights phenomena that require modification. Yet we also need to make sure users will be able to perform the necessary transformations. The best pre-editing rule with high impact on SMT will be quite useless if forum users are unable to apply it. Considering the skills and motivation of users is a fundamental issue in rule development (Helfrich & Music, 2000). Microsoft, who provide spelling and grammar checking to a very broad user base with their Office series, target what they call knowledge-workers with native or near-native language skills. These users have average understanding of grammar issues and expect suggestions from the system (Fontenelle, 2006). On our technical forums, we expect a much broader range of profiles, since anyone can encounter IT problems and post on the forum. We therefore cannot expect users to have any particular language skill. Clear indications of how modifications should be performed must be provided, especially for rules dealing with linguistic phenomena such as negations and passives which users find the most difficult to apply (Temnikova, 2011). This means that rules and error descriptions should be as specific as possible. To further simplify rule application, it is helpful to provide concrete replacement suggestions, instead of instructions. While it is not always possible to provide a unique reliable replacement candidate, it is possible to provide multiple suggestions, relying on the fact that it is often easy for the human to make a choice based on information not available to the machine. Obviously, the most efficient way of reducing the effort involved in the pre-editing task would be to perform it automatically. However, this requires providing unique and reliable replacement suggestions for flagged phenomena, which, as we shall see in the next chapter (3), is not trivial. We have mentioned earlier that the robustness of a checking tool is important when processing messy data like community content. Indeed, users could easily lose patience with erroneous flags or incoherent suggestions produced by a system which is having difficulties analysing this data. Rule precision therefore is also a key factor for the acceptability of the checking tool. We shall discuss this aspect in greater detail in Chapter 7 which investigates the usability of the rules by real forum users.

33

2. PRE-EDITING

2.4.5

Summary

Several constraints surround the development of a pre-editing process for the forum context. To handle the messy input data, checking resources must be particularly robust. Since the main objective of technical forum posts is information transfer, preediting should primarily improve accuracy. For pre-editing to be successful, it has to be accessible to forum users. Rules must be easy to understand for writers with no linguistic knowledge and whenever possible, provide concrete suggestions on how to improve flagged elements. Taking this one step further, as many rules as possible should be automatic, requiring no user intervention at all. Lastly, since pre-editing will have to be performed on a voluntary basis, it should be optimized for the effort-reward ratio: rules that only produce minor stylistic improvements without impact on MT quality should be put aside.

2.5

Conclusion

Natural language is fraught with difficulties for machine translation and other NLP tasks. Pre-editing is a common way to address these difficulties, by transforming the input text before further processing. A very common problem is the presence of out-ofvocabulary items which disrupt most processes. To identify and replace these, different approaches have been developed, including spell-checkers for misspelt words, and normalisation approaches for non-standard items frequently found in social media data. Replacing these items is found to improve machine translation by reducing the number of untranslated words. Another difficulty for machine translation is ambiguity, both lexical and syntactic. A commonly used approach when attempting to reduce ambiguity is the application of controlled language rules. Originally designed to improve human readability of safety critical technical texts, CLs have also been found to improve machine translation. They are therefore often associated with MT, yet mostly with rule-based systems. Also aiming to lessen the complexity of natural language, an analogous approach is simplification, but this has only occasionally been combined with MT. Finally, differences in word order between languages are also problematic for MT, which has led to a pre-editing approach specific to SMT systems, reordering. Most of these pre-editing approaches seem applicable to the problem of improving the translatability of forum data. In our study, we will implement rules for these differ-

34

2.5 Conclusion

ent approaches using Acrolinx, a tool with a phenomena-based approach to language checking, which can easily be customised to include pre-editing rules of different types. Rules are written by combining objects in a declarative formalism similar to regular expressions to describe error patterns. Pre-editing in the forum context does however present several difficulties. The data itself is challenging, as the problems it presents for machine translation are also potential hurdles for the technology used for pre-editing. Since the Acrolinx technology relies on shallow NLP components, we expect the tool to be sufficiently robust to handle this data. Another difficulty is that the pre-editing process will have to be performed by the forum users themselves. This raises questions about the complexity of rule application, especially in the case of difficult transformations requiring some linguistic skill, but also about the motivation of the users, who are expected to perform pre-editing on a voluntary basis. To simplify this task, an Acrolinx plugin was developed to allow interactive rule application directly in the forum interface. Acrolinx rules can also be applied automatically, if certain conditions are fulfilled. Both options will be considered for our pre-editing process. In the following chapter, we will present the Acrolinx pre-editing rules for forum data adapted and developed within this thesis.

35

2. PRE-EDITING

36

3

Rule Development This chapter introduces the Acrolinx rule formalism and describes the pre-editing rules developed and adapted for the forum data.

3.1

Introduction

We have seen in the previous chapter that pre-editing can take on many different forms, with as many approaches as there are applications. In our case, the objective of preediting is to improve translation by the ACCEPT SMT system. In view of the low linguistic quality of the forum data, it seems evident that a spelling and grammar checking component is needed. Another problematic aspect of the forum data is the use of informal language and jargon, which could benefit from some form of normalisation. Considering pre-editing approaches that have been found to improve machine translation, simplification rules such as those defining controlled languages could also be beneficial for forum data. Finally, we will also consider reordering approaches. To develop rules for this eclectic pre-editing process, we have chosen to use a single tool, Acrolinx, which allows us to provide a single pre-editing interface for all these approaches. In this chapter we will see how different types of rules can be developed using the Acrolinx formalism. An important aspect to consider in the development of pre-editing rules for forum data is that the challenges this data presents for MT are likely also challenges for the checking technology. In this chapter, we will investigate the question of whether the Acrolinx technology can handle forum data, and in particular whether the developed

37

3. RULE DEVELOPMENT

rules achieve sufficient precision for our application. Precision of pre-editing rules is a key factor for our pre-editing process, for two reasons: first, high precision means users will not have to deal with many erroneous flags, making the pre-editing process less tedious as well as more reliable, since the risk of wrong corrections being inadvertently accepted is reduced. Second, if rules are very precise, and include reliable unique replacement candidates, it becomes possible to apply them automatically, which is highly desirable in the forum context. We will see in this chapter that different aspects condition the precision of the rules, and that this in turn determines how rules can be applied in the forum context. We begin this chapter by introducing the Acrolinx rule formalism and the rule development process (3.2). We then present the individual rules, regrouped by categories. Section 3.3 presents grammar and spelling rules, followed by a comparison of the performance of these rules with other checking tools. Section 3.4 presents rules for punctuation and spacing, and section 3.5 rules treating informal language phenomena. We then proceed with a set of simplification rules (3.6) and a set of rules for the machine, designed specifically to improve translation with the ACCEPT SMT system (3.7). Finally, we will discuss the concrete application of these pre-editing rules in the forum context, considering automatic vs interactive application as well as application order (3.8).

3.2

Developing Acrolinx rules

As mentioned in the previous chapter, Acrolinx uses a phenomena-oriented approach to language checking. A morphological analyser (MMORPH, Petitpierre & Russel, 1995) is combined with a probabilistic tagger (TnT, Brants, 2000) to obtain linguistic annotations for each token in a text. These annotations are then used in declarative patterns to describe phenomena. The patterns are in turn integrated into different rules: Triggers, Evidences and Suggestions, the combination of which composes a rule file. In this section, we begin by describing the Acrolinx rule formalism in further detail. We then explain how custom rules for specific phenomena can be developed with this formalism.

38

3.2 Developing Acrolinx rules

3.2.1

Rule formalism

The language resources in Acrolinx are subdivided into rule files, with one rule file for each type of error or phenomenon that should be checked. Besides meta data, rule files contain two main sections: object definitions and rules. Objects are used to describe individual tokens based on multiple features. Rules use these objects to describe the patterns that should be found in texts. 3.2.1.1

Object definitions

Objects are defined at the word level, using the linguistic annotations provided by the back-end components in the form of a feature structure. Object descriptions are partial descriptions of a feature structure and can contain regular expressions, disjunctions or negations. Depending on how many features are defined, objects can be very precise (e.g. a specific inflection of a verb), or very general (e.g. any word identified as a noun by the POS tagger). The following example shows the definition of an object referring to a token that is a feminine noun: (3) @femNoun ::=[POS "(PR-NOUN|NOUN)" MORPH.READING.INFLECTION.gender "feminine"]; The definition begins with the @ character, followed by the object name, which can be freely chosen. The features are then listed between [] after the ::= symbol. This object uses two features: POS "(PR-NOUN|NOUN)" The keyword POS is used to indicate a restriction on the part-of-speech tag, which should be PR-NOUN (proper noun) or NOUN (noun). MORPH.READING.INFLECTION.gender "feminine" The keyword MORPH refers to the morphological analysis. To match this object, the morphological analysis of a given token should return the gender value “feminine”. Any number of features can be combined to define an object. In the following paragraphs we will describe some of the features used to define objects. This enumeration is not intended to be comprehensive, but should provide an overview of the most frequently used features.

39

3. RULE DEVELOPMENT

TOK The most basic keyword is TOK. It can refer to any surface string of the tokeniser output, thus making it possible to describe a specific token without involving the part-of-speech tagger or the morphology. The TOK keyword is followed by a string within quotation marks. This string is a regular expression that can include the usual disjunctions, character ranges, wildcards, etc., providing great flexibility in the definition of surface strings. In the following example, the object refers to the tokens a or A: (4) @a ::=[TOK "^(a|A)$"]; The beginning and end of the token are referred to by ˆand $, and a disjunction allows for both the upper and lower case variants. POS

The POS keyword gives access to the category and sub-category tags attributed

by the part-of-speech tagger. The following example shows the definition of an object tagged as an adjective: (5) @adj ::=[POS "ADJ"]; MORPH.LEMMA Another very useful keyword is LEMMA, which relies on the morphological resources. It is frequently used to refer to words that can be inflected, such as verbs. The following object for example will match any inflected form of the French verb faire (to do). (6) @verbFaire ::= [ POS "^VERB" MORPH.LEMMA "^faire$"]; AMBIMORPH This feature stores morphological alternatives. These are especially useful in cases where the analysis is not entirely reliable, due to incorrect or ambiguous syntax of the input sentence. For example, if the text contains an ambiguous noun that by itself could be either noun or verb, and is wrongly tagged by the POS tagger as verb due to its context, we can fall back on the AMBIMORPH feature to identify it as a potential noun or verb. For community content in particular, where analysis is sometimes impaired by errors in the sentence structure, this feature is very helpful to increase coverage of the rules by creating less restrictive objects.

40

3.2 Developing Acrolinx rules

An important aspect to consider when defining objects is the analysis component they rely on. An object defined using a part-of-speech feature for example will only be “functional” if the part-of-speech tagger manages to analyse the sentence correctly and attributes the correct tag to the token. An object defined solely by surface strings, on the other hand, will be functional in any case, even if the more sophisticated analyses fail. These considerations will be discussed in further detail later on in this chapter, where we will see how the quality of community content influences analysis and thereby rule development. One problem that appeared early on in the rule development process was the difficulties encountered by the POS tagger when processing community content. Originally this tagger was trained on an annotated corpus of news data. Over the years, new sentences from different domains have been tagged and added to this corpus. At the very beginning of the ACCEPT project, a batch of 600 sentences of forum content was processed in this way and added to the tagger training data. In order not to introduce cases that might cause ambiguity, these sentences were corrected first in terms of spelling. All along the rule development, more sentences of forum data were added to the training corpus1 . Contributing forum data to the tagger presents two advantages. First, this added a number of previously unknown words to the resources, informal lexical items for example, allowing the tagger to attribute these to the correct category. Second, it added a number of new tag sequences, specific to the structures used in informal language. Arguably, the tagger specialisation for the forum data would have deserved closer investigation, but the complex question of specialising a probabilistic tagger for data as heterogeneous as that found on forums is a research topic in its own right and goes beyond the scope of this thesis. 3.2.1.2

Rules

To define the patterns that should be found in text, the previously described objects are combined into rules. All rules follow the same regular expression based syntax. Besides objects, they can also include fixed string elements and markers for sentence boundaries (BOS and EOS). The specification language uses three types of rules: 1

As Acrolinx uses a unique shared corpus to train the tagger, sentences from other datasets and domains have also been added to the training corpus during this same period.

41

3. RULE DEVELOPMENT

Triggers identify (flag) an initial set of error candidates; recall oriented Evidences confirm or discard candidates as errors by putting further constraints on the linguistic context of the identified candidate; precision oriented Suggestions transform the error into a correct replacement A rule file can contain any number of Triggers, Evidences and Suggestions. Each of these rules is assigned a confidence measure. When a sentence is checked and an error candidate is identified, the confidence measures of all applicable positive and negative rules in the rule file are combined to determine whether the identified candidate is indeed an error. If it is, the appropriate replacement rule is applied. The next sections will describe the individual rule types in more detail. Triggers

specify patterns that should be flagged. Patterns are a sequence of tokens

described by objects (or fixed strings) that can include optional elements, multiple occurrences of one or more elements, as well as alternatives defined by disjunctions. The following example shows a pattern that will match a sequence composed of 1) a token that is either a personal pronoun (@pronPerso) or a noun (@noun) followed by 2) a token that matches the @verbPlusInf object, optionally followed by 3) an adverb (@adv) and concluded by 4) an infinitive verb form (@verbInf). (7) {@pronPerso|@noun} @verbPlusInf [@adv]?

@verbInf

This pattern will therefore match the following sentences: (8) Je crois avoir raison. J’esp`ere bien voir plus des vid´eos li´ees produits Norton bientˆot. Plus s´ecurisant, 80 % des adolescents d´eclarent ´egalement utiliser aussi Internet pour faire leurs devoirs. In each Trigger, objects (or fixed strings) can be instantiated, thus making the instances of these objects available as named variables for the subsequent rules such as Evidences and Suggestions. The process of instantiation is analogous to the definition of groups by means of parentheses in a regular expression. Instantiation is done by marking the objects in the Trigger with the caret character followed by an incremental integer. The following example shows a Trigger rule with 2 instantiated objects:

42

3.2 Developing Acrolinx rules

(9) Trigger(80) == {‘que’|‘comment’}^1 @verbInf^2 -> ($que, $inf) -> { mark: $que, $inf; } This Trigger will find all occurrences of the pattern and instantiate two tokens: {‘que’ |‘comment’} and @verbInf. The second line of the Trigger attributes names to these instances (names are always preceded by ‘$’), which can then be used in Evidences and Suggestions. In Acrolinx, instantiation also enables highlighting of the corresponding tokens in the text when the rule is used for checking. The third line of the Trigger indicates which tokens should be highlighted. By definition, Triggers are recall oriented and are mostly designed to match as many potential errors as possible. Precision is then achieved by adding Negative Evidences to refine the results.

Negative Evidences are used to cancel the effect of Trigger rules. The patterns specified in these evidences overlap with those described in Triggers, but are more specific. Negative evidences can consist of a pattern that includes one or more objects instantiated by Triggers, surrounded by context described either with objects or with strings. The following example shows a Negative evidence that cancels the effect of the Trigger shown in example 9, in the case where the two instantiated objects $que and $inf are preceded by either of the strings ‘moins’ or ‘plus’. (10) NEG EV(40) == {‘moins’|‘plus’} $que $inf; Besides context specifications, Negative evidences can also include conditions on instantiated objects, i.e. the Evidence is only triggered if the object fulfils the condition. The following example shows a case (again for the same Trigger) with a specific context and an additional condition on one of the instantiated objects: (11) NEG EV(40) == ‘et’ {?=$que} ‘comment’ $inf; For this Evidence to be Triggered, the instantiated object $que has to match the string ‘comment’.

43

3. RULE DEVELOPMENT

Suggestion Triggers are used to define the replacement candidates that the user will be shown in a drop down contextual menu during the checking process, or, in the case of automatic application, the unique replacement token or sequence. Suggestions consist of a pattern that describes the context in which the rule should be applied, followed by replacement indications for each instantiated object. As for Evidences, the pattern to match can contain instances of objects (including conditions if required), objects, strings and EOS/BOS markers. The Suggestion Trigger then defines how each of the instantiated objects should be replaced or modified. The following example shows a simple suggestion that inserts the string dois-je before the instantiated infinitive verb: (12)

SUGGEST(10) == $que $inf []* [‘?’]+ -> { suggest: $que -> $que, $inf -> ‘dois-je ’ $inf; rank: 1; }

Suggestion Triggers also allow reordering of instantiated objects, as well as some transformations, such as casing and inflection modifications. The following rule inverts the order of the instantiated objects and removes the initial upper case of the $pron instance: (13)

SUGGEST(10) == $pron $verb -> { suggest: $pron -> $verb ‘-’ $pron/initialLowerCase, $verb -> ‘’; rank:1; }

By combining Objects, Triggers, Evidences and Suggestions, it is possible to develop elaborate rules to deal with complex cases. However, rule files quickly become rather long and sometimes interactions between rules can cause unexpected results. To facilitate the developer’s task, Acrolinx have developed a set of tools and plugins, which we shall now briefly present.

44

3.2 Developing Acrolinx rules

3.2.2

Development environment

Acrolinx rules are developed in Eclipse1 , using a specific plugin, the Acrolinx Integrated Development Environment (Acrolinx IDE), which offers functionalities to support rule development and testing. In the development environment, the IDE provides syntax highlighting, automatic completion and error-checking functionalities, to facilitate rule writing in compliance with the rule formalism. To test the impact of rule modifications during development, the IDE communicates directly with a local Acrolinx server. Provided that the rule is valid, it can be loaded onto the local server at any time, and be used directly for checking, either with one of the text-editor plugins, with the Batch Checker, or with the checking environment of the IDE. The integrated checking environment provides an interface where rules can be activated individually to be tested against a corpus or test-suite. Once a check has been performed with one or more rules, each flag in the corpus can be analysed individually. By selecting an error in the text, the developer can view the corresponding Triggers, Evidences and Suggestions, thus facilitating debugging of rules. Furthermore, in this interface the developer can access a tree-view of the complete analysis result (POS tagging, morphology, etc.) for any token of the checked corpus, containing all the features that can be used to define objects. A dedicated tab of the IDE allows developers to verify the tagger results for every checked sentence, and correct them if necessary. Corrected sentences are then periodically added to the training corpus of the tagger.

3.2.3

Development methodology

Various resources were used for rule development, yet the overall process to develop a new rule is mostly the same: a set of sentences with the phenomena to be treated is gathered manually, and Triggers are written to flag these phenomena, with a low level of restrictions. Once the rule is able to flag all the selected examples, it is applied to a small development corpus using the Acrolinx Batch Checker, which produces an XML checking report. We then apply a perl script developed by Acrolinx to extract from this report the data pertaining to a specific rule, producing a spreadsheet with the flagged 1

https://eclipse.org/

45

3. RULE DEVELOPMENT

words and their contexts in a convenient tabular view. These can then be sorted into true and false positives, which are compiled into test-suites that are used to refine the rules. False positives are then systematically eliminated, either by adding constraints to the Triggers, or by writing new Evidences for particular cases, taking care not to cancel any of the true positives. This process can be repeated any number of times with new development corpora, to further improve precision. The test-suite containing only contexts with correct flags can also be used for regression analysis, when a rule is modified at a later point in time, for example for domain adaptation. We have so far talked of rule files (for a given type of phenomena) regrouping multiple rules (Triggers, Evidences and Suggestions). Henceforth, the designation rule will apply to complete rule files designed to treat a given type of phenomena, and not to the individual components. It should be mentioned here that rules in this respect can be very precise, such as for the confusion of the homophones des and d`es for example, or quite vast, such as for agreement within noun phrases. The number of flags produced by individual rules therefore differs greatly. How many similar phenomena are grouped in a rule is left to the choice of the developer. For our study, we have preferred developing many individual rules (e.g. for homophone confusions, we have one rule per homophone) to obtain a fine-grained view of the impact of individual changes on machine translation. It would however be possible to regroup all these in a single rule. For the most part, the rules used in this study were developed from scratch, in order to address the particularities of the source text as well as the impact that source modifications have on SMT output. However, some of the pre-editing rules were adapted from an existing Acrolinx rule set for French, which was designed for technical documentation. To develop the rules, we used data extracted from the French Norton Community forums. In a first stage, we used a corpus of 10,000 representative sentences1 on which to develop and test the rules. Once a first set of rules was defined, a second corpus of 10,000 representative sentences was extracted from unseen data and used for rule refinement. An overview of the datasets used throughout the thesis is provided in Appendix B. 1

To ensure the developed rules would focus on frequent phenomena, it was decided to select sentences based on their representativeness of the entire available Norton forum corpus. To this end, a script was designed to sort sentences based on a weighted sum of n-gram frequency.

46

3.3 Spelling and grammar rules

Now that we have introduced the Acrolinx rule formalism and development methodology, we continue with a description of the different rules developed and used in this study, regrouped by category. We begin with spelling and grammar rules (3.3), followed by punctuation and spacing rules (3.4), normalisation rules treating informal language phenomena (3.5), simplification rules (3.6) and reordering and reformulation rules for the machine (3.7).

3.3

Spelling and grammar rules

As mentioned previously, community content is subject to many spelling and grammar issues, including both non-word and real-word errors. As these errors are bound to be problematic for machine translation, this is the first issue we will attempt to solve by means of pre-editing. In this section, we will see how non-word errors and real-word errors can be identified and corrected with Acrolinx. Obviously, it would also be possible to take care of these issues with a commercial or open-source spelling and grammar checker. In our particular application context however, this would be disadvantageous as it would require the users to submit their texts to multiple checking processes, once for spelling and grammar, and once for other pre-editing rules. We have therefore chosen to specialise the existing Acrolinx spelling and grammar resources to perform this task, in order to provide forum users with a single integrated checking resource. To ascertain whether checking spelling and grammar with Acrolinx is comparable to checking with a standalone tool, we have evaluated precision and recall on a forum data corpus, and compared the scores with those of one open-source and two commercial systems. We begin this section with a description of Acrolinx resources for the identification and correction of non-word errors. This is followed by rules for real-word errors. Finally, we present the comparative evaluation of checking performance.

3.3.1

Non-word errors - Acrolinx spelling

Non-word errors are flagged by means of a spelling module. Initially, we applied the existing spelling module to our forum data. As with any unspecialised spell-checker, this resulted in numerous flags of words absent from the general language and thus absent

47

3. RULE DEVELOPMENT

from the system’s default lexicon. These included technical terms such as antiphishing or widget, as well as proper nouns such as Firefox or Verisign. While it seems desirable to flag (and suggest replacements for) some of these words, others such as proper nouns should be ignored by the checking process. We therefore had to customise the spelling module by adapting the resources it relies on to our domain. There are two main possibilities for adding unknown words to the Acrolinx resources. Words that should simply be ignored during the checking process can be added to a so-called gazetteer. All these word will then no longer be flagged. This option is particularly adapted to proper nouns such as product names or other invariable words. The second possibility is to add words to the morphological resources. Here words are stored with morphological information which will allow the system to recognise all inflected forms and their misspellings. Words added to the morphology will also be used for replacement suggestions, which is not the case for those added to the gazetteer. The list of words added to these resources was compiled from two sources: the Symantec terminology, which includes Symantec product names, and a manually compiled list of IT domain terms identified by analysing flagged data. A special case very common in the forum data, and in the IT domain in general, are anglicisms, which can be words of any category. Despite these not being correct French, they were added to the morphological resources, which allows Acrolinx to treat verbs such as upgrader or unzipper, as well as all their inflected forms. This specialisation has enabled us to reduce the number of inadequate spelling flags. Some remain however, as it is not possible to list all proper nouns that may be used on the forums. In particular the forum users’ aliases are a never-ending source of new unknown proper nouns, which we did not add to the gazetteer. While adding the aliases extracted from our development data would have reduced the number of unwanted flags on this same data, every new batch of data would have a new set of users. However, the high frequency of these unknown words may cause problems for efficient pre-editing. As mentioned by Fontenelle (2006), a checker that flags named entities and proper nouns can create an over-abundance of flags that the user will get used to ignoring, with the inevitable consequence that real flags will be overlooked too. To deal with this problem, the ACCEPT pre-editing plugin allows users to add words to a user dictionary.

48

3.3 Spelling and grammar rules

In addition to flagging non-word errors, the Acrolinx spelling module suggests replacement candidates. These are selected based on edit distance between the words covered by the morphology resources. At the present time, edit distance is the only criterion for filtering and sorting suggestions, a fairly weak information source, especially for the homophone confusions which are frequent in French (Veronis, 1988). Human intervention is thus necessary to select the right candidate. Additionally, suggestions can only be words that have been added to the morphology, so proper nouns that are not part of the lexicon can not be suggested. In those cases, human intervention is also required. The spelling module must consequently be applied manually.

3.3.2

Real-word errors - Acrolinx grammar rules

To correct real-word errors, Acrolinx uses grammar rules which describe error patterns. The existing Acrolinx resources for French included a number of such rules, many of which were also relevant for forum content. These rules had to be adapted to our data, mainly by addition of triggers for new error patterns found in the forum data and by negative evidences to cover particular cases also found in the forum data. We complemented these rules with a number of rules for homophone confusions, which are very frequent in the forum data. The grammar and spelling rules include rules for agreement, verb mood and tense, sequences and homophone confusions. We will now describe the rules of each of these four categories. 3.3.2.1

Correcting agreement

Two rules handle agreement errors. The NounPhrAgr rule flags number and gender agreement errors within complex noun phrases. The SubjVerbAgr rule handles subject-verb agreement. Due to the ambiguity of some cases, these rules do not always include unique suggestions, therefore human intervention is required to correct the flagged errors. Table 3.1 recapitulates the agreement rules and provides an example for each (in italics). The third column of the table shows the number of flags produced by each rule on an unseen corpus of 10,000 sentences of data extracted from the Norton Community forums, as well as rule precision on this same data (in parentheses). We will describe the precision evaluation methodology in detail later in this chapter (cf. 3.9), but provide the results in each of the recapitulation tables for a better overview.

49

3. RULE DEVELOPMENT

Table 3.1: Grammar (agreement) rules

Rule name

Function

Freq. (P)

NounPhrAgr

incorrect agreement within noun phrase Bonjour, :::: quel:::::::: version de norton 360 avez-vous ?

250 (84%)

SubjVerbAgr

incorrect subject verb agreement Tout d’abord je vous fait part de ma configuration: ::::::::::

189 (86%)

With error detection precisions of 84 and 86%, these two rules are among the less precise of our rules. Since causes of lower precision are not specific to rule categories, we will discuss these for all rules together later in this chapter (3.9).

3.3.2.2

Correcting verbs - mood and tense

Verb endings are an important source of errors in French, and several rules deal with these. Most of the errors are caused by confusion of forms with similar pronunciations. Table 3.2 provides an overview of these rules. The avoidCond rule flags cases where the conditional mood is used instead of the indicative imparfait after the conjunction si. We can see that this is not a particularly frequent error. The useImperative rule flags the erroneous use of indicative instead of imperative verb forms. In French, for the second person singular, these two forms are homophones, which often leads to confusion. The same goes for the future and conditional verb forms in the first person singular, which are also often confused due to their similar pronunciation. These are flagged by the futureCond rule. Both rules provide suggestions to correct these errors. The confParticipIsUs rule flags errors in past participles, which are very common in French, since the agreement rules are rather complex, and again, many endings are homophones (e.g. -i and -is). Also considering verb endings, the wrongVerbForm rule flags inconsistencies in verb phrases (e.g. *tu peux trouv´e, *afin de trouvez, *aidermoi ). Finally, the useSubjunctive rule flags cases where the indicative is used instead of the subjunctive.

50

3.3 Spelling and grammar rules

Table 3.2: Grammar (tense/mood) rules

Rule name

Function

avoidCond

incorrect use of conditional verb form Il serait tr`es utile si vous :::::::: pourriez temporairement restaurer les serveurs.

confParticipIsUs

incorrect past participle Je vous dirai si c’est r´ esolut ::::::

12 (100%)

futureCond

confusion of future and conditional verb forms Je souhaiterai savoir s’il existe une solution ::::::::::

12 (100%)

useImperative

confusion of indicative and imperative verb forms Quand tu d´emarres le PC, :::::::: regardes en bas si tu vois “Boot...

4 (75%)

useSubjunctive

missing subjunctive Bien que je ne comprends pas ce que tu as fait. ::::::::::

15 (67%)

wrongVerbForm

general rule that identifies incorrect verb endings (-er, -ez, -´e etc.) Et aussi, quand que :: tu:::::: avais::::::::: installer norton, as tu eu des problemes d’installations?

3.3.2.3

Freq. (P) 5 (60%)

190 (100%)

Correcting sequences

A more general approach to correcting real-word errors in Acrolinx is to describe sequences that should not appear in correct language. Two existing Acrolinx rules function on this principle, and a new rule was created for a phenomenon identified in the forum data. An overview of these rules is provided in Table 3.3. The wrongSeq rule is an existing Acrolinx rule which identifies sequences of words of different categories that are not possible in correct French syntax, for example an article followed by a preposition, e.g.*le du (*the of). The sequences flagged by this rule are recapitulated in table 3.4. Most of the Triggers in this rule are very general, using objects based on POS tags and morphological analysis to describe categories. For some specific Triggers however, lexical elements are used. As opposed to verb ending or agreement errors, the origin of these errors is often not clearly identifiable (e.g. typos, omission of words or copy/paste

51

3. RULE DEVELOPMENT

Table 3.3: Grammar (sequence) rules

Rule name

Function

Freq. (P)

impossibleTagSeq

impossible sequence of POS tags mais :: le :: je trouve pas

55 (74%)

missingPas

negation missing pas or plus comme :: je::: ne:::: sais ` a quoi sert cette diff´erence

100 (86%)

misspelledExpr

impossible sequence of tokens La licence de ton produit est quand ` a elle sauveg:::::::::::: ard´ee sur ton compte Norton en ligne.

19 (63%)

errors), it is therefore not possible to provide a reliable correction suggestion, thus placing this rule among those requiring human intervention. The wrongExpr rule is similar, but focusses more on ill-written expressions, using lexical Triggers. Many of these errors are also homophone confusions. It flags 26 expressions such as *Quand a ` moi which should be Quant a ` moi, *nul par instead of nulle part, etc. 3.3.2.4

Correcting homophone confusions

Aside from the errors flagged by these rules, we observed other homophone confusion related errors in the forum data. Among others, we can frequently observe a (have) instead of ` a (to), sa (his) instead of ¸ca (this) or du (of) instead of dˆ u (had to). We have therefore chosen to write additional rules to treat these homophone errors, an overview of which is provided in table 3.5. Table 3.5: Homophone rules

Rule name

Function

Freq. (P)

` aA

confusion of the verb a (has) with the preposition ` a (to) Avez-vous r´eussi a: faire l’installation de NIS 2010 en fran¸cais?

309 (96%)

¸caSa

confusion of the demonstrative pronoun ¸ca (this) with 24 (100%) the possessive sa (his/her) Merci beaucoup, je fais sa de suite :: (continued on next page)

52

3.3 Spelling and grammar rules

Table 3.5: Homophone rules (continued)

Rule name

Function

Freq. (P)

ceSe

confusion of the masc. sing. demonstrative adjective ce (this) with the 3rd pers. reflexive pronoun se (himself/herself) C’est se qui s’est pass´e il y un an lorsque j’ai chang´e :: mon ordinateur.

15 (100%)

ciSi

confusion of the adverb ci (here) and the conjunction si (if) Cela permet, mˆeme si on :: si prend ` a la derni`ere minute, de recevoir les mises ` a jour.

0 (n/a)

desD`es

confusion of des (contraction of the preposition de (of) and the definite article les (the)) with the preposition d`es (as soon as) D´ es que je le peux, je fais ce que tu me conseilles. ::::

8 (100%)

divHomophone

miscellaneous homophones NIS 2013 et le gestionnaire de taches ::::::

50 (100%)

duDˆ u

confusion of the preposition du (of) and the past participle of the verb devoir, dˆ u Tu as ::: du mettre une protection par mot de passe, ¸ca ne se met pas par d´efaut.

19 (100%)

laL` a

confusion of the definite article la (the) and the adverb l` a Par contre, :: la je ne vois plus vraiment de solution pour ce souci ...

20 (100%)

maM’a

confusion of the possessive ma (my) and the elided m’a (indirect object pronoun me (to me) + auxiliary verb avoir (to have)) j’ai fait comme tu ma dit plus haut j’ai enlev´e Norton ::: 360 et j’ai r´einstall´e Norton 360.

2 (100%)

niN’y

confusion of the conjunction ni (neither) and the elided n’y (negation ne + adverb y (there)) j’ai desinstaller et reinstaller norton mais rien ni fait, :: mon widget norton a disparu.

0 (n/a)

nounVerbConf

confusion of noun verb homophone, e.g. clic (a click) and clique (click (imperative)) Quand je :::: clic sur ton lien j’arrive ` a une page qui m’indique norton internet security 2011

26 (88%)

onOntSont

confusion of the 3rd pers. sing. pronoun on and the 7 (100%) 3rd pers. plur. present indicative of avoir (to have) (continued on next page)

53

3. RULE DEVELOPMENT

Table 3.5: Homophone rules (continued)

Rule name

Function

Freq. (P)

mais alors ils on dˆ u se tromper :: ouO` u

confusion of the conjunction ou (or) and the adverb o` u (where) d’ou ma question... ::::

7 (100%)

qu’elleQuelle

confusion of the elided qu’elle and the interrogative pronoun quelle (which) Qu’elle est la diff´erence et peut on charger l’une ou ::::::: l’autre?

0 (n/a)

quelQueQuelque confusion of quel que (whichever) and the pronoun quelque (some) La licence reste “attach´e” ` a un utilisateur, ::::::: quelque soit le PC.

3 (100%)

soiSoisSoit

confusion of the pronoun soi (oneself), the 2nd pers. sing. subjunctive of ˆetre (to be) and the conjunction soit (either) il ne peut plus faire de sauvegarde soit disant que mon ::::::::: Backup serait configur´e ` a0

0 (n/a)

surSˆ ur

confusion of the preposition sur (on) with the adjective sˆ ur (safe) Et je suis sur d’avoir vu au d´ebut : ::: confusion of the possessive tes (your) and the incorrect elision t’es (pronoun tu (you) + verb ˆetre (to be))

5 (80%)

confusion of the pronoun tous (all) and the adjective or adverb tout (all) tous ceci me fait un peut peur ::::

20 (100%)

tesT’es tousTout

Most of these rules have two types of Triggers: the first work with general patterns, the second with specific lexical contexts. The Trigger in the following example, extracted from the duDˆ u rule, illustrates a general pattern: (14)

Trigger(80) == @aux @du^1 @verbInf -> ($spell1) -> { mark: $spell1; suggest: $spell1 -> ‘d^ u’; }

This is one of several Triggers designed to identify cases where the circumflex accent has been forgotten, resulting in the preposition du instead of the past participle of the

54

3.3 Spelling and grammar rules

Table 3.4: Sequences flagged by the wrongSeq rule

Pattern

Example

subord. conj. + aux/modal

Compl´etez votre posts si avez des ques:::::: tions je ::::: vous ::::: envoi en fichier txt comme demand´e d’autre part d´esactiv´e :: le:::: par feu windows Une fois de plus :: la:: je viens juste d’avoir une croix rouge [...] mais ils le support n’ont pas trouv´e :::: les utilisateurs sous XP pourront charger une la nouvelle version :::::: Bloquer tout le trafique r´eseau. :::::::::: NIS ne sait pas le trouv´e dans son anal:::::::: yse il ma t´el il :`a::::::: r´eussit `a se connecter au pc Cependant, sur cette page :: je::: on remarque que [...]

pronoun + noun det + preposition det + pronoun pronoun + det det + det det + not-noun det + verb prep + verb fin pronoun + pronoun

verb devoir, dˆ u. It defines the following sequence: auxiliary verb + du + infinitive verb, and will be able to find error candidates such as *J’ai du oublier or *vous avez du r´einstaller. In order to allow this pattern to cover more cases, such as*J’ai du tout oublier, *vous n’avez pas du r´einstaller or *il a du s’installer tout seul, optional elements have been added, such as adverbs and pronouns: (15)

Trigger(80) == @aux [{@adv|‘pas’|‘plus’}]* @du^1 [@adv]* [@pron]* @verbInf -> ($spell1) -> { mark: $spell1; suggest: $spell1 -> ‘d^ u’; }

This example illustrates two difficulties of Trigger development. The first is the neces-

55

3. RULE DEVELOPMENT

sity of increasing coverage of Triggers to produce rules with reasonable recall. This can be done either by including optional elements, as shown above, or by multiple Triggers describing different patterns. In our development process, the first patterns considered for Triggers were mostly motivated by real errors found in our data. We then complemented these Triggers by analysing correct sentences, i.e. locating correct occurrences of dˆ u and using the surrounding patterns to write Triggers to flag incorrect occurrences of du. While we thereby identified many more potential error patterns than those effectively occurring in our test data, this process by no means ensures comprehensive coverage of all possible error patterns. The second difficulty lies in the reliability of the context. To make sure the Trigger flags only valid error candidates, we rely on the context. If we describe this context using objects that depend on analysis results, such as POS tags or morphological information, the effectiveness of the Trigger depends on the success of the underlying analysis steps. The Trigger above for example relies on the correct identification of the two verbs surrounding du. If the verb preceding du contains a typo (*aevz du r´einstaller ) or another spelling error (*il a ` du s’installer ), it will not be identified as an auxiliary verb, and therefore the Trigger will not be activated. While this is not a major issue in regular written texts, where errors are only occasional and it can be assumed that no two or three adjacent words would each contain an error, it is problematic for forum data, where the density of errors is higher. In some cases, we have written Triggers specific to error contexts, for example including a instead of ` a to accommodate the common homophone confusion. In addition to the general Triggers, the homophone rules also contain series of Triggers that are more specific to one given expression, and use mostly lexical objects. These are very efficient for fixed expressions, such as can be found with the preposition ` rule contains close to 150 of these Triggers, of which an extract is shown in ` a. The aA the following example: (16) Trigger(80) == @verbFaire ‘face’ ‘a’^1 -> ($a) -> Trigger(80) == ‘eu’ ‘´ egard’ ‘a’^1 -> ($a) ->

mark :

Trigger(80) == ‘a’^1 ‘bon’ ‘compte’ -> ($a) ->

mark :

$a;

mark :

$a; $a;

` rule has one particularity. It While all the homophone rules correct errors, the aA also adds missing accents on upper case A, although it is not exactly an error to write

56

3.3 Spelling and grammar rules

the preposition ` a (to) without an accent when capitalised. While both accented and unaccented capitals are arguably correct French (AST, 2000), we have chosen to add accents wherever possible, since they reduce ambiguity for the machine, and bring the text closer to the SMT training data, which is formal French where capitals are usually accented. As most of these homophone confusions involve only two words, providing a unique replacement candidate is straightforward. Nearly all of the homophone rules could therefore be applied automatically, given a sufficient precision. As automatic application is desirable in our context, development of these rules has been geared towards high precision, at the expense of recall. In a different pre-editing scenario, where more interactive pre-editing could be asked of users, we might have constrained the Triggers less, favouring recall over precision.

We have so far described all the rules that correct spelling or grammar issues. We expect the combination of these rules with the spelling module to perform more or less the same corrections as a standard spelling checker. Ideally, the total effect should be similar to that achieved if a forum user ran their post through a spell-checker before publishing it on a forum. To verify this, we will now compare the performance of these rules with other spell-checking tools.

3.3.3

Performance of spelling and grammar rules on forum data

For our study, we have chosen to use Acrolinx for pre-editing as it allows us to develop rules of many different types, all in one single tool, thereby making the process more accessible to forum users. While correcting spelling and grammar is not the sole objective of our pre-editing process, it still is a non-negligible component. To ascertain whether this component achieves a performance comparable to current state of the art spelling and grammar checking tools available for French, we will measure precision and recall of the Acrolinx checking resources on forum data, and compare this with two commercial (Antidote 6.11 and the integrated checker of Microsoft Word 2010) and one open-source checker (LanguageTool2 (LT)). For the three checkers, default settings 1 2

http://www.antidote.info/ https://languagetool.org, accessed 10.2014

57

3. RULE DEVELOPMENT

were used, in order to produce results most similar to a scenario where a forum user would apply these tools without particular knowledge. It should be noted that this is a comparison between very different systems. Neither Word nor Antidote have been tuned for the forum domain, but these systems have been developed over many years specifically to check spelling and grammar. On the other hand, the adaptation and development of Acrolinx rules for ACCEPT was based on errors identified in forum data, but these rules have only been tested and fine-tuned on a limited amount of data. Before this specialisation, the principal focus of the existing Acrolinx resources for French was style checking and language control. At this stage, our objective is not to outperform the other systems, but merely to ascertain that forum users will be able to achieve comparable results when checking text with ACCEPT as can be achieved with other systems. For this experiment, we extracted 500 random sentences from unseen Norton Community forum data, for a total of 5,269 words. These sentences were checked with the four tools, and in parallel errors were annotated manually by a native French speaker, differentiating non-word and real-word errors. The corpus was found to contain 83 non-word errors and 79 real-word errors. For all systems, we have computed distinct scores for error detection and error correction. For error detection, we have considered as correct all flags where the highlighted words or sequences were indeed incorrect (Starlander & Popescu-Belis, 2002). The matter of error correction is less straightforward, as systems provide different levels of correction. This can consist of 1) a unique replacement candidate, 2) a list of replacement candidates, 3) an error description (e.g. check agreement between subject and verb), or 4) a simple indication that there is an error, with no further information as to its nature. In our context, where we want pre-editing to be either automatic or to require as little effort and skill as possible, unique replacement candidates are clearly the best option. Consequently, for the evaluation of error correction, only cases with correct unique suggestions were considered as true positives. We will now present the results of this evaluation, beginning with non-word errors. 3.3.3.1

Non-word errors

For the annotation of non-word errors, we chose to ignore words that were in fact correct, but absent from the system’s lexicons, such as proper nouns or domain termi-

58

3.3 Spelling and grammar rules

nology, on the assumption that these could be added to the systems’ resources in some form of user dictionary to avoid flagging. It should be noted however that, unlike the other systems, LT does not provide this possibility. Error detection The precision of error detection is defined as the proportion of real errors (true positives) among all the potential errors identified by the system (true positives + false positives). Table 3.6 shows the precision and recall scores for the four tested systems, as well as the f-measure for error detection of non-words. Table 3.6: Precision and recall of non-word error detection on 500 sentences

total flags precision recall f-measure

ACCEPT

Antidote

Word

LanguageTool

75 1.00 0.87 0.93

84 1.00 0.98 0.99

75 1.00 0.87 0.93

79 1.00 0.92 0.96

We observe that Antidote produces the most flags (84), followed by LT (79). ACCEPT and Word produce slightly less flags (75). All systems have a precision of 1.00, indicating that all flagged tokens were indeed non-word errors. However, the systems differ in terms of recall. Recall is defined as the proportion of errors correctly flagged (true positives) among the total number of errors in a text (true positives + false negatives). For non-word errors, all four systems perform well in terms of recall, with ACCEPT pre-editing performing similarly to MS Word, and slightly less well than Antidote and LT. A closer inspection of the 8 errors that were not found by ACCEPT pre-editing shows that 3 are words with a r´e- prefix. Acrolinx allows composition of words by combination of certain prefixes and existing words1 , which explains why these were not flagged. However, in French the prefix r´e should not be used with a hyphen, which is why these three words were flagged by both other tools. It would be possible to configure Acrolinx not to allow composition of words with prefixes. In our particular context however, this would only be productive if it improves machine translation, which was 1

The Acrolinx morphological resources for French include a list of 57 prefixes, such as archi, inter, poly, etc. An exclusion list contains incorrect combinations.

59

3. RULE DEVELOPMENT

not the case for the few examples encountered in our corpus and the ACCEPT SMT system. To confirm this, we could perform a quantitative analysis of the impact on MT of prefix usage, with or without hyphenation, which could be the object of future work. The same composition mechanism led Acrolinx not to flag the French non-word periode, as it was identified as the combination of the prefix per with the masculine singular noun iode. As this word was used in a context which did not allow identification of the gender of the noun (*renouvellement apres periode d essai ), it was not identified as a potential misspelling of the feminine p´eriode. Had it appeared in a context indicating a feminine noun (*la periode), it would have been identified as an error. While allowing free prefix-word combinations increases the coverage of a system, it can have some adverse effects, as in this case, where an allowed combination results in a misspelling of another existing word. Another three of the errors that were not flagged can be attributed to words with missing accents on capital letters, such as Etes or Etat, which Acrolinx does not flag with the current correction setup. The presence or not of accents on capital letters does not seem to have a major impact on the ACCEPT SMT system, probably because both versions are present in the training data. Acrolinx could be configured to flag these cases, if necessary.

Error correction As mentioned above, for error correction we have considered only correct unique suggestions. Precision of error correction is defined as the proportion of correct unique replacement suggestions among the number of correctly identified errors. Recall of error correction is defined as the proportion of correct unique replacement suggestions among the total number of errors in the text. Table 3.7 shows the results for all three systems. Table 3.7: Precision and recall of non-word error correction on 500 sentences

precision recall f-measure

ACCEPT

Antidote

Word

LanguageTool

0.45 0.40 0.42

0.75 0.73 0.74

0.17 0.15 0.16

0.00 0.00 0.00

60

3.3 Spelling and grammar rules

Regarding correction, there are large difference between the systems, with Antidote surpassing the other three systems, both in terms of precision and recall. ACCEPT is second best, and Word achieves a very low score. LT does not provide correction suggestions at all for non-word errors, thus the scores of 0. A closer analysis of the types of corrections for each true positive flag reveals different distributions for the systems. Results are shown in table 3.8. For ACCEPT, corrections are more or less equally distributed among multiple and unique replacements. Antidote on the other hand provides far more unique replacements while Word more often provides multiple replacements. Table 3.8: Distribution of correction types for non-word errors on 500 sentences

ACCEPT

Antidote

Word

75 34 31 0 10

84 63 5 0 16

75 13 54 0 8

total correct error flags unique suggest. multiple suggest. error description none / wrong suggest.

LanguageTool 79 0 0 0 79

If we take into account multiple suggestions for the precision and recall scores (considering these as correct if the correct solution is among the list of replacement candidates), results are more balanced between the three systems that offer corrections, as shown in table 3.9. Table 3.9: Precision and recall of non-word error correction on 500 sentences, taking into account cases with multiple replacement suggestions

precision recall f-measure

ACCEPT

Antidote

Word

LanguageTool

0.87 0.76 0.81

0.81 0.79 0.80

0.89 0.78 0.83

0.00 0.00 0.00

Here the three systems are very close in terms of f-measure. However, this calculation does not take into account the number of suggestions provided in the multiple suggestion case, nor does it consider the position of the correct suggestion in the list of

61

3. RULE DEVELOPMENT

suggestions, two aspects which clearly influence the usefulness of a list of suggestions. It does however give us an indication of the proportion of cases where the user is provided some form of help to solve an issue, which turns out to be similar for Antidote, Word and ACCEPT. 3.3.3.2

Real-word errors

Error detection

Precision and recall scores for the detection of real-word errors are

shown in Table 3.10. Table 3.10: Precision and recall of real-word error detection on 500 sentences

total flags precision recall f-measure

ACCEPT

Antidote

Word

LanguageTool

50 0.86 0.54 0.67

53 0.87 0.58 0.70

34 0.94 0.41 0.57

47 0.91 0.54 0.68

In terms of precision, Word performs better than the other systems (0.94), however it should be noted that this system also produced considerably less flags than the other systems, which is reflected in the lower recall score. The other three systems are close in terms both of precision and recall, and thus also achieve close f-measures. In comparison to non-word errors, recall is low for all of the systems, with 0.58 recall for the best performer, Antidote. Interestingly, the different systems fail on different issues. The most frequent source of real word errors are homophone confusions such as the a/` a confusion, of which we annotated 34 in total. While the Acrolinx rules flag 25 of these, Antidote flags 17, and Word flags 20. These results are very satisfactory, since one of our focuses in rule development was the correction of homophone confusions (cf. 3.3.2.4). Real word errors also include bad verb endings or noun-verb confusions, of which there are 19 in total. Our rules flag only 7 of these, against 13 for Antidote, 6 for Word and 10 for LT. Identifying these is difficult, since the context is often very messy, as in the following examples:

62

3.3 Spelling and grammar rules

(17) C’est soit un probl`eme de windows 7 ou de norton, puisque ca affect plussieurs applications au mˆeme temps. Mais franchement, je n’aimerais pas que le continu `a me prendre pour une bille. In the first example, both adjacent words are misspelled (*ca instead of ¸ca and *plussieurs instead of plusieurs), thereby making it difficult to identify that affect is misspelled too (here the homophone noun affect is used instead of the verb affecte). The second example illustrates omission of a word, in this case the subject of the misspelled verb continu. The absence of subject makes it impossible to identify the misspelled verb. None of the four tested systems identified either of these errors. The lower performance on real-word errors is unsurprising, since as we have seen in Chapter 2, these are far more complicated to identify. Locating these errors requires some form of analysis of the context, which will obviously be disturbed by the messy context in forum data. Recall could maybe be improved by a carefully designed sequential rule application, which could reduce the problem of messy contexts. Error correction For correction of real-word errors, we have again taken into account only unique correct suggestions. Precision and recall scores are shown in table 3.11. Table 3.11: Precision and recall of real-word error correction on 500 sentences

precision recall f-measure

ACCEPT

Antidote

Word

LanguageTool

0.65 0.35 0.46

0.91 0.53 0.67

0.88 0.35 0.50

0.37 0.20 0.26

Similarly to the results for non-word errors, Antidote outperforms the other systems in terms of precision and recall. Again, this results from Antidote’s ability to provide correct unique suggestions. Table 3.12 shows the distribution of corrections. We observe that neither Antidote nor Word provide error descriptions, while this represents over half of the corrections provided by LT, and a fifth of those provided by ACCEPT. For Antidote and Word, most corrections are unique suggestions. Despite the large share of correct unique suggestions, Word is again penalised in terms of recall

63

3. RULE DEVELOPMENT

Table 3.12: Distribution of correction types for real-word errors on 500 sentences

total correct error flags unique suggest. multiple suggest. error description none / wrong suggest

ACCEPT

Antidote

Word

43 28 3 8 4

46 42 1 0 3

32 28 4 0 0

LanguageTool 43 16 0 18 9

due to the smaller number of errors identified. In summary, regarding non-word detection, ACCEPT performs equally well as Word and slightly less well than LT and Antidote. For this task, all systems achieve perfect precision and high f-measures (>0.90). Non-word error correction on the other hand shows large differences between the systems, with ACCEPT achieving a lower score than Antidote, but a better score than Word, and an f-measure of 0.42. As expected, realword errors are more challenging for all of the systems. For error detection ACCEPT performs similarly to Antidote and LT (f-measure between 0.67 and 0.70), and better than Word which has comparably low recall. Finally, for correction, there are again large differences between the systems, with Antidote clearly outperforming the others. ACCEPT has lower precision than Antidote and Word, but the same recall as Word.

3.3.4

Spelling and grammar - summary

In this section we have presented the Acrolinx resources for spelling and grammar checking. These include a spelling module for non-word error checking, and a number of rules for real-word error checking. The spelling module was customised for the forum data by addition of domain specific lexical items to the morphological resources, used both for error identification and as source of replacement candidates. Rules for realword errors are a combination of existing Acrolinx rules that were adapted to the forum data, and new rules mainly for the correction of homophone confusions. A comparison with two commercial (Antidote and Word) and one open-source (LT) spell-checker has shown that spelling and grammar correction with our rules is comparable. In terms of error detection, our rules perform similarly to the other systems.

64

3.4 Punctuation and spacing rules

Error correction, when considering only unique replacements, could still be improved, as many rules only provide multiple replacements or error descriptions. Despite this, in terms of spelling and grammar correction, considering an interactive checking process, we expect that our rules would produce results similar to what can be achieved with commercial tools. However, spelling and grammar are not the only issues with community content that might impair machine translation. We now continue our description of the pre-editing rules and proceed to those dealing with punctuation and spacing.

3.4

Punctuation and spacing rules

The original Acrolinx resources contain a number of punctuation and spacing rules. While punctuation in general might not seem to be the most striking issue when considering the multitude of errors in forum data, it plays an important role in text segmentation and tokenisation, and is therefore of interest for any subsequent processing. Certain punctuation marks, in particular hyphens, are also important due to their capacity to create compound words. We therefore chose to apply the punctuation rules to our data. These rules treat punctuation issues such as missing commas, hyphens and elisions, as well as spacing issues. Another rule has been included which corrects capitalisation. An overview of these rules is provided in table 3.13. The third column of the table shows the number of flags produced by each rule on an unseen corpus of 10,000 sentences of data extracted from the Norton Community forums, as well as rule precision on this same data (in parentheses). Table 3.13: Punctuation and spacing rules

Rule name

Function

Freq. (P)

addComma

insert a comma before coordinating conjunctions or after certain expressions (e.g. pourtant, si oui, etc) je ne comprend pas ce qui s’est pass´e:mais j’ai trois comptes Norton.

addCommaPP

insert comma after prepositional phrase 79 (100%) (continued on next page)

65

469 (100%)

3. RULE DEVELOPMENT

Table 3.13: Punctuation and spacing rules (continued)

Rule name

Function

Freq. (P)

Dans mon registre :je n’ai aucune cl´e qui commencent par SMRxxxx addElision

elide articles and pronouns Est-ce que il s’agit ici de Norton Online Backup ou ::::: du backup de Norton 360?

93 (92%)

addHyphen

insert missing hyphens, both in words (e.g. parefeu) and in expressions (e.g. dois-je) c’est ` a dire ? :::::::::::

244 (86%)

addSpaceWords

insert a space between concatenated words or tokens J’ai essay´e d’uploader un fichier de 4ko seulement, ::: et le probl`eme est le mˆeme.

185 (100%)

addSpacePunct

correct spacing around punctuation (e.g. insert space before colon) Comme l’a stipul´e Misstigry:, il s’agit de services.

2552 (100%)

addSpacePunct2 correct spacing around punctuation (e.g. insert space after EOS punctuation) je pense que je n’ai pas dˆ u activer la cl´e du produit lors de mon ::::::::::::::::::::: reformatage.comment le faire?.

n/a

avoid(char)

do not use characters between parentheses (e.g. (s) for optional plural) example: El´ement(s) de donn´ees du Registre in::: fect´e(s):

5 (100%)

avoidCapSent

do not write entire sentences in upper case NORTON SECURITY POUR MAC EN ::::::::::::::::::::::::::::::::::::::::::::::::: FRANCAIS SVP :::::::::::::::::

303 (94%)

avoidPunct

remove superfluous punctuation such as commas Reviens nous voir si tu as d’autres questions::, et tiens nous au courant du r´esultat !

230 (100%)

checkEOS/BOS

check absence of punctuation at potential sentence boundary J’esp`ere avoir pu mieux vous ´eclairer:Il est vrai que l’´evolution vers le cloud a oblig´e ` a quelques changements.

n/a

removeSpace

remove superfluous space n/a je n’ai pas envie :de tout refaire !! avoid capitalisation in the middle of the sentence n/a Norton Ghost est plus complet et Te permet de faire :: un image sur un disque dur ou sur un DVD. (continued on next page)

wrongCap

66

3.5 Informal language rules

Table 3.13: Punctuation and spacing rules (continued)

Rule name

Function

Freq. (P)

wrongElision

remove incorrect elisions Le “DiskPart” d’XP ne permet pas ::::

n/a

Theoretically, it would have been interesting to compare the performance of these punctuation rules with that of the other checkers, as we did for the spelling and grammar rules. This was not done, however, for the following reason. Our data are extracted from web forums, and in their original state they contain various html markup, mainly to define formatting such as line breaks, bullets, etc. The markup in the extracted posts was neither complete, nor did it adhere to a rigorous html standard. At the beginning of the ACCEPT project, for easier processing, it was therefore decided to transform the data into plain text files, and they were stripped of all markup by means of a Python script. Unfortunately, this process has sometimes produced artefacts, for example introducing multiple spaces when replacing multiple formatting tags or concatenating strings that should not have been concatenated. We cannot be entirely sure whether the phenomena identified by our rules, especially those regarding spaces, are in fact real errors committed by the forum users, or artefacts introduced by the pre-processing. Computing precision and recall on these data therefore would not lead to representative results. These rules will be discussed individually and in further detail in the following chapter (4.4.2.5), in the context of the evaluation of their impact on machine translation.

3.5

Informal language rules

As we have seen, a very frequent issue in user generated content is informal language; since this type of language does not appear in the SMT training data, with the exception of the small corpus extracted from the forums and used to tune the ACCEPT SMT system, it is nearly always translated badly. Two aspects of informal language are problematic: 1) colloquial words, anglicisms and abbreviations that do not appear in the training data, and 2) informal structures that, although they use words present in the training data, combine them in unusual sequences - grammatically correct or not that do not appear as such in the training data. We have therefore developed different

67

3. RULE DEVELOPMENT

rules for informal language, focussing both on lexical and syntactical phenomena, which we shall now describe in further detail. The rules are recapitulated in table 3.14. Table 3.14: Informal language rules

Rule name

Function

Freq. (P)

avoidAnglicism

replace anglicisms by French equivalents je vais rebooter et voir si c’est toujours le cas. :::::::

96 (98%)

avoidDirQuest

avoid questions without subject-verb inversion Tu as lu le tuto sur le forum ? :::::

28 (100%)

avoidEstCeQue

avoid questions with est-ce que Est ce que je dois d´esinstaller et r´einstaller Norton :::::::::: apr`es coup ou pas?

8 (100%)

avoidFAbbrev

avoid abbreviations Je suis all´e sur le site, j’ai t´el´echarg´e l’appli. ::::

468 (100%)

avoidInformal

avoid informal vocabulary Voil` a deux jours que je :::::: gal`ere ` a essayer de cr´eer ce SRD personnalis´e sans succ`es.

63 (100%)

missingNe

do not write negations without ne je vais relire ce que tu m’as dit sur les antivirus pour ˆe:::: tre :::: sure::: de:::: pas:::::: avoir de probl`emes!

100 (86%)

3.5.1

Informal vocabulary

In forums, and perhaps more so in technical forums, the use of jargon, anglicisms and more or less creative abbreviations is very common. To identify these, we used a list of the out-of-vocabulary (OOV) items of the SMT system, which are a good indicator of the data that is not covered in the training set (Banerjee et al., 2012). From these lists, we extracted colloquial expressions, anglicisms and abbreviations, which we used to customise existing Acrolinx rules for these three phenomena. Whenever possible, these rules replace such tokens with correct French words and expressions. In this sense, applying these rules is akin to normalisation. The avoidInformal rule flags colloquial words or expressions such as paumer (to loose), plouc (bumpkin) or bof (meh). The rule treats 19 verbs and 34 nouns, based

68

3.5 Informal language rules

on the tokens identified in the training corpus. In order to treat verbs correctly, the informal verbs were added to the morphological resources. This allows the rule to match all conjugated forms of these verbs and identify inflections. They can therefore be replaced with a correctly inflected alternative. Besides nouns, adjectives and verbs, this rule also flags interjections such as grrrr, Eeeehh, bof, etc. These interjections do not follow a standard spelling and are often creatively adapted by multiplication of letters. To cover as many variations as possible, the rule uses an object based on a regular expression. No replacement is provided for these interjections, as it was considered that they do not contribute to the informative content of forum posts. For most words, a suggestion of a correct equivalent is provided. Occasionally, there is no one-to-one equivalent and the entire phrase needs to be changed accordingly, in which case no suggestion can be given. For interjections, no suggestion is provided as we assume that these can simply be removed, without major impact on the informative content of a sentence. The rule that treats anglicisms, avoidAnglicism, is very similar to avoidInformal. It replaces 73 nouns, 22 verbs and 3 adjectives. Finding replacements for anglicisms is not as simple as for colloquial language, especially in a technical domain. The pervasiveness of English in the information technology domain has led to the adoption of countless anglicisms. In some cases, the anglicism is far more common than its French equivalent. An example is the French word courriel (email). In our training corpus, courriel appears only 14 times, while its anglicism counterparts are far more frequent, 64 in total (40 mail, 20 email and 4 e-mail ). Although this tends to indicate that in general usage anglicisms might actually be more frequent, they are not that frequent in the more formal training data, thus they need to be replaced. Finally, we have a rule to replace abbreviations, avoidFAbbrev. It replaces both common French abbreviations such as qqn - quelqu’un (someone) and more forum specific ones such as m` aj - mise ` a jour (update). Some abbreviations are ambiguous, and need human intervention to choose the correct word based on context. The abbreviation perso could for example mean personnage (character), personnel (personal) or personnellement (personally), depending on the context. Resolving colloquial words, anglicisms and abbreviations does not take care of all the OOV items. A large number of words that fit into neither of these categories

69

3. RULE DEVELOPMENT

remain: informal second person verb forms and pronouns. These are different from the informal language discussed above, insofar as they are perfectly correct French. They are problematic in our context due to a register mismatch between the forum data and the SMT training data, which we shall discuss later in this chapter (3.7.5) in the context of rules for the machine. Besides these lexical issues, informal writing also presents syntactic issues, which we will discuss in the next section.

3.5.2

Informal syntactic structures

The tendency to write as one speaks leads to some stylistically unconventional structures, such as informal types of questions: statement questions or the use of est-ce que. While both types of questions are not wrong per se, they have little chances of occurring in the more formal training data, and might therefore be subject to mistranslation. We have created rules for each of these, to replace them with more formal equivalents. A typical error of spoken language is the incomplete or incorrect negation. While a correct negation follows the ne ... pas or ne ... plus construction, dropping the ne is very common in spoken language, and thus also very frequent on the forums. The following example shows an incomplete negation: (18) *et en y r´eflichissant bien c’est pas plus mal et en y r´eflichissant bien ce n’est pas plus mal The incomplNeg rule corrects these errors. A second rule, wrongNeg corrects the opposite case, where pas or plus is dropped, but those cases are far less frequent. In French spoken language, it is common to express questions using the same structure as a declarative, i.e. without inversion of subject and predicate, with the intonation as only indicator that the utterance is a question. In informal writing this translates into a declarative structure with nothing but the question mark at the end indicating the interrogative nature of the sentence. The avoidDirQuest rule corrects this by introducing the inversion, as shown in the following example: (19) Raw: Tu n’as que ce produit Norton d’install´e ? Pre-edited: N’as-tu que ce produit Norton d’install´e ?

70

3.6 Controlled language and simplification rules

Another common way of asking questions in French is to add est-ce que in front of a statement. While this structure is correct in French, it is very seldom used in formal written French. Therefore, they are mostly absent from the training corpus, and translated badly (Rayner et al., 2012). Similarly to the the avoidDirQuest rule, the avoidEstCeQue rule replaces these with subject-predicate inversions, as shown in the following example: (20) Raw: Est-ce que tu as fait la mise `a jour ? Pre-edited: As-tu fait la mise `a jour ? The example shown in Table 3.14 for the avoidEstCeQue rule illustrates another issue with rule development for forum data. In this example, the hyphen in Est-ce has been omitted. To enable our rule to find this structure anyway, we have two possibilities. Either we apply a rule correcting hyphenation before the avoidEstCeQue rule, or we include a Trigger with the pattern without hyphen in our rule. In our particular context, as sequential application had not yet been determined at the time of rule development, we included this additional pattern. Overall, the informal language rules have a high precision. With the exception of avoidAnglicism (98%) and incomplNeg(86%), all rules achieved a precision of 100%.

3.6

Controlled language and simplification rules

A common form of pre-editing is text simplification, be it for a human reader or for machine processing. When performed systematically, this is similar to some controlled natural languages. While we have no ambition to entirely control forum data, we have developed several simplification rules. These are recapitulated in table 3.15. Table 3.15: Simplification rules

Rule name

Function

Freq. (P)

avoidAdverbs

avoid unnecessary adverbs, e.g. vraiment (really) J’ai ::::::::: vraiment besoin d’aide ! −→ J’ai besoin d’aide !

avoidCeQui

split sentences by replacing ce qui with Ceci 21 (100%) (continued on next page)

71

154 (100%)

3. RULE DEVELOPMENT

Table 3.15: Simplification rules (continued)

Rule name

Function

Freq. (P)

Je passe la souris (sans cliquer) sur la banni`ere en question, :: ce:::: qui me donne en bas de l’´ecran le nom de la soci´et´e de publicit´e. −→ Je passe la souris (sans cliquer) sur la banni`ere en question. Ceci me donne en bas de l’´ecran le nom de la soci´et´e de publicit´e. avoidCleftSent

avoid cleft sentences C’est Norton qui l’a supprim´e ? :::::::::::::::: −→ Norton l’a supprim´e ?

16 (100%)

avoidConjBOS

avoid using conjunctions at BOS Et Merci ` a qui prendra la peine de me conseiller. :: −→ Merci a ` qui prendra la peine de me conseiller.

202(100%)

avoidNon

avoid using non at the end of interrogative sentences Je suis devenue plus rassur´ee, c’est d´ej` a un pas vers la s´er´enit´e, :::: non? −→ Je suis devenue plus rassur´ee, c’est d´ej` a un pas vers la s´er´enit´e?

4 (100%)

avoidPresPart

use constructions with comme or puisque instead of the present participle Effectuez un test avec une autre suite de s´ecurit´e ayant les mˆemes fonctionalit´es que NIS ::::: −→ Effectuez un test avec une autre suite de s´ecurit´e qui a les mˆemes fonctionalit´es que NIS

140 (100%)

The most common simplification rule, which appears in most controlled languages, is the rule that advocates the use of short, simple sentences (O’Brien, 2003). Acrolinx has a rule to flag long sentences, which can be customised to define the threshold beyond which a sentence is considered too long. A number of forum sentences would certainly benefit from splitting, yet we decided not to use this rule. While asking a professional technical writer to split a sentence seems reasonable, asking a forum user to do so seems more difficult, unless providing further information as to how the sentence could or should be split. The following example shows such a long sentence, which would not be easy to split: (21) J’ai bien du tenter la connection `a mon compte Norton une quarantaine de fois avant de pouvoir inscrire le produit sur mon compte et ainsi recevoir le serial

72

3.6 Controlled language and simplification rules

apr`es achat. We therefore chose to investigate rules which would go beyond just identifying long sentences, by identifying structures where they could be split. By analysing long sentences from our development corpus, we identified one structure that lends itself very well to splitting, as shown in the following example: (22) Raw: Ce qu’il serait int´eressant de savoir, c’est si vous pouvez installer NIS 2013 avec votre licence 2012, ce qui est possible avec une version en cours 2012 valable 1 an. Pre-edited: Ce qu’il serait int´eressant de savoir, c’est si vous pouvez installer NIS 2013 avec votre licence 2012. Ceci est possible avec une version en cours 2012 valable 1 an. Here the second part of the sentence, a subordinate clause introduced by ce qui (which) can easily be split off to become a sentence of its own, linked to the previous sentence by the pronoun ceci (this). Since the replacement is straightforward, we have developed a rule that performs it automatically, avoidCeQui. Other similar patterns might be found to split sentences, though the disparate nature of the forum data is not favourable to finding recurrent patterns. Another simplification rule that we developed specifically for forum data is avoidCleftSent. This rule flags cleft sentences, i.e. sentences where a specific constituent is emphasised by being placed a separate clause, as shown in the following example: (23) C’est un crash qui peut ˆetre dˆ u `a un probl`eme mat´eriel. While these sentences are perfectly correct French, they are unnecessarily complex. Nested constructions with multiple clauses are difficult to process automatically, especially when they cause structural ambiguity (Koehn, 2010). We therefore created a rule to transform these sentences into a simpler equivalent: (24) Ce crash peut ˆetre dˆ u` a un probl`eme mat´eriel. While the emphasis is lost, the constituents remain and the resulting sentence is simpler. We also used one existing Acrolinx style rule which flags present participles, avoidPresPart. In French, the present participle can be used in three ways: in a gerund (preceded by the preposition en (by), e.g. en faisant ceci (by doing this)), as a verb

73

3. RULE DEVELOPMENT

with a complement, mostly expressing causality, e.g. l’ordi ne d´emarrant pas (as the computer does not boot), or as a verbal adjective, e.g. un bug ´enervant (an annoying bug). Our rules focuses on the second case, replacing the present participle with an explicit expression of causality, using the conjunctions comme or puisque (l’ordi ne d´emarrant pas −→ comme l’ordi ne d´emarre pas). To make this rule easier to apply by forum users, we have added reformulation suggestions for cases where this is possible. Finally, we have a set of rules that remove unnecessary words such as adverbs or conjunctions at BOS. The objective of these rules, some of which are adapted from existing Acrolinx rules, is to remove tokens that do not contribute to the informational content of a sentence. By correcting spelling and grammar, reducing informality and performing some simplifications, we have treated the most obvious difficulties for machine translation of our data. In a further step, we have decided to investigate source transformations to specifically improve translation with the ACCEPT SMT system, which we shall now describe.

3.7

Rules for the machine

As opposed to the previously described rules, these transformations are machineoriented and have no ambition to improve the source. Their only aim is to improve translatability, even if it means degrading the source by producing structures that are not correct French. As a consequence, applying these rules is problematic. We can not expect users to apply them, as they would make little sense, especially to forum users. Thus automatic application - with the high precision and suggestion reliability this entails - had to be taken into account for rule development. We begin this section by describing the development methodology for these rules, since it differs from the previously described rules. We then present the developed rules.

3.7.1

Development methodology

To develop these rules, we needed a method to identify badly translated sequences or phenomena that could benefit from a transformation. Post-edited data would have been

74

3.7 Rules for the machine

a useful resource to identify such phenomena, since it would have given us the means to pinpoint badly translated phenomena by analysing edits. Unfortunately, no post-edited data was available at the time of pre-editing rule development. We therefore chose to use a known source of potential translation errors: the mismatch between training and test data (Haddow & Koehn, 2012). Words or sequences that are infrequent or entirely absent from the training data have a higher probability of being translated badly. We thus decided to focus our investigation on these sequences. To identify them, we have worked with bigram and trigram frequencies. In a first step, all trigrams were extracted from a corpus of 10,000 random sentences of forum data. For each of these trigrams, we counted the frequency of occurrence in a subset of 78,000 sentences of forum data and in a quarter of the French Europarl corpus used for training of the ACCEPT SMT system. These frequencies allow us to identify mistranslation candidates, namely those trigrams that are very infrequent or absent from the training data, and to sort these by relevance, based on their frequency in the corpus data. For each of the trigrams in the resulting sorted list, we extracted up to five sentences of forum data in which they occurred, and translated these with the ACCEPT baseline. The results of this selection, translation and sorting process were lists of trigrams, sorted according to their relative frequency in the two corpora (beginning with those occurring frequently in the forum data, and rarely or not at all in the training data), with example context sentences and their translations. Figure 3.1 shows an example of extracted trigrams. For each trigram, the first line shows the tokens, followed by the frequency in the forum data and the frequency in the SMT training data. The next line shows an example sentence where this trigram occurs. Finally, the third line shows the translation by the SMT. This output could then be used to quickly identify badly translated sequences, candidates that could benefit from transformation to improve translation. Using the extracted sentences as a starting point, we tried replacing these trigrams - or larger n-grams if necessary - by other words or expressions, without changing the meaning of the sentence. Although the candidate extraction began with bigrams or trigrams, the resulting rules often apply to more complex sequences, or non-contiguous tokens. As the objective was to investigate transformations for the machine, replacements were not necessarily correct French. We then translated the transformed sentences, to find if replacements led to better translations. If a successful replacement was found, we wrote

75

3. RULE DEVELOPMENT

6 R

labelW[importewquelwmoment]':Jè'6 sourceWyamoun'Par'exemplew'si'pendant'ton'abonnement'Cdonc'à'nfimporte'quel'momentj' sort'une'version'R56R'ou'autrew'tu'pourras'lfinstaller'sans'changer'ta'cléw'il'suffira'de' passer'par'le'Centre'de'mise'à'jour

:

transWYamoun'For'examplew'if'during'your'subscription'Ctherefore'at'any'timej'fate'a' version'R56R'or'otherw'you'can'install'it'without'changing'your'keyw'it'will'be'enough'to'go' through'the'Update'Center

. A 9

W labelW[importewquelwmoment]':Jè'6 sourceWRappel'T'tant'que'ton'abonnement'est'en'coursw'à'nfimporte'quel'moment'tu'peux' mettre'à'jour'gratuitement'en'version'majeurew'ceci'depuis'les'versions'R559U transWReminderT'As'long'as'your'subscription'is'currentw'at'any'time'you'can'update'for' free'in'major'versionw'this'since'the'R559U W labelW[celawpourrawaider]'R.:'5 sourceWJfespère'que'cela'pourra'aider'et'inspirer transWI'hope'that'this'may'help'and'inspire W labelW[celawpourrawaider]'R.:'5 sourceWJfespère'que'cela'pourra'aiderw'et'ne'pas'hésiter'à'donner'ou'demander'plus' dfinformation transWI'hope'that'this'may'helpw'and'do'not'hesitate'to'ask'or'give'more'information W labelW[tienswauwcourant]'R:è'5 sourceWte'tiens'au'courant transWYou'would'like'aware W labelW[tienswauwcourant]'R:è'5 sourceWTiens'au'courant'stpU transWWould'like'aware'stpU W

J è I 65 66 6R 6: 6. 6A 69 6J 6è 6I R5 R6 RR R: R.

Figure 3.1: Example of trigram extraction output

76

3.7 Rules for the machine

a basic Acrolinx rule with a minimally constrained Trigger to flag the identified words or sequences. By applying this rule to a larger corpus of forum data, we extracted a larger number of concerned sentences, which served as test suite for refinement of the Acrolinx rule. While this is in fact not much different from bilingually analysing MT output and testing alternatives for badly translated phrases, it does have the advantage of increasing efficiency. By extracting trigrams, instead of random data, we work with data that should have a higher density of interesting phenomena. Sorting by frequency in the forum data also ensures that the considered phenomena are indeed relevant. Despite this, the entire process is relatively labour intensive and time consuming, as each trigram and translated sentence has to be evaluated manually. As refinement of this approach, having observed that specific ambiguous words were frequently translated incorrectly, the trigram extraction method was extended to allow selection of a specific word in the trigram. We began by selecting ambiguous words that can have different translations depending on the context, such as plus for example, which can be translated either as more (Il est plus cher −→ it is more expensive) or as a negation (Je n’ai plus de laptop −→ I no longer have a laptop). These words were identified by manual translation analysis, searching for badly translated words. The extracted bigrams and trigrams containing these specific words were then sorted and processed exactly as described above. The replacements investigated to improve translatability can be split into four groups: reformulation or elimination of specific expressions, replacement of clitics, reordering of existing words or expressions, and replacement of the informal second person. We will now present each of these categories in further detail.

3.7.2

Reformulation rules

By analysing the bigrams and trigrams, we identified expressions which were translated badly, such as sans doute (no doubt), si jamais (in case) or pas mal de (a lot of). Many of these expressions belong to the informal register, and are therefore absent from the training data. Often, they are translated literally, resulting in meaningless English. By experimentation, we found replacements that would maintain the same meaning as in the source. A recapitulation of the rules is provided in table 3.16.

77

3. RULE DEVELOPMENT

Table 3.16: Reformulation rules

Rule name

Function

Freq. (P)

addDoisJe

in questions, insert dois-je between Comment and infinitive verb Comment faire pour en supprimer une? :::::::::::::: −→ Comment dois-je faire pour en supprimer une?

21 (100%)

avoidExpression replace certain expressions (miscellaneous) Tu as quand mˆeme fait ta mise ` a jour en version 2012 :::::::::::: comme je l’ai expliqu´e plus haut ? −→ Tu as pourtant fait ta mise ` a jour en version 2012 comme je l’ai expliqu´e plus haut ?

92 (100%)

avoidGreeting

19 (100%)

use formal greetings (e.g. Bonjour ) instead of informal (e.g. rebonjour ) Hello ElChikito, et bienvenue sur le Forum!! ::::: −→ Bonjour ElChikito, et bienvenue sur le Forum!!

avoidIlFautQue use tu dois + infinitive instead of il faut que tu + subjunctive Il faut que je v´erifie ` a nouveau. ::::::::::::::::::: −→ Je dois v´erifier ` a nouveau.

5 (80%)

avoidPlus

use aucun (none) instead of plus aucun (none anymore) Non, il n’y a plus aucun Norton install´e. :::::::::: −→ Non, il n’y a aucun Norton install´e.

15 (100%)

avoidPourQue

use afin que instead of pour que J’ai cr´ee un sujet ind´ependant pour que ce soit plus :::::::: facile de faire le suivi. −→ J’ai cr´ee un sujet ind´ependant afin que ce soit plus facile de faire le suivi.

40 (100%)

avoidQue

use the less ambiguous seulement instead of que Je n’ai que la version d’´evaluation. :::::::::: −→ J’ai seulement la version d’´evaluation.

6 (100%)

avoidSiEtQue

repeat the conjunctions si, quand and lorsque for each clause they apply to instead of using que Lorsque je fais clic droit dans le bandeau internet :: et ::::::: que je selectionne Norton Toobar [...] ::: −→ Lorsque je fais clic droit dans le bandeau internet et lorsque je selectionne Norton Toobar [...]

12 (100%)

avoidSoit

use [...] ou [...] ([...] or [...]) for disjunctions instead 15 (100%) of soit [...], soit [...] (either [...] or [...]) (continued on next page)

78

3.7 Rules for the machine

Table 3.16: Reformulation rules (continued)

Rule name

Function

Freq. (P)

Installer soit avec CD, :::: soit en t´el´echargeant depuis le ::: Compte Norton −→ Installer avec CD ou en t´el´echargeant depuis le Compte Norton removeNeExpl

Remove ne expletif Aucun module NORTON ne peut s’activer ! :: −→ Aucun module NORTON peut s’activer !

11 (100%)

wrongAbbrev

Use another abbreviation for this word iPod Touch (3e g´en´eration ou ult´erieure) :: −→ iPod Touch (3`eme g´en´eration ou ult´erieure)

39 (100%)

The avoidQue rule for example replaces the ambiguous que (that/only) with the unambiguous seulement, as shown in the following example: (25) Raw: L’extension n’est que pour quelques minutes. −→ The extension is that for a few minutes. Pre-edited: L’extension est seulement pour quelques minutes. −→ The extension is only for a few minutes. It should be noted that for some expressions, we created individual rules, while a certain number of others were grouped in a single rule, avoidExpressions. From a technical point of view, it would have been possible to treat all expressions within a single rule, but we chose to create individual rules since this will be more convenient for evaluation purposes.

3.7.3

Clitics

In some cases, French and English use divergent syntactical structures. An example is the use of clitics, which are placed to the left of a verb in French, as shown in the following example for the verb t´el´echarger (to download): (26) French: Tu peux le t´el´echarger English: You can download it To produce a correct result, the complement needs to be moved during the translation process. In this particular case, the pronoun le is also a highly ambiguous word, as

79

3. RULE DEVELOPMENT

by itself it can be both pronoun and definite article. We therefore developed a rule to replace the pronominal verb complement with a direct complement ¸ca (this) placed after the verb (cliticsC ¸ a). Another similar issue is that of the first person clitic personal pronouns, which were often mistranslated. However, they are quite as often translated correctly. Experimentation showed that in some cases, translation can be improved by simply removing the clitic: (27) Raw: Norton removable tool ne me propose pas la d´esinstallation de NIS. −→ Norton Removable tool I do not propose the uninstall of NIS. Pre-edited: Norton removable tool ne propose pas la d´esinstallation de NIS. −→ Norton Removable tool does not propose the uninstall of NIS. In other cases, replacing the clitic with a prepositional phrase placed after the verb (me propose −→ propose ` a moi ) or with a direct object (je me r´ep`ete −→ je r´ep`ete moimˆeme) produces an improvement. Finally, some cases are translated correctly without any modification. A rule that replaces all clitics with the same alternative structure would thus break as many examples as it improves. To specialise the rule to account for these differences, we extracted sentences with clitic pronouns from a development corpus, applied each of the replacement options, and evaluated the translation results for the different options. We then adapted the suggestions in the rule to keep only the most successful replacement. This was done by restricting the Suggestion rules to a list of verbs (using the MORPH.LEMMA feature, cf. 3.2.1.1). For example, if we evaluated a case where je r´ep`ete moi-mˆeme (direct object) produced a better translation (I repeat myself ) than either of the other options - clitic (je me r´ep`ete −→ I again), prepositional phrase (je r´ep`ete ` a moi −→ I repeat me) or null (je r´ep`ete −→ I repeat) - the Suggestion rule was adapted so that for any instance of the verb r´ep´eter, the clitic would be replaced by a direct object. All the clitic rules developed on this principle are recapitulated in Table 3.17.

3.7.4

Reordering rules

The reordering rules perform transformations to produce a word order similar to that of the target language, English. These rules are very specific, focussing on specific words rather than categories. They therefore produce only very few flags. Table 3.18

80

3.7 Rules for the machine

Table 3.17: Clitic rules

Rule name

Function

Freq. (P)

cliticsC ¸a

replace singular clitic with ¸ca Je te :: le donne en pi`ece jointe. −→ Je te donne ¸ca en pi`ece jointe.

cliticsCeux-ci

replace plural clitic with ceux-ci Je suis un fid`ele des produits de la Symantec et les :: conseille autour de moi. −→ Je suis un fid`ele des produits de la Symantec et conseille ceux-ci autour de moi.

72 (94%)

cliticsPersPron

replace or remove first person singular clitic personal pronouns Vous pouvez cliquer sur le lien qui :::: vous renverra ` a votre sujet. −→ Vous pouvez cliquer sur le lien qui renverra ` a votre sujet.

172 (100%)

292 (100%)

recapitulates these rules which will be discussed in further detail in the next chapter in the context of their impact on machine translation.

3.7.5

Informal second person

In French we have two second person pronouns, the formal vous and the informal tu. Accordingly, verbs are inflected differently depending on which pronoun is used. The casual discussion context on the forums leads users to address each other with the informal tu, rather than vous. While this is not a problem in itself, it is a serious issue for SMT. Most of the available training data, such as the Europarl corpus for example (Koehn, 2005), belongs to the formal register, where the informal pronouns and verb forms hardly occur. Perfectly correct French words thus become OOV items, because they belong to another register. A study by Rayner et al. (2012), also performed in the context of the ACCEPT project, investigated methods to resolve this issue. Rewriting rules were developed using Prolog and lexical information from the MMORPH system1 (Petitpierre & Russel, 1

As seen in 2.3.1, this is the same morphological analyser as used by Acrolinx

81

3. RULE DEVELOPMENT

Table 3.18: Reordering rules

Rule name

Function

Freq. (P)

reorderJamaisV

move jamais in front of verb Si Norton a ´et´e d´esinstall´e et que tu n’avais jamais ::::::::::::: fait de sauvegarde des identifiants, c’est perdu. ::: −→ Si Norton a ´et´e d´esinstall´e et que tu jamais avais fait de sauvegarde des identifiants, c’est perdu.

6 (100%)

reorderRienInf

move rien after infinitive verb to match English word order Comment faire une image ` a froid avec le Symantec Recovery Disk sans :::: rien:::::::: modifier? −→ Comment faire une image ` a froid avec le Symantec Recovery Disk sans modifier rien?

9 (100%)

reorderTout

move tout after the verb Si tu as install´e NAV et que tu as :::: tout:::::: laiss´e par d´efaut, je ne comprends pas. −→ Si tu as install´e NAV et que tu as laiss´e tout par d´efaut, je ne comprends pas.

18 (100%)

1995), to transform formal pronouns and verb forms into informal and vice versa. In a first experiment, Rayner et al. (op.cit.) applied these rules to the French half of the Europarl corpus to create artificial informal training data, which was used to train a SMT system. This system was then used to translate raw forum data. In a second experiment, Rayner et al. (op.cit.) used the rules to replace informal tokens in forum data, which was then translated with a system trained on regular (formal register) data. Both approaches were successful and improved SMT output, with the pre-editing approach performing better than the training data approach. In order to make these rewriting rules available in the ACCEPT pre-editing environment, we created an Acrolinx rule to perform the same transformations as the Prolog rules, replacing all informal second person pronouns and verb forms with their formal equivalent. The data used for development of the Prolog rules, a corpus of 200 segments presenting informal second person items, was made available to us. A second version of this corpus, processed with the Prolog rules, was also available, which al-

82

3.8 Rule application in the forum context

lowed us to compare the performance of the new Acrolinx rule, tuVous, recapitulated in Table 3.19, with that of the Prolog rule. Table 3.19: Informal second person rule

Rule name

Function

Freq. (P)

tuVous

avoid informal second person Qu’as-tu fait avec ton fichier trop volumineux ? :::: ::: −→ example: Qu’avez-vous fait avec votre fichier trop volumineux ?

688 (92%)

The main difference is a lower recall for the Prolog rule (67%) than for the Acrolinx rule (96%). This low recall is mostly due to the fact that the Prolog rule does not replace possessive pronouns (ton, tes or tien) and imperative verb forms. Regarding precision, both rules perform well with a precision of 97.5% for the Prolog rule and 99.6% for the Acrolinx rule on the training data. We can therefore assume that our tuVous rule will perform at least as well as the Prolog rule in terms of impact on the ACCEPT SMT system.

We have now described all the rules adapted and developed for forum content. In the following section, we will discuss how these rules with their different characteristics can be applied in practice, in the forum context.

3.8

Rule application in the forum context

By their definition, the developed rules fall into two categories: 1) rules for humans and 2) rules for the machine (Huijsen, 1998). While the first improve the source text and should have a favourable effect on MT, the second degrade the source, possibly introducing structures that are incorrect in the source language, but improve translatability by the ACCEPT SMT system. Rule application in the forum scenario must therefore proceed in at least two stages, since we do not wish users to see the changes made by the rules for the machine. Rules for the humans would be applied first, with or without human interaction, and rules for the machine would be applied automatically just before translation. The rules were thus split into two sets.

83

3. RULE DEVELOPMENT

To reduce the effort involved in the pre-editing task, we have considered automatic application of the rules for the humans. Unfortunately, not all rules are equally suitable for automatic application. Some rules have a too low precision, with the risk of corrupting words or structures that are in fact correct. Other rules, such as those affecting style issues, despite excellent precision still need human intervention due to the impossibility of providing complete and unique suggestions for all flagged cases. Despite this, a reasonable number of rules offered both high precision and reliable unique suggestions, and could very well be applied automatically. We therefore chose to split the rules for humans into two individual sets, one for automatic application, and a second for interactive application. The automatic set would be applied before the interactive set, as this step would already improve the text, meaning that the pre-editors could deal with a cleaner input. The three rule sets are recapitulated in Table 3.20. Table 3.20: Pre-editing rule sets

Type

Appl.

Set

for humans

auto.

Set 1

grammar (mood/tense), grammar (sequences), homophones, punctuation

man.

Set 2

grammar (agreement), grammar (mood/tense), grammar (sequences), homophones, punctuation, informal, simplification

auto.

Set 3

reformulation, clitics, reordering, informal second person

for the machine

Included categories

Besides the time gain for pre-editors, who have to apply less rules, application of these sets in sequences presents an additional benefit. As we have seen earlier in this chapter, obtaining high recall is difficult when the density of errors is high, due to the impossibility of describing all imaginable errors in the contexts. By pre-correcting the text, chances are that we might transform a context that has not been considered during rule development, into one that has, and that is therefore covered by a Trigger. We thereby increase the recall of the rules. The following example illustrates this idea: (28) Raw: La impossible de m’inscrire lol Pre-edited: L’impossible de m’inscrire lol

84

3.8 Rule application in the forum context

Here the addElision rule produces a wrong correction based on the context. Had the laL` a rule been applied before, La would have been corrected to L` a. It would then have been correctly identified as an adverb, and the elision rule would not have triggered. Of course, the same effect could be achieved if checking was performed on the fly, i.e. if each modification by an automatic or interactive rule would trigger a recheck of the segment. At the present time, on-the-fly checking has not been implemented in the ACCEPT pre-editing portal. However, considering the potential for rule recall improvement - and for rule precision improvement, since contexts with many errors are also responsible for most of the incorrect flags - this would certainly prove a valuable future enhancement. During rule fine-tuning, we have observed that grouping our rules for humans into two sets has one major drawback. Rules were assigned to a manual or automatic set based on precision and reliability of suggestions. This choice has been made for most of the homophone confusion rules. Once a rule had been attributed to an automatic set, further effort was put into making it as reliable as possible, favouring precision over recall, thus reducing the number of issues flagged. For the tousTout confusion rule for example, which was placed in the first automatic set, many Triggers had to be removed as they occasionally produced erroneous flags. Consequently, all the correct flags produced by these Triggers were also lost, reducing the total number of potential corrections. In this regard, it probably would have been more effective to split each rule into two sub-rules, one high precision sub-rule that could be applied automatically, and one high recall sub-rule for interactive application. Obviously, the effectiveness of interactive rules depends largely on the ability of users to apply these correctly, and it is thus not certain in our context that splitting the rules would have achieved the desired effect, as forum users might not have had sufficient linguistic skills to apply them correctly. This aspect will be discussed in further detail in Chapter 7, where we describe a pre-editing experiment involving real users from the Norton Community. In the final pre-editing setup, the rules are thus grouped into three sets, designed for application in sequence: a first automatic set of rules for humans, a second interactive set of rules for humans, and a final automatic set of rules for the machine. The correction of non-word errors with the Acrolinx spelling module is integrated with the second set, as this step also requires human interaction. In a real-life scenario, forum users would have to apply the interactive rules from the second set and the spelling

85

3. RULE DEVELOPMENT

module to input that has already been automatically pre-edited with the first set. The third set of rules would then be applied automatically just before machine translation. A recapitulation of the rules sorted by set is provided in Appendix C. We have now described all the pre-editing rules adapted and developed for forum data, as well as their mode of application in the forum context. In the last section of this chapter, we will evaluate the precision of these rules.

3.9

Rule precision

Two aspects greatly affect the usability and efficiency of checking tools: precision and recall. High precision saves users the ungrateful task of dealing with false alarms. Besides the obvious time gain due to reduced processing of erroneous flags, an additional benefit is the decreased risk that numerous incorrect flags could cause important correct flags to be overlooked or ignored. In the forum context, where we expect users to perform the checking task, precision takes on an even greater importance, for several reasons. These users create content on a voluntary basis, and while they are probably interested in posting a text of reasonable quality, they might be deterred by a tedious checking process with numerous unproductive and time-consuming steps. Furthermore, because these users mostly are not professional writers, it cannot be excluded that, either as a consequence of varied grammar and spelling skills or through lack of confidence, they might not identify erroneous flags correctly. The pre-editing process could then lead to the introduction of errors in the text. Precision is thus an essential aspect for the success of the pre-editing approach. We have mentioned in the previous chapter that some of the rules were designed for automatic application. Needless to say, for automatic rules high precision is vital, as no human intervention will take care of discarding wrong flags. To evaluate our pre-editing rules, we have proceeded differently depending on the mode of application. For rules from Sets 1 & 3, which are designed for automatic application, we have considered the precision of correction, i.e. if the replacement provided by the rule was correct. For rules from Set 2, since they will be applied interactively (either to select a replacement among several suggestions, or, in the absence of suggestions, to correct errors manually), we have considered only the flag itself, i.e. if the

86

3.9 Rule precision

flagged phenomena was indeed an error or a sequence requiring transformation, it was considered correct. The correctness of the replacement suggestion, if available, was not taken into account. Evaluating recall of the pre-editing rules is another matter. Unlike spelling and grammar errors, which can be annotated systematically, the phenomena treated by the simplification or reformulation rules are not as well demarcated. It thus seems difficult to count the number of cases they should apply to in a corpus. We therefore have not computed recall scores for these rules.

3.9.1

Evaluation

To evaluate rule by rule precision, we have used a corpus of unseen Norton forum data, from which we extracted 10,000 sentences without duplicates. In order to respect the intended sequence of rule application described in Chapter 3, each rule needs to be evaluated on a corpus pre-processed with an appropriate rule-set. Rules from Set1 were evaluated on a raw corpus; rules from Set 2 on a corpus pre-edited with Set 1; rules from Set 3, which should ideally be evaluated on a corpus pre-edited with Sets 1 & 2, were applied to a corpus pre-edited with Set 1 only, as no large corpus pre-edited manually with Set 2 was available. Data extraction for this experiment was performed with the Acrolinx Batch Checker. As described in the previous chapter (cf. 2.3), this tool produces detailed XML checking reports, including all the sentences flagged by a given rule. We thus applied each rule set to the corresponding corpus by means of the Batch Checker, and proceeded to extract flagged sentences for each rule from the reports. Extraction was performed with the Perl script also described in the previous chapter. The number of flags per rule is very variable, ranging from 2 (niN’y) to 2552 (punctSpacing) flags on 10,000 sentences. Two thirds of the rules yielded 50 flags or less. For the evaluation of precision, we chose to keep up to 50 flags per rule. Since the perl script outputs an ordered list of examples, sorted by the tokens flagged by the rule, we randomised the order of these lists before extracting the 50 examples, in order to obtain a variety of examples.

3.9.2

Results

The individual results by rule can be found in each of the rule recapitulation tables in this chapter (Tables 3.1, 3.2, 3.3, 3.5, 3.13, 3.14, 3.15, 3.16, 3.17, 3.18 and 3.19). Table

87

3. RULE DEVELOPMENT

3.21 shows the combined results for the three rule sets. Table 3.21: Flags and precision of rule sets on 10,000 sentences

Flags Rule set

total

evaluated

incorrect

Precision

Set 1 Set 2 Set 3

4095 2692 1436

498 963 430

23 46 8

95% 95% 98%

On the 498 flags evaluated, the rules from the first set achieve a global precision of 95%, with 15 of the 20 rules achieving a precision of 100%. Only three rules have precisions lower than 80%: avoidCond (60%), useImperative (75%) and wrongExpr (63%). For the second set, of the 963 flags evaluated, the rules achieved a global precision of 95%. 20 of the 30 rules achieve a precision of 100%. Only one rule has a precision lower than 80%: wrongSeq (74%). For the third set, the global precision is 98%, with 16 of the 19 rules achieving 100% precision. None of these rules has a precision below 80%. Overall precision is good, yet 5% erroneous flags, especially for automatic rules, is not negligible. A closer investigation of these flags reveals different causes: noisy input data, insufficient rule restriction, but also limitations of the checking technology. Table 3.22 recapitulates the different causes of erroneous flags. We will now discuss these issues and potential solutions. Table 3.22: Causes of erroneous flags on forum data

Cause

Cases

Noisy input data Tagger Insufficient rule restriction Wrong suggestion Missing lexicon Compounds Long distance dependencies Other

43% 8% 8% 8% 6% 6% 5% 16%

88

3.9 Rule precision

3.9.2.1

Noisy input data

By far the most frequent cause (43%) of wrong flags is errors or unusual structures in the input data. The problem of high error density, resulting in erroneous contexts which mislead our Trigger rules is a recurring issue when processing community content with Acrolinx. We have seen that the error patterns in the rules rely on the linguistic annotations produced by the morphological analyser and the POS tagger. However, these annotations are not infallible and can be disrupted by non-words or unusual structures. Unknown words

Non-words, or words absent from the Acrolinx morphological re-

sources, are problematic since they cannot be attributed any morphological features. This in turn disrupts the tagger, as no category information is available for these words. The following example, of a wrong flag of the wrongExpr rule, illustrates this issue: (29) *La j’ai peur :: de::: le retelecharger As the verb *retelecharger is misspelt and thus absent from the morphology, its category remains unknown. In this case, it has been incorrectly tagged as a noun, with the preceding le tagged as a determiner. The sequence de le is consequently flagged by the wrongExpr rule, which suggests replacing the preposition and determiner with the contracted du. In this case, this is incorrect, since le is a pronoun, not a determiner. If we correct the spelling of *retelecharger, it is tagged correctly, and no flag occurs. Non-words are not only problematic for the analysis modules, but also in the case of specific lexical negative evidences. In the following example, as a consequence of a typo, the relevant negative evidence is not triggered : (30) *Ensuite je clique sur centre de sauvegarde en ligne et rentre mon mot de passe et ¸ca :: se:::::: passe ::::: plus ou moin bien (temps d’attente un peu long) The se passe plus sequence is flagged by the incomplNeg rule, which suggests correcting the incomplete negation by adding the missing ne. Normally, if plus is followed by ou moins, a negative evidence cancels the effect of the Trigger rule, as this is clearly not a context where plus is used as a negation. In the present example, as *moin is misspelt, it does not match the definition of the negative evidence, which is thus not triggered, resulting in a faulty flag.

89

3. RULE DEVELOPMENT

These non-words would not be a problem if the spelling module were applied before the other rules. However, as this module requires human intervention to select replacement candidates, it could not simply be applied automatically before Set 1, but would require an additional manual pre-editing step, which does not seem desirable in our context. Unusual style Another common problem for the tagger is the telegraphic style sometimes used on forums, where words are left out for the sake of brevity. This is illustrated by the following example, where the subject of the second clause is omitted: (31) *Je retrouve pas le tien, manque la r´ef´erence compl`ete. ::::::: This sentence is flagged by the tuVous rule, which identifies the verb manque as a misspelt imperative form, and suggests to replace it with its formal equivalent, which is certainly wrong here. The real issue is the missing subject. If we complete the sentence by adding an appropriate subject (il manque), it is not wrongly flagged. Another frequent omission in informal language are prepositions, as in the following example: (32) *:::::: Erreur:::::: envoi message avec Windows Live Mail Erreur envoi is flagged by the nounVerbConf rule, which identifies homophone confusions, and in this case suggests that envoi may be a verb misspelt as a noun, since it is uncommon in French for a noun to follow another noun directly. If this compound had been written correctly (Erreur d’envoi ), this would not be an issue. Punctuation

The omission or misuse of punctuation is also problematic, as shown

in the following example, wrongly flagged by the useImperative rule: (33) *:::::: Penses::: tu que ce soit la solution ? Here the omission of the hyphen in the inverted verb-pronoun sequence Penses-tu lets the rule falsely identify the verb form as a potential imperative. If the hyphen were there, the rule would not flag this verb. In fact, the addHyphen rule also flags this sequence, adding the appropriate hyphen. However, as both rules are in the same set and not applied one after the other, this does not benefit the useImperative rule.

90

3.9 Rule precision

As for the previously mentioned non-words, correcting some errors before others might improve rule precision. One approach to solve this issue would be to apply rules in a finer-grained sequence, making sure that the issues that could disrupt triggers of one rule have been eliminated by a previous rule. It is however still unclear how rules could be developed to work in such a sequence, e.g. which phenomena would have to be treated first. Sequential rule application deserves further investigation and will be the object of future work. For all these examples, although they are incorrect since they do not flag exactly the phenomena they were designed for, the flags still indicate that something is wrong with the flagged sequence or its near context. Obviously identifying sequences that are wrong is easier than identifying what exactly is wrong in these sequences. Yet even only roughly delimited errors such as these might be perfectly sufficient in the context of interactive checking by highly qualified writers. The following example illustrates such a case: (34) *Le tecknicien norton ne ma ::: pas::::::::: rappeler et aujourd’hui effectivement il n’y ` a personne. Here the pas rappeler sequence is flagged by the incomplNeg rule, which suggests adding ne in front of pas rappeler. Although the ne is present in the sentence, the sequence in this sentence does not match any of the patterns defined in the rule, due to the misspelling of m’a as ma. The error here is not the negation, thus suggesting to the user to add ne is evidently wrong, yet chances are that the user whose attention is drawn to this sequence might decide to correct the adjacent ma instead. This however requires the user to take a broader view of flags, which is not necessarily realistic in a forum context. For less qualified users, such flags might only be confusing or frustrating. Needless to say, such imprecise flags are of no use for automatic correction. Tagger errors In a small number of cases, errors are caused by incorrect POS tagging, despite correct sentence structure. This is the case for 8% of our erroneous flags, nearly all of which can be attributed to the wrongSeq rule, a generic rule designed to identify wrong sequences of categories, mainly relying on tagger output. As mentioned earlier, additional tagger training might resolve this issue.

91

3. RULE DEVELOPMENT

3.9.2.2

Insufficient rule restriction - missing negative evidences

We also observe incorrect flags caused by insufficiently constrained rules. 8% of the erroneous flags could be avoided by adding negative evidences to cover specific cases that were not encountered during rule development. Most likely a certain number of such cases will be found each time the rules are applied to a new data set. 3.9.2.3

Wrong suggestions

8% of the erroneous flags are cases where automatic rules correctly identify errors, but provide an incorrect replacement. This happened for the wrongExpr rule, which flagged several instances of the preposition ` a followed by a definite article le or les: (35) *Le produit que je cite ci-dessus :: ` a :: les mˆemes fonctions que N360 pour l’ordinateur. The ` a les sequence is definitely incorrect in French, however it can result from different errors: either the preposition-determiner contraction (to aux ) was neglected, or a homophone confusion has occurred between the preposition ` a and the verb a. Suggestions should thus be refined to distinguish between these two cases, yet this can prove difficult as the close context is not always sufficient. In both cases (preposition ` a or verb a), the right hand side context would be a noun phrase, while the left hand side would be a verb phrase in the case of the missing contraction, or another noun phrase in case of the homophone confusion. Depending on the complexity of the phrases in question, writing patterns to identify them could prove rather difficult. 3.9.2.4

Missing lexicon

A number of errors are due to tokens missing from the Acrolinx morphological resources, which are thus not categorised correctly. These errors could be avoided by completing the resources accordingly. 3.9.2.5

Compounds

Another problem is compounds which are not identified as such. This is particularly problematic for agreement rules, which have to identify noun phrases. In the following example, the subjVerbAgr rule flags the verb correspondent and suggests changing it to the singular form, to agree with the singular noun d´efaut, but the correct subject is the plural r´eglages par d´efaut:

92

3.9 Rule precision

(36) Les choix [...] de faire des r´eglages par d´ efaut qui correspondent au plus grand :::::::::::::::::::::::: nombre ne sont pas une fatalit´e ! As our checking technology functions on the word level, the only way to handle compounds is by describing them token by token in the patterns. Extending the rules to cover all sorts of compounds would mean multiplying Triggers, Evidences and Suggestions wherever a noun phrase could be a compound. However, as only 6% of our wrong flags result from this issue, this might not be worth the effort.

3.9.2.6

Long distance dependencies

A similar problem is that of long-distance dependencies, which are difficult to handle with Acrolinx, as there is no means of defining clauses, except by enumeration of all the elements composing them. Again, this only affects a minimal number of cases (5%).

3.9.3

Summary

We have evaluated the precision on a rule by rule basis; most of the rules perform well, with three quarters achieving 100% precision. The greatest source of erroneous flags is errors or unusual structures in the checked text. Issues such as non-word errors, informal style or wrong punctuation affect the analysis by Acrolinx linguistic modules, resulting in incorrect annotations. As most of the rules rely on these annotations, this results in wrong flags. Another problematic aspect is the high density of errors, producing an overlap of phenomena where for one error to be identified correctly, another would have to be corrected first, but this other error cannot be identified reliably unless the first is corrected. To improve precision, a more elaborate sequential application of the rules might prove interesting, and will be the object of future work. Another source of errors is the difficulty of identifying complex clauses and handling long-distance dependencies with the Acrolinx formalism. The high precision achieved by the rules means that we can safely apply some of the rules automatically and that users will not be bothered by numerous wrong flags in the context of interactive checking.

93

3. RULE DEVELOPMENT

3.10

Conclusion

In this chapter, we have presented the Acrolinx rule formalism and the rules developed and adapted for the forum context. This formalism has allowed us to develop different types of rules. On the one hand, we have pre-editing rules that improve the source, by correcting spelling and grammar, transforming or removing informal language and performing simplifications. Among others, these include rules for the very common homophone confusions, as well as for forum-specific use of anglicisms and jargon. On the other hand, we have rules for the machine, which were designed specifically to improve translation of forum data by the ACCEPT SMT system. These rules perform reformulations or reordering transformations. Regarding the correction of spelling and grammar, a comparison with two commercial (Antidote and Word) and one open-source (LT) spell-checker has shown that in terms of error detection, our rules perform similarly to the other systems. Error correction, when considering only unique replacements, could still be improved, as many rules only provide multiple replacements or error descriptions. Despite this, we expect that our rules would lead to results similar to those that can be achieved with commercial tools, with the additional benefit of an integrated checking process, combining correction of spelling and grammar with the application of other pre-editing rules in a single interface. An evaluation of the other pre-editing rules on a corpus of unseen data has shown that most of the rules achieve a high precision (95-98%). This confirms that a checking approach such as that used by Acrolinx, relying on declarative patterns using linguistic annotations provided by shallow analysis components, is well suited to the task of pre-editing forum data. For the pre-editing scenario in ACCEPT, the rules were grouped into three sets designed for sequential application. A first automatic set contains high precision rules for humans, a second interactive set contains rules for humans with lower precision or with multiple suggestions, and a final automatic set contains all the rules for the machine. In the following chapters, we will evaluate different aspects of these rules. We begin by evaluating their impact on the ACCEPT SMT system in Chapter 4. The next chapters deal with the impact of pre-editing on the post-editing task in the ACCEPT

94

3.10 Conclusion

scenario (5), the portability of these rules to other MT systems and domains (6) and finally their usability by real forum users (7).

95

3. RULE DEVELOPMENT

96

4

Rule Evaluation: Impact This chapter presents the evaluation of the impact of pre-editing on SMT of Norton forum data, both on a rule-by-rule basis and as a global pre-editing process.

4.1

Introduction

We have developed pre-editing rules for forum data, with the objective of improving its translatability. In the present chapter, we will investigate whether these rules can indeed fulfil this objective, by testing their impact on the translation of Norton forum data with the ACCEPT project’s SMT engine. Our aim is to answer several questions, namely how the impact of the rules can best be evaluated, which rules have the most impact, and finally, whether the whole the pre-editing process set up in this thesis improves machine translation of forum data. Many methods have been used to evaluate machine translation, most of which follow one of two trends: manual evaluation, involving human annotation, or automatic evaluation by means of metrics that rely on comparisons with reference translations. Both approaches have their advantages and are widely used in the literature. The choice of an evaluation methodology is conditioned both by what it is expected to reveal and by how it can be performed in practice. In our particular case, evaluation should allow us to determine whether the translation of pre-edited forum data is better than that of raw data. Our focus will therefore be on comparative quality, and not the absolute (as opposed to relative) quality of translation. Regarding the implementation of evaluation, since rule development is an ongoing process, we require an approach

97

4. RULE EVALUATION: IMPACT

that allows frequent re-evaluation to assess evolution of the rules, as well as evaluation on new data. We have seen in the previous chapter that our pre-editing process combines approaches and includes rules of different types, which should all to a certain extent have an impact on translation. To ascertain this we will perform a rule by rule evaluation. This will allow us to determine which rules are the most effective for our particular data and MT technology. In a second stage, we will perform an evaluation of the complete pre-editing process, applying all the rules in the designed sequence. This evaluation will provide insights into the global effectiveness of pre-editing for forum data. We begin this chapter with a brief overview of machine translation evaluation in general, introducing the most common methodologies (4.2). We continue with a description of the approach chosen for our application, as well as the experimental setup (4.3). Next, we present the results of our evaluation, beginning with a rule-by-rule evaluation (4.4), followed by an overall evaluation with all rules combined (4.5). A short section then presents results of an evaluation including only automatic rules (4.5.2). Finally, we present our conclusions regarding the evaluation results and methodology (4.7).

4.2

Machine translation evaluation

Whatever the application of machine translation, from the black-box usage of a commercial MT engine, to the fine-tuning of new algorithms or translation models, there is no way around MT evaluation. End-users need an objective means to decide whether MT is of sufficient quality for a given purpose, or to make a choice between different MT engines. System developers need a clear indicator of system improvement, to assess the benefits of new models or technologies, as well as to identify weaknesses. Different evaluation methodologies have evolved to fulfil these various needs. Traditionally, we distinguish two main approaches: human evaluation and automatic evaluation. The first relies on human judgements of translation quality, generally involving several judges and different scales or annotation schemes. The second relies on statistical measures, determining the similarity of MT output with reference translations. In this section, we propose to give a brief overview of the different approaches.

98

4.2 Machine translation evaluation

4.2.1

Human evaluation

Human evaluation, also called manual evaluation, consists of asking humans to judge the quality of MT output, generally on the sentence level. With the exception of short sentences, or highly specialised systems, a machine translation is only seldom perfectly correct. Hence a strict two point scale - translation correct or incorrect - is not particularly helpful. Human evaluation therefore mostly relies on different criteria, such as language, accuracy, completeness, etc. to classify MT output (Hovy et al., 2002; White, 1992-1994). Depending on the criteria, human evaluations can be performed either bilingually, evaluating the target in relation to the source, or monolingually, considering only the quality of the target. In addition to the use of multiple criteria, human evaluations are made more fine-grained by collecting judgements on graded scales, typically 3, 5 or 7 point Likert scales. Another human evaluation approach, focussing on the extrinsic quality of machine translation, is task-based evaluation. Finally, yet another approach involving human annotation is error analysis. We will now describe these different approaches in more detail. 4.2.1.1

Fluency, Adequacy and Ranking

The two most widely used criteria are Fluency and Adequacy. Fluency judges the language of the translation, with no regard to the meaning conveyed. This criteria covers grammaticality, syntax, word choices, but also style of the MT output. Evaluators have to consider all these aspects to assign a score (flawless, good, disfluent, etc.) to each sentence. By its nature, this criteria is not entirely objective, since individual preferences or expectations of the evaluators will influence the scores. While fluency is a good indicator of MT quality, it is insufficient by itself, as it reveals nothing about the information content of a translation. Fluency is therefore usually combined with adequacy, which is sometimes referred to as fidelity. Adequacy measures how well the meaning conveyed by the translation matches the meaning of the source text. It is affected by distortions, omissions and insertions, and typical scores reflect the degree of meaning preservation (All information present, some information present, no information present, etc.). The evaluation of adequacy requires more resources than that of fluency, since evaluators must be bilingual, or at least understand both source and target language well enough to observe

99

4. RULE EVALUATION: IMPACT

differing meanings. In the absence of such evaluators, it is possible to use a reference translation to which evaluators who are not familiar with the source language can refer, but this can introduce bias, caused by evaluators penalising translations that are very different from the reference (Koehn, 2010). Another possibility is to use a post-edited version of the MT output as reference (Aikawa et al., 2007). Fluency and adequacy are usually evaluated individually, but as Koehn (2010) points out, they are not entirely independent: a sentence with low fluency will be difficult to understand, and will accordingly tend to a low adequacy score. Although the combination of these two criteria is a good indicator of overall MT quality, they have the drawback of being quite subjective, leading to weak inter-annotator agreement or requiring a large number of evaluators to obtain reliable results (Arnold et al., 1994). As quite often the objective of evaluation is to find the best system amongst several, or the best results amongst different versions of a same system, researchers commonly fall back on an alternative approach: ranking. Instead of assigning a score to an individual translation, evaluators are asked to rank different translations of a same source against each other. Studies have shown that ranking tasks tend to give better agreement, and therefore more reliable results (e.g. Callison-Burch et al., 2007). However, while ranking is useful to evaluate the improvement of MT, it hardly provides any useful information about strengths or weaknesses of systems, sources of error, or specific aspects that are improved/degraded. 4.2.1.2

Task-based/use-oriented evaluation

While fluency and adequacy focus on the intrinsic qualities of translations, another evaluation approach focuses on extrinsic aspects instead: task-based evaluations gauge translation quality by measuring to what extent the translation can fulfil a particular purpose. While quantitative criteria are indispensable in the MT development cycle, task-based assessment is the only way to find out if MT does in fact address the enduser’s requirements. In order to perform a task-based evaluation, a translation-related task needs to be defined. (King et al., 2003; The EAGLES MT Evaluation Working Group, 1996). Many studies use some variation of Taylor & White’s hierarchy of text-handling tasks (Taylor & White, 1998), which classifies tasks such as filtering (discarding irrelevant documents), extraction (capturing key information), publishing, etc. according to their

100

4.2 Machine translation evaluation

level of tolerance of translation errors. Translations can then be ranked according to the level of the tasks they are sufficient for (Doyon et al., 1999). However, while task-based evaluations certainly give very meaningful results, they are money- and time-consuming. They are mostly performed in conjunction with automatic metrics, in order to find whether these can be used to predict task results (Voss & Tate, 2006). 4.2.1.3

Error analysis

Although it may not be entirely correct to include error analysis with human MT evaluation, since it is more of a classification than an evaluation task, and is by no means limited to the evaluation of MT, we chose to present it here as it undeniably has a human component: the manual analysis and annotation of text. Error analysis also referred to as error annotation or classification - is another approach which aims at more objectivity. It relies on models to quantify linguistic quality of MT output (Fishel et al., 2012). These models establish a consistent standard against which translation quality can be measured as objectively as possible. Models classify errors both by precise categories and by importance ratings. A frequently used model is the LISA QA model1 , developed by the Localization Industry Standards Association to help service providers involved in localised product testing manage the quality assurance process. This model uses 39 categories and subcategories in combination with three ratings (minor, major, critical). Another widely used model is the SAE J24502 , developed by SAE international, a global association developing standards for the aerospace and automotive industry. This model has a much smaller number of categories and is specifically designed for accurate information transfer in technical texts. These models developed for human translation are often not ideally suited to the evaluation of machine translation, since the error boundaries are not as clear cut, and their causes are rarely apparent. Flanagan (1994) proposes an error classification specific for MT output, where errors are not only classified by language-pair dependent categories (verb inflection, article, pronoun, word selection, etc.) but are additionally 1

http://web.archive.org/web/20110101195746/http://www.lisa.org/LISA-QA-Model-31.124.0.html 2 http://www.apex-translations.com/documents/sae j2450.pdf

101

4. RULE EVALUATION: IMPACT

ranked according to their impact on intelligibility. A wrong word selection or rearrangement will for example be considered more damaging than an elision or capitalisation error. Vilar et al. (2006) propose another error classification scheme with a hierarchical structure, with five large classes, each with a number of subclasses, which is based on a classification proposed by Llitj´ os et al. (2005). In a study of Spanish-Catalan SMT, Farr´ us et al. (2010) compare two systems by performing both a traditional human ranking evaluation and a linguistic classification evaluation, where errors are classified into five categories: orthographic, morphological, lexical, semantic and syntactic. While results for the two evaluations are consistent as to which of the systems performs better, the difference is far larger in the linguistic than in the ranking evaluation. This tends to indicate that human comparative evaluations may be influenced more by some linguistic errors than by others, although fluency judgements are found to correlate more closely with the number of annotated errors (Elliott et al., 2004).

4.2.2

Automatic evaluation

Human judgements are expensive and time-consuming. Thus much research has been invested to find techniques to evaluate MT quality automatically, without human annotation. Especially in the context of MT system development, a fast method that can be repeated any number of times during system optimisation is essential to quantify progress. This has led to the development of numerous automatic metrics, which generate scores for MT output by comparing a candidate translation with one or more reference translations. These scores are in fact similarity measures, the underlying principle being that the closer the candidate is to a reference translation, the greater the chance it has of being correct. The simplest approach to computing text similarity is by measuring precision and recall on the word level. Precision is the number of correct words divided by the number of words in the MT output, recall is the number of correct words divided by the number of words in the reference. Although both these metrics provide useful information, the results can be misleading, as they do not take into account two important factors: sentence length and word order. A high precision score only indicates that nearly all words of the candidate sentence are correct, but it gives no indication of how many words the translation actually contains. Half the words of the sentence could be missing, making the sentence quite useless, but precision would still be high. The same goes

102

4.2 Machine translation evaluation

for recall: besides containing many correct words, resulting in a high recall score, the sentence could contain numerous other irrelevant words, again making the sentence useless. To overcome these issues, precision and recall are combined into a metric called the f-measure, which is the harmonic mean of the two scores: f − measure =

precision × recall (precision + recall)/2

(4.1)

Although the f-measure solves the issue of sentence length, is does nothing for the word order problem. A totally garbled sentence, insofar as the correct words were present, would still obtain a misleadingly high score. Conversely, the Word Error Rate (WER), a widely used metric borrowed from speech recognition, is highly sensitive to word order. This metric is based on the number of editing steps necessary to transform a candidate translation into the reference translation. The editing steps (insertions, deletions and substitutions) are given by the Levenshtein distance (Levenshtein, 1966), which is the minimal number of edits required to match two sequences of words. To compute WER, the total number of edits is divided by the number of words in the reference translation: W ER =

insertions + deletions + substitutions ref erence word count

(4.2)

WER has the disadvantage of attributing bad scores to correct translations that use entirely different formulations than the reference. Despite being perfectly correct, they will have high edit distances if they have few words in common, or phrases ordered differently. An improvement on these metrics is the BLEU score (BiLingual Evaluation Understudy) (Papineni et al., 2002). This is a precision-based metric which considers n-gram matches between a candidate translation and one or more references. It is defined as “the geometric mean of the modified n-gram precisions, p n , using n-grams up to length N and positive weights w n summing to one” : BLEU = BP · exp

N X

! wn logpn

(4.3)

n=1

Modified n-gram precision is a formalisation of the intuition that “a reference word should be considered exhausted after a matching candidate word is identified” (Papineni et al., 2002). By combining different n-gram sizes, the BLEU score reflects

103

4. RULE EVALUATION: IMPACT

different aspects of translation: while the 1-gram precision is an indicator of adequacy (accounting for correct word choice), the precision of longer n-grams is an indicator of fluency (accounting for correct word order). The maximum length of n-grams to be matched can be defined by the user, typically 4 is used. The metric is then called BLEU-4. Additionally, BLEU introduces a brevity penalty to reduce scores of too short sentences, by this means penalising omissions. The brevity penalty is defined as:   output − length BP = min 1, ref erence − length

(4.4)

While BLEU is clearly an improvement on previous metrics, and is generally agreed to correlate well with human judgements , (Coughlin, 2003; Doddington, 2002), CallisonBurch et al. (2006) argue that “the machine translation community is overly reliant on the Bleu machine translation evaluation metric”. Indeed, the shortcomings of BLEU, notably its inability to address important aspects such as the overall coherence of sentences or the relevance of different words, have given rise to lively debate about the appropriate and inappropriate uses of this automatic metric. Another reproach is the absence of meaning of the scores themselves, which are virtually useless unless they can be compared with another score obtained with the exact same setup (Koehn, 2010). However, despite its flaws, BLEU remains the most widely used metric in the field since it is both fast and easy to apply, if reference translations are available. BLEU has led to many variations and extensions, of which we will mention the frequently used METEOR (Lavie & Agarwal, 2007; Lavie et al., 2004). This metric was designed with the aim of improving correlation with human judgements at the segment level. Like BLEU, METEOR evaluates a translation by matching it against a reference. It creates a word alignment between the two sentences, using a sequence of word-mapping modules. These modules seek either exact matches, i.e. identical words, or words that have the same stem (using the Porter Stemmer (Porter, 2001) to stem the words), but also words that are synonyms (based on WordNet (Fellbaum, 1998) synsets). By allowing all these near matches, this metric is more flexible and tolerant of small differences that do not affect the meaning of the translation. METEOR’s main disadvantage is that it relies on complex language specific resources, and cannot therefore be ported easily to arbitrary language pairs.

104

4.2 Machine translation evaluation

Another interesting variation on BLEU ist the NIST score (Doddington, 2002). It also compares MT output with reference translations based on the number of shared n-grams, but implements two enhancements. N-grams are weighted according to their information value and the brevity penalty is modified to reduce the impact of small variations. Finally we should mention TER (Translation Edit Rate) and HTER (Humantargeted TER) (Snover et al., 2006, 2009), metrics which measure the number of edits necessary to change a candidate translation into a reference translation. In principle, TER is similar to WER, counting insertions, deletions and substitutions, but with the addition of a shift operation. This allows the movement of contiguous sequences of words. HTER has the particularity of using targeted references, generated by fluent speakers of the target language on the basis of the system output and one or more pre-determined references. As we have seen, all these automatic metrics are similarity measures, and therefore rely on reference translations. Although the application of the metrics themselves is generally fast and cheap, the creation of references still requires human intervention, which is one of the critical limitations of all these metrics. 4.2.2.1

Quality estimation

The issue of reference translation creation has led to investigations in an entirely different direction, namely predicting the quality of machine translation instead of evaluating it. The aim of this task, which is mostly referred to as quality estimation (QE) (Specia et al., 2010), is to provide an indicator for the quality of unseen translations, at phrase, sentence- or document-level. It does not rely on reference translations, but instead uses features extracted automatically from source and target sentences to compute a score. A large variety of features, both language-dependent and -independent are used to build classifiers or regression algorithms to predict quality. A quality estimation framework, combining a feature extraction module and a machine learning module is now available online (Specia et al., 2013). As we have seen in this section, numerous methods of evaluation are available, each with their advantages and disadvantages. In the next section, we will proceed to describe which approach was chosen within this thesis.

105

4. RULE EVALUATION: IMPACT

4.3

Evaluating the impact of pre-editing on SMT of forum data

In our particular context, the aim of evaluation is to determine whether pre-editing can improve translation, and thereby make the output more useful for community users. Our focus is on the improvement of the translations, and not the quality of the translations per se. To this end, a comparative evaluation, contrasting the translation of a raw sentence against that of a pre-edited sentence seems the most appropriate. Although this will not give us any information about the global quality of the translations, it will allow us to determine whether pre-editing is indeed useful. The observation of community content reveals that it is often wanting in terms of linguistic accuracy, yet this does not seem to be a major issue for the communication process, excepted in extreme cases. It seems reasonable to assume that the translation of such content need not be perfect either, as long as it is understandable. Our evaluation should therefore place greater emphasis on the transmission of a message than on linguistic quality. In our rule development process, we require an evaluation methodology that allows frequent re-evaluation of the rules, to assess their evolution. As we have seen in the previous chapter, the process of rule development involves applying the rules to a corpus, improving precision and recall on these data, before moving on to another corpus. Repeatedly creating references for these corpora, in order to test impact on MT, was not a realistic approach. We have therefore chosen a human evaluation methodology to evaluate the impact of our rules. We will see later in this chapter how these evaluations were set up for increased efficiency. As references have become available later during the project, we have also at a later stage performed automatic evaluations; these will be described in detail in section 4.5. As a complement, we have performed an alternative evaluation, using post-editing temporal effort as a criteria, but this will be the object of chapter 5. The current section will focus on human evaluation. For our human evaluation, given the relatively low linguistic quality of both source and target in our data, we opted against a separate evaluation of fluency and adequacy of raw and pre-edited versions. We expected scores for both versions to be in a similar low range, which would not have provided sufficient distinctions to perform a conclusive comparison. Furthermore, as pre-editing mostly effects only small changes, a score reflecting the quality of entire segments would not necessarily be fine grained enough

106

4.3 Evaluating the impact of pre-editing on SMT of forum data

for our purpose. This is all the more problematic in the case of a rule by rule evaluation, where to asses the impact of a single rule, only one change is performed in each evaluated segment. The following example illustrates this issue: (37) Raw: Maintenant, il me la demande `a nouveau, mais sur le site norton dans mon compte il dit que j’ai utilise toutes les cles. −→ Now, I demand again, but on the site norton in my account it says that I have uses all cles. Pre-edited: Maintenant, il me la demande `a nouveau, mais sur le site norton dans mon compte il dit que j’ai utilis´e toutes les cles. −→ Now, I demand again, but on the site norton in my account it says that I have used all the cles. In this segment, an example for the wrongVerbForm rule, only one word is changed by pre-editing (*utilise −→ utilis´e ). In the resulting translation, only two words are changed. Considered on segment level, this might not be considered a major improvement in terms of fluency or accuracy. However, the phrase I have used all the cles is clearly better than I have uses all cles, thus pre-editing was successful, and that is what we wish to ascertain with our evaluation. Instead of segment level scores, we have therefore chosen to perform a simple ranking evaluation, where participants were asked to rank the translations of raw and pre-edited segments against each other. This should allow for local yet important changes as in example 37 to be taken into account. We will see later in this chapter that the comparatively small changes effected by pre-editing are also difficult to catch with automatic scores (cf. 4.5.2). We opted for a bilingual setup, to ensure that the evaluation would take into account adequacy, i.e. the correct transfer of the meaning of the source. In the evaluation, we present each translation pair with a single source, either the raw or pre-edited version. The source is chosen according to the evaluated rules: For rules of Sets 1 and 2, which regroup what we call rules for humans (cf. 3.8), pre-editing also improves the readability of the source1 , therefore we presented the pre-edited source. For rules of Set 3, where pre-editing potentially degrades the source, we presented the raw source. Results have shown that this choice of source is not innocuous, and has considerable impact on the evaluation of several rules, notably those where elements that are not considered essential to the message are removed by pre-editing (e.g. 4.4.2.7). 1

In most cases; for rules from Set 1, which are applied automatically, as the precision is not 100%, for a small proportion of sentences pre-editing will introduce errors into the source.

107

4. RULE EVALUATION: IMPACT

Another possibility would have been to perform an error analysis of the target sentences, as this would have given us a precise view of the errors corrected by pre-editing. We decided against this approach for several reasons. Error analysis requires highly skilled linguists to produce consistent and reliable annotations, and such resources were not available for this study. Furthermore, even when performed by experts, annotation of MT output is not without difficulties. Especially for community content, where the source itself is of poor linguistic quality, MT output is bound to contain numerous unclear errors, increasing the difficulty of annotation and thereby reducing the reliability and reproducibility of results. Finally, since error annotation is far more complex than comparative evaluation, it would have taken much more time. We will now describe in more detail how our human comparative evaluation was set up and carried out.

4.3.1

Participants

Choosing the right evaluators is probably the most important aspect of every human evaluation. Multiple factors must be taken into account, such as linguistic skills and background knowledge, but also more practical criteria such as availability and remuneration. In our case, real end-users would be the ideal evaluators, since they could judge the actual usefulness of a translation. However, assuming that the end-users do not understand the source language, which is in fact one of the main assumptions of the ACCEPT project, these end-users would not be qualified to evaluate the accuracy of the translation. Since end-users were not an option, we recruited two groups of evaluators with different profiles for our evaluations. The first group are translators. Translators are frequently chosen to evaluate MT, as they have the required linguistic skills. Nearprofessional translators are readily available at FTI, thus we easily found three native English speaking students in the final year of the Masters program, who had French as a working language and some professional experience in the French-English language pair. They had no special technical knowledge, and no prior experience with Norton products. These participants were paid for the task on a per-sentence basis. This group will henceforth be referred to as the “translators”. Due to their background, translators tend to be very critical towards a translation. We therefore decided to recruit a second group of evaluators that would hopefully not

108

4.3 Evaluating the impact of pre-editing on SMT of forum data

have this bias. This second group was recruited on AMT (Amazon Mechanical Turk1 ). Besides having different profiles, which might be closer to those of the forum users, AMT workers have the advantage of being faster and cheaper than translators. In the next section, we will provide some further information about AMT. 4.3.1.1

Using AMT for evaluation tasks

AMT is an online microworking platform. Microworking could be considered a subspecies of crowdsourcing (the concept of outsourcing a job to an undefined group of people over the internet). Large projects are split into a series of small, simple tasks that can be distributed over the internet and completed by many people. Tasks generally require little knowledge or specific training, yet do depend on human intervention, hence the AMT term “Human Intelligence Task” (HIT) to describe these tasks. The work-flow on AMT is very simple. Individuals or companies who distribute tasks, called requesters on AMT, post batches of HITs, with any number of assignments per HIT (i.e. number of instances of a given task, which will each be completed by a different person). These HITs are then made available over the web to the AMT workers, who can chose to complete any number of HITs they are qualified for. Once the tasks are completed, the requester can view the results, and decide to reject or accept the work, at which point the workers receive their payment. Using AMT for simple evaluation or annotation tasks has become quite common in the scientific community (Fort et al., 2011). Large quantities of data can be obtained quickly and at low expense. AMT also has the advantage of providing access to a diverse workforce, which can deliver different results from those obtained from the language professionals that are more commonly recruited for NLP related annotation tasks (Callison-Burch & Dredze, 2010). While the data obtained is noisier than data produced by experts, reliability can be achieved by aggregation of individual annotations (Sheng et al., 2008). A study investigating five natural language processing tasks showed that a small number of non-expert annotators can produce results comparable to those of one expert annotator (Snow et al., 2008). As reliability varies between workers, different methods have been investigated to correct the bias introduced by individual workers. If both expert and non-expert data is available, it is possible to model and correct for bias of individual workers (Snow et al., 2008). 1

http://mturk.com

109

4. RULE EVALUATION: IMPACT

To allow requesters to recruit the best workers for specific tasks, Amazon provides functionalities to restrict tasks to specific worker groups. Criteria are location (restriction to one or several countries, or exclusion of countries) and HIT Approval Rate (the individual worker’s work-record). Unfortunately no restriction based on language skills is provided. In our case, since we needed native English speakers with knowledge of French, we chose to recruit workers from a bilingual country, Canada. By itself, this is clearly not a sufficient restriction, as not all Canadians can be assumed to be bilingual. We thus added an indication in the HIT interface that the tasks should only be completed by native speakers of English with good knowledge of French. To enforce this, and reduce the chance of non-qualified workers completing our tasks, we also added a restriction on HIT Approval Rate: the tasks were only available to workers with an overall approval rate of at least 80%, who had completed at least 50 HITs on the platform. The idea behind this was to discourage workers lacking the skills from attempting to complete HITs. Since requesters have the possibility of rejecting work, they do not have to pay for incomplete or badly completed tasks. For the worker, a rejected HIT immediately affects his or her HIT Approval Rate, reducing access to further work. As a consequence, serious workers are generally careful not to work on HITs they cannot complete satisfactorily and that might therefore be rejected. Although the selected thresholds (80% and 50) are quite low compared to the defaults on AMT, they were successfully used in a previous experiment by Rayner et al. (2012), which led us to maintain them. One difficulty on AMT, which distinguishes it from traditional human evaluations, is the impossibility of providing detailed task instructions. Since work is split up into very small tasks, for which workers receive a correspondingly small payment, the tasks have to be completed quickly. Workers cannot therefore be expected to read long and complex instructions. In our experiment, the instructions provided were simply the following: (38) In each of the following groups, look at the original French sentence and two possible translations. Mark one of the translations as better, or mark them as about the same if you can’t decide.

110

4.3 Evaluating the impact of pre-editing on SMT of forum data

In our experiment, each evaluation task was posted on AMT with three assignments, ensuring that three different workers would complete it. As opposed to the three translators, who each evaluated the complete set of data, workers only completed as many tasks as they wished or had access to. In this experiment, 12 different AMT workers completed our tasks. Table 4.1 shows the distribution of HITs among workers. The most prolific completed 70 of the 71 available HITs, while the least active 3 completed only a single HIT each. Table 4.1: Distribution of HITs among workers

worker

HITs completed

1 2 3 4 5 6 and 7 8 and 9 10, 11 and 12

70 53 24 19 14 11 4 1

In summary, we have collected two sets of three judgements for each sentence: three by the same three translators and three by three random AMT workers among the twelve who worked on our HITs. In the next section, we will describe how the evaluations were carried out.

4.3.2

Evaluation setup

In our evaluation, we want our participants to judge whether the translation of a preedited sentence is better than that of the corresponding raw sentence. Obviously, we do not want our evaluators to know which sentence is which, so raw and pre-edited versions were presented in randomised order. Evaluators then simply had to give a judgement about which of the two, first or second, was better. An additional judgement option, about the same, was offered for cases where translations were perceived to be of similar quality.

111

4. RULE EVALUATION: IMPACT

Preliminary experiments with a simple three point scale, {first better, about the same, second better }, showed that participants had difficulties assigning judgements to bad output. If both translations were globally bad, the tendency was to rate them as about the same, although one might have been slightly better than the other (ACCEPT D9.2.2, 2013). Participants reported having misgivings about attributing a judgement with a positive connotation such as better to an altogether bad translation. Given that in this evaluation it is precisely the slight differences between translations that we are interested in, we performed another experiment, this time on a five-point scale, {first clearly better, first slightly better, about the same, second slightly better and second clearly better }. Adding the slightly option greatly increased the number of non-equal judgements, providing more useful results, so we proceeded with this scale for our evaluations. In the results presented throughout the remainder of this chapter, the slightly and clearly categories are again collapsed into one. To simplify the evaluator’s task, differences between the two translations were highlighted in colour. Highlighting differences is debatable. On one hand, if the differences are so slight that they would be overseen without highlighting, it can be argued that they are not relevant, and can well be ignored in the evaluation. On the other hand, evaluation is a tedious task, and when confronted with longer, potentially garbled sentences, it is often time-consuming to spot the differences. Minor differences might also be overlooked due to inattention. Since we are interested even in slight changes produced by the pre-editing rules, we opted for highlighting. 4.3.2.1

Interface

Ideally, both groups of evaluators should have completed the tasks using the same interface, but unfortunately Amazon’s restrictions on the creation of worker accounts, which no longer allow workers who are not US residents, prevented our translators from creating accounts and completing the evaluation on the AMT platform. We therefore developed a C# Windows Forms Application that would allow our translators to perform the tasks in conditions as similar as possible to AMT. On AMT, workers were presented with HITs of 20 evaluation units each, a unit being composed of a source and two target sentences. The target sentences to be compared were presented above each other, with a column of radio buttons on the right hand side to select the appropriate judgement. Technically, this was done by creating a project

112

4.3 Evaluating the impact of pre-editing on SMT of forum data

Figure 4.1: AMT evaluation interface

on AMT, where we defined an HTML template with fields for the data and controls for the responses. Data was then uploaded to the project in the form of a CSV file, which was combined with the HTML template to produce the final HITs. Figure 4.1 shows a HIT as displayed on AMT. Once the HIT has been completed, the results can be downloaded in the AMT CSV format, and processed to extract the judgements. For the translators, we developed a C# Windows Forms Application to reproduce the evaluation conditions on AMT. The interface of this tool also displays two target sentences above each other, and a column of judgement radio buttons on the right hand side. In analogy to AMT, it takes a CSV file with the data as input, and writes the evaluation results into another CSV file. The main difference is that in the case of AMT, all 20 evaluation units are presented in sequence on a HTML page, while for the local tool users are shown one unit at a time, and have to navigate using buttons to go from one example to the next. Figure 4.2 shows the tool interface. The tool was distributed as an executable file over the Moodle1 learning platform, to which all FTI students have access. Having explained our comparative evaluation methodology, we will now proceed to describe the evaluations performed with this setup. In our study, we have two evaluation objectives: 1) evaluate the impact of individual rules, to investigate which rules are most useful/detrimental, and 2) evaluate the global impact of pre-editing, to ascertain whether pre-editing as a process is useful for community content. We 1

https://moodle.unige.ch/

113

4. RULE EVALUATION: IMPACT

Figure 4.2: Tool evaluation interface

have therefore performed two distinct evaluations, the first on a rule by rule basis with specifically selected sentences, the second with a complete pre-editing process on a randomly selected corpus.

4.4 4.4.1

Rule by rule evaluation Data selection

The aim of this first evaluation is to ascertain the effectiveness of individual rules. We have therefore evaluated sentence pairs where the only difference between the two resides in the application of one single rule. To select the data, we used a corpus of 10,000 unseen sentences extracted from the Norton forum data. We processed these sentences with the Acrolinx Batch Checker (described in section 3.2.2 of the previous chapter) to extract all flagged sentences for each rule. We then randomly selected up to 50 sentences per rule and applied the changes required by the rule to produce the pre-edited version. For the rules intended for automatic application (Sets 1 and 3, cf. 3.8), we kept all examples, including those where flags were incorrect or pre-editing produced an error.

114

4.4 Rule by rule evaluation

For the rules intended for manual application, we kept only the correct flags. We then translated both the raw and the pre-edited sentences with the project baseline SMT (ACCEPT D4.1, 2012). All sentences where the translations of raw and pre-edited were different were evaluated. In the evaluation, each translation pair was presented with a single source, either the raw or pre-edited version, according to the evaluated rules: For the rules for humans, we presented the pre-edited source; for rules for the machine, we presented the raw source.

4.4.2

Results by rule categories

In this section we will present the results of the rule-by-rule evaluation. For a better overview of the treated phenomena, the rules are grouped into categories instead of the sets they have been attributed to (cf. 3.8). Since the results obtained with the two groups of evaluators are very similar, to avoid overburdening this section with tables, we will present only the results obtained with the translators. We will compare the results for the two groups in the next section (4.4.3). Complete results for both groups and for each rule are available in Appendix D. 4.4.2.1

Grammar (agreement)

Agreement errors are some of the most blatant errors in French, which most readers will notice and probably find disturbing. Yet they are not so disturbing for machine translation. Results for the two agreement rules (cf. 3.3.2.1) are shown in Table 4.2. The No impact column indicates the number of examples where pre-editing had no impact, i.e. where the translation of the pre-edited version was identical to that of the raw version. As mentioned earlier, the five point scale used in the evaluation setup is reduced to a three point scale, collapsing the clearly and slightly judgements. The judgements counted are majority judgements, i.e. judgements on which at least two of the evaluators agree. If the evaluators failed to reach a majority judgement, the example is considered as no majority; these are counted in the No majority column. For nearly a third of the sentences, correcting agreement has no impact on the translation. A large share of these can be attributed to the rule that corrects agreement within the noun-phrase (nounPhrAgr), since 42% of the sentences pre-edited with this rule have identical translations. Closer analysis of these sentences reveals that for a large share, the rule corrects gender agreement, e.g. *le tablette corrected to la tablette,

115

4. RULE EVALUATION: IMPACT

*ma question r´esolu corrected to ma question r´esolue. We suspect that the absence of gender inflection in English could account for the low impact of these corrections, since le and la will both be translated with the. Table 4.2: Comparative evaluation results for grammar (agreement) rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

100

31 (31%)

19 (19%)

9 (9%)

39 (39%)

2 (2%)

Individual results per rule are shown in Figure 4.3. The number in parentheses next to the rule names shows the number of flags evaluated. 0p nounPhrAgrR(50) subjVerbAgrR(50)

20p

40p

60p

38p 40p

80p

100p

Pre-editedRbetter AboutRtheRsame RawRbetter NoRmajority NoRImpact

Figure 4.3: % Impact of individual grammar (agreement) rules

The subject-verb agreement rule (subjVerbAgr) produces a high proportion of degraded sentences (26%). A closer investigation shows that these sentences have subjects that are the informal second person “tu”, as shown in the following example:

(39) Raw: N’h´esite pas ` a demander plus d’infos si tu le souhaite. −→ Do not hesitate to ask for more info if you wish. Pre-edited: N’h´esite pas ` a demander plus d’infos si tu le souhaites. −→Do not hesitate to ask for more info if you the souhaites. As so often on the forums, misspellings are caused by homophone confusions. In the case of verb endings, the omission of an s, or use of t instead of s, mostly results in a correct French verb form, albeit not the correct one for agreement with the subject. In the above example the wrong form souhaite could be either the first or third person singular, translated by wish and wishes respectively. By chance, the first is used here, resulting in a correct translation of the raw version. Inserting an s to produce the correct informal second person verb form souhaites to agree with the subject tu leads

116

4.4 Rule by rule evaluation

to mistranslation, leaving the French verb untranslated. This particular verb form is probably unknown to the SMT system as a consequence of the scarcity of informal second person forms in the training data, as discussed in Section 3.7.5 of the previous chapter, resulting in mistranslation. Based on these results, this rule could seem detrimental in the case of informal second person subjects. However, in this setup we are considering rules individually, which does not reflect the real use case for which they were designed. In the proper sequence, this rule would be applied before the tuVous rule, therefore the corrected verb forms would again be transformed, into the formal second person, which is not subject to the training data scarcity issue. On the whole, the agreement rules have a moderate positive impact, improving twice as many sentences as they degrade. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=69) = 45.0, p < 0.001. A non-negligible part of the corrections probably only fail to have an effect due to the less inflected target language. A certain number of sentences are deteriorated as a result of the informal second person issue. Despite these drawbacks, correcting agreement remains useful for SMT. 4.4.2.2

Grammar (mood/tense)

This category contains six rules which deal with tense/mood confusions (cf. 3.3.2.2). Results for all rules combined are shown in Table 4.3. Overall, the impact of the verb rules is mostly positive, with 60% of sentences improved against 16% degraded. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=77) = 77.4, p < 0.001. Table 4.3: Comparative evaluation results for grammar (mood/tense) rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

87

10 (11%)

14 (16%)

8 (9%)

52 (60%)

3 (3%)

117

4. RULE EVALUATION: IMPACT

The rule by rule results1 in this category are very heterogeneous, as shown in Figure 4.4. One of the least successful rules is useSubjunctive, with twice as many degraded as improved sentences. This can be explained by the relative scarcity of subjunctive verb forms as opposed to present or imperfect forms in the training data. By replacing a frequent token by a less common one, this rule degrades translation, despite correcting an obvious error in the source. The following example illustrates this phenomena, where a particularly rare, yet correct, imperfect subjunctive remains untranslated, while the incorrect imperfect in the raw version produces a correct translation: (40) Raw: Bien que ce logiciel fonctionnait parfaitement, c’´etait lui le coupable ! −→ Although this software worked perfectly with the other versions of NIS, it was the culprit! Pre-edited: Bien que ce logiciel fonctionnˆ at parfaitement avec les autres versions de NIS, c’´etait lui le coupable ! −→ Although this software fonctionnˆ at perfectly with the other versions of NIS, it was the culprit! 0w avoidConduF3-

20w

40w

80w

100w

0w

confParticipeIsUsuF9-

89w

futureConduF2-

100w

useImperativeuF2-

useSubjunctiveuF21-

60w

50w

PreNeditedubetter Aboututheusame Rawubetter Noumajority NouImpact

14w

wrongVerbFormuF50-

76w

Figure 4.4: % Impact of individual grammar (mood/tense) rules

On the other hand, the wrongVerbForm rule has as very good impact, improving three quarters of the sentences. Most of the errors corrected by this rule are the very frequent confusions between verb forms ending in e, where the past-participle is used instead of the infinitive, or vice-versa (38 of the 50 evaluated sentences fall into this category), as in the following example: (41) Raw: donc on peut pas demand´e aux techniciens −→ So we can not asked the technicians 1

The results shown for the confParticipeIsUs rule, which was developed at a later stage, were obtained in a later evaluation performed on AMT, using the same data and setup.

118

4.4 Rule by rule evaluation

Pre-edited: donc on peut pas demander aux techniciens −→ So we can not ask the technicians Since these verb forms are all very common, we do not have the same problem as with the subjunctive, and correction nearly always leads to an improvement. The remaining rules, avoidCond, confParticipIsUs, futureCond and useImperative produce too few flags to lead to significant results. Nevertheless, the impact of these rules is mostly positive, with the exception of the avoidCond rule. The confParticipeIsUs rule, which corrects specific past participle endings (i /is/it and u/us) seems promising, with 89% of sentences improved. At the time of writing this thesis, this rule is still in development to include other endings. With the exception of the subjunctive rule, which produces tokens that are not in the training data, the rules correcting verb forms have a positive impact. 4.4.2.3

Grammar (sequences)

We have two rules which we call sequence rules, since they identify sequences of words that are incorrect instead of focussing on specific grammatical issues (cf. 3.3.2.3). Evaluation results for these rules are shown in Table 4.4 and Figure 4.5. Table 4.4: Comparative evaluation results for grammar (sequence) rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

67

15 (22%)

11 (16%)

9 (13%)

29 (43%)

3 (4%)

Overall for this category we observe not quite three times as many improved as degraded sentences. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=52) = 28.9, p < 0.001. While the wrongSeq rule performs well, with 50% of sentences improved against 14% degraded, the wrExpr is less satisfactory, degrading as many sentences as it improves (24%). Closer inspection shows that of the four degraded sentences, three are incorrectly flagged cases involving the expression bien sˆ ur, where sur is wrongly replaced by sˆ ur. The unsatisfactory impact results can thus be attributed to insufficient

119

4. RULE EVALUATION: IMPACT

0N

20N

wrongSeqhI76c wrExprhI17c

40N

60N

80N

100N

50N 24N

Pre-editedhbetter Abouththehsame Rawhbetter Nohmajority NohImpact

Figure 4.5: % Impact of individual grammar (sequence) rules

rule precision rather than to negative impact. To resolve this issue, we could move the wrExpr rule from the automatic Set 1 to the interactive Set 2, thereby excluding the imprecision cases. 4.4.2.4

Homophones rules

This category regroups all the rules that treat homophones or word confusions (cf. 3.3.2.4), for a total of 18 rules. Results for all rules combined are shown in Table 4.5. A total of 302 sentences were included in the evaluation of homophone rules. As seen in the previous chapter, some errors are very infrequent. Many of these rules produced only few flags on the dataset, some as little as only one or two. Their occurrence in our data has led to development of corresponding triggers in the rules, but since the rules are very specific, it is unsurprising that they produce few flags. Table 4.5: Comparative evaluation results for combined homophone rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

302

57 (19%)

27 (9%)

47 (16%)

164 (54%)

7 (2%)

Overall, the homophone rules improve a five times higher proportion of sentences (55%) than they degrade (9%). A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=245) = 242.9, p < 0.001. With the exception of nounVerbConf (D.4), all the homophone rules are part of Set1, which we intend to apply automatically. As we have seen in the previous chapter, although the precision of these rules is high, it is not perfect. Thus the evaluated sentences include some sentences where the pre-editing is incorrect, either flagging something that is not an error, thereby introducing an error in the source, or

120

4.4 Rule by rule evaluation

producing an incorrect suggestion, thereby replacing an error with another. Of the 25 sentences where translation is degraded by these rules, 13 are cases where pre-editing is incorrect. More than half the negative impact of these rules would therefore fall away if these rules were not applied automatically. OL aÀh'TON çaSah'tPN ceSeh'QqN ciSih'tN desDèsh'QN divHomphonesh'QqN duDûh'qtN laLàh'TON maMMah'tN niNMyh'tN nounVerbConfh'ùON onOntSonth'ùN ouOùh'RN quMelleQuelleh'PN soiSoisSoith'ùN surSûrh'wN tMesTesh'ùN tousTouth'tqN

ùOL

QOL

POL

ROL

tOOL

TOL QQL QQL

tOOL QwL AAL TOL b tOOL

Prebeditedhbetter Abouththehsame Rawhbetter Nohmajority NohImpact

QTL tOOL RRL tOOL b RwL TOL qwL

Figure 4.6: % Impact of individual homophone rules

Figure 4.6 shows the results for the individual rules1 . Clearly the most frequent homophone error is the confusion between the verb a ` rule that corrects these has a positive (has) and the preposition ` a (of/to). The aA impact, improving 50% of sentences while degrading only 4%. Without accent, a is invariably translated with has, as shown in the following example: (42) N’h´esitez pas a me signaler si je peux faire quelque chose −→ Do not hesitate has me if I can do something N’h´esitez pas ` a me signaler si je peux faire quelque chose −→ Do not hesitate to tell me if I can do something 1

The results shown for the divHomophones, niN’y and onOntSont rules, which were developed at a later stage, were obtained in a later evaluation performed on AMT, using the same data and setup.

121

4. RULE EVALUATION: IMPACT

Changing only a single token greatly improves the translation. We also observe that for this rule, we have a high proportion of unchanged sentences, where the translation is not impacted by pre-editing (32%). Closer investigation of these cases does not reveal a pattern that might explain the absence of impact. However, since this confusion is very common, it might also occur in the SMT training data, which could explain why in some expressions both versions are translated identically. Among the other reasonably frequent confusions, ceSe, duDˆ u, laL` a and nounVerbConf, duDˆ u is the most successful, with 77% of sentences improved. Similarly ` this is a confusion between a preposition (du - of ) and a verb (dˆ to aA, u - past participle of devoir (to have to do sth)). As shown in the following example, restoring an important word such as a verb noticeably improves the translation: (43) je pense qu’il y a du avoir un probl`eme avec mon LiveUpdate −→ I think there is having a problem with my LiveUpdate je pense qu’il y a dˆ u avoir un probl`eme avec mon LiveUpdate −→ I think there has to be a problem with my LiveUpdate The remainder of the rules produce too few flags to provide significant results, yet on the whole, they improve more sentences than they degrade. Correcting homophones is clearly successful in terms of translation improvement. As many of these errors are confusions of categories, such as verbs and function words, correcting them often resolves important issues. Although only few words are changed, the impact on the understandability of the sentence is high. 4.4.2.5

Punctuation

This category contains thirteen rules that treat punctuation and spacing. Besides rules that treat spaces around punctuation, it includes rules for comma placement, elision, hyphenation, parenthesis use and capitalisation. We evaluated a total of 467 sentences for this category. Results for the entire category are shown in Table 4.6. For 34% of the sentences, pre-editing had no impact on the translation. We also observe a high proportion of rules for which no majority judgement was reached (6%). We will see later that this can be attributed to specific rules. Despite this, overall the impact of the punctuation rules is positive, with 40% of sentences improved against 10% degraded. A one-sample chi-square test of goodness

122

4.4 Rule by rule evaluation

of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=149) = 141.2, p < 0.001. Table 4.6: Comparative evaluation results for combined punctuation rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

467

159 (34%)

48 (10%)

45 (10%)

189 (40%)

26 (6%)

Rr

jRr

addCommac95Rk

6Rr

8Rr

NRRr

I6r

addCommaPPc95Rk

64r

addElisionc95Rk

N6r

addHyphenc95Rk

5Rr

addSpaceBtwnWordsc95Rk addSpacePunctc95Rk

4Rr

Njr Rr

addSpacePunctjc9Ik

NRRr

avoid9charkc99k

78r

avoidCapSentc95Rk

76r

avoidPunctc95Rk

Pre-editedcbetter Aboutcthecsame Rawcbetter Nocmajority NocImpact

j6r

checkEOSABOSc95Rk

76r

wrongCapc9Nk RbRRr wrongElisionc95k

4Rr

Figure 4.7: % Impact of individual punctuation rules

Figure 4.7 shows the individual results for each rule. The most striking aspect of these results is the low impact of four of those rules. Two of these are rules that handle spaces: addSpaceBtwnWords, which inserts spaces between tokens, and addSpacePunct, which corrects spacing around punctuation. The lack of impact of these rules is due to the fact that these changes, while they improve the correctness of the text and may improve human readability, are mostly lost during the pre-processing steps of most MT engines. Multiple spaces tend to disappear during tokenisation, and correct spacing between tokens in the output is frequently reconstructed with the language model. There are however cases where adding a blank creates a clear sepa-

123

4. RULE EVALUATION: IMPACT

ration between two tokens that would otherwise be considered as one, and translated incorrectly, as shown in the following example: (44) Raw: J’ai fait une mise a jour de NORTON il y a 1mois. −→ I have done a update of NORTON there 1mois. Pre-edited: J’ai fait une mise a jour de NORTON il y a 1 mois. −→ I have done a update of NORTON there is 1 month. The other pair of rules that has a low impact are those correcting elisions: both adding missing elisions (addElision) and removing incorrect ones (wrongElision) does not impact translation in 70% and 60% of sentences respectively. However, when it does have an impact, it is mostly positive. In some cases, changing the punctuation in the source affects nothing else than punctuation in the target, which is often the case for added or removed commas. While these changes improve the readability of the source, and arguably improve the readability of the target, they do little to improve the sentence globally, as shown in the following example: (45) Raw: J ai utiliser les outils que norton a mis a ma disposition Norton Power eraser mais c est sans effet −→ I have to use the tools that norton has made my available Norton Power eraser but that is no effect Pre-edited: J ai utiliser les outils que norton a mis a ma disposition Norton Power eraser, mais c est sans effet −→ I have to use the tools that norton has made my available Norton Power eraser, but that is no effect The addition of a comma may not be considered relevant in the face of a “word salad” translation. Evaluation results indicate that improvement/degradation through absence/presence of a comma is a controversial issue. Our translators fail to reach majority judgements for many of these cases. Two rules treat commas: addComma and avoidPunct. With an average of 16% (12 and 20% respectively), the percentage of sentences with no majority is much higher for these rules than for other rules. On the entire data-set, after removing these two rules, the percentage of sentences with no majority is 3.6%. This suggests that these cases are difficult to evaluate. Of the no-majority sentences, two-thirds can be attributed to cases where the only difference between the two translations was a comma.

124

4.4 Rule by rule evaluation

Despite these partly ambiguous results, inserting a comma is clearly useful in a number of cases: a closer investigation of the first rule reveals that there are sentences where inserting a comma after a specific expression changes the translation of that expression. The following example shows a case where insertion of the comma after par contre has a clearly positive effect: (46) Raw: par contre sous ie9 , j’ai pareil que toi. −→ By against under IE9, I have the same as you. Pre-edited: par contre, sous ie9 , j’ai pareil que toi. −→ However, under IE9, I have the same as you. Cases like this represent 12 of the 50 evaluated sentences for this data set. Despite the large proportion of cases with mitigated impact, 36% of sentences were judged as improved by the addComma rule, which is 3 times the number of sentences degraded. The avoidPunct rule is less successful, with 26% positive against 30% negative. However, considering the important proportion of unclear results, we suspect the error margin on these figures is relatively high. With 78% of sentences improved, the avoid(char) rule, which flags characters between parentheses (commonly used to indicate an optional plural) is one of the most successful rules, followed closely by the rule detecting sentence boundaries without punctuation, checkEOS/BOS (76% of sentences improved). This rule identifies missing sentence boundaries, mainly based on expressions and capitalisation, and inserts periods at the appropriate place. In most cases, this insertion doesn’t affect the translation much, apart from adding a period, as shown in the following example: (47) Raw: D´esabonnement Bonjour, il y a peu j’ai d´ecid´e de me d´esabonner `a Norton Mobile Security. −→ Unsubscribe Hello, there is little I decided to unsubscribe to Norton Mobile Security. Pre-edited: D´esabonnement. Bonjour, il y a peu j’ai d´ecid´e de me d´esabonner ` a Norton Mobile Security. −→ Unsubscribe. Hi, there is little I decided to unsubscribe to Norton Mobile Security. It is however perceived very positively by the evaluators, both translators and AMT, probably because it improves readability. In some cases, this rule has a more important impact on the translation, as the following example illustrates:

125

4. RULE EVALUATION: IMPACT

(48) Raw: Merci de votre aide Bien cordialement. −→ Thank you for your help Asset sincerely. Pre-edited: Merci de votre aide. Bien cordialement. −→ Thank you for your help. Best regards. We have mentioned in the previous chapter that the rules were applied to text files, after automatic removal of HTML tags. This pre-processing step may have introduced artefacts, combining two sentences separated by a newline character or tag into one single sentence. Some of the cases flagged by this rule may be such artefacts, which would not exist if the rule were applied directly in a forum text box, where original segmentation markers would prevent erroneous concatenation of two sentences. Only a further study to investigate usage of the rules in a real use-case, in the forum interface itself, would reveal how the pre-processing caused by our “offline” development and evaluation approach has influenced results. Finally, the addHyphen rule is one the more successful rules, with 50% of sentences improved. Expressions like “ci-joint”, “c’est-`a-dire”, “moi-mˆeme” are mistranslated without the hyphens, as illustrated with the following example: (49) Raw: Ci joint mon fichier de l’historique. −→ Ci attached my file history. Pre-edited: Ci-joint mon fichier de l’historique. −→ Enclosed my file history. Besides expressions, the rule also corrects missing hyphens in the verb-pronoun constructions used in interrogative sentences: (50) Raw: Avez vous une solution pour d´esinstaller votre logiciel −→ Have you a solution to uninstall your software Pre-edited: Avez-vous une solution pour d´esinstaller votre logiciel −→ Do you have a solution to uninstall your software Overall, the punctuation rules are useful, despite a relatively low impact compared to other rule categories. Of all categories, this has the lowest negative impact, with less than 10% of sentences degraded. Many of these rules have a high precision and can be applied automatically as they have unambiguous unique replacement suggestions.

126

4.4 Rule by rule evaluation

4.4.2.6

Informal language

We have six rules for informal language: three that focus on lexical aspects and three that treat syntactic and stylistic issues. Combined results are shown in Table 4.7. The global impact of these rules is positive, with clearly more improved sentences than degraded. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=249) = 156.9, p < 0.001. Figure 4.8 shows the individual results by rule, with the four lexical rules first. Table 4.7: Comparative evaluation results for combined informal language rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

291

42 (14%)

54 (19%)

38 (13%)

144 (49%)

13 (4%)

0N avoidAnglicismQh50R

20N

40N

missingNeQh50R

100N

70N

avoidInformalQh41R

avoidEstCeQueQh50R

80N

28N

avoidFAbbrevQh50R

avoidDirQuestQh50R

60N

PrepeditedQbetter AboutQtheQsame RawQbetter NoQmajority NoQImpact

54N 42N 56N 48N

Figure 4.8: % Impact of individual informal language rules

The avoidAnglicism rule has no impact on translation in 40% of cases, and only 28% of sentences are improved. The absence of impact is a logical consequence of our language pair. Since our target language is English, an Anglicism, if left untranslated, has a reasonably high probability of being an acceptable word in the target language; the replacement inserted by our rule, if then translated correctly, might well result in the exact same English word, as shown in the following example: (51) Raw: est-on prot´eger de ce malware ? −→ Are we protect of this malware?

127

4. RULE EVALUATION: IMPACT

Pre-edited: est-on prot´eger de ce logiciel malveillant ? −→ Are we protect of this malware? Obviously, this rule would lead to entirely different results for another target language. The avoidFAbbrev rule reduces the occurrence of untranslated words in the MT output by replacing unknown words or tokens with words that are in the training corpus. While some abbreviations such as info are understandable both in French and English, others such as bdr (base de registre - registry) or dd (disque dur - hard disk) need to be transformed for correct translation. With 70% of sentences improved against 2% degraded, this is one of the most successful rules. The avoidInformal rule has a positive impact, with 54% of sentences improved. As shown in the following example, the colloquial expression me sortir (outputting something) is translated incorrectly, while its less colloquial equivalent me donner (give something) is translated correctly: (52) Raw: [...] et il me sort le message suivant : −→ [...] and I fate the following message: Pre-edited: [...] et il me donne le message suivant : −→ [...] and it gives me the following message: This rule also treats onomatopoeias such as grrr, pfff, bof, etc. that express emotions. These expressions are difficult to translate, since they are mostly absent from the training data, resulting either in insertion of the unchanged source word in the target sentence, or in a strange translation. Assuming that these expressions are not essential to the message, since they do not convey any technical information necessary to understand a question or describe a solution, they are simply removed, as shown in the following example: (53) Raw: heu je n’ai pas cette version norton. −→ H I do not have this version norton. Pre-edited: je n’ai pas cette version norton. −→ I do not have this version norton. In this case a possible correct translation for heu would have been the English umm, but since it does not provide any useful information, it can be removed without major loss of meaning.

128

4.4 Rule by rule evaluation

The first rule treating informal interrogatives, avoidDirQuest, transforms statement questions, very frequent in spoken language, into their inverted equivalent. Since this type of non-inverted question is also acceptable in conversational English, in our context a translation maintaining this structure without inversion would also seem acceptable. This mostly seems to be the case with the ACCEPT SMT system. The evaluation shows that this rule degrades more sentences than it improves. For most of the cases with degradation, the subject is the informal second person tu. Here inversion leads to a construct with -tu, which is nearly always translated badly: (54) Raw: Tu parles de Mon Norton ? −→ You speak of My Norton? Pre-edited: Parles-tu de Mon Norton ? −→ Parles-tu My Norton? This phenomena accounts for 15 of 24 degraded sentences, and we expect it might not be a problem if the tuVous rule was applied afterwards. This still leaves us with 9 sentences where inversion clearly degrades translation. Niessen & Ney (2001) do the exact opposite, removing question inversion, and their results suggest that “harmonizing word order between [...] different sentence types enables a better exploitation of the bilingual training data for statistical machine translation”. Producing a less common word order, such as inversion, may in fact degrade translation. The other rule treating a fairly informal type of question, avoidEstCeQue, has a positive impact on translation, improving more than half of the sentences. Est-ce que is not identified as a unit, and its components are translated individually, as illustrated in the following example: (55) Raw: Est-ce que tu as pu r´esoudre ton probl`eme ? −→ Is it that you have been able to resolve your problem? Pre-edited: As-tu pu r´esoudre ton probl`eme ? −→ Have you been able to resolve your problem? Replacing these structures by an inverted structure, which is both unambiguous and indivisible, improves the translation. In an experiment involving data from the same forums, Rayner et al. (2012) have performed the opposite transformation to create artificial informal training data by replacing inverted structures by est-ce que sequences. However, this artificial data was not found to improve translation.

129

4. RULE EVALUATION: IMPACT

Finally, the rule correcting incomplete negations (incomplNeg), i.e. the very frequent omission of ne in spoken French and on forums, has an overall positive effect, with 48% of sentences improved. We also observe that for 22% of sentences, the rule has no impact. A closer inspection reveals that the negations in the translations are complete and correct, despite the missing ne. Overall the rules treating informal language have a positive impact, but performance differs greatly between individual rules. Replacing anglicisms has little impact, but this is a direct consequence of our target language. Replacing abbreviations and informal language has a high positive impact, with 70% of sentences improved, making these some of our most effective rules. The rules treating informal questions also have a positive impact, although in the present evaluation the absence of sequential rule application leads to artificially bad results, since one rule creates structures that would improve translation only if the tuVous rule was applied afterwards, which was not done here. 4.4.2.7

Simplification

Overall the six simplification rules have a positive impact on translation, improving 65% of sentences. Table 4.8 shows combined results for these rules. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=255) = 252.9, p < 0.001. Table 4.8: Comparative evaluation results for simplification rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

260

5 (2%)

54 (21%)

22 (8%)

170 (65%)

9 (3%)

Figure 4.9 shows individual results per rule. Several of the simplification rules remove words that are not essential to the meaning of the sentences: avoidAdverbs, avoidConjBOS and avoidNon. Their evaluation is not straightforward, since our evaluation methodology introduces a certain bias for certain cases. For two thirds of the sentences evaluated for the avoidAdverbs rule, removing an adverb from source

130

4.4 Rule by rule evaluation

0-

20-

40-

avoidAdverbsjR102w avoidCeQuijR16w avoidCleftSentjR35w

80-

100-

645063-

avoidConjBOSjR50w

92-

avoidNonjR3w avoidPresPartjR50w

60-

100-

Preceditedjbetter Aboutjthejsame Rawjbetter Nojmajority NojImpact

56-

Figure 4.9: % Impact of individual simplification rules

affects the translation no further than also removing the adverb, as in the following example: (56) Raw: effectivement c’est pareil sous chrome avec norton 360 −→ Actually it is the same under chromium with norton 360 Pre-edited: c’est pareil sous chrome avec norton 360 −→ It is the same under chromium with norton 360 In such cases, as observed for the avoidInformalA rule, evaluators generally find the translation that is closer to the source to be better. Thus results are influenced more by the choice of the presented source than by the quality of the translation itself. Arguably, in a simplification perspective, if we assume that removing adverbs improves readability, then translations where these adverbs are also removed can indeed be considered as better. In the remaining third, removing adverbs changes the translation of the near context, as shown in the following example: (57) Raw: Je suis, bien entendu, all´e sur Malekal’s site... −→ I am, of course, gone on Malekal’ s site... Pre-edited: Je suis all´e sur Malekal’s site... −→ I went on Malekal’ s site... For these cases only 16 of the 33 sentences are improved. For the avoidConjBOS, 8 of the 50 evaluated examples have translations that are changed more than the absence of conjunction. For four of those, the change was considered positive. The avoidNon rule produced only 3 correct flags on the corpus, thus results are not significant, yet all three have a positive impact.

131

4. RULE EVALUATION: IMPACT

The rule that cuts long sentences at ce qui (avoidCeQui) has an overall positive impact, improving 50% of sentences. Yet it also degrades 25% of sentences, and shows a high proportion of sentences where no majority judgement was reached (19%). For 9 of the 16 sentences, we observe an identical impact on MT, with which replaced by This, as in the following example: (58) Raw: Il peuvent prendre ton PC en assistance, ce qui permettra de voir ce qu’a supprim´e NU. −→ It can take your PC in assistance, which will see what has removed NU. Pre-edited: Il peuvent prendre ton PC en assistance. Ceci permettra de voir ce qu’a supprim´e NU. −→ It can take your PC in assistance. This will see what has removed NU. In these cases, since pre-editing does not result in a direct improvement of the translation, but in a transformation of stylistic nature, evaluation is more subjective. This is reflected in the results: for only 5 of these 9 cases, the pre-edited version was judged to be better. Results are disparate, with no unanimous judgements, 2 cases with no majority, and many cases where judges strongly disagree (i.e. judgements at both ends of the scale). This suggests that changes that are only of stylistic nature are very difficult to judge. Regarding the remaining 7 cases, 3 are improved, 3 degraded, and one has no majority. For all cases, we again observe strong disagreement between the judges. Two of the degraded sentences are similar, illustrating an interesting aspect of pre-editing for SMT. The following example shows one of these sentences: (59) Raw: En fait, ce virus remplace le “shell” de Windows et se substitue `a lui, ce qui fait que le PC est totalement bloqu´e d`es le moment o` u Windows devrait se lancer. −→ In fact, this virus replaces the “shell” Windows and replace it, which means that the PC is completely blocked when Windows should begin. Pre-edited: En fait, ce virus remplace le “shell” de Windows et se substitue `a lui. Ceci fait que le PC est totalement bloqu´e d`es le moment o` u Windows devrait se lancer. −→ In fact, this virus replaces the“shell” Windows and replace it. This is that the PC is completely blocked when Windows should begin.

132

4.4 Rule by rule evaluation

Here our pre-editing rule breaks the ce qui fait que sequence, which is correctly translated. The sequence produced by our rule (Ceci fait que) on the other hand, produces an incorrect translation. While it is often suggested that splitting long sentences improves translatability (e.g. Bernth & Gdaniec, 2001; Underwood & Jongejan, 2001), and we will see in Chapter 6 that this is indeed the case for rule based systems, the impact of this transformation on SMT systems is unpredictable. Since no sentence level analysis is performed, the length of sentences appears less relevant than the sequences of words produced. Despite these issues, we can still consider this rule as positive, since it successfully splits source sentences, which in turn produce correctly split target sentences. While this cannot be considered an improvement of translation per se, it still is an improvement of the target (on the assumption that shorter sentences are more readable) through pre-editing of the source. Another rule of the simplification category is the avoidCleftSent rule, which replaces cleft sentences with simpler constructions, as shown in the following example:

(60) Raw: Ce que vous pouvez faire pour le moment, c’est d’utiliser un autre gestionnaire de mot de passe. −→ What you can do for the moment, that is to use a different password manager. Pre-edited: Pour le moment vous pouvez utiliser un autre gestionnaire de mot de passe. −→ For the moment you can use another password manager. Not all cleft sentences are translated badly: actually 16 of our 32 examples are translated correctly. However, we have only two sentences that were judged as equal. With this rule, we are again in a situation where the displayed source will impact judgement. Although the meaning is more or less the same, the cleft structure alters the emphasis, as in the following example: (61) Raw: Ce que j’ai besoin, c’est un cd/dvd bootable −→ What I need is a bootable cd / dvd Pre-edited: J’ai besoin d’un cd/dvd bootable −→ I need a bootable cd / dvd If we show the pre-edited source (as we have done here), the second translation might be considered better, if we show the raw, the first might be preferred. This uncertainty is also reflected in our results: only 4 of the 32 examples received unanimous judgements.

133

4. RULE EVALUATION: IMPACT

Finally, in this category we have the avoidPresPart rule, which suggests replacing present participles by either a relative clause (une erreur n´ecessitant un red´emarrage −→ une erreur qui n´ecessite un red´emarrage) or by a subordinating conjunction (Mon abonnment arrivant ` a expiration −→ Comme mon abonnment arrive `a expiration). The overall impact is positive, although we have a relatively high proportion of sentences where both versions are judged to be adequate (20%).

Overall the simplification rules have a positive impact. However, results reveal that our evaluation methodology is not ideal for these rules. Indeed, since many of these rules do not improve translation in the traditional sense - producing a correct translation where the raw input produced a garbled one - but introduce simplifications in the source that are successfully passed on to the target, the comparative evaluation is not best suited to measure these improvements. Especially for those rules that remove non-essential tokens, the outcome of the comparative evaluation is strongly influenced by the presented source. For these rules, an evaluation based on specific criteria, such as readability for example, would seem more appropriate. 4.4.2.8

Reformulation

The reformulation rules transform structures and remove certain expressions in order to improve translatability. They belong to the group we call rules for the machine, since they do not improve the source, and are not designed to improve human readability. These rules were developed using the bigram/trigram approach described in 3.7.1. All these rules replace expressions that the SMT translates incorrectly by expressions that produce better translations, as shown in the following examples: (62) Raw: Merci de nous tenir au courant −→ Thank you keep us informed Pre-edited: Veuillez nous tenir au courant −→ Please keep us informed (63) Raw: Je vous remercie comme je n’ai que les connaissances minimum sur le net. −→ I thank you as I have that the minimum knowledge on the net. Pre-edited: Je vous remercie comme j’ai seulement les connaissances minimum sur le net. −→ I thank you as I have only the minimum knowledge on the net.

134

4.4 Rule by rule evaluation

Combined results for these rules are shown in Table 4.9, detailed results by rules are shown in Figure 4.101 Table 4.9: Comparative evaluation results for reformulation rules

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

289

26 (9%)

28 (10%)

22 (8%)

204 (71%)

9 (3%)

yP

2yP

4yP

addDoisJeIA36b 42P

avoidIlFautQueIAcyb

7yP 26P

avoidPourQueIA4cb

78P 6yP

avoidSiEtQueIA2yb

avoidSoitIA7b wrongAbbrevIAc6b

cyyP

82P

avoidGreetingIAc9b

avoidQueIA3yb

8yP

86P

avoidExpressionIA87b

avoidPlusIA23b

6yP

PreReditedIbetter AboutItheIsame RawIbetter NoImajority NoIImpact

75P 57P 75P

Figure 4.10: % Impact of individual reformulation rules

Overall, the reformulation rules have a highly positive impact on the quality of the MT output, improving 71% of the sentences, while degrading only 10%. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=263) = 390.5, p < 0.001. The addDoisJe rule has a very high positive impact, with 86% of sentences improved. It is closely followed by avoidExpresion, avoidPourQue and wrongAbbrev. 1

As a consequence of rule merging, the total number of examples evaluated for the avoidExpression rule exceeds 50. Originally, several rules were developed for different expressions. At the time of the evaluation, examples were selected for each of these rules. Ultimately these individual rules were merged into the unique avoidExpression rule, resulting in a higher total number of evaluated examples for this rule.

135

4. RULE EVALUATION: IMPACT

TheavoidIlFautQue and avoidSiEtQue also have high proportions of improved sentences, yet have more degraded sentences. A clear outsider is the avoidPlus rule, which degrades more sentences than it improves. This rule performs several transformations regarding the token plus, which is ambiguous since it can mean either more or no longer/not any more. To develop this rule, trigrams including plus were extracted and analysed. It was found that in most cases, plus is simply translated by more, which is not a correct translation. The rule performs different transformations, depending on the context. In some cases, where plus is used to emphasise rien or aucun, the rule simply removes it. While this does skew the meaning, it generally improves translation, as shown in the following example:

(64) Raw: je ne comprends plus rien −→ I do not understand more Pre-edited: je ne comprends rien −→I do not understand Another transformation performed by this rule is replacing plus with the unambiguous pas, when it follows a negated verb: (65) Raw: mon ancien norton antivirus 2011 ne s’ouvre plus. −→ my old norton antivirus 2011 does more. Pre-edited: mon ancien norton antivirus 2011 ne s’ouvre pas −→ my old norton antivirus 2011 does not open. In this case, we also have a slight distortion of the meaning, as the ideal translation would be no longer opens, yet the achieved translation is still better than that of the raw sentence. This distortion of the meaning produces the same issue that we observed for the simplification rule: when both translations of raw and pre-edited are correct, the version that is closer to the source is preferred. For this particular rule, this factor was unfavourable to our rule, since - as it belongs to the rules for the machine - it was evaluated with the raw source, and thus the raw version was preferred in the cases where both translations were correct. Overall, it is unsurprising that the reformulation rules perform well, since they have been developed specifically for these data and the SMT system. We do however observe that they produce relatively few flags, as they are mostly lexical and thus very specific.

136

4.4 Rule by rule evaluation

4.4.2.9

Informal second person

Among the rules for the machine, we also have the tuVous rule, which replaces informal second person pronouns and verbs, similarly to what was done in (Rayner et al., 2012). This rule produces nearly 2,000 flags on 10,000 sentences and has a high positive impact. Table 4.10 shows the results for this rule and Figure 4.11 shows a graphical representation of these results. Table 4.10: Comparative evaluation results for informal second person rule

Sentences

No impact

Raw better

About the same

Pre-edited better

No majority

50

20 (40%)

3 (6%)

4 (8%)

21 (42%)

2 (4%)

A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=30) = 32.6, p < 0.001. Unsurprisingly, the proportion of cases with no impact is large: this represents the cases where the informal second person verb form is translated correctly, resulting in the same translation as the formal form. Despite this, the global impact of this rule is highly positive, with only very few sentences degraded (6%). 0% tuVousy(50)

20%

40%

60%

80%

100%

42%

Pre-editedybetter Aboutytheysame Rawybetter Noymajority NoyImpact

Figure 4.11: % Impact of informal second person rule

4.4.2.10

Clitics

The rules for clitics were developed specifically for the ACCEPT SMT system, also using the bigram/trigram methodology (3.7.1). Combined results for these rules are shown in Table 4.11, detailed results by rule are shown in Figure 4.12. The overall impact of these rules is more positive than negative. A one-sample chi-square test of goodness of fit was performed to test the difference in proportions between the four categories, which was found to be significant, χ2(3,N=118) = 60.4, p
FTI7@/*@[//_*_projw7>ysource5language(>fr>ytarget5language(>en>ydatatype(>>y category(>>yproduct5name(>>) vheader) vphase5group) vphaseyphase5name(>mt_baseline>yprocess5name(>MachineyTranslation>ytool(>Moses>ytoold5 id(>èCCEPTySYMCyBaseline>ydate(>7@/75@[5@]T/]'@*'*@:*/[[@/>ycontact5 email(>johanna:gerlach,unige:ch>.) vphaseyphase5name(>start_pe>yprocess5name(>bilingual>ytool(>èCCEPTyPortal>ytoold5id(>èCCEPT PostyEdityPlug5iny/:@>ydate(>7@/*5@[5/*T/k'@]']*>ycontact5email(>SSSSSSSSSSS>.) [:::]y v.phase5group) [:::] vcount5groupyname(>/]>) vcountyphase5name(>r/]:/>ycount5type(>x5keys>yunit(>instance>)[kv.count) vcountyphase5name(>r/]:/>ycount5type(>x5delete5keys>yunit(>instance>)kv.count) vcountyphase5name(>r/]:/>ycount5type(>x5white5keys>yunit(>instance>)kv.count) vcountyphase5name(>r/]:/>ycount5type(>x5nonwhite5keys>yunit(>instance>)*7v.count) vcountyphase5name(>r/]:/>ycount5type(>x5editing5time>yunit(>x5seconds>)/*7:w97v.count) v.count5group) [:::] v.header) vbody) [:::] vtrans5unityid(>/]>) vsource)jCaiyensuiteytestéyavecymonygestionnaireydCe5mailsyEincredimailPyetylàByplantagey dèsylCaccèsyauyserveurydeymail:v.source) vtargetyphase5name(>r/]:/>)Iythenytestedywithymyye5mailymanageryEincredimailPyandythenyity crashedyasysoonyasyityaccessedytheymailyserver:v.target) valt5transyphase5name(>mt_baseline>)Iythenytestedywithymyymanageryofye5mailsyEincredimailPy andythereByplantageyfromyaccessytoytheymailyserver:v.alt5trans) v.trans5unit) [:::] v.body) v.file)

Figure 5.2: XLIFF output from ACCEPT post-editing portal

as editing time and keystrokes for each segment. Figure 5.2 shows an extract of such an XLIFF file. For our study, we have performed two post-editing experiments using the ACCEPT portal, with the aim of assessing the impact of pre-editing on post-editing activity. We will now describe these experiments.

5.5

First experiment - impact of successful pre-editing

Our first experiment focuses on pre-editing that improves translation. We are mainly interested in two aspects, namely 1) whether the translation improvements produced by pre-editing result in less post-editing effort, and 2) how the advantage gained in the post-editing task relates to the effort invested in pre-editing. For both these aspects, we have chosen to focus on temporal effort, for several reasons. First, since in the ACCEPT project we expect forum users to perform the post-

167

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

editing task, the time factor is an essential component. As for any other task completed voluntarily by unpaid participants, the necessary time investment is a determining factor for people’s willingness to participate. Second, temporal effort is straightforward to observe, and can easily be recorded on the ACCEPT Portal. To complement time measurements, we will also compute edit distances as another indicator of post-editing effort.

5.5.1

Data

The data used for this study are extracted from the French Norton forums. In order to create a representative corpus, we selected 684 sentences, based on n-gram frequency1 , keeping the same proportion of sentences of each length. Sentence lengths range from 6 to 35 words, for a total of 9,741 words. As a result of this selection process, all sentences were out of context.

5.5.2

Participants

For both the pre-editing and post-editing tasks, we recruited translation students in the second year of the MA program at the Faculty of Translation and Interpreting (FTI) of the University of Geneva. For the pre-editing task, we recruited a native French speaker. For the post-editing task, we recruited three native English speakers who had French as a working language. None of the participants had any specific technical knowledge. While translators are not the target audience for either of the tasks, they present several advantages for this experiment, as opposed to real forum users. For the preediting task, employing a translator ensured that rules would be applied exhaustively and correctly. While this might not produce the same result as if a real forum user had performed the task, it ensures that the output is the best possible obtainable with our pre-editing rules, which is what we wish to evaluate here. We will see in Chapter 7 how rule application by real forum users influences the pre-editing results. Regarding the post-editing task, in these experiments we chose to investigate bilingual post-editing, and thus also require bilingual participants. Employing translators ensures that this 1

Since only a limited amount of data can be processed, it was decided to select segments based on their representativeness of the entire Norton corpus. A script was designed to sort sentences based on a weighted sum of n-gram frequency.

168

5.5 First experiment - impact of successful pre-editing

condition is fulfilled. Since our experiment focuses on a comparison of temporal effort for different cases, and not on absolute times, observed trends should be transposable to user groups with different profiles.

5.5.3

Pre-editing task

As described in 3.8, the pre-editing task was divided into three steps. First, we applied the rules from Set 1 automatically, using the Acrolinx autoApply client (cf. 2.3.1). In a second step, we had the French translator manually apply the rules from Set 2 using Acrolinx’s MS Word plugin1 . The translator also corrected spelling errors flagged by the Acrolinx spelling module. The pre-editor was asked to treat all correct flags by selecting a suggestion or by performing manual corrections or transformations in the absence of suggestions. Incorrect flags were to be ignored. We also advised the preeditor to treat only flagged phenomena and to leave all other errors untouched. During this process, we logged the keystrokes, mouse clicks and time. In a third step, we applied Set 3 automatically, using the same method as for Set 1. 456 of the original 684 sentences were affected by pre-editing, i.e. had one or more changes. The flags reported at each step are summarized in Table 5.1. Table 5.1: Flags for each pre-editing rule set

Set

non-word errors

real-word errors, punctuation

spelling module 1 2 3

362 -

87 74 -

7 115 191

total

362

161

313

1

simplification, reformulation, reordering

A the time of this experiment, the most recent checking resources were not yet available on the ACCEPT pre-editing portal, which is why we chose to use a local Acrolinx server and the MS Word plugin. Since aside from some purely graphical differences, the checking process is identical, this should not have influenced results

169

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

Pre-editing Effort

The pre-editor spent 53 minutes processing the entire corpus of

684 sentences, making 334 keystrokes, 576 left-clicks and 542 right-clicks. This process changed 567 tokens in the corpus and affected 456 sentences (representing a total of 6923 words). The large proportion of right-clicks in relation to keystrokes can be attributed to the pre-editor selecting suggestions in the contextual menu providing replacement candidates for each flag, instead of typing corrections manually. The pre-editor found the rules straightforward to apply and the pre-editing process globally quite easy, except for some terminology issues related to the unfamiliar domain.

5.5.4

Translation

The 456 sentences affected by pre-edition were then translated into English using the same ACCEPT SMT system used in the evaluations in the previous chapter. For 319 sentences, the translation of the pre-edited version was different from that of the raw version. In this experiment, since our focus is on pre-editing that improves translation, we chose to retain only sentences matching this condition. We therefore submitted the translation results to a comparative evaluation, performed by three bilingual judges with a translation background and using the same principle as in the previous chapter. The same five-point scale was used, and as for our previous evaluations, the clearly better and slightly better judgements for each category were merged. The results of the comparative evaluation are shown in Table 5.2. When considering the majority judgements, the pre-editing rules have a significant positive impact on translation quality. In 46% of cases, translation was improved, while degradation was only observed in 9% of cases. Table 5.2: Comparative evaluation results

total sents

No impact

Raw better

About the same

Pre-edited better

No majority

456

137 (30%)

43 (9%)

63 (14%)

209 (46%)

13(3%)

11

24

158

Unanimous judgements:

170

5.5 First experiment - impact of successful pre-editing

For data selection, we only considered unanimous judgements: only those sentences where all three judges considered that pre-editing had had a positive impact on the translation were retained for the post-editing task. This selection had the additional benefit of removing problematic sentences, as we had noticed that judges often fail to reach a unanimous judgement when the presented sentences are difficult to understand, due to bad segmentation or poor language or translation quality, as in the following example: (71) Raw: Hmmm... je veux bien mais : −→ Hmmm . . . well , but I want to : Pre-edited: Hmmm... Je veux bien mais : −→ Hmmm . . . I am well but : This final selection resulted in a set of 158 sentences.

5.5.5

Post-editing task

Since processing time is bound to be influenced by each editor’s skill and speed, we can only compare time for each individual editor, and not from one editor to the next. Consequently, to compare editing times, we needed each participant to postedit translations of both raw and pre-edited data. In order to obtain comparable time measurements for these two cases, two sets of comparable data were required. To achieve this, we could have assembled two equivalent batches of sentences, based on different features such as sentence length, type-token ratio, etc., as was done in Mitchell et al. (2013). Another possibility, which we opted for, was to have post-editors process each sentence twice, once raw, once pre-edited, making the times obtained for each version directly comparable. While this second approach ensures that the times are comparable, it does however present one major drawback: the impact of processing order. Indeed, post-editors will spend more time reading and post-editing the first instance of a pair of sentences, thus introducing a bias in favour of whichever version was processed second. When the second instance appears, the post-editor has already spent time processing the meaning of the source and thus will probably spend less time in post-editing. Consequently, comparing times for individual sentences will not be possible, since we have no way to compensate for this bias. However, this advantage can be distributed equally between the two versions (raw and pre-edited), by having an equal number of sentences from each version processed first, and comparing times on the document level.

171

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

Applying this to our data selected as described above, our participants were asked to post-edit the machine translation output both of the raw source and of its preedited counterpart. This added up to a total of 276 sentences1 , the order of which was randomised before post-editing, while making sure to have the same number of sentences for each case (raw processed first and pre-edited processed first). For our experiment, the sentences to post-edit were distributed in tasks of 20 segments each, which were formatted in the required JSON format and uploaded to the ACCEPT portal. We chose a bilingual setup, where post-editors had access to the source for each segment. The target was the machine translation of either the raw or the pre-edited sentence. Post-editors were then invited to complete tasks on the portal. We provided the post-editors with minimal written guidelines, asking them to produce a grammatically correct target sentence, which conveyed the same meaning as the original, while using as much of the raw MT output as possible. The guidelines emphasized the fact that style issues were not important, unless they affected the sentence meaning. A glossary for the domain covered by the data was provided in the form of an Excel spreadsheet. Although no time limit was given for these tasks, we asked our participants to make sure to complete each individual task without taking any breaks, in order to record editing times as reliably as possible. All participants were paid for the task on a price per segment basis, thus encouraging post-editors to be as efficient as possible.

5.5.6

Post-editing temporal effort

Table 5.3 reports the results obtained for our three post-editors in terms of throughput (words per minute) for the two versions, translation of raw and translation of preedited. As for any experiment involving the measurement of time required by humans to complete a task, it cannot be excluded that the participants might have been distracted for a moment, switching windows to another task or reacting to some outside event, etc., resulting in overly long times. We have therefore chosen to remove outliers2 from 1

Of the 158 sentences selected by comparative evaluation, 20 had to be eliminated due to an error in the setup, which left us with 138 sentences 2 We apply one of the common definitions of outliers using the interquartile range (IQR): is considered an outlier any value lower than the 1st quartile minus 1.5*IQR or greater than the 3rd quartile plus 1.5*IQR

172

5.5 First experiment - impact of successful pre-editing

our data, thus removing 13/16/19 outliers from each of the sets of 138 sentences pairs post-edited by each of our participants. We observe that post-editors differ greatly in terms of throughput: PE 1 was much faster than the other two. Despite these differences, we observe a similar ratio from raw to pre-edited for all three post-editors. On average, throughput is increased by 98% with sd=12% for the pre-edited version. Table 5.3: Throughput (words/min) for translations of raw and pre-edited data

PE 1

w/min

PE 2

PE 3

raw

preEd

diff.

raw

preEd

diff.

raw

preEd

diff.

49

90

+83%

30

62

+107%

28

57

+103%

Overall, results show that pre-editing clearly has a positive impact on post-editing effort. Yet we still have to ascertain whether pre-editing is worth the effort, or, more precisely, to investigate whether the productivity gained through pre-editing outweighs the effort invested in this step. While the absolute pre- and post-editing times may not be directly comparable, due to the different number of sentences processed and to the possibly artificially low post-editing times caused by the double processing mentioned above, it remains interesting to combine these times. Due to our data selection process, not all pre-edited sentences have been postedited. We thus have to estimate the pre-editing time for the effectively post-edited sentences. Because we have removed outliers from the post-editing times, the subset of sentences is different for each post-editor. To estimate pre-editing time, we have reduced the effective pre-editing time in proportion to the number of words in the sentences retained for each post-editor. As a consequence, times cannot be compared across post-editors. Combined results are shown in Table 5.4. For each post-editor, the table shows the number of words taken into account (after removal of outliers), the approximated pre-editing time for these data and the effective post-editing time for the raw and pre-edited versions of these data. The total shows combined times for the two scenarios, without and with pre-editing. We observe that the post-editing time gained by using a pre-edited source outweighs the time invested in the pre-editing process itself. For our fastest post-editor, PE 1, the time gain is minimal, for the other two, total time is reduced nearly by 30%. As

173

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

Table 5.4: Combined pre- and post-editing times (minutes) 1st experiment

PE 1 Raw

PE 2

PreEd

Raw

2019

PE 3

PreEd

Raw

1919

PreEd

Words Pre-editing Post-editing

1856

41

16 22

64

15 31

66

14 32

Total

41

38 (-7%)

64

46 (-28%)

66

46 (-30%)

the pre-editing time is a very rough estimation, these figures are only indicative, yet they clearly suggest that the time gained through pre-editing outweighs the temporal effort involved in this task. When comparing these times, we should not forget that for an equal time investment, the pre-editing effort is “cheaper” than the post-editing effort, as 1) it is a monolingual process, thus requiring less qualification from the user, and 2) it is semiautomatic, as most of the rules have suggestions and can be applied by selecting an item in a list. Thus even if the total time is not much reduced, the total effort invested can be considered as less. An additional aspect to consider is the potential benefit in the case of translation to multiple languages, where pre-editing would need to be performed only once, but could benefit several translations, as was found in (Aikawa et al., 2007). This was not investigated in our case, since no equivalent SMT system was developed for translation into another language than English.

5.5.7

Edit distance

As another indicator of post-editing effort in terms of number of edit operations, we computed the document level Translation Error Rate (TER) (Snover et al., 2006) for each of the two MT outputs (raw and pre-edited) using the corresponding post-edited versions as references1 . Table 5.5 shows the scores for the raw and pre-edited versions. We observe a lower score for the pre-edited version, indicating a lower number of edits, which concurs with the reduced time invested in performing edits. 1

Scores were computed using the Asiya online tool, http://asiya.cs.upc.edu/ (Gon` alez et al., 2012)

174

5.5 First experiment - impact of successful pre-editing

Table 5.5: TER scores computed between MT output and post-edited versions

TER

raw

pre-edited

22.22

9.66

A closer investigation of the post-editing output at sentence level shows that for both versions, a number of sentences were left unedited. Table 5.6 shows the counts for each version and each post-editor. For all post-editors, a larger number of sentences was left unedited for the pre-edited versions. Table 5.6: Sentences where no post-editing was performed

Raw Pre-edited

PE 1

PE 2

PE 3

2 (1%) 29 (21%)

9 (7%) 45 (33%)

6 (4%) 23 (17%)

These counts show considerable differences between post-editors, with PE 2 editing far less sentences than the other two. This possibly indicates a different appreciation of what is acceptable and what requires editing, as illustrated by the following example: (72) MT output: Sorry, but I cannot answer you on something vague that I do not see. PE 1: Sorry, but I cannot answer you about something vague that I can not see. PE 2: Sorry, but I cannot answer you on something vague that I do not see. PE 3: Sorry, but I cannot give you an answer on something vague that I do not see. PE 2 has left the sentence unchanged, while the other post-editors have performed minor changes, one replacing a preposition (answer on −→ answer about), the other replacing the verb (answer you −→ give you an answer ). All three versions are understandable however and the original MT output would be acceptable in a forum context.

5.5.8

Summary

In this first experiment, we have found that pre-editing that improves translation is also beneficial for the subsequent post-editing task, reducing post-editing time almost

175

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

by half, independently of the inherent speed of each post-editor. More time is gained through the improvements produced by pre-editing than that invested in this step, indicating that the combination of pre- and post-editing is a valid approach. Lower TER scores to reach the final post-edited version also indicate that less edits are necessary to produce a correct final version when working with translations of pre-edited output. Finally, pre-edited versions are also more often acceptable without any post-editing.

5.6

Second experiment - impact of all pre-editing

Our first experiment has focused on a particular outcome of pre-editing, namely where it improves translation, and thus does not tell us whether pre-editing as a whole will effectively reduce post-editing effort. As we have seen in Chapter 4, our pre-editing rules, by improving a large proportion of sentences, have a significant positive impact on translation quality. In some cases however, pre-editing has no major impact on the translation, or even degrades it. While these cases are infrequent, they are still unavoidably part of the process, and thus also require evaluation. In this second experiment we propose to perform a global evaluation of the impact of pre-editing on post-editing temporal effort, considering all cases: translations improved by pre-editing, translations changed but not clearly improved by pre-editing, and translations degraded by pre-editing.

5.6.1

Experimental setup

For this experiment, we again extracted data from the Norton forums, selecting 500 random posts. Pre-editing was performed in the same way as for the first experiment, applying the three sets in sequence, with the only difference that in this case, we did not measure the time necessary for this task. We then translated the sentences impacted by pre-editing, and submitted these to a comparative evaluation against their raw counterparts. The bilingual judges and the evaluation context were the same as for the first experiment. We selected sentences for the three cases (improved, equal or degraded) based on unanimous judgements, as was done in the first experiment. For this experiment, to evaluate the three outcomes of pre-editing (translation improved, same or degraded), we sought to extract an equal number of examples for each case. This required an

176

5.6 Second experiment - impact of all pre-editing

adaptation of the evaluation methodology, due to the comparable rarity of sentences for the last case, where the translation of raw is better than pre-edited. Table 5.7 shows the results of the evaluation of a first batch of 100 posts: in these data, 339 sentences underwent one or more changes as a consequence of pre-editing and for 220 of these, pre-editing had an impact on translation. Table 5.7: Comparative evaluation results for 100 random posts

total sents

No impact

Raw better

About the same

Pre-edited better

No majority

339

119 (35%)

22 (6%)

53 (16%)

141 (42%)

4 (1%)

11

19

111

Unanimous judgements:

Considering the resulting unanimous judgements, this evaluation only yielded 11 suitable examples for the case where pre-editing degrades translation. In order to obtain more examples for this case, while performing as few evaluations as possible, we adapted the evaluation process. Instead of having three evaluators judge all the sentences of a given batch, we had two evaluators each evaluate half of the batch, then from each of these halves extracted only those sentences which had received a raw better judgement to submit them to the second judge. From the resulting sentences with two judgements each, we extracted those that had received two raw better judgements, and submitted these to a third judge. From this final set of sentences, we only retained those where the third judge agreed with the two first. We thereby greatly reduced the amount of sentences submitted to evaluation. As in all other evaluations, the order of sentences was randomised, evaluators did thus not know which of the two sentences was the translation of raw input and which that of pre-edited input. Evaluators were not informed of the special selection process. We thus believe this process to have produced results similar to those that would have been produced by a complete evaluation. From the evaluated data, we kept three groups of about 50 sentences each to use for the post-editing task: translation of pre-edited better (50), henceforth referred to as PrEB, translations about the same (50), henceforth referred to as AS, and translation of raw better (43), henceforth referred to as RawB. Despite the adapted evaluation methodology, we did not obtain 50 examples for this last case.

177

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

5.6.2

Post-editing task

As for the previous experiment, post-editors processed each sentence twice, once the raw, and once the pre-edited version. Sentences were distributed across two batches in order to make sure that for each case (PrEB, AS and RawB) sentences would be distributed equally among the two options (raw processed first and pre-edited processed first) to make sure results at the case level would be minimally influenced by processing order. The two batches were processed at two weeks interval. The post-editing task was performed on the ACCEPT post-editing portal, with the same three participants, the same guidelines, and the same payment as for the first experiment.

5.6.3

Post-editing temporal effort

Before discussing the results obtained with this second experiment, we should mention that the ACCEPT post-editing portal was upgraded between the two experiments, to provide a more fine-grained measurement of post-editing time. In the original version of the portal, time recording focussed on the time spent editing a segment. Time recording was triggered when the user began to edit a segment, by performing a keystroke for example, and ended when the user left this segment. This timespan will hereafter be referred to as editing time. While this was a reliable measure of the time required to perform edits on the MT output, it failed to take into account the time necessary for the post-editor to read this output, read the corresponding source, and decide which changes to make. Since MT output can be confusing and difficult to interpret at first glance, the time spent deciphering the output to edit is not negligible. In the case of messy forum data, apprehending the exact meaning of the source is not a negligible task either. Therefore, the portal was modified to record an additional time span, beginning when the user switches to a new segment, and ending when the user performs the first action in this segment. Since this should cover time spent on reading and mentally processing source and MT output, we shall hereafter refer to this timespan as thinking time. We are aware that this distinction is quite artificial, and the denomination thinking time is misleading, since the post-editors activity cannot be split into a single thinking period, followed by a single editing period. It seems far more plausible that these

178

5.6 Second experiment - impact of all pre-editing

two activities are mingled, and that what we call editing time will in fact contain a certain amount of thinking activity too. Needless to say, the distribution of thinking and editing activity will also differ from individual to individual, and would require a far more sophisticated system to be apprehended correctly. We do however believe that recording the time spent on the first observation of a segment could provide useful insights about readability of MT output. We will first present results of this second experiment in terms of editing time, which is measured on the same basis as the editing time reported for the first experiment, making these data comparable. We will then present results including the new timespan measure. 5.6.3.1

Editing time

Similarly to the first experiment, we have eliminated items with outlying editing and/or thinking times1 , thus removing 31/15/37 outliers from each of the sets of 143 sentences post-edited by each of our participants. These are far more outliers than for the first experiment, which is caused by the inclusion of thinking time, which produces more outliers than editing time. Table 5.8: Throughput (words/min) based on editing time for translations of raw and pre-edited data, for each of the three cases

PE 1

RawB AS PrEB

PE 2

PE 3

raw

preEd

diff.

raw

preEd

diff.

raw

preEd

diff.

73 57 34

54 52 52

-26% -8% +50%

68 62 33

38 56 54

-45% -9% +62%

64 48 24

40 56 40

-38% 18% +68%

Table 5.8 shows the throughput computed based on editing time for the three posteditors, for each of the three cases (PrEB, AS and RawB). We observe that for RawB, throughput is lower for the pre-edited version, decreased by 36% on average. For PrEB - cases which are comparable to those selected in our first experiment - all participants were faster when post-editing the pre-edited version, with a throughput increased on 1

Outliers were removed on the same principle as for the first experiment, based on the interquartile range (IQR)

179

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

average by 60%. Finally, for AS, results were mixed: two post-editors were slightly faster working with the pre-edited version, and one was slightly slower. These mixed results were expected, since the translations of raw and pre-edited had been judged as about the same. Based on these editing time results, we can conclude that post-editing time is clearly related to translation quality as judged by comparative evaluation. We also observe that the negative impact of pre-editing that degrades translation is lower than the positive impact of pre-editing that improves translation. Thus even if the distribution among these three cases was even - which it clearly is not, according to our comparative evaluations - pre-editing would still globally reduce post-editing temporal effort. We will now see how the measured thinking time influences these results. 5.6.3.2

Complete time

Overall, the measured thinking time is not negligible, representing between 28% and 39% of the time spent on the post-editing task. Thus it seems reasonable to take this time into account to assess post-editing effort. Considering this timespan on its own, we have found no obvious relationship between translation judgement and thinking time common to all three post-editors. While PEs 1 and 3 were clearly faster when apprehending cases where translations were improved by pre-editing, PE 2 was not. We also observed that PE 1 spent less time on this initial thinking phase than the other two post-editors. Many aspects could influence this initial phase. Obviously, the quality of the MT output and the time required to figure out its meaning play an important role. The intelligibility of the source will also have an influence, as will the post-editors’ skill in the source language. The presence of terminology that post-editors might want to look up must also be considered. Finally, different post-editors will have different strategies, which will also determine how much time is spent before performing the first edit. Since by itself, thinking time is inconclusive, we have combined the measured thinking and editing times to get a better idea of the total effort. The resulting throughput values are shown in table 5.9. For these combined times, throughputs follow the same trend as that observed for editing time alone, with pre-editing that improves MT increasing throughput (by 44%

180

5.6 Second experiment - impact of all pre-editing

Table 5.9: Throughput (words/min) based on thinking time + editing time for translations of raw and pre-edited data, for each of the three cases

PE 1

RawB AS PrEB

PE 2

PE 3

raw

preEd

diff.

raw

preEd

diff.

raw

preEd

diff.

46 41 26

38 34 39

-19% -16% +46%

37 37 23

23 34 30

-38% -6% +30%

37 32 18

25 37 27

-33% +18% +55%

on average), and pre-editing that degrades MT decreasing throughput for all three posteditors (by 30% on average). We do however observe considerable differences between post-editors in the impact of the different cases. So far we have considered the impact of each of the three outcomes of pre-editing individually. Yet these three outcomes are not represented equally in our data, thus the global outcome of pre-editing cannot be considered as an equal combination of the three. Table 5.10 shows an estimation of the global impact of pre-editing, where we take into account the proportional importance of each case. We use the proportions obtained with unanimous judgements in the comparative evaluation, cf. Table 5.7. In this estimation, we consider only cases where pre-editing has an impact on translation, thus resulting in 8% raw better, 13% about the same and 79% pre-edited better. Table 5.10: Estimation of global impact of pre-editing on post-editing throughput

estimated throughput raw estimated throughput pre-edited difference

PE 1

PE 2

PE 3

30 38 +27%

26 30 +16%

21 28 +35%

Results show that even when pre-editing is not always successful in improving translation, as is the case for our process and data, the global impact on post-editing throughput is positive, for all three post-editors. Here again, we observe considerable differences between post-editors. In the first experiment, we have combined pre-editing and post-editing times to determine how the time gained through pre-editing relates to the time invested in this

181

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

step. Results of the same combination for the second experiment are shown in Table 5.11. As pre-editing time was not measured for this second experiment, we use the pre-editing time measured in the first experiment to estimate pre-editing time for a fictitious set of 2,000 sentences. The corresponding post-editing times are estimated on the basis of the throughputs reported in Table 5.10. Table 5.11: Estimated combined pre- and post-editing times (minutes) 2nd experiment

PE 1 Raw

PE 2

PreEd

Raw

2000

PE 3

PreEd

Raw

2000

PreEd

Words Pre-editing Post-editing

2000

67

15 53

78

15 67

95

15 70

Total

67

68 (+1%)

78

82 (+6%)

95

86 (-10%)

For this experiment, the trade-off is less favourable than for the first. Total time is reduced for only one of the three post-editors, while for the other two it is very slightly increased. As the times given in this table are estimations, these results are necessarily only indicative. Although here pre-editing produces no time gain, these results still suggest that for an equal time investment, part of the complex bilingual post-editing work can be replaced by easier monolingual pre-editing work.

5.6.4

Edit distance

As in the first experiment, we have computed the edit distance between raw MT output and post-edited versions by means of TER scores for the three cases, using the postedited versions as references. The scores are shown in Table 5.12. Table 5.12: TER scores computed between MT output and post-edited versions

case

TER

RawB AS PrEB

raw

pre-edited

13.24 11.08 14.79

15.86 8.34 7.43

182

5.6 Second experiment - impact of all pre-editing

For the PrEB case, we observe the same two to one relationship between the TER scores for the raw and pre-edited versions as in the first experiment, indicating a similar diminution of necessary edits as a consequence of pre-editing. In the RawB cases, TER is slightly higher for the pre-edited version, suggesting that this version requires more editing. The difference is smaller than for pre-editing that improves translation, which fits with the results obtained in terms of throughput. For cases where translations were judged as equivalent, edit distance is smaller for the pre-edited version, which might suggest that although the translation was not improved, it was still brought closer to the final version.

5.6.5

Comparison with first experiment

Comparing the results for the third case (pre-edited better) with those of the first experiment1 , we observe that in terms of total throughput, the post-editors were slower in the second experiment than in the first. Table 5.13 shows the combined throughput for all sentences of the first experiment, and for the comparable PrEB sentences of the second experiment. While PE 2 has a very similar throughput, PEs 1 and 3 were much Table 5.13: Global throughput in both experiments

1st experiment 2nd experiment*

PE 1

PE 2

PE 3

63 41

40 41

38 29

* Including only the PrEB subset which is comparable to the

data selected for the 1st experiment

slower in the second experiment. This non-negligible difference is difficult to explain. Having post-editors work more slowly in the second experiment, despite the practice acquired in the first experiment, is unexpected. We also observe a lower increase in throughput for the pre-edited version: for an average increase of 98% in the first experiment (cf. Table 5.3), we only have an average increase of 60% in the second experiment. 1

For this comparison, we consider only the measured editing times, since we do not have thinking time measurements for the first experiment

183

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

Finally, the results of the two experiments also differ in terms of the number of sentences left unedited. Proportions for the PrEB subset of data are shown in table 5.14. The table also includes results obtained in the first experiment as reference. In comparison with the first experiment, in total a larger proportion of sentences were left unedited in the second experiment, for all three post-editors, indicating either a higher quality of the MT output, or a difference in the post-editor’s appreciation of what needed editing. Table 5.14: Proportion of sentences were MT output was left unedited for both experiments

PE 1

1st experiment 2nd experiment*

PE 2

PE 3

raw

pre-ed.

raw

pre-ed.

raw

pre-ed.

1% 9%

21% 37%

7% 12%

33% 49%

4% 2%

17% 28%

* Including only the PrEB subset which is comparable to the data selected for the 1st experiment

The decreased throughput in combination with the lower proportion of sentences edited is difficult to explain. The data used in the two experiments were very similar, extracted from the same forum, with the same average sentence length. We must however note that the effect of pre-editing is by no means homogeneous. Our selection process, relying on comparative evaluation, tells us whether the pre-edited version is better than the other, but not to what extent. This is illustrated by the following two examples taken from the pre-edited better data: (73) Raw: Je vous est envoyer un MP, merci de le lire ... MT output: I you is send a MP, thank you for the read... Pre-edited: Je vous ai envoy´e un message priv´e, merci de lire ¸ca ... MT output: I have sent you a private message, please read it... (74) Raw: T’auras juste ` a contacter le service `a la client`ele de Norton pour ton d´ecompte. MT output: You can just to contact customer service of Norton to tone count. Pre-edited: T’auras juste ` a contacter le service `a la client`ele de Norton pour votre

184

5.6 Second experiment - impact of all pre-editing

d´ecompte. MT output: You can just to contact customer service of Norton for your count. For the first example, three phenomena are treated by pre-editing (wrong verb form est envoyer, abbreviation MP and clitic le) and the translation is greatly improved, requiring no post-editing. For the second example, only one phenomena is treated (informal second person pronoun ton), and the translation is not much improved. While for the first sentence, a large difference in post-editing effort is to be expected, for the second, the difference between the two sentences is minor. Both sentences are part of the PrEB group however. Differences such as these and their distribution among the sentences selected for the two experiments may explain the differences both in throughput and in sentences left unedited. Finally, it cannot be excluded that our participants, who were the same for the two experiments, might have developed a tolerance for the MT output, and might have been less inclined to perform non-essential changes in the second experiment than in the first.

5.6.6

Summary

In this second experiment we have evaluated all three outcomes of pre-editing, i.e. preediting that improves, degrades or has no major impact on translation quality as judged by comparative evaluation. Results show that post-editing throughput is clearly related to translation quality as judged by comparative evaluation. We also observe that preediting that degrades translation has proportionally less impact on throughput than pre-editing that improves translation. This is corroborated by the edit distance results, where TER scores differ less for the subset of sentences where pre-editing degrades translation. The additional measure of thinking time does not reveal a clear pattern common to all three post-editors. When these thinking times are combined with editing times however, the global pattern is the same as for editing time alone, indicating the same relationship between post-editing throughput and translation quality. Overall, taking into account the proportional importance of each outcome in our data, preediting has a positive impact on post-editing throughput. Considering the complete scenarios, from raw source to final post-edited target, pre-editing produces no time

185

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

gain, yet results still suggest that for an equal time investment, part of the complex bilingual post-editing work can be replaced by easier monolingual pre-editing work.

5.7

Conclusion

To assess the usefulness of pre-editing, we have investigated its impact on temporal post-editing effort. Two experiments were performed with Norton forum data on the ACCEPT post-editing portal. Our first experiment focussed exclusively on pre-editing that improves translation. Results show that in this case, pre-editing also has a positive impact on post-editing throughput, with post-editors working nearly twice as fast. Edit distance scores confirm that less edits are required when post-editing the translations of pre-edited input. The combination of pre-editing and post-editing seems promising, as total processing time is reduced, and some of the challenging bilingual post-editing work can be avoided through easier monolingual pre-editing work. While in most cases, pre-editing is found to improve translation, in some cases it degrades it or produces no significant improvement. We therefore performed a second experiment investigating all possible outcomes of pre-editing. Results show that preediting that degrades translation also adversely impacts post-editing throughput, yet to a lesser extent than the positive impact of pre-editing that improves translation. The measurement of edit distance between MT output and final translations concurs with these results, showing less differences in the case of pre-editing that degrades translation. For cases judged equal, no coherent impact on post-editing throughput was found, as results differ between post-editors. An estimation of the global impact of pre-editing on post-editing throughput, taking into account the distribution of the three cases among our data, shows an overall positive impact. It is however difficult to quantify this impact. For both experiments, we observe non-negligible differences between post-editors, both in terms of absolute throughput and in terms of throughput increase/decrease. We have also observed a relatively large proportion of outliers among our data, a large share of which can be attributed to the measured thinking time. Finally, the degree of improvement of throughput is clearly different between our first and second experiments, despite identical setups and comparable data. Post-editing time could be influenced by many factors, and these

186

5.7 Conclusion

observations suggest that, by itself, it can not be used to precisely quantify the impact of pre-editing on post-editing effort. However, the additional information provided by TER scores concurs with the temporal observations, confirming the validity of the observed trends. Despite these reservations, on the whole the two experiments satisfactorily demonstrate that there is a relationship between translation quality as evaluated by comparative evaluation and post-editing temporal effort, and that globally, pre-editing contributes to reducing this effort.

187

5. RULE EVALUATION: IMPACT ON POST-EDITING TEMPORAL EFFORT

188

6

Rule Evaluation: Portability This chapter presents two evaluations of the portability of the pre-editing rules: to other MT systems and to other domains.

6.1

Introduction

We have seen in the previous chapters that the pre-editing rules are useful to improve translation of the Norton forums by the ACCEPT SMT system, yet this does not tell us whether the pre-editing approach is useful for MT of community content in general, i.e. for other forums or other MT systems. In this chapter, we investigate the portability of these rules. Our objective is to determine which rules are specific to our original data and system, which rules are more generally applicable and which rules would require specialisation for other domains or MT engines. Most studies involving pre-editing focus on specific combinations of MT engines and data (e.g. Nyberg & Mitamura, 1996; Temnikova, 2010). With the exception of (O’Brien & Roturier, 2007), little research has focused on the portability of rules. However, many pre-editing approaches include similar rules and transformations (O’Brien, 2003), thus it does not seem unreasonable that approaches could be applicable to other texts than those they have been developed for. Portability to other systems is more complicated, especially when considering statistical vs rule-based systems, as the sources of translation errors for the two approaches are largely different. They do however have some common difficulties, mainly in treating tokens or structures that are not covered by their resources, i.e. the training data

189

6. RULE EVALUATION: PORTABILITY

for statistical systems, and the linguistic resources for rule-based systems. A pre-editing process that focuses on such issues could therefore be beneficial to both. This chapter is organised around two main topics: portability to other systems, and portability to other domains. We begin this chapter by evaluating the impact of the rules on translation with two rule-based machine translation engines (6.2). We then continue with an analysis of portability to other community content, applying the rules to data extracted from other forums, both from the same domain, and a from a totally different domain (6.3). Finally, we present our conclusions regarding portability of the different pre-editing rules developed in this study and their specialisation potential.

6.2

Portability to other MT systems (Lucy, Systran)

In Chapter 4 we have seen that many of the pre-editing rules developed within this study have a positive impact on the ACCEPT SMT system. We will now investigate whether these rules are also useful when translating forum data with a rule-based technology. To this end, we have performed both a rule by rule evaluation and a global evaluation with two commercial RBMT systems, LucyLT and Systran. Lucy LT, formerly called Comprendium (Alonso & Thurmair, 2003), is a commercial transfer-based system (Hutchins, 1986, chapter 12). Lucy performs translation in three phases: analysis, transfer and generation. For a given language pair, Lucy LT uses two grammars and three lexicons: a source grammar and lexicon with resources for analysis to produce a source language tree, a transfer lexicon with resources to transform the source language tree into the target language tree, and finally a target language lexicon and grammar to generate the target sentence from the target tree. The three lexicons can be specialised for a given domain, which is what we will do for the present study. In this study, we used Lucy LT Desktop Power 2.1 (translation engine) and LT Lexshop 2.7 (dictionary editor). The second system we will use, Systran, was originally developed in the seventies using a direct translation approach (Hutchins, 1986, chapter 19). Systran has been continually improved over the years and now uses a more modular approach, integrating features of the transfer approach (Surcin et al., 2007). Similarly to Lucy, Systran can be customised for particular domains, by the addition of user dictionaries (Senellart et al., 2003). More recently, Systran has been upgraded to a hybrid system. In our

190

6.2 Portability to other MT systems (Lucy, Systran)

study however we have used the purely rule-based system. We used Systran version 7.3.5.13.

We begin this section by describing how these two systems were specialised for Norton data (6.2.1). We will then present two evaluations performed with these systems: a rule by rule evaluation (6.2.2) and a global evaluation of the entire pre-editing process (6.2.3).

6.2.1

Specialising the rule-based systems

The ACCEPT SMT system was specialised for the Norton forum data. To establish a comparable setup with the rule-based systems, we need to achieve similar domain adaptation. Since these systems are fundamentally different, it is difficult to determine what would count as a similar specialisation. As the aim of this study is not to compare SMT and RBMT, but only to compare the impact of pre-editing on these different systems, differences in specialisation should not be a major issue. We have different resources available to specialise the linguistic systems. The ACCEPT SMT system used for the evaluations in Chapter 4 was specialised for our domain using two main resources (ACCEPT D4.1, 2012). The first was an existing Symantec translation memory containing data from Symantec user manuals and product documentation. The second resource was a set of 500 sentences extracted from the French Norton Community forums, translated into English. For the specialisation of the linguistic systems, we have used these two resources, with the addition of a Symantec terminology glossary. We will now briefly describe how these data were processed to create system dictionaries for Lucy and Systran. 6.2.1.1

Symantec glossary

We began specialisation with the bilingual Symantec glossary. This contained 5322 entries of different categories: a large share of proper nouns, some nouns, verbs and a few adjectives. Most of the entries were names of products or processes. About a quarter of these entries were untranslated, i.e. had identical surface forms in both languages. A large share of the entries were compounds consisting of two or more words.

191

6. RULE EVALUATION: PORTABILITY

Lucy For Lucy, new lexical entries have to be integrated in all three lexicons, to allow analysis, transfer and generation. Entries can be imported in a specific CSV format. As the Lucy lexicon format is quite complex, manual validation of a number of entries was necessary. A special case are compounds, for which Lucy provides two options: 1) so-called multiwords, which allow decomposition into a head and a string or variable, allowing inflection of the components, or 2) fixed expressions that do not inflect. The compounds from the Symantec list were reviewed individually and coded according to their inflection requirements. Thus when the head word was a French word (for example administrateur in the compound administrateur Enterprise Vault), compounds were imported as multiwords (thereby allowing a plural such as administrateurs Enterprise Vault).

Systran

Importing lexicon entries into Systran is slightly easier than for Lucy, as only

a single bilingual lexicon is required. Systran has a specific feature for words which have to remain untranslated, which was very convenient for the numerous product names.

6.2.1.2

Symantec translation memory

The translation memory used to train the ACCEPT SMT system contained 1.6 million aligned French-English segments. In order to leverage these data for the RBMT systems, we used both system’s functionalities to identify unknown words.

Lucy We processed the French half of the translation memory using Lucy’s Create New Word List functionality, which extracts all the words or compounds unknown to the system. This resulted a list of of around 60,0001 “unknown words”. To refine the list of words to import, we computed the frequency of each of these “unknown words” in the French translation memory data, and retained the 500 most frequent. These words were translated by looking up the English equivalents in the translation memory, and imported into the three lexicons. 1

In this list, Lucy also includes words that are in its French resources, but which could not be analysed correctly in the context in which they appeared. These words were not considered for importation.

192

6.2 Portability to other MT systems (Lucy, Systran)

Systran

Apart from minor technical differences, the process for Systran was identical

to that for Lucy. Systran found 90,000 “unknown words” in the data. Once sorted by frequency, the top 500 showed considerable overlap with the corresponding list produced by Lucy. These words were then translated and imported into Systran. 6.2.1.3

Forum data

The corpus of 500 representative sentences extracted from the forum data was processed similarly to the translation memory data, by means of the unknown word extraction functionalities. For Lucy, this produced a list of 331 unknown words, about two thirds of which were misspellings of existing words, incorrect product names or non-standard abbreviations which were obviously not included for system specialisation. The remaining words were translated and imported accordingly. The same methodology was applied for Systran. By this specialisation process, each of the systems acquired about 5,000 new lexicon entries. It should be noted that this specialisation has focused mostly on Symantec terminology and domain specific vocabulary, and not on forum language. Indeed, with the exception of the 500 sentences of forum data, none of the resources used were specific to community content. It would of course have been possible to try to specialise the systems further, by extensive testing on forum data and adjusting of resources to handle the lexical and syntactic phenomena found on forums. This was not done, as we wished the systems to be comparable to the ACCEPT SMT system.

6.2.2

Rule by rule evaluation with Lucy and Systran

To evaluate the impact of the rules on the two RBMT systems, we chose to perform the same rule by rule evaluation as has been presented in Chapter 4, comparing translations of raw and pre-edited forum data. The data used are the same sentences selected for the previous rule by rule evaluation (cf. 4.4.1). Since results obtained with AMT workers were comparable to those obtained with translators (4.4.3.2), we chose to use only AMT workers for this second evaluation. The evaluation setup was in all points identical to that of the previous evaluation. In the following sections we will analyse the results sorted by rule categories. Complete results by rule are provided in Appendix D.

193

6. RULE EVALUATION: PORTABILITY

6.2.2.1

Grammar (agreement)

Table 6.1 shows combined results for the two agreement rules (cf. 3.3.2.1) , nounPhrAgr and subjVerbAgr. Results obtained with the ACCEPT SMT system are included as reference. For Lucy, three quarters of the sentences are improved. With only half of the sentences improved, Systran seems less impacted. For both systems we observe only a low proportion (6-7%) of degraded sentences. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=69) = 234.5, p < 0.001 (Lucy) and χ2(3,N=69) = 93.5, p < 0.001 (Systran). Table 6.1: Comparative evaluation of grammar (agreement) rules for three MT systems

total sents Lucy Systran ACCEPT

100 100 100

No impact

Raw better

About the same

Pre-edited better

No majority

14 (14%) 38 (38%) 31 (31%)

6 (6%) 7 (7%) 16 (16%)

3 (3%) 3 (3%) 8 (8%)

76 (76%) 50 (50%) 41 (41%)

1 (1%) 2 (2%) 4 (4%)

For Systran, we observe a large proportion of sentences where rule application has no impact (38%), which is even higher than for the SMT system. Analysis of these sentences reveals that, similarly to what we observed for the SMT system, the main reason for the lack of impact are words that have the same translation in English, for example le and la which are both translated by the. In the case of a less inflected target language, as we have here, wrong gender or number agreement are not an issue for MT systems relying mostly on a direct translation approach, like Systran. Lucy on the contrary has more difficulties with incorrect agreement. This is a consequence of the syntactic analysis performed in the transfer approach, which has a higher probability of being disrupted by a token that does not agree in number or gender as expected in grammatically correct French. The following example shows a sentence where the correction of le to la has no impact on either Systran or the ACCEPT SMT system, but where the translation by Lucy is clearly affected: (75) Raw: Il me semble que je rencontre ce probl`eme depuis le mise `a jour de mon produit.−→ It seems to me that I meet this problem for the one updated from

194

6.2 Portability to other MT systems (Lucy, Systran)

0%

20%

40%

60%

80%

100%

Lucy Systran ACCEPT

nounPhrAgrE(50)

subjVerbAgrE(50)

Figure 6.1: % of sentences improved by grammar (agreement) rules

my product. Pre-edited: Il me semble que je rencontre ce probl`eme depuis la mise `a jour de mon produit. −→ It seems to me that I meet this problem from the updating of my product. While the second translation is not particularly good, it is more understandable than the first. Figure 6.1 shows the proportion of sentences improved by each rule of this category. For both linguistic systems, correcting subject verb agreement has a slightly higher positive impact than correction of agreement within the noun phrase. These evaluation results show that correcting agreement is clearly beneficial for translation with rule-based systems, especially for ones like Lucy which rely on a transfer approach. 6.2.2.2

Grammar (mood/tense)

Table 6.2 shows the combined results for the six rules correcting mood and tense of verbs (cf. 3.3.2.2). These rules improve a large proportion of sentences for both RBMT systems, while degrading only few. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=77) = 112.3, p < 0.001 (Lucy) and χ2(3,N=77) = 134.3, p < 0.001 (Systran). As we can see in Figure 6.2, the positive impact of these rules is not equally distributed among rules. The useSubjunctive rule, which we have seen was not useful for the SMT system is no better for the linguistic systems. For both Lucy and Systran, the number of sentences where this rule has no impact is very high (43% and 76% respectively). The wrong verbFormRule has a very high positive impact on all systems, especially Lucy. The four remaining rules, avoidCond, confParticipIsUs,

195

6. RULE EVALUATION: PORTABILITY

Table 6.2: Comparative evaluation of grammar (mood/tense) rules for three MT systems

total sents Lucy Systran ACCEPT

87 87 87

No impact

Raw better

About the same

Pre-edited better

No majority

11 (13%) 18 (21%) 10 (11%)

10 (11%) 6 (7%) 15 (17%)

3 (3%) 1 (1%) 6 (7%)

59 (68%) 61 (70%) 55 (63%)

4 (5%) 1 (1%) 1 (1%)

0V

20V

40V

60V

80V

100V

avoidCondSL3y confParticipIsUsSL9y

Lucy Systran ACCEPT

futureCondSL2y useImperativeSL2y useSubjunctiveSL21y wrongVerbFormSL50y

Figure 6.2: % of sentences improved by grammar (mood/tense) rules

futureCond and useImperative, produce too few flags to clearly quantify their impact 6.2.2.3

Grammar (sequences)

Table 6.3 shows the combined results for the three sequence rules (cf. 3.3.2.3). Here again, positive impact is much higher for Lucy than for the other two systems, with 83% of sentences improved. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=58) = 193.7, p < 0.001 (Lucy) and χ2(3,N=58) = 94.1, p < 0.001 (Systran). Figure 6.3 shows that while all three rules have a high positive impact on translation with Lucy, the misspelledExpr rule in particular improves twice as many sentences as for the other systems. A closer look at the analysis performed by Lucy explains why small changes like the ones performed by this rule can have as much impact. For the following sentence for example, the misspelling of nul severely disrupts Lucy’s analysis, with part identified as a noun phrase, resulting in an incorrect translation:

196

6.2 Portability to other MT systems (Lucy, Systran)

Table 6.3: Comparative evaluation of grammar (sequence) rules for three MT systems

total sents Lucy Systran ACCEPT

76 76 76

No impact

Raw better

About the same

Pre-edited better

No majority

6 (8%) 22 (29%) 18 (24%)

5 (7%) 8 (11%) 11 (14%)

2 (3%) 1 (1%) 6 (8%)

63 (83%) 45 (59%) 38 (50%)

0 (0%) 0 (0%) 3 (4%)

0t

20t

40t

60t

80t

100t Lucy Systran ACCEPT

impossibleTagSeq7(50)

missingPas7(9) misspelledExpr7(17)

Figure 6.3: % of sentences improved by grammar (sequence) rules

(76) Raw: En effet, je ne sais pas quelle version j’ai install´e, l’ann´ee ne figure nul part. −→ Indeed, I do not know what release I have set up, the year does not depict any part. Pre-edited: En effet, je ne sais pas quelle version j’ai install´e, l’ann´ee ne figure nulle part. −→ Indeed, I do not know what release I have set up, the year does not appear anywhere. Results for these sequence rules confirm that the more complex the analysis performed by a MT system, the more improvements can be achieved by correcting syntax errors. 6.2.2.4

Homophones rules

Table 6.4 shows the results obtained with both linguistic systems over the 302 examples selected for the 18 homophone rules (cf. 3.3.2.4). The homophone rules have a positive impact on all systems. A one-sample chisquare test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=245) = 455.2, p < 0.001 (Lucy) and χ2(3,N=245) = 240.8, p < 0.001 (Systran). Again, the highest positive impact is found for Lucy (71%). Systran on the other hand has less positive impact (59%) and a

197

6. RULE EVALUATION: PORTABILITY

Table 6.4: Comparative evaluation of homophone rules for three MT systems

total sents Lucy Systran ACCEPT

302 302 302

No impact

Raw better

About the same

Pre-edited better

No majority

18 (6%) 51 (17%) 57 (19%)

22 (7%) 62 (21%) 33 (11%)

42 (14%) 18 (6%) 26 (9%)

213 (71%) 162 (54%) 177 (59%)

7 (2%) 9 (3%) 9 (3%)

relatively high proportion of degraded sentences (21%). A large share of these degraded sentences can be attributed to the laL` a rule, which in our examples replaces la with l` a. Systran doesn’t handle l` a well, and in most cases translates this word literally with there, which is usually not correct. On the other hand, the incorrect la tends to simply disappear, making the raw version less incorrect. For Lucy, we observe a lower rate of sentences where pre-editing has no impact on translation (6%) than for Systran (17%) or the SMT (19%). This is not surprising, since most of the homophones are function words indicating grammatical relationships, which are important for the complete analysis performed by Lucy. A closer look at the results per rule shows that, for Lucy, 17 of the 18 sentences ` rule, and are where pre-editing had no impact on translation belong the to the aA all examples where a capitalised A at the beginning of a sentence has been pre-edited ` We recall from Chapter 4 that the SMT translated 9 of these examples badly, to A. often interpreting the unaccented capital A as a verb. For the rule-based systems, these cases should not present any ambiguity, since from a linguistic point of view, an A at the beginning of sentence, unless it is part of the interrogative inversion a-t-il, must be a preposition. Consequently, adding the accent should not greatly change analysis and translation, which is confirmed by the results obtained with Lucy. Similarly, with ` Systran, 14 of the 17 sentences have identical translations for A and A. Figure 6.4 shows the percentage of sentences improved by each rule, for each of the systems. Among the other rules, we observe a high percentage of sentences with no impact for the tousTout rule for Systran (54%). As in the following example, these sentences are all cases where the singular adjective tout is used instead of the plural tous, or vice-versa. Since the English any or all do not inflect, the translations are identical.

198

6.2 Portability to other MT systems (Lucy, Systran)

q'

9q'

Aq'

Pq'

8q'

Qqq'

aÀûVEqr çaSaûVQPr ceSeûVATr ciSiûVQr desDèsûVAr divHomophoneûVATr duDûûVTQr laLàûVEqr

Lucy Systran ACCEPT

maMyaûVQr niNyyûVQr nounVerbConfûV9qr onOntSontûV9r ouOùûV8r quyelleQuelleûVPr soiSoisSoitûV9r surSûrûV9r tyesTesûV9r tousToutûVQTr

Figure 6.4: % of sentences improved by homophone rules

199

6. RULE EVALUATION: PORTABILITY

(77) Raw: En tous cas, moi je suis ravie −→ In any case, me I am delighted Pre-edited: En tout cas, moi je suis ravie −→ In any case, me I am delighted In comparison, Lucy is far less tolerant of these confusions, since only one sentence out of 13 has the same translation and the pre-editing rule has a high positive impact, with 77% of sentences improved. One rule shows a striking divergence between systems: nounVerbConf. For the ACCEPT SMT system, it improves 40% of sentences, for Systran 65% and for Lucy 100%. Besides the proportion of improvement, we also observe a difference in the number of unchanged sentences: 25% for the baseline, 0% for Lucy. A closer look at the data leads to several observations. The SMT seems to be less sensitive to noun-verb homophones where both tokens are translated to the same token, for example: (78) Raw: Je rappel que le probl`eme est le mˆeme −→ I recall that the problem is the same Pre-edited: Je rappelle que le probl`eme est le mˆeme −→ I recall that the problem is the same Since both the noun rappel and the verb rappeler conjugated at the first person of the singular are translated by the English token recall, the translations are identical. For the transfer-based system, being confronted with a noun instead of a verb disrupts analysis, producing a bad translation. Agreement between annotators is globally good for the homophones rules. For all three systems, only 3-4% of sentences did not lead to a majority judgement, indicating that these cases are not ambiguous. Regarding negative impact of the rules, we observe a higher percentage for Systran than for the other two systems. However, overall, the homophones have a significant positive impact on all three systems. The relationship between improved and degraded sentences differs between systems: for Lucy, there are more than 8 improved sentences for each degraded, for the baseline 5 to 1, and for Systran barely more than 2 to 1. 6.2.2.5

Punctuation

Table 6.5 shows the combined results for the punctuation rules (cf. 3.4). The global impact is slightly higher for the linguistic systems than for the SMT, with up to 60% of

200

6.2 Portability to other MT systems (Lucy, Systran)

sentences improved. As for the SMT, the punctuation rules have the highest proportion of sentences with no impact of all the evaluated rules. As mentioned in Chapter 4, this is not unexpected, since several of the rules produce changes that only affect spaces, which are mostly lost during the translation process as a consequence of tokenisation. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=308) = 637.1, p < 0.001 (Lucy) and χ2(3,N=308) = 588.2, p < 0.001 (Systran). Table 6.5: Comparative evaluation of punctuation rules for three MT systems

total sents Lucy Systran ACCEPT

466 466 466

No impact

Raw better

About the same

Pre-edited better

No majority

115 (25%) 128 (27%) 158 (34%)

28 (6%) 34 (7%) 47 (10%)

35 (8%) 17 (4%) 33 (7%)

278 (60%) 268 (58%) 211 (45%)

10 (2%) 19 (4%) 17 (4%)

Figure 6.5 shows the proportion of sentences improved by individual rules for each system. We observe that the addElision rule has far more positive impact on Systran than the other two systems. A closer look at the sentences reveals that Systran is incapable of dealing with elisions where the apostrophe has been omitted, as shown in the following example: (79) Raw: car je suis oblig´e d effectuer moi mˆeme la d´efragmentation avec l’outil Windows −→ because I am obliged D to carry out to me even the defragmentation with the Windows tool Pre-edited: car je suis oblig´e d’effectuer moi mˆeme la d´efragmentation avec l’outil Windows −→ because I am obliged to carry out to me even the defragmentation with the Windows tool

For Lucy and ACCEPT on the other hand, this rule produces close to 70% of sentences with no impact on translation. Similarly to what we observed in Chapter 4 for the SMT, avoidPunct and addComma have high proportions of sentences where no majority judgement was reached (ranging from 6% to 22%), confirming that the use of commas is subject to debate. For

201

6. RULE EVALUATION: PORTABILITY

09

209

409

609

809

1009

addCommaW/50g addCommaPPW/50g addElisionW/50g addHypenW/50g addSpaceBtwnWordsW/50g

addSpacePunctW/50g

Lucy Systran ACCEPT

addSpacePunct2W/2g avoid/chargW/9g avoidCapSentW/50g avoidPunctW/50g checkEOS/BOSW/50g wrongElisionW/5g

Figure 6.5: % of sentences improved by punctuation rules

the SMT, the insertion or deletion of a comma produced more changes in the translation than just insertion or deletion of a comma for about a quarter of the evaluated sentences. This is far less frequent for the RBMT systems, affecting 8% of sentences for Lucy and 13% for Systran. In the remaining cases, the only difference between raw and pre-edited is the insertion of the comma. Overall the punctuation rules have a positive impact on the RBMT systems, resulting in slightly more improved and slightly less degraded sentences than for the ACCEPT SMT system. 6.2.2.6

Informal language

Results for the informal language rules (cf. 3.5) are recapitulated in Table 6.6. These rules have far less impact on the RBMT systems than on the SMT, in particular for Systran where the translation of nearly one half of the sentences is not affected by pre-editing. Yet overall, these rules have a positive impact on the RBMT systems. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=249) = 183.0, p < 0.001 (Lucy) and χ2(3,N=249) = 108.6, p < 0.001 (Systran).

202

6.2 Portability to other MT systems (Lucy, Systran)

Table 6.6: Comparative evaluation of informal language rules for three MT systems

total sents Lucy Systran ACCEPT

291 291 291

No impact

Raw better

About the same

Pre-edited better

No majority

114 (39%) 141 (48%) 42 (14%)

42 (14%) 45 (15%) 64 (22%)

3 (1%) 5 (2%) 11 (4%)

127 (44%) 88 (30%) 164 (56%)

5 (2%) 12 (4%) 10 (3%)

0y

20y

40y

60y

80y

100y

avoidAbbrevCT50) avoidAnglicismCT50) Lucy Systran ACCEPT

avoidDirQuestCT50) avoidEstCeQueCT50) avoidInformalCT41)

missingNeCT50)

Figure 6.6: % of sentences improved by informal language rules

As shown in Figure 6.6, results differ greatly between the rules. The rule that expands abbreviations (avoidAbbrev) has very high positive impact on both Lucy and the SMT, but results in only 30% improvement for Systran. For 60% of the sentences, this rule has no impact. A closer investigation shows that in most of these sentences the expanded abbreviations are info or ordi, which Systran by default translates with the non-abbreviated information and computer, thus producing the same result as pre-editing. This mechanism could be applied to other abbreviations too. Instead of expanding abbreviations through pre-editing, they could also be added directly to the systems’ user user dictionaries, The two rules handling informal questions, avoidDirQuest and avoidEstCeQue, which improve over 50% of sentences for the SMT system, are of little use for the RBMT systems, since different forms of questions produce the same analysis result. For Lucy these rules have no impact on translation in, respectively, 74% and 72% of cases; for Systran 58% and 63% of sentences have identical translations. Similarly, both RBMT systems are less affected by incomplete negations. The

203

6. RULE EVALUATION: PORTABILITY

missingNe does not affect the translation of 50% of sentences for Lucy, and as many as 70% for Systran. In most of the cases, the negation is present in the translation of sources with and without ne, as shown in the following example taken from the Systran data: (80) Raw: H´esite pas ` a reprendre contact avec le Support par le Chat en direct. −→ Do not hesitate to renew contact with the Support by the live Cat. Pre-edited: N’h´esite pas ` a reprendre contact avec le Support par le Chat en direct. −→ Do not hesitate to renew contact with the Support by the live Cat. These results have shown that several of the informal language issues treated by our rules are not problematic for RBMT systems. Rules correcting purely lexical issues however, such as anglicisms and colloquialisms, are equally useful for all three systems. 6.2.2.7

Simplification

The simplification rules, which perform modifications similar to those often found in controlled language rules (cf. 3.6), have a very high positive impact on translation by both linguistic systems. It should however be noted that, as discussed in 4.4.2.7, the evaluation results are strongly influenced by the presented source (raw or pre-edited). Table 6.7: Comparative evaluation of simplification rules for three MT systems

Lucy Systran ACCEPT

total sents

No impact

Raw better

About the same

Pre-edited better

No majority

253 253 253

1 (0%) 0 (0%) 5 (2%)

18 (7%) 18 (7%) 45 (18%)

7 (3%) 4 (2%) 16 (6%)

217 (86%) 223 (88%) 176 (70%)

10 (4%) 8 (3%) 11 (4%)

A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=248) = 511.1, p < 0.001 (Lucy) and χ2(3,N=248) = 550.6, p < 0.001 (Systran). Figure 6.7 shows the proportion of sentences improved by individual rules. The avoidCeQui rule, which splits long sentences with a clause introduced by ce qui, performs much better for both Lucy and Systran than for the SMT. By default, Lucy translates ce qui by that and Ceci by This, thus the pre-edited version is always better, as in the following example:

204

6.2 Portability to other MT systems (Lucy, Systran)

0c

20c

40c

60c

80c

100c

avoidAdverb5T102) avoidCeQui5T16)

Lucy Systran ACCEPT

avoidCleftSent5T32) avoidConjBOS5T50) avoidNon5T3) avoidPresPart5T50)

Figure 6.7: % of sentences improved by simplification rules

(81) Raw: Il peuvent prendre ton PC en assistance, ce qui permettra de voir ce qu’a supprim´e NU. −→ Him can to take your PC in assistance, what will allow seeing what NU has deleted. Pre-edited: Il peuvent prendre ton PC en assistance. Ceci permettra de voir ce qu’a supprim´e NU. −→ Him can to take your PC in assistance. This will allow seeing what NU has deleted. The highly positive impact on Systran is less simple to explain, as in most cases both raw and pre-edited versions produce correct translations, yet evaluators seemed to largely prefer the split sentences. As already mentioned in 4.4.2.7, the evaluation of the simplification rules is difficult. Since our evaluation methodology is focussed on comparing two translations of a given source, rather than comparing two target sentences in terms of readability, it does not provide the means of distinguishing cases where both translations are correct, but one has been successfully simplified. 6.2.2.8

Reformulation

Unlike the simplification rules, which reformulate complex or ambiguous sequences, the reformulations rules focus specifically on sequences that were found to be problematic for the ACCEPT SMT system (cf. 3.7.2). We thus did not expect these rules to perform well for other MT systems. Table 6.8 shows the combined results for the rules of this category. The rules perform slightly better for Lucy (53% of sentences improved) than for Systran (47%), but both systems have a high proportion of sentences where the rules

205

6. RULE EVALUATION: PORTABILITY

Table 6.8: Comparative evaluation of reformulation rules for three MT systems

total sents Lucy Systran ACCEPT

289 289 289

No impact

Raw better

About the same

Pre-edited better

No majority

74 (26%) 78 (27%) 27 (9%)

38 (13%) 58 (20%) 56 (19%)

10 (3%) 9 (3%) 16 (6%)

152 (53%) 135 (47%) 179 (62%)

15 (5%) 9 (3%) 11 (4%)

have no impact (26% and 27%). A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=262) = 211.7, p < 0.001 (Lucy) and χ2(3,N=262) = 172.1, p < 0.001 (Systran). Figure 6.8 shows the proportion of improved sentences for individual rules. The addDoisJe rule, which had high positive impact on the ACCEPT SMT system is far less positive for the rule-based systems, especially for Systran, where 72% of sentences are degraded. This rule applies to interrogative sentences following the Comment + infinitive verb phrase or Que + infinitive verb phrase structure. It inserts dois-je before the infinitive. In the case of Systran, dois-je is translated by do I have, as shown in the following example, which the judges did not prefer. (82) Raw: comment contacter la direction ? −→ how to contact the direction? Pre-edited: comment dois-je contacter la direction ? −→ how do I have to contact the direction? For Lucy, these interrogative structures are often translated badly, and the inserted dois-je is systematically translated by must I, as shown in the following example: (83) Raw: Comment r´esoudre ce probl`eme ? −→ How do remediate this problem? Pre-edited: Comment dois-je r´esoudre ce probl`eme ? −→ How must I remediate this problem? While the translation is not perfect, it is arguably better. However, the results show 19% of sentences where no majority judgement could be reached, indicating that the improvement is not clear.

206

6.2 Portability to other MT systems (Lucy, Systran)

The avoidExpression rule, which replaced expressions that were found to degrade translation by the ACCEPT SMT system, performs very well for both rule-based systems, improving 80% of sentences for Lucy and 74% for Systran. As in the evaluations with the SMT, these high proportions of improved sentences are mostly due to the evaluators’ tendency to prefer the translation that is closest to the presented source, in this case the pre-edited source. The avoidGreeting rule, which replaces ` a toutes et tous with `a tout le monde, has a surprisingly positive impact on both linguistic systems, far higher than for the SMT. A closer investigation of the individual sentences for Lucy reveals that the original expression is indeed translated badly. The replacement, although better, is not ideal, as shown in the following example: (84) Raw: Bonjour ` a toutes et tous −→ Hello at all of them and everyone Pre-edited: Bonjour ` a tout le monde −→ Hello to everybody Results for Systran are similar, with the replacement resulting in an improvement, but not the ideal translation. These results suggest that such reformulation rules are interesting for different system architectures, but that replacements may have to be customised according to the system. Obviously, as this would require testing, it is only interesting for frequent phenomena. 6.2.2.9

Informal second person

As expected, the tuVous rule, which was developed specifically to compensate the register mismatch between the forum data and the SMT training data (cf. 3.7.5), is not particularly useful for the RBMT systems. Table 6.9 shows the evaluation results for this rule. Since it replaces informal second person pronouns and verbs by their formal equivalent, which are translated identically in English, it is unsurprising that a high share of raw and pre-edited sentences will have identical translations with the rule-based systems. Incidentally, this rule does present one advantage for the RBMT systems. The cases where translation was different, nearly all of which obtained a “pre-edited better” judgement, are mostly sentences with imperative verb forms, as in the following example:

207

6. RULE EVALUATION: PORTABILITY

0S

20S

40S

60S

80S

100S

addDoisJelb36L avoidExpressionlb87L avoidGreetinglb19L avoidIlFautQuelb10L

Lucy Systran ACCEPT

avoidPluslb23L

avoidPourQuelb41L avoidQuelb30L avoidSiEtQuelb20L avoidSoitlb7L wrongAbrevlb16L

Figure 6.8: % of sentences improved by reformulation rules Table 6.9: Comparative evaluation of tuVous rule for three MT systems

total sents Lucy Systran ACCEPT

50 50 50

No impact

Raw better

About the same

Pre-edited better

No majority

43 (86%) 41 (82%) 20 (40%)

1 (2%) 1 (2%) 3 (6%)

0 (0%) 1 (2%) 2 (4%)

6 (12%) 7 (14%) 24 (48%)

0 (0%) 0 (0%) 1 (2%)

(85) Raw: Essaye une r´einstallation `a l’aide de l’outil NRnR (Norton Remove and Reinstall). −→ Tries a reinstallation with the aid of the tool NRnR (Norton Remove and Reinstall). Pre-edited: Essayez une r´einstallation `a l’aide de l’outil NRnR (Norton Remove and Reinstall). −→ Try a reinstallation with the aid of the tool NRnR (Norton Remove and Reinstall). Imperatives are ambiguous, since they have the same surface form as some indicative forms, for instance essaye could be either the informal second person imperative (translated by try) or the first or third person indicative (translated by try and tries respectively). Thus if essaye is analysed incorrectly, a bad translation may occur. In the case of the formal second person however, essayez can also be either indicative or

208

6.2 Portability to other MT systems (Lucy, Systran)

imperative, yet both have the same translation, try, thus even if the word is analysed incorrectly, the translation of this particular word will always be the correct one. Replacing informal by formal forms can thus be beneficial, even without the training data issue. Although it only improves few sentences, the tuVous rule has far more positive than negative impact on the rule-based systems. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=30) = 20.9, p < 0.001 (Lucy) and χ2(3,N=30) = 18.8, p < 0.001 (Systran). In view of these results, the tuVous rule is interesting even for rule-based systems, for this particular language pair. 6.2.2.10

Clitics

The clitic rules (cf. 3.7.3), which perform reordering and replacements producing incorrect French, were developed on the basis of the bigram/trigram frequencies extracted from the forum data and the SMT training data. They are thus very specific to this system. Results for these rules are shown in Table 6.10. We observe a very high proportion of sentences degraded by these rules (70-77%) for Lucy and Systran. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=118) = 295.4, p < 0.001 (Lucy) and χ2(3,N=118) = 222.9, p < 0.001 (Systran). The negative impact of these rules is unsurprising, since they produce sequences that are not correct French, and can therefore not be analysed by the resources of the rule-based systems. Table 6.10: Comparative evaluation of clitic rules for three MT systems

total sents Lucy Systran ACCEPT

150 150 150

No impact

Raw better

About the same

Pre-edited better

No majority

5 (3%) 5 (3%) 32 (21%)

116 (77%) 105 (70%) 24 (16%)

5 (3%) 11 (7%) 11 (7%)

15 (10%) 21 (14%) 75 (50%)

9 (6%) 8 (5%) 8 (5%)

Figure 6.9 shows the proportion of sentences improved by the individual clitic rules. All rules perform badly for the rule-based systems, improving only few sentences.

209

6. RULE EVALUATION: PORTABILITY

0%

20%

40%

60%

80%

100% Lucy Systran ACCEPT

cliticsÇaE(50) cliticsCeux-ciE(50) cliticsPersPronE(50)

Figure 6.9: % of sentences improved by clitic rules

A closer analysis of the individual sentences shows that Lucy handles clitic pronouns very well. Replacing the clitic pronoun by a postposed ¸ca, as the utilisezC ¸ a rule does, invariably produces that instead of it, thereby not improving the translation. The same happens for Systran, as shown in the following example. (86) Raw: Tu ne le vois plus parce que c’est mis en r´esolu. −→ You do not see it anymore because it is put in remediated. Tu ne vois ¸ca plus parce que c’est mis en r´esolu. −→ You do not see that anymore because it is put in remediated. The cliticsCeuxCi rule, which replaces plural clitic pronous by ceux-ci placed after the verb, does not improve translation by the rule-based systems either. In most cases, the only change in the translation is them replaced by those, as shown in the following example: (87) Raw: D`es que j’ai des nouvelles, je les posterai ici. −→ As soon as I have news, I will post them here. Pre-edited: D`es que j’ai des nouvelles, je posterai ceux-ci ici. −→ As soon as I have news, I will post those here. The last clitic rule, which replaces personal pronouns, is hardly more successful, although results are more disparate since this rule does not always perform the same replacements. Overall, the clitic rules cannot be considered useful for rule-based systems. Since for most of the examples processed in this experiments, clitics are translated correctly in the first place, using pre-editing to replace them is not useful. For Systran for example, for 128 of the 150 examples the raw version with the clitic pronoun produces a correct translation.

210

6.2 Portability to other MT systems (Lucy, Systran)

6.2.2.11

Reordering

Similarly to the clitic rules, the reordering rules (cf. 3.7.4) were developed specifically to improve translation by the ACCEPT SMT system. We therefore do not expect them to perform particularly well for the rule-based systems. Table 6.11 shows the combined results for the three rules. With about a half of the sentences degraded, Systran does not benefit from these rules. Results for Lucy are slightly better, although nearly a quarter of the sentences are degraded. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=59) = 28.1, p < 0.001 (Lucy) and χ2(3,N=59) = 54.5, p < 0.001 (Systran). Table 6.11: Comparative evaluation of reordering rules for three MT systems

total sents Lucy Systran ACCEPT

71 71 71

No impact

Raw better

About the same

Pre-edited better

No majority

24 (34%) 6 (8%) 12 (17%)

17 (24%) 34 (48%) 20 (28%)

1 (1%) 1 (1%) 4 (6%)

25 (35%) 27 (38%) 32 (45%)

4 (6%) 3 (4%) 3 (4%)

Figure 6.10, showing the proportion of sentences improved by individual rules, shows great disparities between the rules of this category. The reorderTout rule, which moves the pronoun tout after the verb, has a very high positive impact on Systran, improving 91% of the sentences. By default, Systran doesn’t perform the required inversion, as shown in the following example: (88) Raw: [...] si tu n’as pas d´ej`a effectu´e une sauvegarde tu vas tout perdre. −→ [...] if you did not already carry out a backup you all will lose. Pre-edited: [...] si tu n’as pas d´ej`a effectu´e une sauvegarde tu vas perdre tout. −→ [...] if you did not already carry out a backup you will lose all. Of the three reordering rules, reorderTout also has the highest positive impact on Lucy, improving 44% of sentences, while degrading only 13%. reorderTout is thus the most successful of the reordering rules for all three systems. The other two reordering rules, reorderJamaisV and reorderRienInf, prove less successful for both rule-based systems.

211

6. RULE EVALUATION: PORTABILITY

0%

20%

40%

60%

80%

100% Lucy Systran ACCEPT

reorderToutS(23) reorderJamaisVS(8) reorderRienInfS(40)

Figure 6.10: % of sentences improved by reordering rules

6.2.2.12

Summary

We have performed a rule by rule evaluation with two rule-based MT systems, Lucy and Systran. Results show differences in impact of the different rule categories for the two translation approaches. The transfer-based system, Lucy, is strongly affected by phenomena that disrupt its analysis, like syntax or real-word spelling errors. Thus the grammar and homophone rules have a high positive impact on this system. On the other hand, Systran, with its direct approach to translation, shows results that are closer to those of the SMT. Agreement errors or ungrammatical input are less problematic, and thus the pre-editing has less impact. The punctuation rules perform equally well for both linguistic systems, improving more sentences than for the SMT. The syntactic informal language rules, since they treat issues that are not problematic for the RBMT systems, have less impact on both rule-based systems than on the SMT. The simplification rules on the other hand have a very high positive impact on Lucy and Systran, but we must remember that, for these rules, results are strongly influenced by our evaluation methodology. Unsurprisingly, the rules developed specifically to improve translation by the ACCEPT SMT system do not perform well with the rule-based systems. Of the reformulation rules, some have high positive impact on RBMT, while others have hardly any impact. Results suggest that the reformulation approach is interesting, but that reformulations should be customised to the systems. The clitic rules on the other hand degrade nearly three-quarters of the sentences. Results show that both RBMT systems handle clitics very well, producing mostly correct translations, thus replacing them is not useful. The reordering rules are equally unsuccessful, with the exception of one inversion which is beneficial for Systran. This suggests that, similarly to reformulation, specific cases may benefit from reordering, but this requires specialisation of the rules

212

6.2 Portability to other MT systems (Lucy, Systran)

for the MT system. Finally, the tuVous rule, while it is mostly ineffective, does improve some cases by removing the ambiguity of imperative verb forms. This evaluation has shown that while not all individual rules have a positive impact, a large proportion of the pre-editing process is beneficial to translation with RBMT systems. We will now proceed with a global evaluation of this process.

6.2.3

Global evaluation with Lucy and Systran

As was done for the SMT system in 4.5, we have also performed a global evaluation of the pre-editing process for both RBMT systems. For this evaluation, the three rule sets were applied in sequence, with the addition of the spelling module in the interactive step. This evaluation was performed using the same data and same comparative evaluation setup on AMT as was done in 4.5 Table 6.12 shows the evaluation results. Results obtained for the ACCEPT SMT system are provided as reference. Table 6.12: Comparative evaluation results for complete pre-editing sequence for three MT systems

total sents Lucy Systran ACCEPT

665 665 665

No impact

Raw better

About the same

Pre-edited better

No majority

139 (21%) 182 (27%) 132 (20%)

55 (8%) 59 (9%) 58 (9%)

22 (3%) 2 (0%) 31 (5%)

431 (65%) 403 (61%) 431 (65%)

18 (3%) 19 (3%) 13 (2%)

The overall impact is highly positive, both for Systran and Lucy, with over 60% of sentences improved, against less than 10% degraded. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for both systems, χ2(3,N=526) = 915.8, p < 0.001 (Lucy) and χ2(3,N=483) = 893.9, p < 0.001 (Systran). In light of the results obtained in the rule by rule evaluation, especially the degradations introduced by the rules for the machine, these positive results might seem surprising. We must however take into account that the different rules are not applied with the same frequency, i.e. some very specific reformulations are far less common than some homophone confusions for example.

213

6. RULE EVALUATION: PORTABILITY

On the whole, the pre-editing process developed in this study can be applied to improve translation by RBMT systems. An easy way to further improve results is to remove the “rules for the machine” developed specifically for the ACCEPT SMT system. After this analysis of portability to other MT systems, we will now continue with an investigation of portability to other data, in this case, other forums.

6.3

Portability to other forums

The range of topics covered by internet forums is extremely wide, ranging from technical topics like IT support (the Norton Community forums we have studied here), to all sorts of hobbies, crafts, trades, health and lifestyle issues, etc. With general internet availability and the increasing accessibility of forum interfaces, any subject that interests a community of users, for either professional or leisure purposes, will probably have at least one dedicated forum. In developing our pre-editing rules, we have focussed exclusively on the Norton Community forums. We will now investigate whether these rules can also be of use for French forum data in general. We therefore chose to apply them to different forum data by selecting two other French forums, one very close to the Norton Community forums, the other from an entirely different domain. In the following sections we begin by describing the two selected forums and the data preparation process (6.3.1). We then present the results of an evaluation of rule precision (6.3.2), and continue with an evaluation of impact on machine translation (6.3.3).

6.3.1

Forum selection

For our first forum, we decided to use a forum that would be as close to the Norton Community forums as possible, with similar user profiles and topics. We selected the French CNet forum1 on which users exchange help about issues with personal computers. This forum has several boards, one of which is devoted specifically to security issues2 . We therefore exported posts only from this board. Aside from the 1 2

http://forums.cnetfrance.fr/ http://forums.cnetfrance.fr/forum/51-securite/, accessed 07.2014

214

6.3 Portability to other forums

larger diversity of product names - since these forums are not specific to one particular supplier of security software - we expected this content to be reasonably close to the Norton data. This forum will be referred to as CNet. For our second forum, we shifted to an entirely different domain: DIY. We chose a popular French DIY forum, Bricoleur du dimanche 1 . The objective of the forum is the same as for the computer forums, with users writing descriptions of problems, and the remainder of the community suggesting solutions. Discussions are grouped in boards by domain, for example electricity, plumbing, heating, woodwork, etc. For our experiments, we chose to use the threads from the plumbing board2 . Threads on this board treat topics like water pressure, blocked drains, pipes and fixtures, boilers etc. As with the IT forums, users range from people who have no experience at all with plumbing, to well-informed specialists. Accordingly, the language used is variable, ranging from simple descriptions to specific terminology and abbreviations. This forum will be referred to as Brico. For both forums, we downloaded discussion threads using WinHTTrack Website Copier3 . We extracted individual posts from the resulting html files, keeping only post content. The posts were then processed with a slightly adapted version of the script used for the Norton data to remove HTML markup. Finally, the data was split into individual sentences using the NLTK punkt tokenizer4 . After removing duplicate lines, we selected 10,000 lines from each of the forums. In the following sections, we will see how these data were used to investigate two aspects of rule portability: rule precision (6.3.2) and impact on MT (6.3.3).

6.3.2

Rule precision

To evaluate precision, we used the exact same methodology as described in Section 3.9 for the ACCEPT SMT system. We used the Acrolinx Batch Checker to apply the three rule sets in sequence to suitably pre-processed corpora. We then extracted the flagged sentences from the XML report produced by the Batch Checker. For each rule, we evaluated up to 50 randomly selected examples. Complete results by rule can be found 1

http://www.bricoleurdudimanche.com/ http://www.bricoleurdudimanche.com/forums/forums-bricolage/forum-plomberie.html, accessed 07.2014 3 www.httrack.com/ 4 http://www.nltk.org/api/nltk.tokenize.html 2

215

6. RULE EVALUATION: PORTABILITY

in Appendix D. Table 6.13 shows the precision results by rule categories. Results for Norton data are included as reference. Table 6.13: Rule precision on 10,000 sentences by categories for the three forums

CNet Total Precision flags

Brico Total Precision flags

Norton Total Precision flags

agreement verbs sequences homophones punctuation informal simplification reformulation clitics reordering tuVous

728 281 157 695 1843 1035 417 309 610 68 2683

88% 93% 81% 83% 93% 93% 100% 88% 98% 100% 98%

1246 330 247 887 2834 1203 836 521 744 98 1813

84% 93% 79% 86% 99% 87% 98% 97% 95% 100% 98%

439 223 87 517 1623 763 539 271 488 33 688

85% 96% 76% 98% 95% 97% 100% 100% 98% 100% 92%

total

8826

92%

10759

93%

5671

96%

Examining the total number of flags for the 10,000 sentences, we have more flags on both the CNet and the Brico forums than we observed for Norton, the difference being larger for Brico, which has nearly twice as many flags as Norton. A closer look at the flags by category reveals that this superior number of flags is not distributed equally. A large share of the extra flags can be attributed to the rules correcting grammar, punctuation and informal language. This suggests that either the users of this forum take less care in writing, or might be less skilled than the users of the IT forums. Both CNet and Brico have far more flags for the tuVous rule than Norton. It is intriguing that these forums should have a much higher frequency of informal second person pronouns and verb forms than the other. A simple explanation would be different levels of formality within different communities on different forums. To yield better information on this question, a comparative analysis of different forums would be necessary, but this goes beyond the scope of the present study.

216

6.3 Portability to other forums

Regarding precision, most rule categories perform equally well for CNet and Brico as for Norton. We observe several exceptions. The homophones category has an about 10% lower precision for the two new forums. For CNet, the main culprits are the ¸caSa, ceSe, desD`es and surSˆ ur rules (with precision ranging from 45 to 74%). All these rules rely heavily on the tagger, since they need context information to flag errors. A tagging error on an ambiguous noun/verb will invariably lead to a wrong flag. Since the tagger was not specialised for these new forums, it is possible that it performed less well, thus the errors. For Brico, the ceSe, laL`a, onOntSont and surSˆ ur rules perform less well (with precision ranging from 54 to 84%), again mostly as a consequence of tagging errors. It should be noted that the precision of the surSˆ ur rule is low for the Norton forums too. The nounVerbConf rules also perform badly, with precision of 46% for CNet and 54% for Brico. Another category that presents differences between the forums is the informal category, which works less well for Brico. A closer look at the rules shows that two rules are problematic. The first is the avoidFAbbrev rule, which is part of Set 2 and asks the user to replace specific abbreviations. On both IT forums, it has a precision of 100%, while on the plumbing forum, precision goes down to 72%. The results show that the use of abbreviations differs from domain to domain. For example, on the computer forums, mm is commonly used to abbreviate mˆeme, while on the plumbing forums, it is mostly used for the more common millim`etre. The avoidFAbbrev rule would thus require some domain adaptation to be truly reliable. This also confirms the choice of keeping this rule among those requiring human intervention, since abbreviations are not unique and thereby ambiguous. The second rule that performs differently for the plumbing domain is the avoidAnglicism rule, which produces only 2 flags on 10,000 sentences, against 98 for Norton and 137 for CNet. This can be explained by several facts. First, since the anglicisms the rule looks for were extracted from Norton forum data, it is probable that they are specific to the IT domain, and therefore have little chance of being used in relation to plumbing. Another possibility is that the use of anglicisms is not so common in a more traditional trade like plumbing, which is less subject to international influence, and thus to intrusions of the English language. The two anglicisms flagged on the Brico corpus are topic and soft. While topic is a typical forum anglicism and its meaning should not be influenced by the domain, soft is another matter. As opposed to the

217

6. RULE EVALUATION: PORTABILITY

IT domain, where soft is commonly used as a short version of software, here it is used synonymously to gentle, describing a cleaning method. The suggestion provided by our rule (logiciel ) is therefore not applicable. As for abbreviations, this is a rule that would require domain adaptation. Finally, we also observe some interesting phenomena with the avoidInformal rule. This rule flags colloquial words or expressions. For the sake of simplicity, it flags mostly words instead of complete expressions. This presents the advantage of being less sensitive to tagger errors. For example, on the Norton forums we encountered the informal French expression ˆetre une buse (be stupid). When used on a forum where the word buse has little chance of appearing, as on the IT support forums, this word can safely be flagged as informal language. On the plumbing forums however, where buse may appear more frequently (with the meaning of nozzle/duct), it cannot systematically be considered informal. We also observe a comparatively low score for the reformulation rules on CNet. This is caused by a bug in a single rule, avoidGreeting. On the Norton forums, it was common to encounter the short pas de quoi as a stand alone expression, instead of the correct complete expression il n’y a pas de quoi (you are welcome). Since this was problematic for the SMT system, a rule was created to replace it automatically, with vous ˆetes le bienvenu, which, while it is not idiomatic French, produces the correct translation. The rule however was insufficiently restricted and also replaced the pas de quoi sequence when it appeared within the il n’y a pas de quoi expression, producing errors. Correction of this bug brings precision of this rule to 100%. Overall, study of the rule by rule results shows that some rules have a very high precision, whatever the forum: the purely lexical rules, avoidAlors or avoidCeQui for example. Since these rules simply flag tokens and do not rely on complex patterns taking into account the context, they do not rely on information provided by the tagger, and are thereby mostly immune to the tagging errors that affect other rules.

These results show that precision for the two forums is only slightly inferior to that obtained with the Norton forum, suggesting that our rules are sufficiently robust to be used on other data. To improve results, a specialisation of the mainly lexical rules dealing with abbreviations, informal language and anglicisms would be necessary. We

218

6.3 Portability to other forums

will now continue with an evaluation of the impact of pre-editing on machine translation of these other forums.

6.3.3

Impact on translation

To evaluate the impact on translation, we used a subset of 1,000 sentences of the data used to evaluate precision. The data was pre-edited according to the defined sequence: automatically with rules from Set 1, manually with rules from Set 2 and with the spelling module, and automatically with rules from Set 3. For the CNet data, 697 of the 1,000 sentences underwent one or more changes through pre-editing, for Brico 751 of 1,000. Since the ACCEPT SMT system was specialised for Norton forum data, we chose not to employ it for the other forums. Instead, we translated with Google translate1 , which is probably what most forum users would do. In order to obtain comparable results, we also translated a set of 1,000 sentences from the Norton forums with this free online translator. The data used for Norton are the same as used in Section 4.5.2 for the global evaluation with the ACCEPT SMT system. After translation, we subjected the resulting sentence pairs to a comparative evaluation. As we have found that judgements collected on AMT are about as reliable as those obtained from translators (cf. 4.4.3.2), we performed this evaluation on AMT. The setup was identical in all respects to the evaluation performed in 4.5 for the Norton data, collecting three judgements on a five-point scale for each sentence pair. Results of these evaluations for the three forums are shown in Table 6.14. For reference purposes, the results obtained on Norton data with the ACCEPT SMT system are also provided. Table 6.14: Comparative evaluation results for alternate forums

CNet Brico Norton

total sents

No impact

Raw better

About the same

Pre-edited better

No majority

697 751 665

211 (30%) 132 (18%) 131 (20%)

60 (9%) 136 (18%) 115 (17%)

16 (2%) 44 (6%) 17 (3%)

390 (56%) 408 (54%) 376 (57%)

20 (3%) 31 (4%) 26 (4%)

665

132 (20%)

58 (9%)

31 (5%)

431 (65%)

13 (2%)

(Google)

Norton (ACCEPT) 1

https://translate.google.com/, accessed 07-08.2014

219

6. RULE EVALUATION: PORTABILITY

For CNet, we observe that the complete pre-editing process has a positive impact, improving 56% of sentences, while degrading only 9%. Pre-editing also has a positive impact on the data from the Brico forum, with 54% of sentences improved. For this forum, we observe a higher proportion of degraded sentences, representing 18%. This might be due to the precision issues for this forum discussed in the previous section, which could have produced mistranslations. Results obtained for Brico are very close to those obtained with the Norton data translated by Google, with 57% of sentences improved, and 17% degraded. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant for all three forums, χ2(3,N=486) = 800.8, p < 0.001 (CNet), χ2(3,N=619) = 594.9, p < 0.001 (Brico) and χ2(3,N=534) = 631.3, p < 0.001 (Norton (Google)). In comparison with our previous evaluations, we observe a slightly higher proportion of cases where no majority judgement was reached. This might indicate that the data was more difficult to judge. Inter-annotator agreement, computed with Fleiss’ Kappa (Fleiss, 1971) is 0.4648 (moderate) for CNet, 0.3855 (fair) for Brico, and 0.4305 (moderate) for the Norton data. As discussed in Chapter 4, the specificities of our data, in particular the skewed distribution of judgements across categories, lead to artificially low kappa scores. As a complementary measure, we have again computed the Intraclass Correlation Coefficient (ICC) (McGraw & Wong, 1996). We obtain the following scores: CNet 0.801 (excellent), Brico 0.720 (good), Norton 0.808 (excellent). For CNet and Norton, agreement scores are very close to those obtained for our previous evaluations on AMT, but the Brico data produces lower agreement. A closer investigation of the 31 cases where judges do not reach an agreement for Brico sentences shows that for 20 of them, the judgement distribution is 1 raw slightly better, 1 about the same and 1 pre-edited slightly better, which suggests cases with only slight differences that are difficult to judge. Overall, the impact of our pre-editing process is positive for both alternate forums. A relatively high proportion of degraded sentences for the Brico forum suggests that there still is room for improvement. It must be noted however that in these experiments we have not only used other data, but also a different MT system, which has a nonnegligible impact on the results, as shown by the results for the Norton data translated with the same system.

220

6.4 Conclusion

6.4

Conclusion

In this chapter we have investigated two aspects of rule portability by evaluating the impact of our pre-editing rules on different MT systems and on data extracted from other forums. We performed a rule by rule evaluation with two RBMT systems specialised for our domain and data, one transfer-based (Lucy) and one using an improved direct approach to translation (Systran). This evaluation has shown that many of the pre-editing rules developed in this study are also useful for translation with RBMT systems. The two MT approaches, transfer and direct, are affected differently by errors, and thus also benefit differently from pre-editing. Rules correcting language, such as the agreement, sequence and homophones rules, have a positive impact on both, yet are clearly more useful for the transfer-based systems, as they solve issues that affect analysis. The direct system, like the SMT system, is less affected by these issues. As expected, the rules developed specifically to improve translation with the ACCEPT SMT system, like the clitic and reordering rules, are detrimental to the RBMT systems, as they mostly replace structures that these systems can handle correctly. Among the other rules for the machine, the reformulation rules perform much better. Results show that the phenomena treated by these rules are also problematic for other MT engines. However, in some cases the different systems require different reformulations, thus these rules would have to be specialised. To investigate portability to other data, we have applied our pre-editing rules to posts extracted from two other forums: a very similar IT forum treating topics from the security domain, and a forum from an entirely different domain, plumbing. An evaluation of rule precision shows results only slightly inferior to those obtained with data from the Norton forums, suggesting that our rules are sufficiently robust to be used on other data. The only rules with precision issues are those that handle domain specific lexical aspects like abbreviations, informal language and anglicisms. To improve results, a domain specialisation of these rules would be necessary. We have also performed an evaluation of the impact of the pre-editing process as a whole on machine translation of these two forums. For this evaluation, we used the Google translation engine, since the ACCEPT SMT system was specialised for the Norton data and would not have provided comparable results on other data. Results

221

6. RULE EVALUATION: PORTABILITY

for the two new forums are very similar to those obtained with Norton data with this engine, suggesting that the pre-editing process can successfully be applied to improve machine translation of forum data in general. Both experiments give promising results for rule portability but also clearly show the necessity of excluding certain rules and specialising others, either to perform transformations focused on specific difficulties or shortcomings of the systems, or to treat domain-specific lexical elements. An evaluation framework like the comparative evaluations we have performed on AMT seems well suited to identify these rules. An important aspect that we have not investigated is the portability to other target languages. We have seen that some of our rules, such as the agreement rules, have little impact due to the less inflected target language. Translation to other languages would certainly reveal other aspects of the impact of pre-editing, and will be the object of future work.

222

7

Rule Evaluation: Usability This chapter investigates rule usability by real forum users.

7.1

Introduction

While pre-editing community content is useful to improve machine translation, one question still remains: Can pre-editing be implemented successfully in a forum? In the experiments presented in the previous chapters, the pre-editing rules have always been applied carefully and systematically by language professionals, for the most part native speakers with a translation background. In contrast, in the real life scenario targeted by the ACCEPT project, the pre-editing task will have to be accomplished by the community members themselves. As these pre-editors have very different profiles from those recruited for previous experiments, it cannot be excluded that there will be differences in the pre-editing activity and results, and consequently in the final impact on MT output. In this chapter we will focus on pre-editing by real users of the French Norton forum. Pre-editing by forum users gives rise to several issues, related to both the motivation and linguistic competence of non-professional pre-editors. As these users probably have little knowledge of the functioning of an SMT engine or the consequences of preediting, the importance of making certain changes to the source might not be obvious to them. Furthermore, applying pre-editing rules might prove difficult for users with varied linguistic knowledge, as it can involve complex modifications. Finally, if preediting requires too much time or effort, users will be less inclined to complete this

223

7. RULE EVALUATION: USABILITY

step. Several approaches can be considered to resolve these issues. Probably the most obvious is to automate the pre-editing process as much as possible in order to reduce the user effort. Considering the characteristics of the pre-editing rules described in the previous chapters, this is not possible for all the rules. A number of rules however have a high precision and unique suggestions, and thus can be applied automatically. As seen in 3.8, these were grouped in a specific set, Set 1, while the remainder was grouped in Set 2, for interactive application. Although the precision of the Set 1 rules is high, it is not perfect, and this raises the question whether it is preferable to have an automatic, yet not completely reliable process (due to a small proportion of wrong flags or inadequate suggestions) or a manual process which is also not completely reliable (due to user performance). For those rules where automatic application is not possible, pre-editing can be made more accessible to users by means of a robust and user-friendly checking tool. This was implemented within the ACCEPT project by the development of a checking plug-in that can be integrated into the forum interface where content is created (ACCEPT D5.1, 2012), to allow users to check for rule compliance directly during the authoring process. It remains to be seen whether this plugin will allow forum users to achieve a pre-editing performance comparable to that of the experts. To investigate these two aspects, we have performed, in collaboration with our project partner Symantec, a pre-editing experiment with real users of the Norton Community forums. In this experiment, we have set up two pre-editing scenarios involving the rules from Sets 1 and 2 (rules for humans, cf. 3.8), one semi-automatic and one entirely manual. The objective of this experiment is to collect two types of data, namely 1) user-interactions with the pre-editing plugin, which will give us an insight into how forum users apply pre-editing rules, and 2) forum data pre-edited by real forum users, which we will compare with the data pre-edited by experts. This chapter is organised as follows: we begin by presenting the experiments performed on the Norton Community forums (7.2), followed by a discussion of automatic vs manual rule application by forum users (7.3). We then examine how data pre-edited by forum users differs from that produced by experts (7.4).

224

7.2 Pre-editing in the Norton Community forums

7.2

Pre-editing in the Norton Community forums

In this section, we present the pre-editing experiments performed with users of the Norton Community forums. We will first describe the ACCEPT pre-editing plugin in more detail, then describe the participant and data selection.

7.2.1

The ACCEPT pre-editing plugin

The ACCEPT pre-editing plugin allows users to check their text and apply the rules directly in the browser window when writing a forum post (ACCEPT D5.1, 2012). This plugin uses the ACCEPT pre-editing API, which functions as a wrapper over the Acrolinx services by combining the main Acrolinx features in an independent REST (Representational State Transfer) API. By means of this API, the plugin can access the Acrolinx checking technology hosted on a server. The plugin functions similarly to other checking tools, by underlining potential errors or doubtful structures in the text. Depending on the rules, it provides different feedback when the cursor is hovered over the underlined words or phrases. In the case of a rule with suggestions, a contextual menu displays the possible replacement candidates which the user can accept with a mouse click. In cases where the rule does not have any suggestions, the user is shown a tool-tip with the description of the error, and has to complete modifications by directly editing the text. The plugin is shown in Figure 7.1. Besides these two main interactions, users can also choose to “learn words”, i.e. add a given token to the system so that it will not be flagged again, or “ignore rules”, i.e. completely deactivate a given rule. Both actions are stored within the user profile and remain active for all subsequent checking sessions. By means of a properties window, users can view learned words and ignored rules, which can be reverted at any time. Since the plugin and API were developed for research purposes, besides making preediting available to users, they also allow analysis of pre-editing activity. To this end, the API logs user actions during the pre-editing process. The logged data includes all the flags produced by the pre-editing rules on the input text, as well as user interaction with these flags (accept suggestion, show tooltip, ignore rule, etc). The logs can be retrieved in JSON format.

225

7. RULE EVALUATION: USABILITY

Figure 7.1: Pre-editing plugin in forum interface

7.2.2

Participants

To recruit users willing to participate in our study, an open call for participation was made in the French-speaking Norton forum. No specific profile was required, and the only condition was to be a French native speaker. Seven users showed their willingness to participate and were contacted, but only two eventually completed all the tasks.

7.2.3

Data selection

The amount of data we could reasonably expect volunteer forum users to process being limited, we chose to create a corpus of about 2,500 words for this study. From an initial corpus of 10,000 forum posts, only posts of 250 words or less were selected to ensure that the final corpus would contain posts with a diversity of writers and topics. Given the relatively low volume of data, to maximise the number of user interactions with the pre-editing rules and thus give us more insight into the difficulties encountered with each rule category, we chose to focus on posts with a high occurrence of errors and structures to pre-edit. To this end, we processed our corpus with the Acrolinx Batch Checker to produce XML reports summarising all the errors found for each post. Based on the error counts in these reports, we selected only those posts that had at least 6 flags produced by our pre-editing rules1 . Among these, we performed a second 1

In this count, we did not include flags produced by the Acrolinx spelling module (c.f 3.3.1), as

226

7.2 Pre-editing in the Norton Community forums

selection to keep only the posts with the highest error/words ratio, resulting in a set of 25 posts, for a total of 2,274 words. The selected posts were made available to participants in the forum itself. Specific forum sections were created for each participant and automatically populated with the selected posts. As a consequence of our data and user selection, users were asked to edit texts that they had not necessarily authored, which would not be the case in a real-life scenario.

7.2.4

Pre-editing scenarios

We defined two pre-editing scenarios which differ in the method of rule application. In the first, which matches the rule application we have performed in the other experiments in this thesis (cf. 4.5, 5.5.3, 6.2.3 and 6.3.3), rules from Set 1 were applied automatically. After this, the corpus was submitted to the forum users, who applied the rules from Set 2 and the spelling module interactively using the ACCEPT plugin. This scenario will be referred to as SemiAuto. In the second scenario, forum users applied both Sets 1 and 2, as well as the spelling module, interactively to the raw corpus using the ACCEPT plugin. This scenario will be referred to as AllManual. The two scenarios were performed by the same users with an intervening interval of one week. Users were provided with instructions on usage of the ACCEPT plugin (cf. Appendix E). For both setups, pre-editing activity was recorded using the ACCEPT plugin. This included recording the number and type of errors flagged by the rules and the actions performed during the process (accepted suggestions, displayed tooltips, ignored rules and words learned). Additionally to this user data, we collected the following two pre-edited data sets to serve as reference: • Expert: Rules from Set 1 were applied automatically, followed by interactive application of the rules from Set 2 and the spelling module by a native Frenchspeaking language professional. this functions like any other spelling checker that most users are familiar with, and the application of these flags is thus less relevant to our study.

227

7. RULE EVALUATION: USABILITY

• Oracle: This version is the result of manual post-processing of the Expert version by a native French speaker. All remaining grammar, punctuation and spelling issues were corrected. No further style improvements were made in this step. To complete the pre-editing process as designed in this study, once all manual preediting steps were performed, we applied the rules from Set 3 automatically to all four pre-edited versions. All versions were then translated into English using the project’s baseline system. In the following sections, we will evaluate the collected data in light of our two research questions, beginning with the analysis of automatic rule application vs interactive rule application by forum users.

7.3

Fully manual vs semi-automatic pre-editing

In this section, we wish to determine whether it is detrimental to the pre-editing result to apply part of the rules automatically in order to reduce the pre-editing effort required of the forum users.

7.3.1

Pre-editing activity

The data logged using the ACCEPT plugin provided information about the number of flags and actions performed to correct the text in both user scenarios (SemiAuto vs AllManual). Table 7.1 recapitulates the logged results. Table 7.1: Pre-editing activity for the two user scenarios

total flags flags with suggestions processed

SemiAuto

AllManual

430 333 (75%)

642 539 (84%)

(user 1)

(user 2)

(user 1)

(user 2)

347 81%

327 76%

501 78%

512 79%

As expected, users had to deal with more flags in the AllManual scenario (642) than in the SemiAuto (430) because they had to apply rules from both sets (1 and 2)

228

7.3 Fully manual vs semi-automatic pre-editing

interactively. In both scenarios, users applied on average slightly less than 80% of all available flags, suggesting that the number and density of flags did not deter the users. We should mention that we did not expect users to apply 100% of the flags, since the precision of the rules is not perfect (cf. Chapter 3), and thus a number of flags are incorrect. To quantify the number of changes performed during the two pre-editing scenarios, we computed the edit distance (Levenshtein distance) between the raw and pre-edited versions. Results for both scenarios and users are shown in Table 7.2. Table 7.2: Edit distance (words) between raw and pre-edited versions for user scenarios

SemiAuto

AllManual

141

-

automatic interactive

raw → final

(user 1)

(user 2)

(user 1)

(user 2)

326

407

465

480

449

527

465

480

For the SemiAuto scenario, 141 tokens were changed by the automatic application of Set 1 to the raw original corpus. This scenario then required 326 more changes from user 1 when applying Set 2 manually, and 407 from user 2. In the second scenario, where there was no automatic application of rules, more changes were performed during the interactive phase: 465 tokens were changed by user 1 (+ 39% in comparison to semiAuto) and 480 by user 2 (+ 17%). Considering the total number of changes performed in the two scenarios, i.e. the edit distance between the original raw corpus and the final pre-edited version (cf. last line of Table 7.2), we observe on average a very similar number of changes (472 against 488), suggesting that, in the two versions, the data must have undergone similar corrections. So far, the two scenarios produce similar results, in terms of the number of applied flags and the number of changed tokens. Based on these results, we could assume that the semiAuto scenario, which requires less effort from the forum users, is comparable to the entirely manual process. To ascertain this, we will now see how these two versions perform in terms of impact on translation.

229

7. RULE EVALUATION: USABILITY

7.3.2

Impact on translation

To evaluate the impact on translation, we have used the exact same comparative setup as for the evaluations performed in Chapter 6. Three judgements were collected for each sentence pair, using AMT judges. In a first evaluation, we compared the translation of the pre-edited versions (for each scenario and each user) against the translation of the corresponding raw sentences1 . Table 7.3 presents the results of this evaluation. Table 7.3: Comparative evaluation results for Raw vs SemiAuto and AllManual

SemiAuto user 1 user2 AllManual user 1 user 2

No impact

Raw better

About the same

Pre-edited better

No majority

42% 41%

7% 4%

2% 1%

45% 50%

4% 3%

43% 44%

6% 2%

2% 2%

47% 50%

3% 2%

For both scenarios and users, the translations of nearly half of the sentences are improved by pre-editing, with a low proportion of degraded sentences. A one-sample chi-square test of goodness of fit shows that, for each of the scenarios and users, the difference in proportions between the four categories is significant, χ2(3,N=122) = 183.2, p < 0.001(SemiAuto user 1), χ2(3,N=103) = 182.8, p < 0.001 (SemiAuto user 2), χ2(3,N=123) = 239.9, p < 0.001 (AllManual user 1), χ2(3,N=6118) = 257.7, p < 0.001 (AllManual user 2). We will see in Section 7.4.2 how these results compare to pre-editing by experts. Results for the two scenarios are very similar. It should be noted that due to our setup, the relationship between impact on translation and rule application is not linear. Since the evaluation is sentence-based, but most of the sentences had several flags, even if the pre-editing was incomplete - as a consequence of users not applying all of the available flags - the comparison with the raw version might still be positive, if at least one flag had been applied. Therefore, to obtain a more precise indication 1

In the (frequent) cases where two or more of the four pre-edited versions were identical, we submitted each differing translation only once to evaluation. Besides reducing expenses by eliminating redundant evaluations, this presents the advantage of increasing coherency of the results

230

7.4 Users against Experts

of how close the two versions produced in the two scenarios are, we have performed a second evaluation, comparing the versions from the two scenarios against each other. Results are shown in Table 7.4. Table 7.4: Comparative evaluation results for SemiAuto vs AllManual

User

Identical

semiAuto better

About the same

allManual better

No majority

user 1 user 2

113 (72%) 108 (58%)

13 (8%) 33 (18%)

9 (6%) 12 (6%)

21 (13%) 29 (16%)

0 (0%) 3 (2%)

This evaluation reveals that for two thirds of the sentences on average, there is no difference between the two versions in terms of impact on MT, i.e. they produce identical translations. For both users, the remainder of the sentences are more or less equally distributed among the SemiAuto better and AllManual better categories, suggesting that neither version is better than the other. These results indicate that to reduce user effort, without diminishing the outcome of pre-editing, the rules from Set1 can safely be applied automatically. We will now move on to our second question, namely how pre-editing performed by forum users compares to that performed by experts.

7.4

Users against Experts

In this section we will compare the results obtained with the forum users with those from the Expert and Oracle versions.

7.4.1

Pre-editing activity

Since the Expert version was not pre-edited using the same interface as the user versions, we have no flag application statistics for this version. To compare pre-editing activity, we have therefore computed the Levenshtein distance between the raw and pre-edited versions. Results are shown in Table 7.5. For the users, we include results for the SemiAuto scenario only, since this is directly comparable with the Expert scenario (automatic application of Set 1 followed by interactive application of Set 2).

231

7. RULE EVALUATION: USABILITY

Table 7.5: Edit distance between raw and pre-edited versions for all scenarios

User (user 1)

(user 2)

449

527

Expert

Oracle

582

694

Overall, the users performed less edits than the Expert (-22% for user 1, -9% for user 2). This concurs with the pre-editing activity reported in 7.3, where we saw that users applied about 80% of available flags. Our last pre-editing scenario, Oracle, shows an additional 112 (694 against 582) words changed compared to the Expert version. This suggests that our pre-editing rules already correct a large share of the errors in the text. We have mentioned in 7.2 that the plugin gives users the option of rejecting flags, either by adding tokens to a user dictionary, or by deactivating rules entirely. We suspect the distinction between these two options might not have been entirely clear to the users, since they made some odd choices, for example deciding to ignore the rule that corrects subject-verb agreement, or adding words like present participles to the dictionary. We have therefore chosen to consider both options simply as “rejected corrections”. Table 7.6 shows the number of rejected flags for both users and scenarios. Table 7.6: Flags rejected by forum users

SemiAuto (user 1) (user 2)

AllManual (user 1) (user 2)

Total

style spelling anglicisms grammar casing abbreviations punctuation

0 2 2 1 1 0 0

8 3 4 1 2 3 1

4 4 3 8 2 0 1

5 5 4 1 2 3 1

17 14 13 11 7 6 3

total

6

22

22

21

71

The most frequently rejected flags are those for style rules (17) such as avoidConj-

232

7.4 Users against Experts

BOS and avoidPresPart, probably because the rationale behind these suggestions was not obvious. Corrections of spelling (non-word errors) and anglicisms were also rejected often. It should be noted that the ignored spelling flags include flags of abbreviations, anglicisms and proper nouns, all very common in a technical forum, and not always incorrect. Three of these flags were also rejected by the Expert. Words such as “boot”, “Trojan” or “software” are frequent in French techie speak, and users might not see the use of replacing them with less common French equivalents. We also observe that a number of grammar flags were rejected, for which we have no explanation. A manipulation error cannot be excluded. We will now see how these different pre-editing levels affect translation.

7.4.2

Impact on translation

In a first evaluation, we compared the translations of the Expert and Oracle versions with those of the raw versions. Results are shown in Table 7.7. Again, we include only results for the SemiAuto user scenario, since these are comparable. Table 7.7: Comparative evaluation results for Raw vs Expert and Raw vs Oracle

Oracle Expert user1 user2

No impact

Raw better

About the same

Pre-edited better

No majority

29% 32% 42% 41%

6% 6% 7% 4%

2% 4% 2% 1%

60% 52% 45% 50%

3% 5% 4% 3%

The results obtained with the Expert version concur with those obtained in Chapter 4, showing that correct application of the pre-editing rules has a significant positive impact on translation quality of this data set. For 52% of the sentences, the translation of the pre-edited version is better, while the translation is degraded for only 6% of the sentences. A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant, χ2(3,N=127) = 184.5, p < 0.001. The Oracle version only produces slightly better results than the Expert version (60% sentences improved). A one-sample chi-square test of goodness of fit shows that the difference in proportions between the four categories is significant, χ2(3,N=133) =

233

7. RULE EVALUATION: USABILITY

268.3, p < 0.001. This suggests that our pre-editing rules, in their current state, can produce high-quality results not far from those obtained with the Oracle. In comparison with these two versions, the user versions perform well, improving slightly less sentences than the Expert version. The main difference between the Expert and user versions is a higher proportion of sentences with no impact for the user version. Again, this concurs with the slightly lower flag application. These results do not tell us if pre-editing by the users produced as good a result as pre-editing by the Expert. It cannot be excluded that, while they were judged as better than the Raw version, some of the improved sentences are still of lesser quality than the Expert version. For this reason, we decided to compare the user versions against the Expert version. Results are shown in Table 7.8. Table 7.8: Comparative evaluation results for User against Expert

user1 user2

Identical

User better

About the same

Expert better

No majority

65% 60%

5% 13%

2% 4%

25% 19%

3% 3%

For both users, pre-editing performed by the users and the Expert produced identical translations for more than half of the sentences (60%-65%). Regarding the remainder of the sentences, results differ between the two users. For user 1, the relationship between Expert better and user better is 5:1, while for user 2 it is 3:2. This suggests that there might be differences between users; with only two users we have, however, insufficient data to investigate this question in detail. When considering these results, we must keep in mind that the forum users were allowed to discard flags or rules they did not consider necessary, while the expert pre-editor had instructions to systematically apply all the rules.

7.5

Conclusion

We have performed an experiment in the Norton Community forums where our preediting rules were applied by two forum users. Two aspects of rule application were investigated. In a first stage, we have compared two different pre-editing scenarios, the

234

7.5 Conclusion

first matching the pre-editing process used in other experiments of this study, combining automatic and interactive rule application, the second entirely interactive. Results show that both scenarios produce similar results in terms of edit distance between raw and pre-edited version and of impact on machine translation. It is thus possible to reduce the pre-editing effort required of users by applying some rules automatically, without detrimentally affecting the results. In the forum, users had the possibility of ignoring flags. Analysis of the logged rejections has shown that neither user was particularly inclined to replace anglicisms, abbreviations or misspelt proper nouns. Rules correcting style, such as avoiding conjunctions at the beginning of sentences, were also frequently ignored. These seem to be cases where the rules were deliberately not applied because their usefulness was unobvious, as opposed to cases where the rules were difficult to apply. This aspect of rule application deserves further investigation, with more forum users. In a second stage, we compared the pre-editing performance of forum users with that of experts, i.e. language professionals as employed in our previous experiments to perform the pre-editing task. Results show that forum users perform slightly less edits than experts, and that the resulting impact on machine translation is slightly lower. Most of the difference is probably due to the lack of awareness of the importance of the rules, rather than inability to apply them. Overall the users perform well, indicating that our pre-editing process is accessible to forum users.

235

7. RULE EVALUATION: USABILITY

236

8

Conclusion This chapter concludes the thesis by summarising achievements, discussing limitations and presenting future work.

8.1

Achievements

The key achievements of this study are the following: • For the majority of the phenomena studied, it is possible to write pre-editing rules that have high enough precision to be useful, using only regular expressions, morphological analysis and part-of-speech tagging. • The impact of these rules can be evaluated quickly and objectively using crowdsourcing methods. • Improved sentences according to this evaluation methodology coincide with reduced post-editing effort. • Pre-editing can successfully be carried out by forum users, who perform only slightly less well than experts. • Despite inconclusive improvements in BLEU scores, pre-editing has a high impact on post-editing effort. • Rules developed for a specific domain and MT system are easily portable.

237

8. CONCLUSION

Pre-editing rules for forum data

Taken as a whole, the pre-editing process defined

in this work has a significant positive impact on machine translation with the ACCEPT SMT, as measured by a sentence-level comparative human evaluation. Two-thirds of the sentences are improved by pre-editing, while less than 10% are degraded. Impact is not equally distributed among the different types of rules. Correcting spelling and grammar issues was found to have significant positive impact on translation, with for example the correction of errors such as homophone function words and wrong verb forms having more impact than correcting agreement. Normalising informal language, with its abbreviations and out of vocabulary items, was found to clearly improve translation. Improving informal style had less impact on translation for the language pair studied here, as some structures are acceptable in both languages. Simplification rules, akin to controlled language, also have a positive impact on SMT. The reformulation rules designed specifically to improve translation of forum data by the ACCEPT SMT on the basis of the known mismatch between training and test data have a high positive impact. Finally, we have also investigated reordering rules and rules replacing clitics, which were found to improve a high proportion of cases, yet also degrade a non-negligible proportion of sentences. Declarative rule formalism

Overall, the declarative formalism used by Acrolinx

has proven equal to the task of developing different types of rules for forum data. It was possible to write high precision rules for the majority of the phenomena studied, confirming that a checking approach based on part-of-speech tagging, morphological analysis and regular expression is suited to the task of pre-editing forum data. Evaluating impact on translation

To evaluate the impact of pre-editing, we have

developed a comparative human evaluation approach to quickly and reliably distinguish between improved and degraded translations without the necessity of creating large numbers of reference translations. The evaluations were performed at the sentence level, where translations of raw and pre-edited input were ranked against each other, in relation to the source, to measure accuracy. This approach was tested with two groups of evaluators, translators and AMT workers, and was found to produce consistent results: judgements obtained from AMT workers showed a high correlation with those obtained from translators, and both groups presented similar inter-annotator agreement

238

8.1 Achievements

scores. Thus, at least in this context and for the considered language pair, AMT judges can be used instead of translators without substantially reducing the validity of the results obtained. A sentence-level evaluation of the same data with several common automatic metrics was inconclusive, revealing no impact of pre-editing and showing no correlation with the human evaluation. Closer analysis has revealed that in our particular case, the nature of the reference translations used may have played an important role. Impact on post-editing effort

Besides improving machine translation, the pre-

editing process also has a positive impact on the post-editing process. Specifically, pre-editing that improves translation is found to increase post-editing speed. The measurement of technical effort by means of edit distance scores concurs with these findings, showing that fewer edits are required in the case of pre-edited input. On the other hand, pre-editing that degrades translation adversely impacts post-editing throughput, yet to a lesser extent than the positive impact observed for pre-editing that improves translation. Again, edit distance scores concur with these results. For cases judged equal, no coherent impact on post-editing throughput was found, as results differ between post-editors. These results show that the human comparative evaluation methodology used in this study is a good indicator to identify rules which reduce postediting effort. An estimation of the global impact of pre-editing on post-editing throughput, obtained by combining the results for the three possible outcomes of pre-editing according to their frequency in the data, has shown an overall reduction of post-editing time. Taking into account the time invested in pre-editing, the bottom line is that the total processing time from raw input to final post-edited translation is not increased or reduced when the pre-editing step is included. One can reasonably argue that this is a win, since a substantial proportion of the challenging bilingual post-editing work is transformed into much easier monolingual pre-editing work. Pre-editing accessibility

A pre-editing experiment in the Norton Community fo-

rums has demonstrated that the pre-editing process is accessible to forum users. Results show that by applying some rules automatically, it is possible to reduce the pre-editing effort required of users without adversely affecting the outcome. A comparison of the

239

8. CONCLUSION

pre-editing performance of forum users with that of experts (language professionals as employed in the other experiments described in this thesis) suggests that forum users perform slightly less edits than experts, and that the resulting positive impact on machine translation is slightly lower. Forum users were not particularly inclined to replace anglicisms, abbreviations or misspelt proper nouns, and also frequently ignored style rules. These appear to be cases where the rules were deliberately not applied because their usefulness was unobvious, as opposed to cases where the rules were difficult to apply.

Portability to other data Rule precision on two other forums (a very similar IT forum treating topics from the security domain and a forum from an entirely different domain, plumbing) is only slightly lower than on the Norton forums, suggesting that the rules are sufficiently robust to handle other data. An exception is the rules that handle domain-specific lexical aspects like abbreviations, informal language and anglicisms, which require domain specialisation. Regarding the impact of the pre-editing process as a whole on machine translation of these two forums, results were very similar to those obtained with Norton Community data, suggesting that the pre-editing process can successfully be applied to improve machine translation of forum data in general.

Portability to other systems Many of the pre-editing rules developed in this study are also useful for translation with RBMT systems. The two MT approaches investigated, transfer and direct, are affected differently by errors. Rules correcting language have a positive impact on both systems, yet are clearly more useful for the transferbased systems, as they solve issues that affect analysis. The direct system, like the SMT system, is less affected by these issues. As expected, most of the rules developed specifically to improve translation with the ACCEPT SMT are detrimental to the RBMT systems, as they replace structures that these systems can handle correctly. However, a number of these rules could prove useful as some of the treated phenomena are also problematic for other MT engines, but the replacements provided by these rules would have to be specialised according to the MT system.

240

8.2 Limitations

8.2

Limitations

The wide-ranging approach to pre-editing adopted in this study, exploring many different types of rules, has limited the attention given to individual rule types; thus the developed rules can by no means be considered a complete and final set. Indeed, the amount of post-editing necessary to transform the MT output of pre-edited data into a correct final translation suggests that the pre-editing rules do not address all the issues that degrade translation. While the comparative evaluation methodology used throughout this work was efficient for most of the rules, evaluating the impact of the simplification rules has proven difficult. For these rules, translation was often not improved in the traditional sense, but simplifications of the source were successfully passed on to the target. An evaluation based on specific criteria, for example readability, as performed in Porro et al. (2014) (for post-editing rules), would seem more appropriate for this type of transformations. Two issues cause the results obtained for post-editing effort to be merely indicative. For practical reasons, the post-editing task was performed by translators and not by forum users; additional experiments would be necessary to confirm that the observed trends also apply to forum users. Furthermore, measuring post-editing time has proven problematic, as evidenced on the one hand by the relatively large proportion of outliers among the data, and on the other hand by the similar yet unequal results obtained in the two experiments. This suggests that time measurement is affected by many factors, and by itself is not a sufficient indicator of post-editing effort. Probably the most important limitation of our work is the use of a single target language. The main reason for this was the absence of a comparable SMT system for other target languages, as this was not considered in the ACCEPT project. We have seen that some of the rules, such as the agreement rules, have little impact due to the less inflected target language. Translation to other languages would certainly reveal other aspects of the impact of pre-editing, and will be the object of future work. Finally, the usability of rules by forum users has not been addressed extensively in this study. While results are promising in terms of accessibility of the pre-editing process, the question of participation of users in a real-life scenario remains largely unexplored. In particular, the issue of conveying the importance of pre-editing to forum users who have other priorities is worthy of analysis.

241

8. CONCLUSION

8.3

Future work

Important directions for future work are the improvement of pre-editing rules, which could be tested against the baseline provided in this work, as well as the investigation of new rule development methodologies. In our evaluation of rule precision and recall, we have seen that a number of issues cannot be addressed due to the word-level approach necessitated by the rule formalism. Being able to describe phrases - for example noun phrases in the case of agreement rules - might allow more efficient rule development and improve coverage. This raises several questions, in particular how this could be implemented in a tool like Acrolinx, and whether it would in fact improve the performance of the rules. Building on the bigram/trigram extraction method used in this work to quickly identify sequences with low translatability, an interesting extension would be to extract sequences based on linguistic annotations rather than surface forms. This could be performed with the same formalism as that used for rule development. By this means we could for example identify unusual sequences specific to informal language. Another question worthy of further investigation is the implications of sequential rule application. For practical reasons, in this study we have split the pre-editing rules into three sets to be applied in sequence. A side effect of this sequential rule application was that the correction of certain issues by the first set of rules improved precision or recall of subsequent sets. The potential of sequential rule application has only been superficially considered in this study, but it would be interesting to take this aspect into account for further rule development and investigate to what extent it can improve both recall and precision. Certain rules produce a non-negligible number of degraded sentences. In our current development methodology, negative evidences were added manually to rule out these cases based on evaluated data. As these evidences are mostly simple lexical constraints on contexts, an interesting development would be to see how these evidences could be generated automatically on the basis of negative examples. To ensure that these generated evidences do not cancel valid flags, some form of regression testing would have to be implemented. We have seen that, in terms of language correction, the pre-editing rules perform comparably to existing spelling and grammar checking systems. As the correction of

242

8.3 Future work

language issues represents an important share of the transformations performed by the pre-editing rules, this raises the question how pre-editing with these existing tools would perform in terms of translation improvement. Lastly, as a result of the post-editing experiments, we now have a corpus of postedited data which could be leveraged for the identification of MT errors, which could in turn be used to develop further pre-editing rules. Another direction for future work is the investigation of stronger integration with the SMT system. So far we have considered pre-editing as a stand-alone process. A first step would be to see whether translation can be improved by application of the preediting rules to the training data, as is done in most cases when pre-editing is associated with SMT. This would obviously include only the rules performing transformations such as the simplification, reordering or clitics rules, and not the rules correcting language issues. For a number of our rules, automatic application is not possible due to ambiguous cases where they cannot provide unique suggestions. Instead of having pre-editors perform the selection in an interactive checking process, a possibility would be to pass on the multiple replacement candidates to the SMT system, and use its resources to choose the appropriate replacement. This approach was investigated using a weighted graph model in a study focussing on homophone confusions (Bouillon et al., 2013), where it was found to perform comparably to the purely rule-based approach. This hybrid approach could be applied to other phenomena, for example the issue of clitic personal pronouns, where we have seen that replacements must be selected on a case by case basis, difficult to describe comprehensively in a rule. Finally, in this study, we have focussed on pre-editing to improve translation. However, many of the transformations performed in this context also improve the readability of the source, e.g. the simplification rules. This also raises a number of questions to be answered by future work, notably, as an extension of the work presented here, the effect source improvement could have on bilingual post-editing effort. Developing an effective and feasible pre-editing process to improve machine translation of forum data is a complex task, involving multiple parameters. In this study we

243

8. CONCLUSION

have shown that pre-editing can produce significant improvement of machine translation output, thereby reducing post-editing effort, while being accessible to forum users. While the details of the pre-editing approach that we have discussed apply specifically to French Norton Community forums translated with the ACCEPT SMT, many aspects could be generalised to other forums, MT systems and languages.

244

AST (2000). Guide du Typographe romand. Groupe de Lausanne de l’Association suisse des typographes, Heliographia. 57 Aw, A., Zhang, M., Xiao, J. & Su, J. (2006). A phrase-based statistical model for sms text normalization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL ’06), 33–40, Sydney, Australia. 16

References

Aziz, W., Castilho, S. & Specia, L. (2012). PET: a tool for post-editing and assessing machine translation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), 3982–3987, Istanbul, Turkey. 161 ACCEPT D4.1 (2012). Baseline machine translation systems. Public deliverable, The ACCEPT Project (FP7/20072013) grant agreement n◦ 288769. 115, 191

Baker, K.L., Franz, A.M., Franz, E.M., Jordan, P.W., Mitamura, T. & Nyberg, E.H. (1994). Coping with ambiguity in a large-scale machine translation system. In Proceedings of the 15th International Conference on Computational Linguistics (COLING ’94), 90–94, Kyoto, Japan. 20

ACCEPT D5.1 (2012). Browser-based client prototype used to access Acrolinx IQ server. Restricted deliverable, The ACCEPT Project (FP7/2007-2013) grant agreement n◦ 288769. 27, 224, 225

Baldwin, T., Cook, P., Lui, M., MacKinlay, A. & Wang, L. (2013). How noisy social media text, how diffrnt social media sources? In Proceedings of the Sixth International Joint Conference on Natural Language Processing, 356–364, Nagoya, Japan. 30

ACCEPT D9.2.2 (2013). Survey of evaluation results – version 1. Public deliverable, The ACCEPT Project (FP7/20072013) grant agreement n◦ 288769. 112 ACCEPT D9.2.4 (2014). Survey of evaluation results – version 2. Public deliverable, The ACCEPT Project (FP7/20072013) grant agreement n◦ 288769. 7

Banerjee, P., Naskar, S.K., Roturier, J., Way, A. & van Genabith, J. (2011). Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In Proceedings of MT Summit XIII , 285– 292, Xiamen, China. 2

AECMA (1995). AECMA Simplified English – a guide for the preparation of aircraft maintenance documentation in the international aerospace maintenance language. 20

Banerjee, P., Naskar, S.K., Roturier, J., Way, A. & van Genabith, J. (2012). Domain adaptation in SMT of usergenerated forum content guided by OOV word reduction: Normalization and/or supplementary data? In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT), 169–176, Trento, Italy. 2, 14, 16, 68

Aikawa, T., Schwartz, L., King, R., Corston-Oliver, M., & Lozano, C. (2007). Impact of controlled language on translation quality and post-editing in a statistical machine translation environment. In Proceedings of MT Summit XI , 10–14, Copenhagen, Denmark. 3, 19, 21, 100, 160, 161, 163, 174

Beaufort, R., Roekhaut, S., Cougnon, L.A. & Fairon, C. (2010). A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL ’10), 770–779, Uppsala, Sweden. 16

Alabau, V., Bonk, R., Buck, C., Carl, M., Casacuberta, F., Garc´ıa-Mart´ınez, M., Gonz´ alez, J., Koehn, P., Leiva, L., Mesa-Lao, B., Ortiz, D., Saint-Amand, H., Sanchis, G. & Tsoukala, C. (2013). Casmacat: An open source workbench for advanced computer aided translation. In The Prague Bulletin of Mathematical Linguistics, 101–112. 161

Beigman Klebanov, B., Knight, K. & Marcu, D. (2004). Text simplification for information-seeking applications. In On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE , vol. 3290 of Lecture Notes in Computer Science, 735–747, Springer Berlin Heidelberg. 21

Alonso, J. & Thurmair, G. (2003). The Comprendium translator system. In Proceedings of MT Summit IX , New Orleans, USA. 190

Berger, A.L., Pietra, V.J.D. & Pietra, S.A.D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71. 23

Alu´ısio, S.M., Specia, L., Pardo, T.A., Maziero, E.G. & Fortes, R.P. (2008). Towards brazilian portuguese automatic text simplification systems. In Proceedings of the Eighth ACM Symposium on Document Engineering (DocEng ’08), 240– 248, New York, NY, USA. 21 Arnold, D., Balkan, L., Humphreys, R.L., Meijer, S. & Sadler, L. (1994). Machine translation: An introductory guide. NEC Blackwell. 100

Bernth, A. (1998). EasyEnglish: Addressing structural ambiguity. In D. Farwell, L. Gerber & E. Hovy, eds., Machine Translation and the Information Soup, vol. 1529 of Lecture Notes in Computer Science, 164–173, Springer Berlin Heidelberg. 2, 11, 19

Artstein, R. & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34, 555–596. 143

Bernth, A. (1999). Controlling input and output of MT for greater acceptance. In Proceedings of the 21st ASLIB Conference, Translating and the Computer , London. 22

245

REFERENCES

Bernth, A. & Gdaniec, C. (2001). MTranslatability. Machine Translation, 16, 175–218. 2, 9, 18, 20, 133

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C. & Schroeder, J. (2007). (meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation (StatMT ’07), 136–158, Prague, Czech Republic. 100

Blain, F., Senellart, J., Schwenk, H., Plitt, M. & Roturier, J. (2011). Qualitative analysis of post-editing for high quality machine translation. In Proceedings of MT Summit XIII , 164–171, Xiamen, China. 162

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C. & Schroeder, J. (2008). Further meta-evaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation (StatMT ’08), 70–106, Columbus, Ohio. 145

Blair, C. (1960). A program for correcting spelling errors. Information and Control, 60–67. 12 Blanchon, H., Boitet, C., Huynh, C.P. et al. (2009). A web service enabling gradable post-edition of pre-translations produced by existing translation tools: Practical use to provide high-quality translation of an online encyclopedia. In Proceedings of the MT Summit XII Beyond Translation Memories: New Tools for Translators Workshop, 20–27, Ottawa, Ontario, Canada. 161

Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R. & Specia, L. (2012). Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation (StatMT ’12), 10–51, Montr´ eal, Canada. 151 Campbell, S. (2000). Choice network analysis in translation research. Research Models in Translation Studies I. Textual and Cognitive Aspects, 29–42. 162

Bouillon, P., Gerlach, J., Germann, U., Haddow, B. & Rayner, E. (2013). Two approaches to correcting homophone confusions in a hybrid machine translation system. In Proceedings of Second Workshop on Hybrid Approaches to Translation (HyTra), Sofia, Bulgaria. 7, 243

Candido, A., Jr., Maziero, E., Gasperin, C., Pardo, T.A.S., Specia, L. & Aluisio, S.M. (2009). Supporting the adaptation of texts for poor literacy readers: A text simplification editor for Brazilian Portuguese. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications (EdAppsNLP ’09), 34–42, Boulder, Colorado. 21, 22

Bouillon, P., Gaspar, L., Gerlach, J., Porro, V. & Roturier, J. (2014). Pre-editing by forum users: a case study. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland. 7

Carbonell, J.G. & Hayes, P.J. (1983). Recovery strategies for parsing extragrammatical language. Computational Linguistics, 9, 123–146. 18

Brants, T. (2000). Tnt: A statistical part-of-speech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLC ’00), 224–231, Seattle, Washington. 25, 38

Carl, M., Dragsted, B., Elming, J., Hardt, D. & Jakobsen, A.L. (2011). The process of post-editing: a pilot study. In Proceedings of the 8th international NLPSC workshop. Special theme: Human-machine interaction in translation, 131–142, Copenhagen Business School. 162

Bredenkamp, A., Crysmann, B. & Petrea, M. (2000). Looking for errors: A declarative formalism for resource-adaptive language checking. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece. 3, 18, 25, 26

Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22, 249–254. 143, 144

Brill, E. & Moore, R.C. (2000). An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL ’00), 286–293, Hong Kong. 14

Carrera, J., Beregovaya, O. & Yanishevsky, A. (2009). Machine translation for cross-language social media. http://www.promt.com/company/technology/pdf/mach ine translation for cross language social media.pdf. 1, 18

Brown, P., Pietra, d., Pietra, d., Lafferty, J. & Mercer, R. (1992). Analysis, statistical transfer, and synthesis in machine translation. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, 83–100. 23, 24

Carroll, J., Minnen, G., Canning, Y., Devlin, S. & Tait, J. (1998). Practical simplification of English newspaper text to assist aphasic readers. In Proceedings of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, 7–10, Madison, Wisconsin. 21, 22

Byrt, T., Bishop, J. & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423 – 429. 143

Chandrasekar, R., Doran, C. & Srinivas, B. (1996). Motivations and methods for text simplification. In Proceedings of the 16th International Conference on Computational Linguistics - Volume 2 (COLING ’96), 1041–1044, Copenhagen, Denmark. 21

Callison-Burch, C. & Dredze, M. (2010). Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT ’10), 1–12, Los Angeles, California. 109

Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S. & Basu, A. (2007). Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition, 10, 157–174. 16

Callison-Burch, C., Osborne, M. & Koehn, P. (2006). Reevaluating the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL ’06), 249–256. 104

Cicchetti, D.V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284–290. 147

246

REFERENCES

Clark, A. (2003). Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics, Lancaster. 16

Elliott, D., Hartley, A. & Atwell, E. (2004). A fluency error categorization scheme to guide automated machine translation evaluation. In R.E. Frederking & K. Taylor, eds., AMTA, vol. 3265 of Lecture Notes in Computer Science, 64–73, Springer. 102

Clark, E. & Araki, K. (2012). Two database resources for processing social media English text. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. 2, 11, 15, 16

Farr´ us, M., Costa-juss` a, M.R., Mari˜ no, J.B. & Fonollosa, J.A.R. (2010). Linguistic-based evaluation criteria to identify statistical machine translation errors. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT), 167–173. 102

Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20, 37. 143

Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Bradford Books. 104

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin. 143

Feng, M., Peter, J.T. & Ney, H. (2013). Advancements in reordering models for statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL ’13), 322–332, Sofia, Bulgaria. 23

Collins, M., Koehn, P. & Kuˇ cerov´ a, I. (2005). Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL ’05), 531–540, Ann Arbor, Michigan. 10, 24

Fishel, M., Bojar, O. & Popovi´ c, M. (2012). Terra: a collection of translation error-annotated corpora. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. 101

Cook, P. & Stevenson, S. (2009). An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity (CALC ’09), 71–78, Boulder, Colorado. 16

Flanagan, M. (1994). Error classification for MT evaluation. Proceedings of the First Conference of the Association for Machine Translation in the Americas, 65–72. 101

Coster, W. & Kauchak, D. (2011). Simple English Wikipedia: A new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (HLT ’11), 665–669, Portland, Oregon. 22

Fleiss, J. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382. 143, 145, 220 Fontenelle, T. (2006). Les nouveaux outils de correction linguistique de microsoft. In TALN06 – Verbum ex machina – Actes de la 13` eme Conf´ erence sur le traitement automatique des langues naturelles, 3–19, UCL, Presses Universitaires de Louvain. 12, 14, 33, 48

Coughlin, D. (2003). Correlating automated and human assessments of machine translation quality. In Proceedings of MT Summit IX , New Orleans, USA. 104 Daelemans, W., H¨ othker, A. & Sang, E.T.K. (2004). Automatic sentence simplification for subtitling in Dutch and English. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), 1045–1048, Lisbon, Portugal. 21

Fort, K., Adda, G. & Cohen, K.B. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37, 413–420. 109 Genzel, D. (2010). Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING ’10), 376–384, Beijing, China. 24

Damerau, F.J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM , 7, 171–176. 12, 13

Gerlach, J., Porro, V., Bouillon, P. & Lehmann, S. (2013a). Combining pre-editing and post-editing to improve smt of user-generated content. In Proceedings of the MT Summit XIV Workshop on Post-editing Technology and Practice (WPTP-2). 7

de Preux, N. (2005). How much does using controlled language improve machine translation results? In Proceedings of the Twenty-seventh International Conference on Translating and the Computer . 21

Gerlach, J., Porro, V., Bouillon, P. & Lehmann, S. (2013b). La pr´ e´ edition avec des r` egles peu coˆ uteuses, utile pour la TA statistique des forums ? In Proceedings of TALN 2013 (Volume 2: Short Papers), 539–546. 7

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research (HLT ’02), 138–145, San Diego, California. 104, 105

Goh, C.L. & Sumita, E. (2011). Splitting long input sentences for phrase-based statistical machine translation. In Proceedings of the 17th annual meeting of the association of Natural Language Processing, 802–805. 20

Doherty, S. (2012). Investigating the effects of controlled language on the reading and comprehension of machine translated texts: A mixed-methods approach. PhD thesis, Dublin City University. 21

Gon` alez, M., Gim´ enez, J. & M` arquez, L. (2012). A Graphical Interface for MT Evaluation and Error Analysis. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL ’12), Jeju Island, Korea. 151, 174

Doyon, J., S.White, J. & B.Taylor, K. (1999). Task-based evaluation for machine translation. In Proceedings of MT Summit VII , 574–578, Kent Ridge Digital Labs, Singapore. 101

247

REFERENCES

Goyvaerts, P. (1996). Controlled English, curse or blessing? a user’s perspective. In Proceedings of CLAW 1996 , 137–142. 11, 22, 32

Computational Linguistics, Companion Volume: Short Papers (NAACL-Short ’09), 177–180, Boulder, Colorado. 21 Kaufmann, M. & Kalita, J. (2010). Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India. 16

Green, S., Heer, J. & Manning, C.D. (2013). The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13), 439–448, Paris, France. 160

Kernighan, M.D., Church, K.W. & Gale, W.A. (1990). A spelling correction program based on a noisy channel model. In Proceedings of the 13th International Conference on Computational Linguistics - Volume 2 (COLING ’90), 205–210, Helsinki, Finland. 13

Guerberof, A. (2009). Productivity and quality in mt postediting. In Proceedings of the MT Summit XII Beyond Translation Memories: New Tools for Translators Workshop, Ottawa, Ontario, Canada. 160

Khan, M., Dickinson, M. & Kuebler, S. (2013). Does size matter? text and grammar revision for parsing social media data. In Proceedings of the Workshop on Language Analysis in Social Media (LASM 2013), 1–10, Atlanta, Georgia. 14, 16

Haddow, B. & Koehn, P. (2012). Analysing the effect of out-ofdomain data on SMT systems. In Proceedings of the Seventh Workshop on Statistical Machine Translation (StatMT ’12), Montreal, Canada. 75

King, M., Popescu-Belis, A. & Hovy, E. (2003). FEMTI: creating and using a framework for MT evaluation. In Proceedings of MT Summit IX , 224–231, New Orleans, USA. 100

Hallgren, K.A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23–34. 143 Han, B. & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT ’11), 368–378. 2, 11, 15

Kobus, C., Yvon, F. & Damnati, G. (2008). Normalizing sms: Are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1 (COLING ’08), 441–448, Manchester, United Kingdom. 16

Han, B., Cook, P. & Baldwin, T. (2013). Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology, 4, 5:1–5:27. 15

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X , 79– 86, Phuket, Thailand. 17, 31, 81

Helfrich, A. & Music, B. (2000). Design and evaluation of grammar checkers in multiple languages. In Proceedings of the 18th International Conference on Computational Linguistics - Volume 2 (COLING ’00), 1036–1040, Saarbrücken, Germany. 33

Koehn, P. (2010). Statistical Machine Translation. Cambridge University Press, New York, NY, USA, 1st edn. 73, 100, 104 Koehn, P. & Germann, U. (2014). The impact of machine translation quality on human post-editing. In Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation, 38–46, Gothenburg, Sweden. 160, 161

Hovy, E., King, M. & Popescu-Belis, A. (2002). An introduction to MT evaluation. In Workbook of the LREC 2002 Workshop on Machine Translation Evaluation: Human Evaluators Meet Automated Metrics, 1–7, Las Palmas de Gran Canaria, Spain. 99

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R. et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL ’07), 177–180, Prague, Czech Republic. 3

Huijsen, W. (1998). Controlled language: An introduction. In Proceedings of the Second Controlled Language Application Workshop (CLAW 1998), 1–15. 2, 19, 32, 83 Hutchins, W.J. (1986). Machine Translation: Past, Present, Future. John Wiley & Sons, Inc., New York, NY, USA. 190

Koponen, M. (2012). Comparing human perceptions of postediting effort with post-editing operations. In Proceedings of the Seventh Workshop on Statistical Machine Translation (StatMT ’12), 181–190, Association for Computational Linguistics, Montreal, Canada. 162

Jakobsen, A.L. (1999). Logging target text production with translog. Probing the Process in Translation. Methods and Results. (Copenhagen Studies in Language 24), 9–20. 161 Jelinek, F. (1997). Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, USA. 13

Koponen, M., Aziz, W., Ramos, L. & Specia, L. (2012). Postediting time as a measure of cognitive effort. In Proceedings of the AMTA Workshop on Postediting Technology and Practice, 11–20. 161, 162

Jiang, J., Way, A. & Haque, R. (2012). Translating usergenerated content in the social networking space. In Proceedings of AMTA-2012, the Tenth Biennial Conference of the Association for Machine Translation in the Americas. 1, 2, 30

Krings, H.P. (2001). Repairing texts: Empirical investigations of machine translation post-editing process. The Kent State University Press, Kent, OH. 160, 161, 162 Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Sage commtext series, Sage Publications. 143

Jonnalagadda, S., Tari, L., Hakenberg, J., Baral, C. & Gonzalez, G. (2009). Towards effective sentence simplification for automatic processing of biomedical text. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for

Kuhn, T. (2014). A survey and classification of controlled natural languages. Computational Linguistics, 40, 121–170. 19

248

REFERENCES

Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24, 377– 439. 18

Melero, M., Costa-Juss` a, M.R., Domingo, J., Marquina, M. & Quixal, M. (2012). Holaaa!! writin like u talk is kewl but kinda hard 4 NLP. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. 15, 30

Lacruz, I., Shreve, G.M. & Angelone, E. (2012). Average pause ratio as an indicator of cognitive effort in post-editing: A case study. In AMTA 2012 Workshop on Post-Editing Technology and Practice (WPTP 2012), 21–30. 161

Mitamura, T. & Nyberg, E.H. (1995). Controlled English for knowledge-based MT: experience with the kant system. In Proceedings of TMI-95 . 2, 10, 11, 19

Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159– 174. 145, 148

Mitamura, T. & Nyberg, E.H. (2001). Automatic rewriting for controlled language translation. In Proceedings of the NLPRS 2001 Workshop on Automatic Paraphrasing: Theory and Application. 22

Lavie, A. & Agarwal, A. (2007). Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation (StatMT ’07), 228–231, Prague, Czech Republic. 104

Mitchell, L., Roturier, J. & O’Brien, S. (2013). Communitybased post-editing of machine-translated content: monolingual vs. bilingual. In Proceedings of the MT Summit XIV Workshop on Post-editing Technology and Practice (WPTP2), Nice, France. 166, 171

Lavie, A., Sagae, K. & Jayaraman, S. (2004). The significance of recall in automatic metrics for MT evaluation. In In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004). 104

Morado Vazquez, L., Rodriguez Vazquez, S. & Bouillon, P. (2013). Comparing forum data post-editing performance using translation memory and machine translation output: a pilot study. In Proceedings of MT Summit XIV , Nice, France. 160

Lehmann, S., Gottesman, B., Grabowski, R., Kudo, M., Lo, S., Siegel, M. & Fouvry, F. (2012). Applying cnl authoring support to improve machine translation of forum data. In T. Kuhn & N. Fuchs, eds., Controlled Natural Language, vol. 7427 of Lecture Notes in Computer Science, 1– 10, Springer Berlin Heidelberg. 2, 3, 19

Naber, D. (2003). A Rule-Based Style and Grammar Checker . Master’s thesis, Technische Fakult¨ at, Universit¨ at Bielefeld. 18

Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10. 13, 103

Niessen, S. & Ney, H. (2000). Improving SMT quality with morpho-syntactic analysis. In Proceedings of the 18th International Conference on Computational Linguistics - Volume 2 (COLING ’00), 1081–1085, Saarbr¨ ucken, Germany. 24

Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365–377. 145

Niessen, S. & Ney, H. (2001). Morpho-syntactic analysis for reordering in statistical machine translation. In Proceedings of MT Summit VIII , 247–252, Santiago de Compostela, Spain. 10, 23, 24, 129

Lin, C.Y. & Och, F.J. (2004). ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Geneva, Switzerland. 151

Nyberg, E., Mitamura, T. & Huijsen, W. (2003). Controlled language for authoring and translation. In H. Somers, ed., Computers and Translation: A Translator’s Guide., 245– 281, John Benjamins Publishing Company, Amsterdam. 19

Liu, F., Weng, F. & Jiang, X. (2012). A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (ACL ’12), 1035–1044, Jeju Island, Korea. 14, 16

Nyberg, E.H. & Mitamura, T. (1996). Controlled language and knowledge-based machine translation: Principles and practice. In Proceedings of the First International Workshop on Controlled Language Applications (CLAW-96), 74–83. 2, 11, 20, 189

Llitj´ os, A.F., Carbonell, J.G. & Lavie, A. (2005). A framework for interactive and automatic refinement of transferbased machine translation. In Proceedings of the 10th Annual Conference of the European Association for Machine Translation (EAMT), 30–31, Budapest, Hungary. 102

O’Brien, S. (2003). Controlling controlled English: An analysis of several controlled language rule sets. In Proceedings of EAMT-CLAW-03 , 105–114, Dublin, Ireland. 20, 72, 189

Max, A. (2006). Writing for language-impaired readers. In A. Gelbukh, ed., Computational Linguistics and Intelligent Text Processing, vol. 3878 of Lecture Notes in Computer Science, 567–570, Springer Berlin Heidelberg. 22

O’Brien, S. (2004). Machine translatability and post-editing effort: how do they relate? In Proceedings of the Twentysixth International Conference on Translating and the Computer . 160, 161

Mays, E., Damerau, F.J. & Mercer, R.L. (1991). Context based spelling correction. Information Processing & Management, 27, 517 – 522. 13

O’Brien, S. (2005). Methodologies for measuring the correlations between post-editing effort and machine translatability. Machine Translation, 19, 37–58. 18, 162

McGraw, K.O. & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. 143, 146, 220

O’Brien, S. (2006a). Machine-translatability and post-editing effort: an empirical study using translog and choice network analysis. Ph.D. thesis, Dublin City University. 164

249

REFERENCES

O’Brien, S. (2006b). Pauses as indicators of cognitive effort in post-editing machine translation output. Across Languages and Cultures, 7, 1–21. 161

Renahy, J., Devitre, D., Dziadkiewicz, A. & Thomas, I. (2009). Controlled language norms for the redaction of security protocols: finding the median between system needs and user acceptability. In Proceedings of the 11th International Symposium on Social Communication, Cuba. 22

O’Brien, S. (2011). Towards predicting post-editing productivity. Machine translation, 25, 197–215. 162

Roturier, J. (2004). Assessing a set of controlled language rules: Can they improve the performance of commercial machine translation systems? Proceedings of the 26th Conference of Translating and the Computer (ASLIB 2004), 1– 14. 19

O’Brien, S. & Roturier, J. (2007). How portable are controlled languages rules? a comparison of two empirical MT studies. In Proceedings of MT Summit XI , 105–114, Copenhagen, Denmark. 21, 189

Roturier, J. (2006). An investigation into the impact of controlled English rules on the comprehensibility, usefulness and acceptability of machine-translated technical documentation for french and german users. PhD thesis, Dublin City University. 165

Pakhomov, S. (2002). Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL ’02), 160–167, Philadelphia, Pennsylvania. 15

Roturier, J. & Bensadoun, A. (2011). Evaluation of mt systems to translate user generated content. In Proceedings of MT Summit XIII , 244–251, Xiamen, China. 1, 11

Papineni, K., Roukos, S., Ward, T. & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL ’02), 311–318, Philadelphia, Pennsylvania. 103

Roturier, J., Mitchell, L., Grabowski, R. & Siegel, M. (2012). Using automatic machine translation metrics to analyze the impact of source reformulations. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). 155

Pedler, J. & Mitton, R. (2010). A large list of confusion sets for spellchecking assessed against a corpus of real-word errors. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. 13

Roturier, J., Mitchell, L. & Silva, D. (2013). The ACCEPT post-editing environment: a flexible and customisable online tool to perform and analyse machine translation postediting. In Proceedings of the MT Summit XIV Workshop on Post-editing Technology and Practice (WPTP-2), Nice, France. 161, 165, 166

Petersen, S.E. & Ostendorf, M. (2007). Text simplification for language learners: a corpus analysis. In Proceedings of the SLaTE Workshop on Speech and Language Technology in Education, Farmington, Pennsylvania USA. 21, 22

Rowe, N.C. & Laitinen, K. (1995). Semiautomatic disabbreviation of technical text. Information Processing and Management: an International Journal, 31, 851–857. 15

Petitpierre, D. & Russel, G. (1995). Multext deliverable report, chap. Mmorph - the multext morphology program. 25, 38, 81

Ruffino, J.R. (1981). Coping with machine translation. In Practical experience of machine translation. Proceedings of a conference, 57–60. 2, 10

Plitt, M. & Masselot, F. (2010). A productivity test of statistical machine translation post-editing in a typical localisation context. Prague Bulletin of Mathematical Linguistics, 93, 7–16. 160, 161

Scott, W.A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321–325. 143

Popovi´ c, M., Avramidis, E., Burchardt, A., Vilar, D. & Uszkoreit, H. (2013). What can we learn about the selection mechanism for post-editing? In Proceedings of the MT Summit XIV Workshop on Post-editing Technology and Practice (WPTP-2), Nice, France. 163

Senellart, J., Yang, J. & Rebollo, A. (2003). Systran intuitive coding technology. In Proceedings of MT Summit IX , New Orleans, USA. 190 Seretan, V. (2012). Acquisition of syntactic simplification rules for French. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). 21, 22

Porro, V., Gerlach, J., Bouillon, P. & Seretan, V. (2014). Rulebased automatic post-processing of SMT output to reduce human post-editing effort. In Proceedings of Translating and the Computer 36 (TC36). 241

Seretan, V., Bouillon, P. & Gerlach, J. (2014). A large-scale evaluation of pre-editing strategies for improving usergenerated content translation. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). 7

Porter, M. (2001). The porter stemming algorithm. http://tartarus.org/ martin/PorterStemmer/. 104 Pym, P.J. (1988). Pre-editing and the use of simplified writing for MT: an engineer’s experience of operating an MT system. In Translating and the Computer 10: The translation environment 10 years on, 80–96. 2, 11, 20

Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. 13

Rayner, M., Bouillon, P. & Haddow, B. (2012). Using sourcelanguage transformations to address register mismatches in SMT. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). 71, 81, 110, 129, 137

Sheng, V.S., Provost, F.J. & Ipeirotis, P.G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 614–622. 109

250

REFERENCES

Shimohata, M. & Sumita, E. (2002). Automatic paraphrasing based on parallel corpus for normalization. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). 15

Subramaniam, L.V., Roy, S., Faruquie, T.A. & Negi, S. (2009). A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data (AND ’09), 115– 122, Barcelona, Spain. 30

Shrout, P. & Fleiss, J. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin. 143

Surcin, S., Lange, E. & Senellart, J. (2007). Rapid development of new language pairs at SYSTRAN. In Proceedings of MT Summit XI , 443–44, Copenhagen, Denmark. 190

Sidarenka, U., Scheffler, T. & Stede, M. (2013). Rule-based normalization of German twitter messages. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, GSCL 2013 . 2, 10, 11, 15

Tatsumi, M. (2009). Correlation between automatic evaluation metric scores, post-editing speed, and some other factors. In Proceedings of MT Summit XII , 332–339, Ottawa, Ontario, Canada. 160, 161, 163

Siddharthan, A., Nenkova, A. & McKeown, K. (2004). Syntactic simplification for improving content selection in multidocument summarization. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Geneva, Switzerland. 21

Tatsumi, M. & Roturier, J. (2010). Source text characteristics and technical and temporal post-editing effort: What is their relationship ? In Second Joint EM+/CNGL Workshop “Bringing MT to the User: Research on Integrating MT in the Translation Industry” JEC 2010 . 160, 162, 164

Siegel, S. & Castellan, N. (1988). Nonparametric statistics for the behavioral sciences. McGraw–Hill, Inc., 2nd edn. 143

Taylor, K. & White, J. (1998). Predicting what MT is good for: User judgements and task performance. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup (AMTA ’98), 364–373, Langhorne, Pennsylvania, USA. 100

Simard, M. & Deslauriers, A. (2001). Real-time automatic insertion of accents in French text. Natural Language Engineering, 7, 143–165. 14 Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Weischedel, R. (2006). A study of translation error rate with targeted human annotation. In In Proceedings of the Association for Machine Transaltion in the Americas (AMTA 2006). 105, 174

Temnikova, I. (2010). Cognitive evaluation approach for a controlled language post–editing experiment. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). 3, 19, 160, 161, 162, 163, 189

Snover, M., Madnani, N., Dorr, B.J. & Schwartz, R. (2009). Fluency, adequacy, or HTER?: Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (StatMT ’09), 259–268, Athens, Greece. 105

Temnikova, I. (2011). Establishing implementation priorities in aiding writers of controlled crisis management texts. In RANLP, 654–659. 22, 32, 33 The EAGLES MT Evaluation Working Group (1996). EAGLES Evaluation of Natural Language Processing Systems. Final Report. EAGLES Document EAG-EWG-PR.2. Center for Sprogteknologi, Copenhagen. 100

Snow, R., O’Connor, B., Jurafsky, D. & Ng, A.Y. (2008). Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), 254–263, Honolulu, Hawaii. 109

Thurmair, G. (1990). Parsing for grammar and style checking. In Procedings of the 13th International Conference on Computational Linguistics, 365–370. 18

Specia, L. & Farzindar, A. (2010). Estimating machine translation post-editing effort with hter. In Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas workshop Bringing MT to the User: MT Research and the Translation Industry, Denver, Colorado, USA. 161, 163

Toutanova, K. & Moore, R.C. (2002). Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL ’02), 144–151, Philadelphia, Pennsylvania. 14

Specia, L., Raj, D. & Turchi, M. (2010). Machine translation evaluation versus quality estimation. Machine Translation, 24, 39–50. 22, 105

Underwood, N.L. & Jongejan, B. (2001). Translatability checker: A tool to help decide whether to use MT. In Proceedings of MT Summit VIII , 363–368, Santiago de Compostela, Galicia, Spain. 133

Specia, L., Shah, K., de Souza, J.G. & Cohn, T. (2013). QuEst - a translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL ’13), 79–84, Sofia, Bulgaria. 105

van Berkelt, B. & De Smedt, K. (1988). Triphone analysis: A combined method for the correction of orthographical and typographical errors. In Proceedings of the Second Conference on Applied Natural Language Processing. 14

Starlander, M. & Popescu-Belis, A. (2002). Corpus-based evaluation of a French spelling and grammar checker. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. 58

Veronis, J. (1988). Correction of phonographic errors in natural language interfaces. In Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’88), 101–115, Grenoble, France. 13, 49

251

REFERENCES

Vickrey, D. & Koller, D. (2008). Sentence simplification for semantic role labeling. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technology Conference (ACL-08: HLT), 344–352, Columbus, Ohio, USA. 21

White, J.e.a. (1992-1994). ARPA workshops on MT (series of four workshops on comparative evaluation). Technical report, PRC Inc., McLean, Virginia. 99 Wojcik, R., Hoard, J. & Holzhauser, K. (1990). The Boeing Simplified English checker. In Proceedings of the International Conference, Human Machine Interaction and Artificial Intelligence in Aeronautics and Space, 43–57. 22

Vilar, D., Xu, J., D’Haro, L.F. & Ney, H. (2006). Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, 697–702, Genoa, Italy. 102, 162

Xia, F. & McCord, M. (2004). Improving a statistical MT system with automatically learned rewrite patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING ’04), Geneva, Switzerland. 24

Voss, C.R. & Tate, C.R. (2006). Task-based evaluation of machine translation (MT) engines. Measuring how well people extract who, when, where-type ents in MT output. In Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT), Oslo, Norway. 101

Yannakoudakis, E.J. & Fawthrop, D. (1983). An intelligent spelling error corrector. Information Processing & Management, 19, 101–108. 13

Wagner, R.A. (1974). Order-n correction for regular languages. Communications of the ACM , 17, 265–268. 13

Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL ’94), 88–95, Las Cruces, New Mexico. 13

Wang, C. (2007). Chinese syntactic reordering for statistical machine translation. In In Proceedings of EMNLP, 737– 745. 24

252

Appendix A

Research Overview This appendix recapitulates the research questions addressed in this thesis. Table A.1 provides an overview of the experiments carried out to answer each question. Table A.1: Research overview Aim

Data

Participants

Measure

Chapt.

Research question: Is a declarative rule-based formalism such as Acrolinx suited to pre-edit forum data? Evaluate rule precision

10,000 randomly selected sentences, unseen data; examples for each rule extracted with Acrolinx Batch Checker; 50 examples per rule included in evaluation

1 French native translator

Precision of error detection (interactive rules) and error correction (automatic rules)

3.9.1

Evaluate non-word and real-word correction in comparison with state of the art spelling and grammar checkers (Antidote, MS Word 2010, LT)

500 randomly selected sentences, unseen data; 5,269 words

1 French native translator

Precision of error detection and error correction

3.3.3

Research question: Can pre-editing improve SMT of French forum data? Evaluate the individual impact of pre-editing rules on translation with the ACCEPT SMT

10,000 representative sentences of unseen Norton Community forum data; up to 50 examples per rule included in evaluation, total 1390 examples

3 translators

Human comparative evaluation, 3 judgements per example

4.4

(continued on next page)

253

A. RESEARCH OVERVIEW

Table A.1: Research overview (continued) Aim

Data

Participants

Measure

Chapt.

Evaluate the combined impact of pre-editing rules

1,030 randomly selected sentences, unseen data; 11,159 words

3 AMT judgements

Human comparative evaluation, 3 judgements per example

4.5

Evaluate the combined impact of pre-editing rules with automatic metrics

same as above

-

Automatic evaluation metrics

4.5.2

Evaluate the combined impact of automatic preediting rules only

same as above

3 AMT judgements

Human comparative evaluation, 3 judgements per example

4.6

Research question: Can a comparative translation evaluation on Amazon Mechanical Turk (AMT) produce comparable results to evaluations performed with language professionals? Assess user ratings reliability

Same 1390 examples as for evaluation of the individual impact of pre-editing rules (above)

3 AMT judgements

Human comparative evaluation, 3 judgements per example

4.4.3

Research question: Does pre-editing that improves MT reduce post-editing effort? Measure technical and temporal post-editing effort for translations of raw and translations of pre-edited data where pre-editing improves translation

158 representative tences, unseen data;

sen-

3 translators

Post-editing time Edit distance

5.5

Research question: What is the relationship between the effort invested in pre-editing and the gain in terms of post-editing effort? Measure technical and temporal post-editing effort for translations of raw and translations of pre-edited data considering improved, degraded and indifferent cases

500 posts, unseen data; selection of 50 sentences per case

3 translators

Post-editing time Edit distance

5.6

(continued on next page)

254

Table A.1: Research overview (continued) Aim

Data

Participants

Measure

Chapt.

Research question: Is the pre-editing approach portable to other MT systems? Evaluate the individual impact of pre-editing rules across paradigms (Lucy LT, Systran, Google Translate)

10,000 randomly selected sentences of unseen Norton Community forum data; examples for each rule extracted with Acrolinx Batch Checker; 50 examples per rule included in evaluation

3 AMT judgements

Human comparative evaluation, 3 judgements per example

6.2.2

Evaluate the combined impact of pre-editing rules across paradigms (Lucy LT, Systran, Google Translate)

1,030 randomly selected sentences, unseen data; 11,159 words

3 AMT judgements

Human comparative evaluation, 3 judgements per example

6.2.3

Research question: Is the pre-editing approach portable to forums from another domain? Evaluate rule across domains

precision

10,000 randomly selected sentences each from 2 forums: CNet France and Bricoleur du dimanche

1 French native translator

Precision of error detection (interactive rules) and error correction (automatic rules)

6.3.2

Evaluate the combined impact of pre-editing rules across domains

1,000 randomly selected sentences each from 2 forums: CNet France and Bricoleur du dimanche; translated with Google Translate

3 AMT judgements

Human comparative evaluation, 3 judgements per example

6.3.3

Research question: Can forum users successfully perform the transformations required by the pre-editing rules, and thus achieve the desired impact on MT output? Assess pre-editing by forum users: compare rule application in two scenarios (semi-automatic and all manual)

25 posts, unseen data, selected for high error/words ratio; 2,274 words

2 Norton forum users

Pre-editing ity

activ-

7.3.1

Evaluate the impact of preediting performed by forum users in two scenarios (semi-automatic and all manual) on translation with the ACCEPT SMT

same as above

3 AMT judgements

Human comparative evaluation, 3 judgements per example

7.3.2

(continued on next page)

255

A. RESEARCH OVERVIEW

Table A.1: Research overview (continued) Aim

Data

Participants

Measure

Assess pre-editing by forum users: compare rule application by forum users against rule application by experts

same as above

2 Norton forum users

Pre-editing ity

activ-

7.4.1

Evaluate the impact of pre-editing by forum users against pre-editing by experts on translation with the ACCEPT SMT

same as above

3 AMT judgements

Human comparative evaluation, 3 judgements per example

7.4.2

256

Chapt.

Appendix B

Data This appendix recapitulates the data used for development and testing.

257

B. DATA

Norton)forum)data:)1st)batch wsymc_bip_Lk_fr_forumN LMC886-posts 78CMMM-linesL w5GCMMM-uniqueN

2)x)10,000)representative)sentences → rule-development-wLst-LMCMMMNC-rule-fineEtuning-wknd-LMCMMMN-wcfPSection-KPkPKN all → nEgram-extraction-for-rules-for-the-machine wcfP-Section-KP7PLN 25)posts)&2,274)wordsB)selected)for)high)error/word ration → preEediting-experiment-with-Norton-forum-users-wcfP-Section-7PkPKN

Norton)forum)data:)2nd)batch wsymc_bip_kk_fr_forumkN 6CMMG-posts LkMCMMM-lines-w8MCMMM-uniqueN

10,000)representative)sentences → rule-by-rule-evaluation-with-ACCEPT-SMT-system-wcfP-Section-GPGN → rule-by-rule-evaluation-with-other-MT-systems-wcfP-Section-6PkPkN 1,030)random sentences → global-evaluation-with-ACCEPT-SMT-system-wcfP-Section-GP5N → global-evaluation-with-automatic rules-only-and-ACCEPT-SMTsystemwcfP-Section-GP6N → global-evaluation-with-other-MT-systems-wcfP-Section-6PkPKN 684)representative)sentences → first-postEediting-experiment-wcfP-Section-5P5N

5)x)100)random)posts)wfrom-knd-and-Krd-batchesN → second-postEediting-experiment-wcfP-Section-5P6N

Norton)forum)data:)3rd)batch wsymc_bip_kK_fr_forumKN kCK89-posts L6CMMM- lines-wLLCMMM-uniqueN

500)random)sentences → performance-of-spelling-and-grammar-rules-on-forum-data-wcfPSection-KPKPKN

10.000)random)sentences →-evaluation of-rule-precision-wcfP-Section-KP9N

Cnet France)forum)data C Bricoleur du)dimanche forum) data

L

10,000)random)sentences)wfrom each-forumN → evaluation of-rule-precision-wcfP-Section-6PKPkN 1,000)random)sentences)wfrom each-forumN →-global-evaluation-with-Google-Translate-wcfP-Section-6PKPKN

Lines-are-counted-after-removing-html-tags-and-inserting-line-breaks-between-sentences-using-the- PythonNLTK-utilityP

Figure B.1: Overview of data used

258

Appendix C

Pre-editing rules ordered by Set This appendix recapitulates the pre-editing rules by set.

259

C. PRE-EDITING RULES ORDERED BY SET

Table C.1: Set 1 (rules for humans; automatic application)

Rule

Category

avoidCond (D.2) futureCond (D.2) useImperative (D.2) misspelledExpr (D.3) ` (D.4) aA ¸caSa (D.4) ceSe (D.4) ciSi (D.4) duDˆ u (D.4) laL` a (D.4) maM’a (D.4) niN’y (D.4) onOntSont (D.4) ouO` u (D.4) qu’elleQuelle (D.4) quelQueQuelque (D.4) soiSoisSoit (D.4) tousTout (D.4) addComma (D.5) addElision (D.5) addHyphen (D.5) addSpaceBtwnWords (D.5) addSpacePunct (D.5) avoidPunct (D.5) wrongAbbrev (D.8)

grammar (mood/tense) grammar (mood/tense) grammar (mood/tense) grammar (mood/tense) homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion punctuation punctuation punctuation punctuation punctuation punctuation reformulation

260

Table C.2: Set 2 (rules for humans; interactive application)

Rule

Category

nounPhrAgr (D.1) subjVerbAgr (D.1) confParticipIsUs (D.2) useSubjunctive (D.2) wrongVerbForm (D.2) impossibleTagSeq (D.3) missingPas (D.3) desD`es (D.4) divHomophone (D.4) nounVerbConf (D.4) surSˆ ur (D.4) t’esTes (D.4) addCommaPP (D.5) addSpacePunct2 (D.5) avoid(char) (D.5) avoidCapSent (D.5) checkEOS/BOS (D.5) wrongCap (D.5) wrongElision (D.5) avoidAbbrev (D.6) avoidAnglicism (D.6) avoidDirQuest (D.6) avoidEstCeQue (D.6) avoidInformal (D.6) missingNe (D.6) avoidSoit (D.8) avoidAdverb (D.7) avoidCeQui (D.7) avoidCleftSent (D.7) avoidConjBOS (D.7) avoidNon (D.7) avoidPresPart (D.7)

grammar (agreement) grammar (agreement) grammar (mood/tense) grammar (mood/tense) grammar (mood/tense) grammar (sequence) grammar (sequence) homophone confusion homophone confusion homophone confusion homophone confusion homophone confusion punctuation punctuation punctuation punctuation punctuation punctuation punctuation informal informal informal informal informal informal reformulation simplification simplification simplification simplification simplification simplification

261

C. PRE-EDITING RULES ORDERED BY SET

Table C.3: Set 3 (rules for the machine; automatic application)

Rule

Category

addDoisJe (D.8) avoidExpression (D.8) avoidGreeting (D.8) avoidIlFautQue (D.8) avoidPlus (D.8) avoidPourQue (D.8) avoidQue (D.8) avoidSiEtQue (D.8) removeNeExpl (D.8) tuVous (D.9) cliticsC ¸ a (D.10) cliticsCeux-ci (D.10) cliticsPersPron (D.10) reorderJamaisV (D.11) reorderRienInf (D.11) reorderTout (D.11)

reformulation reformulation reformulation reformulation reformulation reformulation reformulation reformulation reformulation informal 2nd person clitics clitics clitics reordering reordering reordering

262

Appendix D

Rule by rule results This appendix presents precision, impact and portability results for each rule. Table D.1 lists the rules by alphabetic order of the rule name. The remainder of the Appendix is organised by rule categories. It should be noted that results in this annex have been regrouped from the different experiments, which were performed over a longer time period. During this time, some rules have undergone debugging, some have been split or merged with others, and some have been moved from one set to another. Therefore not all experiments have included all rules, and the results for some rules are incomplete.

Table D.1: Pre-editing rules Rule name

Category

Set

` (D.4) aA addComma (D.5) addCommaPP (D.5) addDoisJe (D.8) addElision (D.5) addHyphen (D.5) addSpaceWords (D.5) addSpacePunct (D.5) addSpacePunct2 (D.5) avoid(char) (D.5) avoidAbbrev (D.6) avoidAdverb (D.7) avoidAnglicism (D.6) avoidCapSent (D.5) avoidCeQui (D.7) avoidCleftSent (D.7) avoidCond (D.2) avoidConjBOS (D.7) avoidDirQuest (D.6) avoidEstCeQue (D.6) avoidExpression (D.8) avoidGreeting (D.8) avoidIlFautQue (D.8) avoidInformal (D.6)

Homophone confusion Punctuation Punctuation Reformulation Punctuation Punctuation Punctuation Punctuation Punctuation Punctuation Informal Simplification Informal Punctuation Simplification Simplification Grammar (mood/tense) Simplification Informal Informal Reformulation Reformulation Reformulation Informal

Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set

1 1 2 3 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 3 3 3 2

(continued on next page)

263

D. RULE BY RULE RESULTS

Table D.1: Pre-editing rules (continued) Rule name

Category

Set

avoidNon (D.7) avoidPlus (D.8) avoidPourQue (D.8) avoidPresPart (D.7) avoidPunct (D.5) avoidQue (D.8) avoidSiEtQue (D.8) avoidSoit (D.8) ¸caSa (D.4) ceSe (D.4) checkEOS/BOS (D.5) ciSi (D.4) cliticsC ¸ a (D.10) cliticsCeux-ci (D.10) cliticsPersPron (D.10) confParticipIsUs (D.2) desD` es (D.4) divHomophone (D.4) duDˆ u (D.4) futureCond (D.2) impossibleTagSeq (D.3) laL` a (D.4) maM’a (D.4) missingNe (D.6) missingPas (D.3) misspelledExpr (D.3) niN’y (D.4) nounPhrAgr (D.1) nounVerbConf (D.4) onOntSont (D.4) ouO` u (D.4) qu’elleQuelle (D.4) quelQueQuelque (D.4) removeNeExpl (D.8) reorderJamaisV (D.11) reorderRienInf (D.11) reorderTout (D.11) soiSoisSoit (D.4) subjVerbAgr (D.1) surSˆ ur (D.4) t’esTes (D.4) tousTout (D.4) tuVous (D.9) useImperative (D.2) useSubjunctive (D.2) wrongAbbrev (D.8) wrongCap (D.5)

Simplification Reformulation Reformulation Simplification Punctuation Reformulation Reformulation Reformulation Homophone confusion Homophone confusion Punctuation Homophone confusion Clitics Clitics Clitics Grammar (mood/tense) Homophone confusion Homophone confusion Homophone confusion Grammar (mood/tense) Grammar (sequence) Homophone confusion Homophone confusion Informal Grammar (sequence) Grammar (sequence) Homophone confusion Grammar (agreement) Homophone confusion Homophone confusion Homophone confusion Homophone confusion Homophone confusion Reformulation Reordering Reordering Reordering Homophone confusion Grammar (agreement) Homophone confusion Homophone confusion Homophone confusion Informal 2nd person Grammar (mood/tense) Grammar (mood/tense) Reformulation Punctuation

Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set

2 3 3 2 1 3 3 2 1 1 2 1 3 3 3 2 2 2 1 1 2 1 1 2 2 1 1 2 2 1 1 1 1 3 3 3 3 1 2 2 2 1 3 1 2 1 2

(continued on next page)

264

Table D.1: Pre-editing rules (continued) Rule name

Category

Set

wrongElision (D.5) wrongVerbForm (D.2)

Punctuation Grammar (mood/tense)

Set 2 Set 2

265

D. RULE BY RULE RESULTS

D.1 D.1.1

Grammar (agreement) nounPhrAgr

Category Set Description Example

Precision

Grammar (agreement) Set 2 Agreement error in noun phrase Bonjour, :::: quel::::::: version de norton 360 avez-vous ? −→ Bonjour, quelle version de norton 360 avez-vous ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

250 394 717

50 50 50

8 3 14

84% 94% 72%

Norton CNet Brico

Impact on MT

(for 50 sentences of 196 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

21 21 10 19

(42%) (42%) (20%) (38%)

Raw better 6 (12%) 4 (8%) 4 (8%) 4 (8%)

266

About same 3 1 1 2

(6%) (2%) (2%) (4%)

Pre-ed better 19 21 35 23

(38%) (42%) (70%) (46%)

Unclear

1 (2%) 3 (6%) 2 (4%)

D.1 Grammar (agreement)

D.1.2

subjVerbAgr

Category Set Description Example

Precision

Grammar (agreement) Set 2 Error in subject-verb agreement Tout d’abord je vous fait part de ma configuration: ::: −→ Tout d’abord je vous fais part de ma configuration:

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

189 334 529

50 50 50

7 9 2

86% 82% 96%

Norton CNet Brico

Impact on MT

(for 50 sentences of 237 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

10 10 4 19

(20%) (20%) (8%) (38%)

Raw better 13 12 2 3

(26%) (24%) (4%) (6%)

267

About same 6 (12%) 7 (14%) 2 (4%) 1 (2%)

Pre-ed better 20 20 41 27

(40%) (40%) (82%) (54%)

Unclear

1 (2%) 1 (2%) 1 (2%) -

D. RULE BY RULE RESULTS

D.2 D.2.1

Grammar (mood/tense) avoidCond

Category Set Description Example

Precision

Grammar (mood/tense) Set 1 Incorrect use of conditional tense Il serait tr`es utile si vous pourriez temporairement restaurer les ::::::: serveurs DNS d’Orange. −→ Il serait tr`es utile si vous pouviez temporairement restaurer les serveurs DNS d’Orange.

(on 10,000 sentences)

Norton CNet Brico

Impact on MT

Flags

Evaluated

Incorrect

Precision

5 14 11

5 14 11

2 0 4

60% 100% 64%

(for 3 sentences of 3 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

1 1 1 1

(33%) (33%) (33%) (33%)

Raw better 2 2 2 2

(67%) (67%) (67%) (67%)

268

About same

Pre-ed better

Unclear

-

-

-

D.2 Grammar (mood/tense)

D.2.2

confParticipIsUs

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (mood/tense) Set 2 Confusion of past participle word endings (-i, -is, -it) voila ¸ca va durer un bon moment je vous t´el quand c’est :::: finit ... −→ voila ¸ca va durer un bon moment je vous t´el quand c’est fini ...

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

12 18 14

12 18 14

0 0 1

100% 100% 93%

(for 9 sentences of 9 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a -

n/a 1 (11%) -

n/a 8 (89%) 9 (100%) 9 (100%)

n/a -

269

D. RULE BY RULE RESULTS

D.2.3

futureCond

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (mood/tense) Set 1 Use of future tense instead of conditional Je ::::::::::: souhaiterai savoir s’il existe un moyen de r´ecup´erer les param`etres de sauvegarde ? −→ Je souhaiterais savoir s’il existe un moyen de r´ecup´erer les param`etres de sauvegarde?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

12 13 27

12 13 27

0 3 2

100% 77% 93%

(for 2 sentences of 2 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

-

-

270

Pre-ed better 2 2 2 2

(100%) (100%) (100%) (100%)

Unclear

-

D.2 Grammar (mood/tense)

D.2.4

useImperative

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (mood/tense) Set 1 Use of indicative instead of imperative (homophones) CD ins´er´e, quand tu d´emarres le PC, regardes en bas si tu vois :::::::: “Boot... −→ CD ins´er´e, quand tu d´emarres le PC, regarde en bas si tu vois “Boot...

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

4 21 13

4 21 13

1 2 1

75% 90% 92%

(for 2 sentences of 2 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

1 (50%) -

1 (50%)

1 (50%) 1 (50%) 1 (50%)

1 (50%) 2 (100%) -

-

271

D. RULE BY RULE RESULTS

D.2.5

useSubjunctive

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Grammar (mood/tense) Set 2 Use subjunctive Bien que je ne ::::::::::: comprends pas ce que tu as fait quand tu ´ecris : −→ Bien que je ne comprenne pas ce que tu as fait quand tu ´ecris :

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

29 n/a n/a

29 n/a n/a

8 n/a n/a

72% n/a n/a

(for 21 sentences of 29 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

8 (38%) 8 (38%) 9 (43%) 16 (76%)

Raw better 7 8 4 3

(33%) (38%) (19%) (14%)

272

About same

Pre-ed better

Unclear

3 (14%) 1 (5%) -

3 (14%) 3 (14%) 7 (33%) 1 (5%)

1 (5%) 1 (5%) 1 (5%)

D.2 Grammar (mood/tense)

D.2.6

wrongVerbForm

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (mood/tense) Set 2 Wrong verb form Et aussi, quand que tu avais installer norton, as tu eu des prob::::::: lemes d’installations? −→ Et aussi, quand que tu avais install´e norton, as tu eu des problemes d’installations?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

190 215 265

50 50 50

0 3 0

100% 94% 100%

(for 50 sentences of 158 flagged in 10,000; Norton forum data) No impact

Raw better

About same

1 (2%) 1 (2%) 1 (2%)

5 (10%) 5 (10%) 4 (8%) -

3 (6%) 4 (8%) 2 (4%) -

273

Pre-ed better 38 40 41 49

(76%) (80%) (82%) (98%)

Unclear

3 (6%) 3 (6%) -

D. RULE BY RULE RESULTS

D.3 D.3.1

Grammar (sequence) impossibleTagSeq

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (sequence) Set 2 Incorrect word sequence Il suffit de d´emarre ton PC avec le CD ( attention , logiciel en ::::::::::: anglais , mais simple d’emploi ). −→ Il suffit de d´emarrer ton PC avec le CD ( attention , logiciel en anglais , mais simple d’emploi ).

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

55 109 188

50 50 50

13 9 16

74% 82% 68%

(for 50 sentences of 76 flagged in 10,000; Norton forum data) No impact

Raw better

About same

9 (18%) 9 (18%) 3 (6%) 7 (14%)

7 (14%) 8 (16%) 3 (6%) 6 (12%)

7 (14%) 3 (6%) 1 (2%) 1 (2%)

274

Pre-ed better 25 29 43 36

(50%) (58%) (86%) (72%)

Unclear

2 (4%) 1 (2%) -

D.3 Grammar (sequence)

D.3.2

misspelledExpr

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Grammar (sequence) Set 1 Spelling error in common expression La licence de ton produit est de quand `a elle sauvegard´ee sur :::::::::::: ton compte Norton en ligne. −→ La licence de ton produit est quant ` a elle sauvegard´ee sur ton compte Norton en ligne.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

19 21 34

19 21 34

7 7 6

63% 67% 82%

(for 17 sentences of 17 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

6 (35%) 6 (35%) 11 (65%)

Raw better 4 3 2 2

(24%) (18%) (12%) (12%)

275

About same

Pre-ed better

Unclear

2 (12%) 1 (6%) 1 (6%) -

4 (24%) 5 (29%) 14 (82%) 4 (24%)

1 (6%) 2 (12%) -

D. RULE BY RULE RESULTS

D.3.3

missingPas

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Grammar (sequence) Set 2 Negation missing “pas” je ne sais ` a quoi sert cette diff´erence ? ::::::::::: −→ je ne sais pas ` a quoi sert cette diff´erence ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

2 9 5

2 9 5

0 3 0

100% 67% 100%

(for 9 sentences of 9 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a 3 (33%) 3 (33%) 4 (44%)

n/a -

n/a 2 (22%) -

n/a 4 (44%) 6 (67%) 5 (56%)

n/a -

276

D.4 Homophone confusion

D.4 D.4.1

Homophone confusion ` aA

Category Set Description Example

Precision

Homophone confusion Set 1 Confusion of the homophones “a”, “as” and “`a” Avez-vous r´eussi a: faire l’installation de NIS 2010 en fran¸cais? −→ Avez-vous r´eussi ` a faire l’installation de NIS 2010 en fran¸cais?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

309 277 500

50 50 50

2 0 1

96% 100% 98%

Norton CNet Brico

Impact on MT

(for 50 sentences of 349 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

16 16 17 18

(32%) (32%) (34%) (36%)

Raw better 2 2 1 12

(4%) (4%) (2%) (24%)

277

About same 5 (10%) 3 (6%) 5 (10%) 1 (2%)

Pre-ed better 25 28 27 18

(50%) (56%) (54%) (36%)

Unclear

2 (4%) 1 (2%) 1 (2%)

D. RULE BY RULE RESULTS

D.4.2

¸caSa

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “¸ca” and “sa” Merci beaucoup, je fais sa de suite :: −→ Merci beaucoup, je fais ¸ca de suite

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

24 23 29

24 23 29

0 6 7

100% 74% 76%

(for 16 sentences of 16 flagged in 10,000; Norton forum data) No impact

Raw better

1 (6%) 1 (6%) -

2 (13%) 1 (6%) 2 (13%)

278

About same 6 3 2 3

(38%) (19%) (13%) (19%)

Pre-ed better

Unclear

7 (44%) 11 (69%) 14 (88%) 10 (63%)

1 (6%)

D.4 Homophone confusion

D.4.3

ceSe

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “ce” and “se” C’est se qui s’est pass´e il y un an lorsque j’ai chang´e mon ordi: nateur. −→ C’est ce qui s’est pass´e il y un an lorsque j’ai chang´e mon ordinateur.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

15 20 30

15 20 30

0 5 3

100% 75% 90%

(for 43 sentences of 43 flagged in 10,000; Norton forum data) No impact

Raw better

About same

9 (21%) 9 (21%) -

3 (7%) 8 (19%) 8 (19%) 7 (16%)

10 (23%) 2 (5%) 6 (14%) 2 (5%)

279

Pre-ed better 19 23 27 31

(44%) (54%) (63%) (72%)

Unclear

2 1 2 3

(5%) (2%) (5%) (7%)

D. RULE BY RULE RESULTS

D.4.4

ciSi

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “ci”, “si” and “s’y” Cela permet, mˆeme si on :: si prend `a la derni`ere minute, de recevoir les mises ` a jour. −→ Cela permet, mˆeme si on s’y prend `a la derni`ere minute, de recevoir les mises `a jour.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

0 0 0

-

-

-

(for 1 sentences of 1 flagged in 10,000; Norton forum data) No impact

Raw better

-

-

About same 1 1 1 1

280

(100%) (100%) (100%) (100%)

Pre-ed better

Unclear

-

-

D.4 Homophone confusion

D.4.5

desD` es

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 2 Confusion of the homophones “des”, “d`es” and “d´es” D´ es que je le peux, je fais ce que tu me conseilles. :::: −→ D`es que je le peux, je fais ce que tu me conseilles.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

8 8 18

8 8 18

0 3 5

100% 63% 72%

(for 4 sentences of 4 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

4 (100%)

-

-

4 (100%) 4 (100%) 4 (100%) -

-

281

D. RULE BY RULE RESULTS

D.4.6

divHomophone

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 2 Miscelaneous homophone confusions NIS 2013 et le gestionnaire de :::::: taches −→ NIS 2013 et le gestionnaire de tˆ aches

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

50 154 67

50 50 50

0 0 0

100% 100% 100%

(for 43 sentences of 43 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a 17 (40%) 17 (40%)

n/a 3 (7%) 1 (2%) 2 (5%)

n/a 2 (5%) 8 (19%) -

n/a 21 (49%) 34 (79%) 24 (56%)

n/a -

282

D.4 Homophone confusion

D.4.7

duDˆ u

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “du” and “dˆ u” Tu as ::: du mettre une protection par mot de passe, ¸ca ne se met pas par d´efaut. −→ Tu as dˆ u mettre une protection par mot de passe, ¸ca ne se met pas par d´efaut.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

19 23 33

19 23 33

0 0 1

100% 100% 97%

(for 31 sentences of 31 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

1 (3%) 2 (7%) 1 (3%) 4 (13%)

5 (16%) 3 (10%) 1 (3%) 1 (3%)

283

Pre-ed better 24 26 26 25

(77%) (84%) (84%) (81%)

Unclear

1 (3%) 3 (10%) 1 (3%)

D. RULE BY RULE RESULTS

D.4.8

laL` a

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “la” and “l`a” Par contre, :: la je ne vois plus vraiment de solution pour ce souci ... −→ Par contre, l` a je ne vois plus vraiment de solution pour ce souci ...

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

20 47 61

20 47 50

0 4 8

100% 91% 84%

(for 50 sentences of 54 flagged in 10,000; Norton forum data) No impact

Raw better

About same

4 (8%) 4 (8%) -

9 (18%) 10 (20%) 7 (14%) 26 (52%)

11 (22%) 10 (20%) 15 (30%) 7 (14%)

284

Pre-ed better 25 22 27 16

(50%) (44%) (54%) (32%)

Unclear

1 4 1 1

(2%) (8%) (2%) (2%)

D.4 Homophone confusion

D.4.9

maM’a

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “ma” and “m’a” j’ai fait comme tu ma dit plus haut j’ai enlev´e Norton 360 et j’ai :: r´einstall´e Norton 360. −→ j’ai fait comme tu m’as dit plus haut j’ai enlev´e Norton 360 et j’ai r´einstall´e Norton 360.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

2 5 0

2 5 -

0 0 -

100% 100% -

(for 1 sentences of 1 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

-

1 (100%) 1 (100%) -

1 (100%) 1 (100%)

-

285

D. RULE BY RULE RESULTS

D.4.10

niN’y

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “ni” and “n’y” j’ai desinstaller et reinstaller norton mais rien :: ni fait, mon widget norton a disparu. −→ j’ai desinstaller et reinstaller norton mais rien n’y fait, mon widget norton a disparu.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

2 2 3

2 2 3

0 0 0

100% 100% 100%

(for 1 sentence of 1 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a -

n/a 1 (100%) 1 (100%)

n/a 1 (100%) -

n/a -

286

D.4 Homophone confusion

D.4.11

nounVerbConf

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 2 Confusion of verb/noun homophones Quand je :::: clic sur ton lien j’arrive `a une page qui m’indique norton internet security 2011 −→ Quand je clique sur ton lien j’arrive `a une page qui m’indique norton internet security 2011

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

26 72 24

26 50 24

3 27 11

88% 46% 54%

(for 20 sentences of 27 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

5 (25%) 5 (25%) 4 (20%)

4 (20%) 5 (25%) 3 (15%)

1 (5%) -

9 (45%) 8 (40%) 20 (100%) 13 (65%)

1 (5%) 2 (10%) -

287

D. RULE BY RULE RESULTS

D.4.12

onOntSont

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “on” and “ont” mais alors ils :: on dˆ u se tromper −→ mais alors ils ont dˆ u se tromper

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

7 2 13

7 2 13

0 1 3

100% 50% 77%

(for 2 sentences of 2 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a 1 (50%)

n/a -

n/a 2 (100%) 2 (100%) 1 (50%)

n/a -

288

D.4 Homophone confusion

D.4.13

ouO` u

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Homophone confusion Set 1 Confusion of the homophones “ou” and “o` u” d’ou ma question... :::: −→ d’o` u ma question...

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

7 14 25

7 14 25

0 0 0

100% 100% 100%

(for 8 sentences of 8 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

-

Raw better 1 1 2 1

(13%) (13%) (25%) (13%)

289

About same -

Pre-ed better

Unclear

7 7 6 6

1 (13%)

(88%) (88%) (75%) (75%)

D. RULE BY RULE RESULTS

D.4.14

qu’elleQuelle

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “qu’elle”, “quel” and “quelle” Qu’elle est la diff´erence et peut on charger l’une ou l’autre? ::::::: −→ Quelle est la diff´erence et peut on charger l’une ou l’autre?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

0 0 1

1

0

100%

(for 6 sentences of 6 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

1 (17%)

-

6 (100%) 6 (100%) 6 (100%) 5 (83%)

-

290

D.4 Homophone confusion

D.4.15

quelQueQuelque

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “quelque”, “quel que” and “quelle que” La licence reste “attach´e” `a un utilisateur, :::::::: quelque soit le PC. −→ La licence reste “attach´e” `a un utilisateur, quel que soit le PC.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

3 5 1

3 5 1

0 0 0

100% 100% 100%

(for 0 sentences of 0 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

-

-

-

-

291

D. RULE BY RULE RESULTS

D.4.16

soiSoisSoit

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “soit”, “sois” and “soi” il ne peut plus faire de sauvegarde soit disant que mon Backup :::::::::: serait configur´e a` 0 −→ il ne peut plus faire de sauvegarde soi disant que mon Backup serait configur´e `a 0

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

0 0 0

-

-

-

(for 2 sentences of 2 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

1 (50%) 2 (100%)

1 (50%) -

2 (100%) 2 (100%) -

-

292

D.4 Homophone confusion

D.4.17

surSˆ ur

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 2 Confusion of the homophones “sur” and “sˆ ur” Et je suis sur d’avoir vu au d´ebut : ::: −→ Et je suis sˆ ur d’avoir vu au d´ebut :

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

5 20 58

5 20 50

1 11 19

80% 45% 62%

(for 9 sentences of 9 flagged in 10,000; Norton forum data) No impact

Raw better

About same

1 (11%)

1 (11%)

1 (11%) 1 (11%) -

293

Pre-ed better

Unclear

8 8 8 7

1 (11%) -

(89%) (89%) (89%) (78%)

D. RULE BY RULE RESULTS

D.4.18

t’esTes

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 2 Confusion of the homophones “tes” and “t’es” tu es sˆ ur que tu tes pas tromp´e? ::: −→ tu es sˆ ur que tu t’es pas tromp´e?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

0 0 0

-

-

-

(for 2 sentences of 2 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

-

1 (50%) 1 (50%) 2 (100%) 2 (100%)

1 (50%) -

1 (50%) -

294

D.4 Homophone confusion

D.4.19

tousTout

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Homophone confusion Set 1 Confusion of the homophones “tous” and “tout” tous ceci me fait un peut peur :::: −→ tout ceci me fait un peut peur

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

20 23 24

20 23 24

0 1 0

100% 96% 100%

(for 13 sentences of 13 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

5 (39%) 5 (39%) 1 (8%) 7 (54%)

1 (8%) 1 (8%) 2 (15%) -

2 (15%) -

5 (39%) 7 (54%) 10 (77%) 5 (39%)

1 (8%)

295

D. RULE BY RULE RESULTS

D.5 D.5.1

Punctuation addComma

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 1 Missing comma je ne comprend pas ce qui s’est pass´e: mais j’ai trois comptes Norton. −→ je ne comprend pas ce qui s’est pass´e, mais j’ai trois comptes Norton.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

469 464 676

50 50 50

0 2 0

100% 96% 100%

(for 50 sentences of 372 flagged in 10,000; Norton forum data) No impact

Raw better

About same

4 (8%) 4 (8%) -

6 (12%) 4 (8%) 4 (8%) 4 (8%)

16 (32%) 9 (18%) 11 (22%) 2 (4%)

296

Pre-ed better 18 30 32 40

(36%) (60%) (64%) (80%)

Unclear

6 (12%) 3 (6%) 3 (6%) 4 (8%)

D.5 Punctuation

D.5.2

addCommaPP

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 2 Missing comma after prepositional phrase Dans mon registre je n’ai aucune cl´e qui commencent par SMRxxxx −→ Dans mon registre, je n’ai aucune cl´e qui commencent par SMRxxxx

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

79 72 210

50 50 50

0 5 0

100% 90% 100%

(for 50 sentences of 79 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a 3 (6%) -

n/a 15 (30%) 1 (2%) 2 (4%)

n/a -

n/a 32 (64%) 49 (98%) 48 (96%)

n/a -

297

D. RULE BY RULE RESULTS

D.5.3

addElision

Category Set Description Example

Precision

Punctuation Set 1 Missing or incorrect elision Est-ce que il s’agit ici de Norton Online Backup ou du backup ::::: de Norton 360? −→ Est-ce qu’il s’agit ici de Norton Online Backup ou du backup de Norton 360?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

93 226 420

50 50 50

4 3 1

92% 94% 98%

Norton CNet Brico

Impact on MT

(for 50 sentences of 129 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

35 35 34 22

(70%) (70%) (68%) (44%)

Raw better

About same

Pre-ed better

Unclear

1 (2%) 4 (8%) 5 (10%) -

5 (10%) 2 (4%) 1 (2%) 1 (2%)

8 (16%) 7 (14%) 10 (20%) 27 (54%)

1 (2%) 2 (4%) -

298

D.5 Punctuation

D.5.4

addHyphen

Category Set Description Example

Punctuation Set 1 Missing hyphenation c’est a dire ? ` :::::::::: −→ c’est-` a-dire ?

Precision

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

244 385 633

50 50 50

7 3 1

86% 94% 98%

Norton CNet Brico

Impact on MT

(for 50 sentences of 186 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

13 13 4 13

(26%) (26%) (8%) (26%)

Raw better 5 (10%) 11 (22%) 6 (12%) 4 (8%)

299

About same 4 2 3 2

(8%) (4%) (6%) (4%)

Pre-ed better 25 24 37 30

(50%) (48%) (74%) (60%)

Unclear

3 (6%) 1 (2%)

D. RULE BY RULE RESULTS

D.5.5

addSpaceWords

Category Set Description Example

Precision

Punctuation Set 1 Missing space J’ai essay´e d’uploader un fichier de ::: 4ko seulement, et le probl`eme est le mˆeme. −→ J’ai essay´e d’uploader un fichier de 4 ko seulement, et le probl`eme est le mˆeme.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

185 241 758

50 50 50

0 0 0

100% 100% 100%

Norton CNet Brico

Impact on MT

(for 50 sentences of 109 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

36 36 23 37

(72%) (72%) (46%) (74%)

Raw better

About same

3 2 4 1

3 3 3 2

(6%) (4%) (8%) (2%)

300

(6%) (6%) (6%) (4%)

Pre-ed better 6 (12%) 8 (16%) 18 (36%) 9 (18%)

Unclear

2 1 2 1

(4%) (2%) (4%) (2%)

D.5 Punctuation

D.5.6

addSpacePunct

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 1 Incorrect spaces before or after punctuation Comme l’a stipul´e Misstigry:, il s’agit de services. −→ Comme l’a stipul´e Misstigry, il s’agit de services.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

2552 n/a n/a

50 n/a n/a

0 n/a n/a

100% n/a n/a

(for 50 sentences of 2430 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

50 (100%) 50 (100%) 48 (96%) 46 (92%)

-

1 (2%) 4 (8%)

1 (2%) -

-

301

D. RULE BY RULE RESULTS

D.5.7

addSpacePunct2

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 2 Missing space je pense que je n’ai pas dˆ u activer la cl´e du produit lors de mon reformatage.comment le faire? ::::::::::::::::::::: −→ je pense que je n’ai pas dˆ u activer la cl´e du produit lors de mon reformatage. Comment le faire?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

n/a n/a n/a

n/a n/a n/a

n/a n/a n/a

n/a n/a n/a

(for 2 sentences of 3 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

-

-

1 (50%) -

2 (100%) 2 (100%) 1 (50%) 2 (100%)

-

302

D.5 Punctuation

D.5.8

avoid(char)

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 2 Avoid using “(s)” for optional plurals El´ement(s) de donn´ees du Registre ::::::::: infect´e(s): −→ El´ements de donn´ees du Registre infect´ es: ::::::: (on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

5 18 10

5 18 10

0 0 0

100% 100% 100%

(for 9 sentences of 9 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

1 (11%)

2 (22%) 2 (22%) 1 (11%) -

303

Pre-ed better

Unclear

7 6 8 8

1 (11%) -

(78%) (67%) (89%) (89%)

D. RULE BY RULE RESULTS

D.5.9

avoidCapSent

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 2 Do not write an entire sentence in uppercase NORTON SECURITY POUR MAC EN FRANCAIS SVP :::::::::::::::::::::::::::::::::::::::::::::::::::::::: −→ Norton Security pour Mac en fran¸cais svp

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

303 102 228

50 50 50

3 6 0

94% 88% 100%

(for 50 sentences of 303 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a 8 (16%) 1 (2%) 2 (4%)

n/a 1 (2%) 1 (2%)

n/a 3 (6%) 1 (2%) 2 (4%)

n/a 38 (76%) 48 (96%) 45 (90%)

n/a -

304

D.5 Punctuation

D.5.10

avoidPunct

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 1 Unnecessary punctuation Reviens nous voir si tu as d’autres questions ,: et tiens nous au courant du r´esultat ! −→ Reviens nous voir si tu as d’autres questions et tiens nous au courant du r´esultat !

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

230 171 467

50 50 50

0 0 0

100% 100% 100%

(for 50 sentences of 258 flagged in 10,000; Norton forum data) No impact

Raw better

About same

2 (4%) 2 (4%) -

15 (30%) 8 (16%) 7 (14%) 19 (38%)

10 (20%) 9 (18%) 9 (18%) 4 (8%)

305

Pre-ed better 13 24 29 16

(26%) (48%) (58%) (32%)

Unclear

10 (20%) 7 (14%) 5 (10%) 11 (22%)

D. RULE BY RULE RESULTS

D.5.11

checkEOS/BOS

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Punctuation Set 2 Sentence boundary without punctuation J’esp`ere avoir pu mieux vous ´eclairer :Il est vrai que l’´evolution vers le cloud a oblig´e `a quelques changements. −→ J’esp`ere avoir pu mieux vous ´eclairer. Il est vrai que l’´evolution vers le cloud a oblig´e `a quelques changements.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

200 405 190

50 50 50

3 5 2

94% 90% 96%

(for 50 sentences of 96 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

4 (8%) 4 (8%) 4 (8%) 5 (10%)

Raw better

About same

2 2 1 2

2 (4%) 3 (6%) 3 (6%) -

(4%) (4%) (2%) (4%)

306

Pre-ed better 38 38 42 41

(76%) (76%) (84%) (82%)

Unclear

4 (8%) 3 (6%) 2 (4%)

D.5 Punctuation

D.5.12

wrongCap

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Punctuation Set 2 Wrong capitalisation Norton Ghost est plus complet et ::: Te permet de faire un image sur un disque dur ou sur un DVD. −→ Norton Ghost est plus complet et te permet de faire un image sur un disque dur ou sur un DVD.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

n/a n/a n/a

n/a n/a n/a

n/a n/a n/a

n/a n/a n/a

(for 1 sentences of 1 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

1 (100%) 1 (100%) 1 (100%)

-

-

1 (100%) -

-

307

D. RULE BY RULE RESULTS

D.5.13

wrongElision

Category Set Description Example

Precision

Punctuation Set 2 Incorrect elision Le “DiskPart” d’XP ne permet pas :::: −→ Le “DiskPart” de XP ne permet pas

(on 10,000 sentences)

Norton CNet Brico

Impact on MT

Flags

Evaluated

Incorrect

Precision

0 0 0

-

-

-

(for 5 sentences of 5 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

3 3 1 3

(60%) (60%) (20%) (60%)

Raw better

About same

-

1 (20%) -

308

Pre-ed better 2 2 3 2

(40%) (40%) (60%) (40%)

Unclear

-

D.6 Informal

D.6 D.6.1

Informal avoidAnglicism

Category Set Description Example

Precision

Informal Set 2 Avoid anglicisms je vais rebooter et voir si c’est toujours le cas. :::::::: −→ je vais red´emarrer et voir si c’est toujours le cas.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

96 137 2

50 50 2

1 8 1

98% 84% 50%

Norton CNet Brico

Impact on MT

(for 50 sentences of 89 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

20 20 13 13

(40%) (40%) (26%) (26%)

Raw better

About same

6 (12%) 9 (18%) 17 (34%) 14 (28%)

3 (6%) 2 (4%) 1 (2%) -

309

Pre-ed better 14 16 17 21

(28%) (32%) (34%) (42%)

Unclear

7 (14%) 3 (6%) 2 (4%) 2 (4%)

D. RULE BY RULE RESULTS

D.6.2

avoidDirQuest

Category Set Description Example

Precision

Informal Set 2 Use inversion instead of direct questions Tu as lu le tuto sur le forum ? ::::: −→ As-tu lu le tuto sur le forum ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

28 45 79

28 45 50

0 4 3

100% 91% 94%

Norton CNet Brico

Impact on MT

(for 50 sentences of 66 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

1 1 37 29

(2%) (2%) (74%) (58%)

Raw better 24 23 4 13

(48%) (46%) (8%) (26%)

310

About same

Pre-ed better

Unclear

3 (6%) 2 (4%)

21 (42%) 26 (52%) 9 (18%) 4 (8%)

1 (2%) 2 (4%)

D.6 Informal

D.6.3

avoidEstCeQue

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Informal Set 2 Use inversion instead of “est-ce que” for questions Est ce que je dois d´esinstaller et r´einstaller Norton apr`es coup ::::::::::::: ou pas? −→ Dois-je d´esinstaller et r´einstaller Norton apr`es coup ou pas?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

8 22 37

8 22 37

0 0 0

100% 100% 100%

(for 50 sentences of 55 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

5 (10%) 5 (10%) 36 (72%) 32 (64%)

9 (18%) 15 (30%) 7 (14%) 5 (10%)

6 (12%) 2 (4%) 1 (2%)

28 (56%) 26 (52%) 6 (12%) 10 (20%)

311

Unclear

2 2 1 2

(4%) (4%) (2%) (4%)

D. RULE BY RULE RESULTS

D.6.4

avoidAbbrev

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Informal Set 2 Avoid abbreviations Je suis all´e sur le site, j’ai t´el´echarg´e l’appli. ::::: −→ Je suis all´e sur le site, j’ai t´el´echarg´e l’application.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

468 503 448

50 50 50

0 0 14

100% 100% 72%

(for 50 sentences of 417 flagged in 10,000; Norton forum data) No impact

Raw better

About same

1 (2%) 1 (2%) 30 (60%)

1 (2%) 4 (8%) 3 (6%)

12 (24%) 3 (6%) 1 (2%)

312

Pre-ed better 35 45 46 15

(70%) (90%) (92%) (30%)

Unclear

1 (2%) 1 (2%) 1 (2%)

D.6 Informal

D.6.5

avoidInformal

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Informal Set 2 Avoid colloquial language Voil` a deux jours que je gal` ere `a essayer de cr´eer ce SRD person::::: nalis´e sans succ`es. −→ Voil`a deux jours que je me bats `a essayer de cr´eer ce SRD personnalis´e sans succ`es.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

63 146 272

50 50 50

0 0 8

100% 100% 84%

(for 41 sentences of 41 flagged in 10,000; Norton forum data) No impact

Raw better

About same

4 (10%) 4 (10%) 3 (7%) 2 (5%)

7 (17%) 9 (22%) 6 (15%) 9(22%)

7 (17%) 4 (10%) 1 (2%) 1 (2%)

313

Pre-ed better 22 23 29 25

(54%) (56%) (71%) (61%)

Unclear

1 (2%) 1 (2%) 2 (5%) 4 (10%)

D. RULE BY RULE RESULTS

D.6.6

missingNe

Category Set Description Example

Precision

Informal Set 2 Incomplete negation je vais relire ce que tu m’as dit sur les antivirus pour ˆetre sure de pas avoir de probl`emes! :::::: −→ je vais relire ce que tu m’as dit sur les antivirus pour ˆetre sure de ne pas avoir de probl`emes!

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

100 182 365

50 50 50

7 6 4

86% 88% 92%

Norton CNet Brico

Impact on MT

(for 50 sentences of 180 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

11 11 25 35

(22%) (22%) (50%) (70%)

Raw better

About same

7 (14%) 8 (16%) 4 (8%) 1 (2%)

7 (14%) 1 (2%) -

314

Pre-ed better 24 28 20 13

(48%) (56%) (40%) (26%)

Unclear

1 (2%) 3 (6%) 1 (2%)

D.7 Simplification

D.7 D.7.1

Simplification avoidAdverb

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Simplification Set 2 Avoid unnecessary adverbs J’ai ::::::::: vraiment besoin d’aide ! −→ J’ai besoin d’aide !

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

154 121 267

110 94 150

0 0 3

100% 100% 98%

(for 15 sentences of 15 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

3 (3%) 1 (1%) -

Raw better 1 25 2 4

(7%) (25%) (2%) (4%)

315

About same 8 (8%) 2 (2%) 1 (1%)

Pre-ed better 14 65 91 95

(93%) (64%) (89%) (93%)

Unclear

1 (1%) 6 (6%) 2 (2%)

D. RULE BY RULE RESULTS

D.7.2

avoidCeQui

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Simplification Set 2 Split long sentences at “ce qui” Je passe la souris (sans cliquer) sur la banni`ere en question,::: ce qui me donne en bas de l’´ecran le nom de la soci´et´e de publicit´e. ::: −→ Je passe la souris (sans cliquer) sur la banni`ere en question. Ceci me donne en bas de l’´ecran le nom de la soci´et´e de publicit´e.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

21 7 44

21 7 44

0 0 0

100% 100% 100%

(for n/a sentences of n/a flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a 4 (25%) 1 (6%)

n/a 1 (6%) -

n/a 8 (50%) 16 (100%) 15 (94%)

n/a 3 (19%) -

316

D.7 Simplification

D.7.3

avoidCleftSent

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Simplification Set 2 Avoid cleft sentences C’est Norton qui l’a supprim´e ? :::::::::::::::: −→ Norton l’a supprim´e ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

16 13 38

16 13 38

0 0 0

100% 100% 100%

(for 32 sentences of 35 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

8 (25%) 5 (16%) 6 (19%) 3 (9%)

2 (6%) 2 (6%) 1 (3%) -

317

Pre-ed better 20 23 24 29

(63%) (72%) (75%) (91%)

Unclear

2 (6%) 2 (6%) 1 (3%) -

D. RULE BY RULE RESULTS

D.7.4

avoidConjBOS

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Simplification Set 2 Avoid using a conjunction at beginning of sentence Et Merci ` a qui prendra la peine de me conseiller. :: −→ Merci ` a qui prendra la peine de me conseiller.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

202 137 209

50 50 50

0 0 0

100% 100% 100%

(for 50 sentences of 202 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a 2 (4%) -

n/a 1 (2%) 1 (2%) -

n/a 46 (92%) 49 (98%) 50 (100%)

n/a 1 (2%) -

318

D.7 Simplification

D.7.5

avoidNon

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Simplification Set 2 Remove “non” at end of interrogative sentence Je ne suis pas devenue plus organis´ee, mais plus rassur´ee, c’est d´ej` a un pas vers la s´er´enit´e, :::: non? −→ Je ne suis pas devenue plus organis´ee, mais plus rassur´ee, c’est d´ej`a un pas vers la s´er´enit´e?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

4 7 20

4 7 20

0 0 0

100% 100% 100%

(for 3 sentences of 3 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

-

-

319

Pre-ed better 3 3 3 3

(100%) (100%) (100%) (100%)

Unclear

-

D. RULE BY RULE RESULTS

D.7.6

avoidPresPart

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Simplification Set 2 Avoid present participle Effectuez un test avec une autre suite de s´ecurit´e ayant les mˆemes ::::: fonctionalit´es que NIS −→ Effectuez un test avec une autre suite de s´ecurit´e qui a les mˆemes fonctionalit´es que NIS ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

140 131 257

50 50 50

0 0 5

100% 100% 90%

(for 50 sentences of 94 flagged in 10,000; Norton forum data) No impact

Raw better

2 (4%) 2 (4%) -

8 (16%) 9 (18%) 10 (20%) 10 (20%)

320

About same 10 4 3 3

(20%) (8%) (6%) (6%)

Pre-ed better 28 31 34 31

(56%) (62%) (68%) (62%)

Unclear

2 (4%) 4 (8%) 3 (6%) 6 (12%)

D.8 Reformulation

D.8 D.8.1

Reformulation addDoisJe

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Add “dois-je” in questions with infinitive Comment faire pour en supprimer une? :::::::::::::: −→ Comment dois-je faire pour en supprimer une?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

21 9 21

21 9 21

0 0 0

100% 100% 100%

(for 36 sentences of 36 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

2 (6%) 2 (6%) -

2 (6%) 10 (28%) 26 (72%)

2 (6%) -

31 (86%) 28 (78%) 19 (53%) 9 (25%)

1 (3%) 4 (11%) 7 (19%) 1 (3%)

321

D. RULE BY RULE RESULTS

D.8.2

avoidExpression

Category Set Description Example

Precision

Reformulation Set 3 Replace expression by more translatable alternative Tu as quand mˆeme fait ta mise `a jour en version 2012 comme je :::::::::::: l’ai expliqu´e plus haut ? −→ Tu as pourtant fait ta mise `a jour en version 2012 comme je l’ai expliqu´e plus haut ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

92 78 80

62 52 51

0 0 3

100% 100% 94%

Norton CNet Brico

Impact on MT

(for 87 sentences of 180 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

7 8 1 1

(8%) (9%) (1%) (1%)

Raw better 4 27 10 15

(5%) (31%) (11%) (17%)

322

About same 3 4 1 4

(3%) (5%) (1%) (5%)

Pre-ed better 71 46 70 64

(82%) (53%) (80%) (74%)

Unclear

2 2 5 3

(2%) (2%) (6%) (3%)

D.8 Reformulation

D.8.3

avoidGreeting

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Replace greetings by standard expressions Hello ElChikito, et bienvenue sur le Forum!! ::::: −→ Bonjour ElChikito, et bienvenue sur le Forum!!

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

19 100 177

19 50 50

0 23 2

100% 54% 96%

(for 19 sentences of 19 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

3 (16%) 3 (16%) 4 (21%)

4 (21%) 3 (16%) 1 (5%) 2 (11%)

4 (21%) 5 (26%) 2 (11%) -

8 (42%) 6 (32%) 16 (84%) 13 (68%)

2 (11%) -

323

D. RULE BY RULE RESULTS

D.8.4

avoidIlFautQue

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Reformulation Set 3 Do not use “il faut que” Il faut que je v´erifie `a nouveau. :::::::::::: −→ Je dois v´erifier `a nouveau.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

5 12 24

5 12 24

1 2 3

80% 83% 88%

(for 10 sentences of 10 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

6 (60%) -

Raw better 2 2 1 1

(20%) (20%) (10%) (10%)

324

About same 1 (10%)

Pre-ed better

Unclear

7 7 3 7

1 (10%) 1 (10%) 1 (10%)

(70%) (70%) (30%) (70%)

D.8 Reformulation

D.8.5

avoidPlus

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Remove “plus” before “rien” or “aucun” Non, il n’y a plus aucun Norton install´e. :::: −→ Non, il n’y a aucun Norton install´e.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

15 28 8

15 28 8

0 0 0

100% 100% 100%

(for 23 sentences of 23 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

3 (13%) 3 (13%) 1 (4%) -

7 (30%) 8 (35%) 10 (44%) 10 (44%)

5 (22%) 1 (4%) 1 (4%) 1 (4%)

6 (26%) 11 (48%) 11 (48%) 12 (52%)

2 (9%) -

325

D. RULE BY RULE RESULTS

D.8.6

avoidPourQue

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Use “afin que” instead of “pour que” J’ai cr´ee un sujet ind´ependant pour que ce soit plus facile de ::::::::: faire le suivi. −→ J’ai cr´ee un sujet ind´ependant afin que ce soit plus facile de faire le suivi.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

40 19 64

40 19 50

0 0 0

100% 100% 100%

(for 41 sentences of 41 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a 4 (10%) 39 (95%) 40 (98%)

n/a 4 (10%) 1 (2%) -

n/a 1 (2%) 1 (2%) 1 (2%)

n/a 32 (78%) -

n/a -

326

D.8 Reformulation

D.8.7

avoidQue

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Use “seulement” instead of “ne que” Je n’ai que la version d’´evaluation. :::::::::: −→ J’ai seulement la version d’´evaluation.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

6 15 41

6 15 41

0 0 0

100% 100% 100%

(for 30 sentences of 30 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

5 (17%) 5 (17%) 24 (80%) 24 (80%)

2 (7%) 3 (10%) 2 (7%) -

2 (7%) -

18 (60%) 21 (70%) 3 (10%) 6 (20%)

3 (10%) 1 (3%) 1 (3%) -

327

D. RULE BY RULE RESULTS

D.8.8

avoidSiEtQue

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Repeat the conjunction (“si”, “lorsque”, etc.) instead of using “que” Lorsque je fais clic droit dans le bandeau internet :: et :::: que je se:::::::: lectionne Norton Toobar [...] −→ Lorsque je fais clic droit dans le bandeau internet et lorsque je selectionne Norton Toobar [...]

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

12 8 34

12 8 34

0 1 0

100% 88% 100%

(for 20 sentences of 20 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a -

n/a 3 (15%) 1 (5%) 2 (10%)

n/a 2 (10%) 1 (5%) 1 (5%)

n/a 15 (75%) 16 (80%) 15 (75%)

n/a 2 (10%) 2 (10%)

328

D.8 Reformulation

D.8.9

avoidSoit

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 2 Use “[...] ou [...]” instead of “soit [...] soit [...]” Installer soit avec CD, soit en t´el´echargeant depuis le Compte ::: ::: Norton −→ Installer avec CD ou en t´el´echargeant depuis le Compte Norton

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

15 5 31

15 5 31

0 0 1

100% 100% 97%

(for 7 sentences of 7 flagged in 10,000; Norton forum data) No impact

Raw better

About same

-

2 (29%) 4 (57%) 1 (14%)

1 (14%) 1 (14%) -

329

Pre-ed better

Unclear

4 2 6 4

1 (14%) 2 (29%)

(57%) (29%) (86%) (57%)

D. RULE BY RULE RESULTS

D.8.10

removeNeExpl

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reformulation Set 3 Remove ne expl´etif Aucun module NORTON ne peut s’activer ! −→ Aucun module NORTON peut s’activer !

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

11 18 20

11 18 20

0 0 1

100% 100% 95%

(for n/a sentences of n/a flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

n/a n/a n/a n/a

n/a n/a n/a n/a

n/a n/a n/a n/a

n/a n/a n/a n/a

n/a n/a n/a n/a

330

D.8 Reformulation

D.8.11

wrongAbbrev

Category Set Description Example

Precision

Reformulation Set 1 Misspelt abbreviation iPod Touch (3e g´en´eration ou ult´erieure) −→ iPod Touch (3`eme g´en´eration ou ult´erieure)

(on 10,000 sentences)

Norton CNet Brico

Impact on MT

Flags

Evaluated

Incorrect

Precision

39 33 37

39 33 37

0 2 0

100% 94% 100%

(for 16 sentences of 16 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

2 2 3 9

(13%) (13%) (19%) (56%)

Raw better

About same

Pre-ed better

Unclear

2 (13%) 1 (6%)

2 (13%) 3 (19%) 3 (19%) 1 (6%)

12 (75%) 11 (69%) 8 (50%) 5 (31%)

-

331

D. RULE BY RULE RESULTS

D.9 D.9.1

Informal 2nd person tuVous

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

Informal 2nd person Set 3 Replace “tu” by “vous” Qu’as-tu fait avec ton fichier trop volumineux ? −→ Qu’avez-vous fait avec votre fichier trop volumineux ?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

688 2683 1813

50 50 50

4 1 1

92% 98% 98%

(for 50 sentences of 1968 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

20 20 43 41

(40%) (40%) (86%) (82%)

Raw better

About same

Pre-ed better

Unclear

3 3 1 1

4 (8%) 2 (4%) 1 (2%)

21 (42%) 24 (48%) 6 (12%) 7 (14%)

2 (4%) 1 (2%) -

(6%) (6%) (2%) (2%)

332

D.10 Clitics

D.10

Clitics

D.10.1

cliticsC ¸a

Description Example

Precision

Norton CNet Brico

Impact on MT

Replace clitics by “¸ca” Je te le donne en pi`ece jointe. −→ Je te donne ¸ca en pi`ece jointe.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

292 405 545

50 50 50

0 1 2

100% 98% 96%

(for 50 sentences of 557 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

11 (22%) 11 (22%) -

Raw better 4 3 44 42

(8%) (6%) (88%) (84%)

333

About same

Pre-ed better

5 (10%) 4 (8%) 1 (2%) 1 (2%)

28 (56%) 29 (58%) 4 (8%) 5 (10%)

Unclear

2 3 1 2

(4%) (6%) (2%) (4%)

D. RULE BY RULE RESULTS

D.10.2

cliticsCeux-ci

n/a n/a Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Replace plural clitics by “ceux-ci” Je suis un fid`ele des produits de la Norton et les conseille autour de moi. −→ Je suis un fid`ele des produits de la Norton et conseille ceux-ci autour de moi.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

72 87 88

50 50 50

3 2 6

94% 96% 88%

(for 50 sentences of 69 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

Unclear

16 (32%) 16 (32%) -

7 (14%) 5 (10%) 48 (96%) 26 (52%)

7 (14%) 3 (6%) 9 (18%)

18 (36%) 25 (50%) 10 (20%)

2 (4%) 1 (2%) 2 (4%) 5 (10%)

334

D.10 Clitics

D.10.3

cliticsPersPron

Category Set Description Example

Precision

Clitics Set 3 Remove or replace clitics Vous pouvez cliquer sur le lien qui vous renverra `a votre sujet. −→ Vous pouvez cliquer sur le lien qui renverra `a votre sujet.

(on 10,000 sentences)

Norton CNet Brico

Impact on MT

Flags

Evaluated

Incorrect

Precision

124 118 111

50 50 50

0 0 0

100% 100% 100%

(for 50 sentences of 172 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

5 5 5 5

(10%) (10%) (10%) (10%)

Raw better 17 16 24 37

(34%) (32%) (48%) (74%)

335

About same 10 4 4 1

(20%) (8%) (8%) (2%)

Pre-ed better

Unclear

17 (34%) 21 (42%) 11 (22%) 6 (12%)

1 (2%) 4 (8%) 6 (12%) 1 (2%)

D. RULE BY RULE RESULTS

D.11

Reordering

D.11.1

reorderJamaisV

Category Set Description Example

Precision

Reordering Set 3 Move “jamais” in front of verb to match English word order (also removes “ne”) Si Norton a ´et´e d´esinstall´e et que tu n’avais jamais fait de sauvegarde des identifiants, c’est perdu. −→ Si Norton a ´et´e d´esinstall´e et que tu jamais avais fait de sauvegarde des identifiants, c’est perdu.

(on 10,000 sentences)

Norton CNet Brico

Impact on MT

Flags

Evaluated

Incorrect

Precision

6 4 23

6 4 23

0 0 0

100% 100% 100%

(for 8 sentences of 8 flagged in 10,000; Norton forum data) No impact

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

2 2 1 6

(25%) (25%) (13%) (75%)

Raw better 4 4 3 1

(50%) (50%) (38%) (13%)

336

About same

Pre-ed better

Unclear

1 (13%) 1 (13%)

1 (13%) 1 (13%) 3 (38%) -

1 (13%) 1 (13%) -

D.11 Reordering

D.11.2

reorderRienInf

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reordering Set 3 Move “rien” after infinitive to match English word order Comment faire une image `a froid avec le Norton Recovery Disk” sans rien modifier ? −→ Comment faire une image `a froid avec le Norton Recovery Disk” sans modifier rien?

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

9 37 22

9 37 22

0 0 0

100% 100% 100%

(for 40 sentences of 40 flagged in 10,000; Norton forum data) No impact

Raw better

About same

Pre-ed better

3 (8%) 3 (8%) 13 (33%) -

9 (23%) 12 (30%) 11 (28%) 31 (78%)

5 (13%) 3 (8%) 1 (3%) -

21 (53%) 19 (48%) 12 (30%) 6 (15%)

337

Unclear

2 3 3 3

(5%) (8%) (8%) (8%)

D. RULE BY RULE RESULTS

D.11.3

reorderTout

Category Set Description Example

Precision

Norton CNet Brico

Impact on MT

ACCEPT (FTI) ACCEPT (AMT) Lucy Systran

Reordering Set 3 Move “tout” after verb Si tu as install´e NAV et que tu as tout laiss´e par d´efaut, je ne comprends pas. −→ Si tu as install´e NAV et que tu as laiss´e tout par d´efaut, je ne comprends pas.

(on 10,000 sentences) Flags

Evaluated

Incorrect

Precision

18 27 53

18 27 50

0 0 0

100% 100% 100%

(for 23 sentences of 23 flagged in 10,000; Norton forum data) No impact

Raw better

About same

7 (30%) 7 (30%) 10 (44%) -

3 (13%) 4 (17%) 3 (13%) 2 (9%)

1 (4%) -

338

Pre-ed better 10 12 10 21

(44%) (52%) (44%) (91%)

Unclear

2 (9%) -

Appendix E

Pre-editing in the Norton Community forum: instructions for participants

339

E. PRE-EDITING IN THE NORTON COMMUNITY FORUM: INSTRUCTIONS FOR PARTICIPANTS

Contexte Le projet ACCEPT est un projet de recherche co-financé par la Commission Européenne. Le but principal de ce projet est de développer de nouvelles méthodes et techniques visant à améliorer la traduction automatique (TA) dans le cadre des communautés virtuelles partageant des informations spécialisées (la communauté Norton est un exemple de ce type de communauté). Afin d’atteindre ce but, certaines équipes du projet mettent en place des études visant à étudier l’impact de certaines variables (comme le nombre de fautes de grammaire dans un texte) sur certains phénomènes (comme la qualité fournie par un outil de traduction automatique). L’une de ces études est actuellement coordonnée par des chercheurs de l’Université de Genève et de Symantec et nous aurions besoin de votre aide afin d’obtenir des résultats. Votre contribution est donc primordiale ! Pourquoi prééditer ? La préédition consiste à corriger les fautes d’orthographe, de grammaire et de style pour améliorer la qualité du texte original et faciliter sa traduction automatique. Elle devrait ainsi permettre aux utilisateurs de communauté virtuelle d’accéder à des informations rédigées dans d’autres langues. La tâche La tâche sera divisée en deux phases, et dans chaque phase il s’agira de corriger 25 posts, un par un, en suivant les directives ci-dessous : -

Corrigez uniquement les mots soulignés dans le texte. Utilisez si possible les suggestions proposées. Essayez d’appliquer le plus de modifications possibles (même s’il s’agit de corrections purement stylistiques). La qualité de la traduction dépendra en effet de cet effort de préédition.

Instructions Les 25 posts à corriger lors de chaque phase seront réunis dans un board auquel vous pourrez accéder une fois connecté au forum grâce à un lien qui vous sera personnellement fourni.

Une fois dans le board, vous devrez cliquer sur chaque objet afin de pouvoir accéder à la modification du message :

Figure E.1: Pre-editing guidelines page 1 of 2

340

Une fois que vous aurez cliqué sur « Modifier le message », vous pourrez alors cliquer sur le bouton pour lancer la correction. Le système souligne en rouge ou en vert les mots à corriger. Passez ensuite la souris sur les mots soulignés. 1. Pour les mots soulignés en rouge, choisissez l'une des suggestions proposées ou éditez le texte manuellement. Pour ignorer des suggestions, vous pouvez aussi choisir l’option « Apprendre » ou « Ignorer ».

Option « Apprendre » : Le mot ou l'expression sera ajouté dans le dictionnaire (liste de réserve) et ne sera plus jamais souligné. Option « Ignorer » : La correction sera ignorée dans ce cas particulier, mais sera à nouveau proposée dans la suite. 2. Pour les mots soulignés en vert, le système affiche une description du problème à corriger. Pour éviter que le système ne marque à nouveau ce type de problème, appuyez sur l’option « Ignorer cette règle ? ».

Attention : il est préferable d’utiliser cette option seulement lorsque vous jugez que les règles ne sont pas nécessaires. Si vous pensez qu’une règle a généré une fausse alarme, il est préferable de passer un autre problème en revue. Si vous voulez modifier les mots appris, ainsi que les mots et règles ignorées ou les visualiser, cliquez sur le bouton ou le mot de la liste.

. Si vous changez d’opinion, cliquez sur « Supprimer » pour enlever la règle

Quand vous avez terminé la correction, n'oubliez pas de cliquer sur le bouton pour que le texte original soit remplacé par le texte modifié dans le pop-up. Si vous cliquez directement sur le bouton sur

, le pop-up se fermera et toute modification sera perdue. Vous pourrez ensuite cliquer et passer au texte suivant.

Si vous avez la moindre question, n’hésitez-pas à nous contacter ! Merci d’avance !

Figure E.2: Pre-editing guidelines page 2 of 2

341

E. PRE-EDITING IN THE NORTON COMMUNITY FORUM: INSTRUCTIONS FOR PARTICIPANTS

342

Appendix F

Post-editing guidelines

343

F. POST-EDITING GUIDELINES

Post-editing Guidelines Please read these guidelines carefully before you begin to post-edit.

Brief Introduction and General Instructions This experiment is carried out as part of the EU-funded Accept Project. You will be asked to post-edit machine translation output. The post-editing tasks will be performed through the Accept Project portal, to which you will receive a link by email. You will first need to create an account to start working. Open the Accept portal and move on to the following section for more details.

How To Create An Account To create an account, click on Register. In the new window introduce the email to which the invitation was sent and a password of your choice.

1

You will receive an email for confirmation. Upon confirmation, log in to the portal using your email and password.

How To Use The Portal 

Access the post-editing section by clicking on the “Post Edit” bubble on the upper banner.

2



Click on the Project named “Student test” or follow the link included in the email invitation you should have received on your email box.

3

Student test

Page 1 of 4

Figure F.1: Post-editing guidelines page 1 of 4

344

4 You should now be able to visualize the list of active tasks to be completed.

Student test



Click on “Edit” to start post-editing the first task. IMPORTANT: Note that each task should be performed without interruption. Stopping and restarting may distort results. However, you are allowed to interrupt your work between tasks and resume it later. Just move forward to the next task to resume your work. Never use the “Save for Later” button at the right-bottom corner when working on a task.



After clicking on “Edit”, a new window will pop up, which should look like this: IMPORTANT: The first time you click on “Edit”, the pop-up window may come up empty or partially empty. If this occurs, please close the window and click again on “Edit”. Now it should be fine. Text should appear in every textbox (except for the “Comments?” box).

5 Source

Target

Edit the text here

Move forward and backwards

Include any comments about the sentence (difficulties, terminology issues, etc.)

Submit results

Page 2 of 4

Figure F.2: Post-editing guidelines page 2 of 4

345

F. POST-EDITING GUIDELINES



Edit the text on the textbox under “Current sentence to edit”. The textbox will be automatically filled in with a candidate translation, which will also appear highlighted in yellow on the left-hand side for you to know how many sentences are left to post-edit. IMPORTANT: Sentences were extracted from original user posts on online forums. Please note that, although sentences were grouped in tasks, they do not belong to a same coherent post: all sentences were mixed up and there might be no relation at all between them. IMPORTANT: Do not use the “Guidelines” button. Only refer to the instructions included in the Guidelines we describe herein. IMPORTANT: Do not use the check spelling button. IMPORTANT: Never use the “Save for Later” button at the right-bottom corner when working on a task.

 

Click on “Next” to move on to the following sentence. If necessary, you can click on “Prev” to review the previous sentence.



Once you have post-edited ALL the sentences in the task, click on “Complete task” to submit the results for that particular task. The task will disappear from the list and you won’t be able to review your work.

Linguistic Instructions  Post-editing Type and Expected End Quality We expect a full post-editing effort from your side, which means producing a final set of sentences of publishable quality. Nevertheless, please keep in mind the following instructions:      

Use as much of the raw MT output as possible. Do not restructure the sentences only to improve the natural flow of the text and particularly do not delete a whole sentence to retranslate from scratch. Aim for grammatically and semantically correct translations. Aim at conveying the meaning of the source in the target. Check for mistranslations and omissions. Do not introduce preferential changes. There is no need to implement corrections that are of a stylistic nature only. Always try to do as few edits as possible.

 General Vocabulary 

For general vocabulary, you may use any reference material (offline or online) to find the appropriate translation.

Page 3 of 4

Figure F.3: Post-editing guidelines page 3 of 4

346

 Terminology 

Follow the terminology glossary we have sent to you for product and process names.

 Style 

Style is not given priority in this study. Only abide by the main stylistic rules of your language. The resulting sentence is only expected to be grammatically correct and convey the meaning of the source.

Deliverables Please, send an email to [email protected] once you have completed and submitted all tasks.

Who to contact if problems arise during your work Should you encounter any technical problem while post-editing or should you have any doubt about the task, please contact us immediately: Johanna Gerlach [email protected]

Page 4 of 4

Figure F.4: Post-editing guidelines page 4 of 4

347