Automatic Native Language Identification

3 downloads 0 Views 2MB Size Report
emociones en la tarea de ILN, (v) evaluamos la fortaleza de las características basadas en emociones en la NLI, (vi) desarrollamos un método robusto para la ...
Instituto Politecnico Nacional ´n Centro de Investigacion en Computacio

Automatic Native Language Identification

Tesis que para obtener el grado de: Doctorado en Ciencias de la Computaci´ on Presenta: M. en C. Ilia Markov

Directores de tesis: Dr. Grigori Sidorov Dra. Obdulia Pichardo Lagunas

Mayo, 2018 Ciudad de M´exico, M´exico

Resumen La tarea de identificación del lenguaje nativo (ILN) consiste en identificar el lenguaje nativo (L1) de una persona basándose en sus textos escritos en una segunda lengua (L2). La ILN es útil para una variedad de propósitos, incluyendo aplicaciones de marketing, seguridad y educación. La identificación del lenguaje nativo se basa en la suposición de que el lenguaje nativo influye la escritura en L2 debido al efecto de interferencia lingüística, uno de los temas principales en el campo de la adquisición de una segunda lengua (ASL). Varios aspectos del lenguaje nativo que influyen la producción de L2 se han explorado previamente para la ILN: elecciones léxicas, incluyendo la elección de cognados, etimología general, patrones gramaticales y errores ortográficos, entre otros. Estos aspectos proporcionan varias ideas sobre la naturaleza de la influencia de L1 en la escritura en L2. En esta disertación, nos enfocamos en otras dos áreas, poco exploradas, de la identificación del lenguaje nativo: la puntuación y las emociones. Mostramos que el uso de la puntuación y la forma en que los autores expresan sus emociones en la escritura en L2 están influenciados por su idioma nativo. Primero, mostramos que el uso de la puntuación en la escritura en L2 está fuertemente afectado por L1. Describimos nuestros experimentos sobre la evaluación del impacto de los signos de puntuación en la tarea de ILN. Proponemos características novedosas: n-gramas de puntuación, con el objetivo de capturar los patrones de uso de signos de puntuación. Utilizamos dos conjuntos de características: etiquetas gramaticales y palabras funcionales. Mostramos que agregando las características basadas en puntuación a estos conjuntos de características se mejoran los resultados en una variedad de configuraciones: clasificación tradicional de clases múltiples, clasificación en dos pasos, clasificación del nivel de dominio de L2 y clasificaciones de temas cruzados y de corpus cruzados. Segundo, exploramos el papel de las emociones en la ILN. Para modelar la información de las emociones, utilizamos las características de polaridad de la emoción y la carga 1

emocional. Nuestros resultados con etiquetas gramaticales y palabras funcionales muestran que agregar características basadas en emociones es útil para la tarea en la mayoría de las configuraciones mencionadas anteriormente, incluso en el corpus compuesto de ensayos, donde el uso de emociones está limitado por el género. Estos hallazgos, junto con algunas otras modificaciones descritas en detalle en el texto, nos permitieron mejorar nuestro método ILN, desarrollado para la competencia de Identificación del Lenguaje Nativo 2017 (NLI Shared Task 2017 ), y superar resultados del estado del arte sobre los corpus principales en la investigación de ILN. Las principales contribuciones de esta disertación son las siguientes: (i) evaluamos la fortaleza de los signos de puntuación como características para la ILN, (ii) evaluamos la fortaleza de los n-gramas de puntuación como características para la ILN, (iii) mostramos que los signos de puntuación son indicadores sólidos de la lengua materna para diferentes niveles de dominio de L2 y variaciones de tema/corpus, (iv) exploramos el papel de las emociones en la tarea de ILN, (v) evaluamos la fortaleza de las características basadas en emociones en la NLI, (vi) desarrollamos un método robusto para la tarea de ILN.

2

Abstract Native language identification (NLI) is the task of identifying the native language (L1) of a person based on his or her texts written in a second language (L2). NLI is useful for a variety of purposes, including marketing, security, and educational applications. Identifying the native language relies on the assumption that the native language influences L2 writing due to language transfer effect, one of the major topics in the field of second language acquisition (SLA). Various aspects of the native language that influence L2 production have been explored before for NLI: lexical choices, including choice of cognates, general etymology, grammatical patterns, and spelling errors, among others. These aspects provide various insights into the nature of L1 influence in L2 writing. In this dissertation, we focus on two other, underexplored areas of native language identification: punctuation and emotions. We show that the use of punctuation and the way authors express their emotions in L2 writing are influenced by their native language. First, we show that the use of punctuation in L2 writing is strongly affected by L1. We describe our experiments on evaluation of the impact of punctuation marks on the NLI task. We propose novel features: PM n-grams, aiming to capture patterns of the use of PMs. We use two feature sets: part-of-speech tags and function words. We show that adding punctuation-based features to these feature sets improves the results under variety of settings: traditional multi-class classification, 2-step classification, proficiencylevel classification, and cross-topic and cross-corpus classifications. Second, we explore the role of emotions in NLI. To model emotion information, we use emotion polarity and emotion load features. Our results with part-of-speech tag and function word feature sets show that adding emotion-based features is useful for the task, in the majority of settings listed above, even in essay domain, where the use of emotions is limited by genre. 3

These findings, along with some other modifications described in detail in the text, allowed us to improve our NLI method, developed for the NLI Shared Task 2017, and to overcome state-of-the-art results on the two main datasets in NLI research. The main contributions of this dissertation are as follows: (i) we evaluated the strength of punctuation marks as NLI features, (ii) we evaluated the strength of punctuation mark n-grams as NLI features, (iii) we showed that punctuation marks are robust indicators of the native language for different proficiency levels and topic/corpus variations, (iv) we explored the role of emotions in the NLI task, (v) we evaluated the strength of emotionbased features in NLI (vi) we developed a robust NLI method.

4

Acknowledgments First and foremost, I would like to express my sincere gratitude and appreciation to my supervisors, Dr. Grigori Sidorov and Dr. Obdulia Pichardo-Lagunas, for their patience, guidance, and support. I am very grateful for their insightful comments and encouragement throughout the course of this work. My endless gratitude goes to Dr. Alexander Gelbukh for inspiring me and being always ready to help both scientifically and personally. I am deeply grateful to Dr. Jorge Baptista for providing me with a great deal of help and valuable advice over the years. I would also like to thank my remarkable internship advisors and collaborators: Dr. Carlo Strapparava, Dr. Vivi Nastase, and Dr. Efstathios Stamatatos. It was a great privilege working with them. I will always be grateful for their hospitality and willingness to share their knowledge. A final word of gratitude is dedicated to my colleagues, Helena Gómez-Adorno and Miguel Sánchez-Perez, my family, and all my friends for their constant support along the way. This work would not have been possible without the financial support of the Mexican Government, CONACYT project 240844, and the CONACYT scholarship.

5

Contents Resumen

1

Abstract

3

Acknowledgments

5

List of Figures

9

List of Tables

11

List of Abbreviations

15

1 Introduction

17

1.1

Native Language Identification . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.2

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.3

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.3.1

General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.3.2

Particular objectives . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

1.4

2 Related Work 2.1

2.2

23

Native Language Identification . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.1.1

Aspects of the language as L1 indicators . . . . . . . . . . . . . . .

24

2.1.2

Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.1.3

Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Shared Tasks on Native Language Identification . . . . . . . . . . . . . . .

32

2.2.1

32

NLI Shared Task 2013 . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2

NLI Shared Task 2017 . . . . . . . . . . . . . . . . . . . . . . . . .

3 Experiments and Results 3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 TOEFL11 dataset . . . . . . . . . . . . . . . 3.1.2 ICLE dataset . . . . . . . . . . . . . . . . . 3.2 Punctuation as NLI Features . . . . . . . . . . . . . 3.2.1 Features . . . . . . . . . . . . . . . . . . . . 3.2.2 Experiment setup . . . . . . . . . . . . . . . 3.2.3 Punctuation results . . . . . . . . . . . . . . 3.3 Emotions as NLI Features . . . . . . . . . . . . . . 3.3.1 Emotion-based features . . . . . . . . . . . . 3.3.2 Emotion results . . . . . . . . . . . . . . . . 3.4 Combination of Punctuation and Emotion Features 3.5 Method for NLI . . . . . . . . . . . . . . . . . . . . 3.5.1 CIC-FBK: Method for NLI . . . . . . . . . . 3.5.2 Improving the CIC-FBK method . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

34 39 39 39 42 43 43 45 46 54 54 58 60 61 62 70

4 Conclusions and Future Work 83 4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Scientific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography

87

A Results for Top 10 Punctuation Mark N-grams

97

B Results for Function Words with Punctuation-Based Features

99

C Emotion Polarity Results

103

D Fine-Tuning the Feature Set Size on the ICLE Dataset

109

8

List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 3.7

Basic workflow for a text classification task. . . . . . . . . . . . . . . . . . Syntactic tree for the sample sentence (3). . . . . . . . . . . . . . . . . . . Confusion matrices for the original cic-fbk method (top) and the improved cic-fbk method (bottom) on the toefl11-17 test set. . . . . . . . . . . . Confusion matrices for the original cic-fbk method (top) and the improved cic-fbk method (bottom) on the icle dataset. . . . . . . . . . . . . . . . Accuracy (%) variation with respect to the frequency threshold on the icle dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy (%) variation with respect to the minimum document frequency (min_df) on the icle dataset. . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix before (top) and after (bottom) selecting the optimal feature set size for the icle dataset. . . . . . . . . . . . . . . . . . . . . . .

9

62 66 73 76 78 79 81

List of Tables 2.1

Top five results in the NLI Shared Task 2013. . . . . . . . . . . . . . . . .

34

2.2

Official results for the participating teams in the NLI Shared Task 2017. . .

36

2.3

Oracle results (F1-score, %) on the NLI Shared Tasks’ 2013 and 2017 systems. Source: [39]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.1

Distribution of topics in the toefl11 dataset. . . . . . . . . . . . . . . . .

40

3.2

English proficiency levels in the toefl11 dataset. . . . . . . . . . . . . . .

41

3.3

Number (No.) of essays per class in the 7-way icle dataset (the icle-nli subset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.4

Punctuation mark (PM) n-grams for the sample sentence (1). . . . . . . .

45

3.5

10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) with PMs, without PMs, and with PM n-grams (n = 2, 5); 1- and 2-step approaches.

47

3.6

Contigency table of two model’s predictions. . . . . . . . . . . . . . . . . .

48

3.7

10-fold cross-validation accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams for each proficiency level (imbalanced setting).

49

10-fold cross-validation accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams for each proficiency level (balanced setting). .

50

10-fold cross-validation and one fold/topic setting accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams. . . . . . . . . . .

51

3.10 Mixed- and cross-topic settings accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams. . . . . . . . . . . . . . . . . . . . .

51

3.11 Number (No.) of essays per class in the icle dataset used for the crosscorpus experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.12 Cross-corpus classification results for POS n-grams without PMs, with PMs, and with PM n-grams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.8 3.9

11

3.13 Performance of emotion words and random words. . . . . . . . . . . . . . .

55

3.14 Number of emotion words per L1 (No.) and number of emotion words divided by the total number of words (Ratio) in the toefl11-17 and icle datasets; sorted from the highest to the lowest. . . . . . . . . . . . . . . . .

56

3.15 NRC emotion lexicon entry for the target word bad. . . . . . . . . . . . . .

57

3.16 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3, with PMs), when adding emotion polarity (emoP) and emotion load (emoL) features. .

58

3.17 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) without PMs, with PMs, with PM n-grams (n = 2,5) and with emotion polarity and emotion load features; 1- and 2-step approaches. . . . . . . . . . . . .

60

3.18 10-fold cross-validation accuracy (%) for FW unigrams without PMs, with PMs, with PM n-grams (n = 2,5) and with emotion polarity and emotion load features; 1- and 2-step approaches. . . . . . . . . . . . . . . . . . . . .

61

3.19 Categories of character n-grams introduced by Sapkota et al. [63]. . . . . .

64

3.20 Character n-grams (n = 3) per category for the sample sentence (2) after applying the algorithm by Sapkota et al. [63]. . . . . . . . . . . . . . . . .

65

3.21 10-fold cross-validation accuracy (%) for each feature type individually on the toefl11-17 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.22 Comparison of typed and untyped character n-grams (n = 3/n = 4) on the toefl11-17 training and test sets when used in combination with other feature types employed in the cic-fbk method. . . . . . . . . . . . . . . .

70

3.23 Results (accuracy, % and F1-score, %) for the original cic-fbk method, the modified cic-fbk method, the modified cic-fbk method with PM n-grams (n = 2,5) and emotion-based features on the toefl11-17 training and test sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.24 5-fold cross-validation (5FCV) results (accuracy, % and F1-score, %) for the original cic-fbk method, the modified cic-fbk method, the modified cic-fbk method with PM n-grams (n = 2,5) and emotion-based features on the icle dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.25 Accuracy (%) and F1-score (%) variation with respect to the frequency threshold on the icle dataset for the modified cic-fbk method + PM n-grams + emoP + emoL. . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

12

3.26 Accuracy (%) and F1-score (%) variation with respect to the minimum document frequency (min_df) on the icle dataset for the modified cicfbk method + PM n-grams + emoP + emoL. . . . . . . . . . . . . . . . .

78

A.1 Top 10 10-fold cross-validation results (accuracy, %) for PM n-grams with POS n-gram (n = 1–3) features with PMs on the toefl11-17 dataset; 1step and 2-step approaches. Number of features (No.) is provided for each experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

A.2 Top 10 10-fold cross-validation results (accuracy, %) for PM n-grams with POS n-grams (n = 1–3) features with PMs on the icle dataset; 1-step and 2-step approaches. Number of features (No.) is provided for each experiment. 98 B.1 10-fold cross-validation accuracy (%) for FW unigrams with PMs, without PMs, and with PM n-grams (n = 2, 5); 1- and 2-step approaches. . . . . .

99

B.2 10-fold cross-validation accuracy (%) for FW unigrams for each proficiency level (imbalanced setting); toefl11 dataset. . . . . . . . . . . . . . . . . . 100 B.3 10-fold cross-validation accuracy (%) for FW unigrams for each proficiency level (balanced setting); toefl11 dataset. . . . . . . . . . . . . . . . . . . 101 B.4 10-fold cross-validation and one fold/topic setting accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams. . . . . . . . . . 101 B.5 Mixed- and cross-topic settings accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams. . . . . . . . . . . . . . . . . . . . . 102 B.6 Cross-corpus classification accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams. . . . . . . . . . . . . . . . . . . . . . . . 102 C.1 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) with PMs and PM n-grams (n = 2,5), and with emotion polarity features; 1- and 2-step approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.2 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM n-grams (n = 2,5), and with emotion polarity features; 1- and 2-step approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.3 10-fold cross-validation accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (imbalanced setting). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 13

C.4 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (imbalanced setting). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 10-fold cross-validation accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (balanced setting). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (balanced setting). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.7 10-fold cross-validation and one fold/topic setting accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features. . C.8 10-fold cross-validation and one fold/topic setting accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features. C.9 Mixed- and cross-topic settings accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features. . . . . . . . . . . . . C.10 Mixed- and cross-topic settings accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features. . . . . . . . . . . . . C.11 Cross-corpus classification accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features. . . . . . . . . . . . . . . C.12 Cross-corpus classification accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features. . . . . . . . . . . . . . .

104

105

105 105 106 106 106 106 107

D.1 Accuracy (%) and F1-score (%) variation with respect to the frequency threshold on the icle dataset for the modified (modif.) cic-fbk with emoL, with PM n-grams and emoL, and with emoP and emoL features. . . 109 D.2 Accuracy (%) and F1-score (%) variation with respect to the minimum document frequency (min_df) on the icle dataset for the modified (modif.) cic-fbk with emoL, with PM n-grams and emoL, and with emoP and emoL features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

14

List of Abbreviations NLI – native language identification NLP – natural language processing SLA – second language acquisition L1 – first language L2 – second language BoW – bag of words POS – part of speech FW – function word PM – punctuation mark EmoP – emotion polarity EmoL – emotion load SVM – support vector machines ML – machine learning

15

Chapter 1 Introduction 1.1

Native Language Identification

Native language identification (NLI) is a natural language processing (NLP) task that aims at automatically identifying the native language (L1) of a writer based solely on his or her texts written in the second language (L2). The effect of native language seeping into texts produced in a different language is known as language transfer [53], also referred to as cross-linguistic influence. According to the language transfer theory, the native language influences the way a second language is learned. Therefore, if a system learns to identify what is being transferred from one language to another, then it will be able to identify the native language of an author given a text written in L2. Numerous aspects of the language have been explored for NLI, e.g., character-level models [22], lexical choice [6, 29], cognates [51], general etymology [50], grammar [81, 72], spelling errors [13, 26], among others. We focus on two aspects of the language that have not been previously explored for NLI. We suggest the following hypotheses: (i) punctuation is a robust indicator of author’s native language, since punctuation emphasizes the information structure of a sentence, and different languages structure information differently. (ii) The use of emotions is L1 specific, and is distinctive enough to contribute to the native language identification task. Punctuation is part of the indicators that overtly represent the manner in which each language organizes and conveys information. Our hypothesis is that the use of punctuation for an author’s L1 is distinct from norms of L2, and therefore, is a strong indicator of 17

Introduction

author’s native language. While punctuation was included in some previous models (e.g., in character-level models), there was no quantification of its impact, that is, there was no analysis of how much the n-grams including punctuation contribute to the final classification, and so from these previous studies we gain no insight into what the punctuation’s impact/interference degree is. The series of experiments described in this dissertation serve to make this quantitative. Each experiment explores some aspect of interference from the native language. We propose to model punctuation information by introducing punctuation mark (PM) n-gram features. PM n-grams are composed of n consecutive PMs omitting all the tokens in between. This punctuation representation is designed to capture patterns of the use of PMs, which according to our hypothesis are distinct for different L1 groups. We investigate the impact of emotions on native language identification. Our hypothesis is that the way authors express their emotions when providing their opinion or describing personal experience in response to the prompts represented in the data depends on their native language. We show that emotionally charged words contribute to the native language identification task, and model this information via the so-called emotion polarity and emotion load features. We use emotion and sentiment flags from the NRC Word-Emotion Association Lexicon (NRC emotion lexicon) [48] (emotion polarity features). These features are designed to capture emotion and positive/negative polarity information contained in essays written by different L1 groups. Emotion load features aim at capturing the frequency of emotionally charged words used by different L1 groups in their essays, that is, highly-emotional and low-emotional essays. To the best of our knowledge, no work has been done on evaluating the impact of emotion information on the NLI task based on textual data, which is one of the central research questions addressed in this dissertation. To investigate our hypotheses, we perform a series of experiments that measure the impact of punctuation-based and emotion-based features on the NLI task, which we approach from a supervised machine-learning (ML) perspective, as a multi-class classification problem of English (as L2) written documents, using part-of-speech (POS) tags and function words (FWs) feature sets, and analyze the performance of these feature sets with and without punctuation marks (PMs), with PM n-gram features, and finally, with emotionbased features. We opted for POS n-grams and function word-based features, since they constitute a core set of standard features for this task [36, 35], when lexicalized represen18

Introduction

tation through word or character n-grams is often avoided in NLI research due to the issue with topic bias [5, 24]. We perform 1-step classification – classify the L1 of the writer directly – and also a 2-step classification, where in the first step we classify the language family/geographical group, and then the actual L1. We evaluate punctuation- and emotion-based features for different proficiency levels, as well as under cross-topic and cross-corpus NLI conditions. This set of experiments was designed in order to show that these features are robust indicators of L1, regardless of the topics present in the dataset, and that they are portable across different topics and corpora. We develop a method based on the Support Vector Machines (SVM) algorithm trained on several types of features: word n-grams, lemma n-grams, part-of-speech n-grams, function words, character n-grams from misspelled words, typed character n-grams, and syntactic dependency-based n-grams of words and of syntactic relation tags. Our method shared the first rank in the NLI Shared Task 2017 [39] scoring, achieving the second best result in the share task. We present several modifications to our method, which consists in feature selection: replacing typed character n-grams by untyped n-grams, and incorporating punctuation mark n-gram and emotion-based information, and feature filtering: fine-tuning the feature set size. The proposed modifications allowed us to improve our NLI method and to push state-of-the-art results on the two main datasets in NLI research.

1.2

Motivation

There is a long history of computational aids for language pedagogy, both for first and second-language acquisition [10, 15, 62]. Automated NLI enables to develop pedagogical material that takes into consideration the learner’s L1, that is, it helps to L2 educators to concentrate their efforts on specific areas of a language that cause the most learning difficulty for different L1s. For instance, a Russian native speaker tend to misuse the articles in English writings due to the lack of this lexical category in Russian, when a Spanish native speaker can be identified by an incorrect use of grammatical structure as a result that the Spanish language allows more flexibility than English in the order of words, especially in the case of placing adjectives. Since speakers of different language groups make different types of errors when learning a second language [71], a system that can 19

Introduction

detect the native language of a learner will be able to provide a more targeted feedback to language learners concerning their errors and contrast it with common properties of learners with a similar native language background. An example of second language acquisition research in the classroom, and how it can influence the teaching process is provided, e.g., in [31]. Another potential application of language-specific characteristics in language pedagogy is taking them into account when developing interactive learning systems or when designing proficiency tests. Proficiency tests usually differ with respect to the level of proficiency, but not with respect to the native language. Since speakers of different languages experience different difficulties when learning a second (or nth ) language, adjusting proficiency tests to different language backgrounds would make them more accurate and complete. Recently, a growing number of marketing applications are taking advantage of NLI to improve their services. Identifying the native language of a customer, and in this way, gaining deeper insights into his or her cultural identity is used to improve market segmentation, to predict customer behaviour or to develop new targeted products. NLI is considered a subtask of author profiling – the task of identifying the author’s demographics, such as age, gender, personality traits, occupation, place of residence, and native language, based on a sample of author’s writing. Being a subtask of author profiling, and part of a broader task of authorship analysis, one of the potential applications of NLI is forensic examinations, where part of the evidence refers to texts (e.g., notes, SMS messages, e-mail messages, written reports, etc.). NLI can help to limit the number of candidate authors of a text under investigation. From a forensic point of view, NLI has been already used as a tool for intelligence and law enforcement agencies [55].

1.3 1.3.1

Objectives General objective

The goal of this dissertation is to evaluate the impact of punctuation and emotion information on native language identification and to develop a method, which allows to identify the writer’s native language with high accuracy. 20

Introduction

1.3.2

Particular objectives

The particular objectives of this dissertation are manifold: • To evaluate the impact of punctuation on NLI. • To evaluate the strength of punctuation mark n-gram features. • To show that punctuation marks are robust indicators of the native language for different proficiency levels, as well as under cross-topic and cross-corpus conditions. • To evaluate the impact of emotions on NLI. • To evaluate the strength of emotion-based features in NLI. • To develop a robust NLI method, which includes feature selection and feature filtering. • To evaluate the performance of the method on various datasets.

1.4

Structure of the Document

This dissertation is structured as follows: Chapter 2 describes related work on automatic native language identification and provides an overview of the NLI shared tasks. Chapter 3 describes the datasets used in this dissertation, the conducted experiments, and presents the obtained results, their evaluation and discussion. Chapter 4 draws the conclusions, presents the contributions of this work, and points to the possible directions of future work.

21

Chapter 2 Related Work 2.1

Native Language Identification

While early work on native language identification originated at the beginning of this century [46], there has been an increased interest in this task since 2012, with the availability of publicly accessible corpora. For many years the standard corpus used for NLI research was the icle dataset [20], which contains essays written by highly-proficient non-native English learners with 16 represented L1 groups. However, as a result of being originally collected for the purpose of corpus linguistic investigations, the dataset was criticized for containing some idiosyncrasies that make it problematic for NLI research [5]; therefore, a modified version of the corpus normalized for topic and character encoding [74] is currently used in the field. This version covers 7 native languages: Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish. This 7-way icle subset is also used for our experiments and is described further in Section 3.1. In 2012, the toefl11 [4] corpus of non-native English was released. The dataset was designed specifically to support the research on NLI. It contains 11 first languages: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish, and was used in the NLI Shared Tasks 2013 and 2017 (see Section 2.2), although with a newly introduced test set in the 2017 edition. We further refer to the toefl11 dataset used in the NLI Shared Task 2013 as toefl11-13 and to the 2017 version as toefl11-17. The dataset is described in detail in Section 3.1. 23

Related Work

Since English is a widespread common language for communication in a variety of fields – science, politics, news, entertainment, etc., numerous people learn English as a second language. As NLI methods are particularly relevant for languages with a large number of language learners, most of NLI studies explored English as second language; however, there is a significant amount of work on other languages as well, e.g., Italian, Chinese, Finnish, Spanish, Arabic, and German [36]. The state-of-the-art results for this task are usually in the 80%–90% accuracy range, with results above 90% reported in some studies, e.g., [6, 74, 22]. However, the results vary greatly across the available datasets, depending on the number of languages being considered, size and complexity of data, etc. NLI is an interesting example of a task that is difficult to perform for humans, especially when the number of target native languages is large. The study of human performance in NLI [40] showed that automated systems significantly outperform human annotators. The human performance showed an average accuracy of 37.3%, which is significantly lower than 73.3% accuracy achieved by an automated NLI system. Moreover, the authors were able to use solely five native languages in their study (Arabic, Chinese, German, Hindi, and Spanish), since a larger number of L1s is difficult to cover by human annotators. In this chapter, we first focus on the aspects of the first language that influence L2 writing and describe the features proposed to capture this influence. Then, we outline the approaches that take the perspective of optimizing the NLI models, that is, explore the best way of modelling the problem in order to obtain the highest results for this task.

2.1.1

Aspects of the language as L1 indicators

In this subsection, we focus on the aspects of the native language that have been previously explored for NLI, which includes character-level models [22, 27], lexical choice [6, 29], cognates [51], general etymology [50], grammar [81, 72], spelling errors [13, 26], among others. These aspects provide revealing insights into the nature of L1 influence in L2 writing. Lexical choices Numerous studies [6, 29, 76] have shown that the lexical choices of non-native speakers are strong indicators of their native language. Brooke and Hirst [6] investigate the lexical choices as L1 indicators by demonstrating the effectiveness of lexical features on the 7-way icle dataset (Japanese, Chinese, French, 24

Related Work

Spanish, Italian, Polish, Russian), as well as under cross-corpus NLI conditions. Word n-grams were the most powerful features in their experiments achieving up to 77% accuracy on the 7-way icle dataset and around 50% in cross-corpus setting. Lexical choices have proved to be indicative of the first language; however, lexicalized representation of documents (e.g., through word n-grams) inevitably reflects topic-specific information as well [5]. Moreover, lexical features can lead to an artificial boosting of the results through capturing other non-topical words, such as toponyms or culture-specific terms, e.g., proper nouns and currencies [33, pp. 70, 97]. Though the usefulness of lexical features goes beyond proper nouns and cannot be simply ignored as a reflection of a topic [6], caution should be taken when generalising NLI findings based on lexical features or when developing robust NLI models, where lexical features may cause overfitting (perfect classification of noisy training examples, which is followed by disappointing performance on the test set) or capture other information described above rather than the characteristics of the L1 as such. Tsur and Rappoport [76] hypothesize that lexical choices in second language writings are affected by the phonology of the native language, that is, by L1 sounds and sound patterns. In order to capture the phoneme transfer from the learner’s L1, the authors use character bigram representation of the documents and evaluate it on a 5-way icle subset: Bulgarian, Czech, French, Russian, and Spanish. The authors report 66% accuracy using just these subword features concluding that they do reflect phonology transfer to some extent. A later study [52], however, found that removing a small set of highly discriminative words decreases the accuracy of a bigram-based classifier. Based on this the authors conclude that character bigram features capture differences in word usage and lexical transfer rather than L1 phonology. More recently, phoneme n-grams were used as features for the task [9]. The authors used Carnegie Mellon Pronouncing Dictionary1 to map orthography onto phonemes. Phoneme 5-grams were the best unique feature type in their approach, achieving 72.41% (SVM, tf weighting) on the toefl11-17 development set. Several studies [51, 50] investigated the influence of native language on lexical choices trough cognates and etymology-based representation of documents. Cognates are words that have a common etymological origin, i.e., words that have similar form (sound) and meaning in two different languages. For example, English terminate 1

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

25

Related Work

and Spanish terminar derive from the common ancestor language, i.e., Latin. Nicolai et al. [51] hypothesize that cognate interference may cause the writer to misspell the intended English word under the influence of the L1 spelling, e.g., spell cuestion instead of question. The authors use the misspelled part (cu:qu) as misspelling feature. These features were intended to relate a spelling mistake to cognate interference. The authors propose an algorithm to identify the cases where the cognate word is closer to the misspelling than to the intended word in order to extract the misspelling features. They report that these features, despite used only for 4 out of 11 languages, contribute 0.7% to overall NLI system accuracy (81.73% on the toefl11-13 test set) when combined with other commonly used features, such as word and character n-grams. Malmasi and Cahill [34] and Nastase and Strapparava [50] investigate the influence of etymological ancestor languages on the lexical choice. The former use words with either Old English or Latin origins as unigram features, while the latter replace each word with the language of its etymological ancestor and model this information to address NLI and to constract language family tree on Indo-European languages from the icle dataset, achieving promising results. Lexical features and their n-gram representation (e.g., word n-gram features) are often considered to be the best unique feature type in NLI [42]. These features were included in the majority of the shared tasks approaches described further in Section 2.2. However, as mentioned earlier, caution should be taken when generalizing the finding obtained on a lexical feature set, e.g., word n-gram models, given that they may lead to unintended extraction of topic or domain information, rather than capturing general characteristics of the native language of the author.

Grammatical information The idea that grammatical patterns from the native language surface in L2 texts was suggested in one of the early studies on NLI conducted by Koppel et al. [25], but was not explored in this work. This hypothesis was tested later in several independent NLI studies, such as [79, 80, 24, 72, 8], among others. Wong and Dras [79] base their approach on the contrastive analysis hypothesis [28], according to which errors committed by non-native speakers bear certain characteristics of their native language and are strong indicators of the language background. The authors focus on the main types of syntactic errors: subject-verb disagreement, noun-number 26

Related Work

disagreement, and misuse of determiners. They obtained 24.57% accuracy on the 7-way icle dataset (Bulgarian, Chinese, Czech, French, Japanese, Russian, and Spanish), with the majority baseline of 14.29%, which is a statistically significant improvement at the 95% confidence level. However, when combined with other basic features, such as function words, character n-grams, and POS n-grams, syntactic error features did not contribute to the overall accuracy of 73.71%. More recently, Wong and Dras [80] experimented with more complex syntactic features: cross-sections of parse trees and context-free grammar features. By combining them with lexical features, as used in [25], the authors obtained 81.7% accuracy on the same 7-way icle dataset, which was the state-of-the-art accuracy at that time. Other grammatical features that have been explored for NLI include the use of statistical grammatical induction techniques, such as adopt grammars [81] or tree substitution grammars [72]. Tetreault et al. [74] experiment with the Stanford parser dependency features and conclude that they are strong indicators of structural differences in L2 writing. Bykh and Meurers [8] further extend syntactic models exploring lexicalized and unlexicalized context-free grammar production rules – rules used to generate constituent parts of sentences, such as noun phrases; they aim at capturing the overall structure of grammatical constructions and global syntactic patterns. The authors evaluate the performance of these syntactic features both under singe-topic and cross-corpus NLI conditions. They report a significant drop of around by half in therms of accuracy in cross-corpus setting concluding that obtaining high cross-corpus results is challenging in NLI. Overall, grammatical information (dependency parses, parse tree rules, preference for particular grammatical forms, e.g., active or passive voice, etc.) and lexical information (lexical choices in L2 production) is considered orthogonal and complementary in the NLI task [8, 34, 33, p. 76]. As we will show further, the vast majority of the top approaches in the NLI Shared Task 2017 used some combination of lexical and syntactic features (e.g., word n-grams with syntactic dependency-based n-grams, described in detail in Subsection 3.5.2), which boosts NLI classification results comparing to the setting when one of these feature types is used in isolation.

Spelling The idea that spelling errors in L2 texts are affected by the spelling conventions in the author’s native language has a long history in NLI. It was first formulated by Koppel et 27

Related Work

al. [26]. They consider syntax errors and eight types of spelling errors, such as repeated letter, double letter appears only once, letter replacement, letter inversion, inserted letter, and missing letter. The relative frequency of each error type with regard to the length of the document was used as feature. When combining these features with common NLI features, such as function words and character n-grams, they obtained 80.2% accuracy on a 5-way icle dataset (Russian, Czech, Bulgarian, French, and Spanish). One of the key findings reported by the authors is that spelling errors reflect not only orthographic but also pronunciation conventions in the writer’s native language.

In a more recent study, Nicolai et al. [51] focus only on the misspelled part of a word rather than the type of spelling errors. The authors use character-level alignments, that is, pairs of correct and misspelled parts of a word as features. These features contributed 0.4% to overall NLI system accuracy (81.73% on the toefl11-13 test set) when combined with other commonly used features, such as word and character n-grams. Similar strategy was described in [30], where the authors represent the spelling errors by the inner-most misspelled substring compared to the correct word. The authors report 75.29% accuracy on the toefl11-13 test set when combining these features with commonly used features, such as POS and character n-grams. Chen et al. [13] also explore spelling errors but extract character n-gram features from misspelled words. The authors show that adding spelling error character n-grams to other common features (word, lemma, and character n-grams) improves NLI accuracy by 1.2% on the toefl11-13 test set.

As can be noted, character n-gram features are often used to capture spelling variations. In fact, character n-grams are hypothesized to capture different types of information in the NLI task, such as misspelling [24, 51, 13], stylistic and morphological information [42], among other. Along with lexical features, character n-grams are considered among the best unique feature types in NLI [24, 22, 27], especially when the value of n is high (as high as n = 10), since when extracted across word boundaries, these features capture not only subword information but also dependencies between words. The previous state-of-theart result on the 7-way icle dataset (91.3% accuracy) was achieved using character-level models [22]. We describe this approach further in Subsection 2.2.1. However, it should be taken into account that character n-grams, similarly to word n-grams, are sensitive to topic bias, as it was shown, for example, in [5] and [24]. 28

Related Work

Other aspects of the native language Morpho-syntactic aspects of the native language in NLI are captured by part-of-speech (POS) tags and n-grams of such features, which are sequences of POS tags assigned to words that signify their syntactic role. Basic categories include verbs, nouns, and adjectives, but can be expanded to include additional more fine-grained morpho-syntactic information such as tense, case, gender, number, person, verb transitivity, etc. This representation can encode word order and grammatical properties of the native language by capturing the use or misuse of well-established grammatical structures, e.g., verbsubject-object, subject-verb-object, and subject-object-verb, etc. Function words can be seen as indicators of the grammatical relations between other words. Their role in assigning syntax to sentences is linguistically well-defined. They belong to a set of closed-class words and embody relations rather than propositional content. Examples of function words include articles, prepositions, determiners, conjunctions, and auxiliary verbs. As a result of language transfer, non-native speakers often misuse or ignore certain function words, e.g., native speakers of Russian tend to omit English articles [76], as a consequence, function words are considered strong context- and topic-independent indicators of the native language of the writer. POS n-grams and function words constitute a core set of standard features for this task and have been used in various previous NLI studies, e.g., [36]. These features will be our base features as well for exploring punctuation and emotion contribution described in Section 3. Their combination is considered to be an affective and complementary representation for English L2 data, as well as for cross-lingual NLI research [41, 36]. Summing up, various aspect of the language have been explored for NLI. At the lexical level, the choice as well as the spelling can be indicative of the native language. Grammatical patterns, POS features, function words, all capture different aspects of the L1 influence on L2 writing. In continuation, we focus on two underexplored aspects of the language: punctuation and emotions.

2.1.2

Punctuation

Punctuation is important, and often revealing, aspect of written language. It is considered a strong indicator of authorship, and has been used successfully in stylometric analysis for authorship attribution [21, 11, 45]. For example, in our previous study on authorship attribution [45], we introduced several pre-processing steps, which included splitting punc29

Related Work

tuation marks from adjacent words and from each other in order to be able to capture their frequency separately and not just in combination with the adjacent words when using a character n-gram representation of documents. We showed that splitting punctuation marks and capturing their frequency is an efficient strategy for authorship attribution, evidence that this pre-processing step enhanced an authorship attribution model performance. Note, however, that NLI is a different problem when compared to the authorship attribution task, since NLI aims at characterizing the writing style of a set of writers rather than the unique style of a single person. From a linguistic point of view, punctuation has been disputed as following prosodic principles or as a clarifier of grammatical structure [2, 7]. Moore [49] suggests that prosody and punctuation realize the same function – revealing/emphasizing the information structure of an utterance – in the spoken and respectively written modes of language. Since grammar and prosodic structure are language specific, indicators that reveal them would be language specific as well. We suggest that grammatical/prosodic influences from the native language surface in the new language as particular punctuation choices. As it was mentioned earlier, punctuation has been included in some NLI models (e.g., in character-level models or in word n-gram models), but has not been studied separately, and there was no analysis nor quantification of its impact. Analysis of the impact of punctuation in the NLI task is one of the research questions addressed in this dissertation.

2.1.3

Emotions

While emotion-based features have been used in other NLP tasks, such as sentiment analysis [68] or classification of documents into the corresponding emotion category [78], they are an underexplored area of second language writing. Rangel and Rosso [59] investigate the hypothesis that the use of emotions depends on author’s gender (male/female). The authors conducted experiments on Spanish data (Facebook comments), combining a wide range of features, such as ratio between number of unique words and total number of words, length of words, POS tags, punctuation marks, among others, with emoticon-based features: ratio between number of emoticons and total number of words, and number of different types of emoticons representing emotions (joy, sadness, disgust, angry, surprise, derision, and dumb), as well as with emotion-based features: values for each emotion from the Spanish Emotion Lexicon [68]. An approach, which incorporates all these features was evaluated on the two following tasks: emotions 30

Related Work

classification and gender identification. Since the combination of these features showed relatively high results for the both tasks (40.30% F1 for emotion and 59.09% accuracy for gender identification), the authors suggested that there is a certain correlation between the use of emotions and the gender of the authors. Though the authors did not report the degree of contribution of emoticon/emotion-based features in their model, it can be noted that these features improve the performance, which is evidenced by their results. In a more recent study [60], the authors present the improvement achieved by adding emoticons and emotion-based features, which is 2.84% in terms of accuracy (improvement from 56.25% to 59.09%); however, there was no statistical significance test provided to evaluate the emoticons/emotion contribution, when it is hard to judge whether this improvement is statistically significant, since the dataset used in the study is relatively small (1,200 Facebook comments). In the same recent study [60], the authors improved their approach and extended it to age and gender identification. The improvement was achieved by combining the feature set described above with a graph-based approach (EmoGraph). In the proposed EmoGraph, each node and edge were represented by the corresponding POS tag, then the representation was enriched with semantic information (Wordnet domains and semantic classes of verbs) and emotion information, which included polarity of words (polarity of common nouns, adjectives, adverbs or verbs in a sentiment lexicon) and emotionally charged words (replacing common nouns, adjectives, adverbs or verbs with the emotion information from the Spanish Emotion Lexicon [68]). Then, the combination of all the features described above was fed into a SVM classifier. Although the proposed approach showed high results for both age and gender identification on several Spanish corpora, there was no analysis of the individual contribution of the proposed emotion-based features within the EmoGraph, which makes it hard to evaluate their impact on age and gender identification. While the studies by Rangel and Rosso [59, 60] suggest that there are certain communalities in the use of emotions across the genders (evidence that the emoticon/emotionbased features contribute to gender identification [60]), we model emotion information to examine whether there are communalities in the use emotions across different L1s, and evaluate the impact of emotions/emotion-based features on native language identification. 31

Related Work

2.2

Shared Tasks on Native Language Identification

The NLI studies described above mainly focus on some aspects of the language (or combination thereof) that seep into L2 production, and on the features designed to capture different types of linguistic information. In this section, we focus on system papers, i.e., the papers that outline the submissions to the NLI shared tasks organized in 2013 and 2017.2 Most of these approaches target the task from a machine-learning perspective, where native languages (L1s) are considered class labels, and writing samples in L2 are used as training and test data (see basic workflow of a classification task in Section 3). System papers describe research that immediately puts theory into practice, providing real-world results, which allows for better interpretation of robustness of the NLI models.

2.2.1

NLI Shared Task 2013

The first NLI shared task organized in 2013 [73] attracted 29 participating teams form different linguistics fields, such as computational linguistics and second language acquisition. The teams approached the task from a machine-learning perspective as a multi-class classification problem using various machine-learning algorithms and a large variety of features, which we describe further in this section. The dataset used for the evaluation of the submitted systems was toefl11-13 (see Section 3.1). In continuation, we describe the top three approaches in the competition. We focus here on the closed subtask, where the teams were allowed to use solely the training data provided by the organizers, since this was the most popular subtask in the competition with the largest number of participating teams and submissions. The winning entry for the NLI Shared Task 2013 was that of Jarvis et al. [23]. Their team achieved 83.6% in terms of classification accuracy (the official metric) on the toefl11-13 test test. The features the authors explored represented three following categories: words, characters, and complex features. The word category included lexemes, lemmas, POS tags, and n-grams of these features. Lexemes were defined as the observed forms of words, numbers, punctuation marks, and symbols that were encountered in the toefl11-13 dataset. Lemmas were defined as the dictionary forms of lexemes. The character n-grams category consisted of character n-gram features (n = 1–9), when the 2

We focus here on the NLI shared tasks on textual data and do not consider the competition on speech-based data [65] nor other non-textual tracks in the NLI Shared Task 2017.

32

Related Work

complexity-based features category included nominalization suffixes (e.g., –tion, –ism), number of tokens per essay, number of sentences, number of characters, mean sentence length, mean length of lexemes, and a measure of lexical variety (i.e., type/token ratio). The best performing model consisted of lexeme n-grams, lemma n-grams, and POS n-grams (n = 1–3) that occur in at least two documents in the corpus. The team used a liblinear SVM classifier and a log-entropy weighting scheme (described in Subsection 3.5.2). The total number of unique features in their model was approximately 400,000. The researchers report that higher levels of accuracy were achieved when excluding the complex features, and that the model trained on character n-gram features produced nearly as high levels of NLI accuracy as the model trained on lexical and POS n-gram features included in their final submission. The entry ranked second [32] was based on a linear SVM classifier trained on the combination of word unigrams and bigrams, character n-grams (n = 3–6), 4 character suffix bigrams, and proficiency and prompt values. The author used term frequency– inverse document frequency (tf –idf ) weighting, considering only those features that occur at least five times in the corpus and that occur in more than 50% of the documents. The best submission consisted of 439,063 features, achieving 83.4% accuracy on the toefl1113 test set. Popescu and Ionescu [56] performed at 82.7% accuracy in identifying the L1s on the toefl11-13 test set, ranking third in the shared task. Their approach incorporated string kernels and a kernel based on the local rank distance, which is a distance measure designed to work on DNA sequences and which the authors applied successfully to NLI. Kernel-based learning algorithms work by embedding the data into a feature space and searching for linear relations in that space. String kernels measure the similarity of two strings by counting the number of substrings of a certain length the two strings have in common, while the local rank distance uses n-grams instead of single characters, that is, the local rank distance between two strings is computed by summing up all the offsets of similar n-grams between the two strings. As a machine-learning algorithm, the authors used Kernel Ridge Regression, which showed 5% higher accuracy than SVM. The authors’ final submission combined string kernels with a local rank distance kernel based on 6- and 8-grams. It is noteworthy that the authors constantly report state-of-the-art results using this approach on various NLI datasets, for example, the previous state-of-the-art result on the 7-way icle dataset (91.3% accuracy) was achieved using a modified version of this method [22]. 33

Related Work

The majority of the teams in the NLI Shared Task 2013 used the Support Vector Machines (SVM) algorithm (13 teams), other popular machine-learning (ML) algorithms in the competition were maximum entropy, logistic regression, k-nearest neighbours, and ensemble methods (described in the next subsection). The most commonly used features in the competition were word n-grams, character n-grams (with high values of n, i.e., n ≥ 4), POS n-grams, and syntactic features, which mostly involved using dependency parses. Others features the teams explored included function words, spelling errors, and text complexity measures. A more detailed description of the NLI Shared Task 2013 approaches is presented in [73]. The official results for the top five participating teams in the NLI Shared Task 2013 in terms of classification accuracy (the official metric) on the toefl11-13 test set are shown in Table 2.1. The majority baseline, 9.09% accuracy for 11 classes in the toefl11-13 dataset, is also provided. It is worth mentioning that the difference between the first five results is not statistically significant. Table 2.1: Top five results in the NLI Shared Task 2013. Rank 1 2 3 4 5 Majority baseline

Accuracy, % 83.6 83.4 82.7 82.6 82.2 9.09

The highest result achieved in the competition was 83.6% accuracy on the toefl11-13 test set. In a recent study [37], the state-of-the-art result on the toefl11-13 test set was pushed to 87.1% accuracy using a method based on classifier stacking. One of the key contributions of the NLI Shared Task 2013 is that it partly solved the problems of unavailability of balanced and wide-coverage benchmark corpora and the lack of evaluation standards for this task, enabling the direct comparison of results achieved through different approaches and methodologies. In the next subsection, we focus on the second edition of the NLI shared task organized in 2017.

2.2.2

NLI Shared Task 2017

The 2017 edition of the NLI shared task [39] attracted 17 participating teams. As in the case with the NLI Shared Task 2013, we focus here on the first three ranked entries 34

Related Work

according to the official shared task ranking. In this edition, the toefl11-17 test set was used for the evaluation of the submitted systems (see Section 3.1). The winning entry [14] was based on the SVM algorithm trained on different types of features, which the authors divided into three following categories: (i) lexical features: text and word length, character n-grams, function word n-grams, and word and lemma n-grams; (ii) morpho-syntactic features: POS n-grams; and (iii) syntactic features: linear and hierarchical dependency n-grams, and head dependents of the syntax tree. The described features were first applied to sentence classification (classifying the native language of each sentence), then the sentence classifier predictions along with other features, such as average sentence length and type/token ratio, were applied to document classification. The authors considered only those features that occur more than 2 times in the dataset. The key contribution of this approach consisted in combining sentence and document classifiers into a 2-stacked sentence-document architecture. The authors obtained 88.18% in terms of F1-score (the official metric) on the toefl11-17 test set, which was the highest result in the competition. The second-best entry was that of our team (cic-fbk) [42]. We achieved 88.08% F1score on the toefl11-17 test set, which differs from the best result by only one correctly predicted label. We describe our approach in detail in Section 3.5.2. Kulmizev et al. [27] took the third place in the competition. They experimented with various approaches (e.g., prompts omission, skipgrams: n-grams separated by a distance of k omitting all the tokens in between, among others), achieving their best results with a simple approach based on linear SVM trained on a combination of character n-gram features (with n varying from 1 to 9). The authors used binary feature representation normalized using tf –idf weighting. Other participating teams explored different classification algorithms, such as logistic regression, as well as experimented with various ways of building meta-classifiers (a classifier trained on the probability distributions output by another classifier, e.g., sentence and document classifiers [14]) and ensemble combination methods, which combine the predictions of several classifiers and output the most likely class label by voting or probability-averaging, e.g., voting ensemble of multiple SVM classifiers [19]. Alternative approaches included character-level string kernels combined with multiple kernel learning, neural network models, word and character embeddings, and n-gram language models. A detailed description of the NLI Shared Task 2017 submissions can be found in [39]. 35

Related Work

The official results for the participating teams in the NLI Shared Task 2017 in terms of macro-averaged F1-score (the official metric) and classification accuracy are shown in Table 2.2. The official shared task baseline (linear SVM trained on word unigram features) and the majority baseline (9.09% for 11 classes in the toefl11-17 dataset) are also provided. There were several teams assigned the same rank according to McNemar’s statistical significance test with an α value of 0.05 used by the organizers (we describe McNemar’s statistical significance test in Subsection 3.2.1). Our team [42] shared the first rank, achieving the second-based result in the competition. Table 2.2: Official results for the participating teams in the NLI Shared Task 2017. Rank 1 1 1 1 1 1 1 2 2 2 3 3 3 3 4 4 5 -

Team ItaliaNLP Lab CIC-FBK (our) Groningen NRC taraka_rama UnibucKernel WLZ Uvic-NLP ETRI-SLP CEMI RUG-SU NLI-ISU IUCL GadjahMada superliuxz ltl ut.dsp Official baseline Majority baseline

F1, % 88.18 88.08 87.56 87.40 87.16 86.95 86.54 86.33 86.01 85.36 83.23 82.64 82.62 81.07 78.96 76.76 76.09 71.04 9.09

Accuracy, % 88.18 88.09 87.55 87.36 87.18 86.91 86.55 86.36 86.00 85.36 83.18 82.64 82.64 81.10 79.00 76.73 76.36 71.09 9.09

Some of the main observations derived from the NLI Shared Task 2017 are the following: (i) average team performance in the NLI Shared Task 2017 was higher than in 2013, though much of the training data was the same as in the NLI Shared Task 2013; (ii) a large number of participants have employed multiple classifier systems, such as meta-classifiers (classifier stacking), ensemble methods (both voting and probability-averaging), and multiple kernel learning; (iii) in the 2017 edition, deep learning approaches have been used to address the task; however, the achieved results were lower than with traditional supervised learning models (a comparison of SVM and neural network models performance is provided, e.g., in [12] and [9]). 36

Related Work

The NLI shared task organizers reported the upper-bound on accuracy for the datasets used in the NLI Shared Tasks 2013 and 2017 by treating each team’s best submission as an independent system, and combining the results using two ensemble methods: plurality vote and oracle. Table 2.3 presents the result for this experiment. Table 2.3: Oracle results (F1-score, %) on the NLI Shared Tasks’ 2013 and 2017 systems. Source: [39]. No. of systems Shared Task best Oracle Accuracy@3 Accuracy@2 Plurality vote

2013 29 83.59 97.91 95.55 92.18 84.25

2017 17 88.18 96.28 95.92 95.01 87.93

An oracle is a type of fusion method that assigns the correct class label for an instance if any of the classifiers in the ensemble produces it correctly. Plurality vote is a combination strategy that selects the label with the highest number of votes. Accuracy@N is an extension of the plurality vote combiner that takes into account the possibility that a classifier may randomly predict the correct label, therefore, instead of selecting the label with the highest votes, the labels are ranked by their vote counts and a sample is correctly classified if the true label is in the top N ranked candidates. The organizers conclude that the higher results achieved in the NLI Shared Task 2017 are caused by more accurate submissions, rather than the test set being easier to classify (in this case, the oracles results would be higher). Another reported conclusion is that it is challenging to develop an NLI method that would achieve statistically significant gains on the toefl11-17 dataset. To sum up, in this section we described the aspects of the language transfer effect explored in NLI, as well as presented the approaches to the NLI task as a multi-class classification problem. In continuation, we present the experiments designed to evaluate the robustness of punctuation- and emotion-based features in NLI and describe the developed NLI method, which pushes state-of-the-art results in automatic native language identification.

37

Chapter 3 Experiments and Results 3.1

Datasets

In this section, we provide characteristics of the toefl11 [4] and the iclev2 [20] datasets, the two main datasets in NLI research. Both datasets cover English L2 data and represent learner corpora, that is, consist of English essays written by non-English native speakers.

3.1.1

TOEFL11 dataset

The ETS Corpus of Non-Native Written English (toefl11) [4] was designed specifically to support the NLI task and has become a standard frame of reference for NLI research. The dataset contains 1,100 essays in English (with an average of 348 tokens per essay) for each of the following 11 native languages: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. We refer to this version of the dataset as toefl11-13. The essays were written in response to eight different writing prompts/topics (P0– P7), all of which appear in all 11 L1 groups. The topics include old vs. young people comparison, education, advertisement, car usage, travel, and risk management. Dataset statistics in terms of writing prompts is presented in Table 3.1. Within four of the L1 groups (Arabic, Chinese, Japanese, Korean), all prompts are almost equally represented with a proportion of approximately 12.5% per prompt. When in other groups, there is more variability. The Italian group shows the largest variations, with one prompt representing only 1.1% of the essays, while each of P0 and P5 prompts represent 17.0% of the essays. A 39

Experiments and Results

noteworthy observation mentioned in [27] is that Hindi and Telugu (the most problematic languages as we will show further in Subsection 3.5.2) have similar distributions across the prompts (few essays on the first two prompts), which differs from all the other represented L1s. Table 3.1: Distribution of topics in the toefl11 dataset. L1 Arabic Chinese French German Hindi Italian Japanese Korean Spanish Telugu Turkish Total

Count % Count % Count % Count % Count % Count % Count % Count % Count % Count % Count % Count %

P0 139 12.6% 140 12.7% 156 14.2% 151 13.7% 86 7.8% 187 17.0% 138 12.5% 128 11.6% 159 14.5% 55 5.0% 170 15.5% 1,509 12.5%

P1 133 12.1% 141 12.8% 68 6.2% 28 2.5% 53 4.8% 12 1.1% 142 12.9% 142 12.9% 157 14.3% 41 3.7% 43 3.9% 960 7.9%

Number of essays per P2 P3 P4 141 138 138 12.8% 12.5% 12.5% 139 139 140 12.6% 12.6% 12.7% 160 151 158 14.5% 13.7% 14.4% 153 152 155 13.9% 13.8% 14.1% 161 158 161 14.6% 14.4% 14.6% 141 173 173 12.8% 15.7% 15.7% 143 141 116 13.0% 12.8% 10.5% 143 141 140 13.0% 12.8% 12.7% 162 160 141 14.7% 14.5% 12.8% 171 166 165 15.5% 15.1% 15.0% 169 167 169 15.4% 15.2% 15.4% 1,683 1,686 1,656 13.9% 13.9% 13.7%

prompt P5 136 12.4% 134 12.2% 160 14.5% 150 13.6% 156 14.2% 187 17.0% 138 12.5% 137 12.5% 134 12.2% 169 15.4% 147 13.4% 1,648 13.6%

P6 138 12.5% 126 11.5% 87 7.9% 157 14.3% 163 14.8% 138 12.5% 140 12.7% 136 12.4% 54 4.9% 167 15.2% 90 8.2% 1,396 11.5%

P7 137 12.5% 141 12.8% 160 14.5% 154 14.0% 162 14.7% 89 8.1% 142 12.9% 133 12.1% 133 12.1% 166 15.1% 145 13.2% 1,562 12.9%

The proficiency level of the author of each essay (low, medium or high) is also provided as metadata. Dataset statistics in terms of proficiency levels is presented in Table 3.2. The distribution of learners’ proficiency levels determined by assessment specialists is even more variable across groups than the writing prompts. The distribution is especially sparsed in the case of the German speakers, where only 1.4% of the participants fall into the low proficiency category, whereas 61.2% fall into the high proficiency category. The lowproficiency category represents only 11.0% of the essays and an estimated 7.2% of total 40

Experiments and Results

words (based on mean lengths of the essays from each proficiency level given in [4]). The medium-proficiency category comprised 54.3% of the essays and an estimated 52.8% of words in the corpus, and the high proficiency category represented the remaining 34.7% of the texts and an estimated 40% of the words. Table 3.2: English proficiency levels in the toefl11 dataset. L1 Arabic Chinese French German Hindi Italian Japanese Korean Spanish Telugu Turkish Total

Count % Count % Count % Count % Count % Count % Count % Count % Count % Count % Count % Count %

English proficiency Low Medium High 296 605 199 26.9% 55.0% 18.1% 98 727 275 8.9% 66.1% 25.0% 63 577 460 5.7% 52.5% 41.8% 15 412 673 1.4% 37.5% 61.2% 29 429 642 2.6% 39.0% 58.4% 164 623 313 14.9% 56.6% 28.5% 233 679 188 21.2% 61.7% 17.1% 169 678 253 15.4% 61.6% 23.0% 79 563 458 7.2% 51.2% 41.6% 94 659 347 8.5% 59.9% 31.5% 90 616 394 8.2% 56.0% 35.8% 1,330 6,568 4,202 11.0% 54.3% 34.7%

The toefl11 dataset was used in the NLI Shared Task 2013: a set of 900 essays per each language was used as training data, when 100 essays per language were used as development data, and another 100 essays per language formed the test set. In the NLI Shared Task 2017, the training data consisted of the training plus development data used in the NLI Shared Task 2013, while the development set consisted of the test data from the 2013 task. The test set (100 essays per language) was new, previously unreleased data.1 1

I would like to thank the shared task organizers for providing the toefl11-17 test set labels.

41

Experiments and Results

Seven out of eight writing prompts were represented in the 2017 test set. The average length of the essays across all three partitions was 316.2 tokens. Further we refer to the NLI Shared Task 2017 toefl11 dataset as toefl11-17. The toefl11 dataset was designed specifically for the task of NLI. While it largely accomplished the goal of being comparable across all 11 represented L1s, it is not optimally balanced across the represented prompts and proficiency categories.

3.1.2

ICLE dataset

The iclev2 dataset [20] consists of essays written by highly-proficient non-native collegelevel students of English. The dataset contains 6,085 essays with length between 500 and 1,000 words written in response to 1,302 prompts, which were manually grouped into 736 different topics. Two types of essays represented are the following: argumentative essay writings and literature examination papers. The dataset covers 16 first languages: Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, and Tswana. In spite of the relatively large number of second languages represented, the dataset was originally collected for the purpose of corpus linguistic investigations, and thus, contains certain limitations when used for NLI research [5, 74]. First, as noted in [5], there are several topics for which the essays are written by the learners of a single L1. As was mentioned above, topic bias is a confounding factor for machine-learning tasks, since a classifier can learn to distinguish represented classes by distinguishing topics. Second, there are several encoding errors and annotation issues, which only occur in essays written by learners of a certain L1 [74]. Furthermore, the size of the icle dataset is relatively small (half the size of toefl11, described earlier), which is another obstacle for generalizing the performance of NLI models when using solely this dataset. Taking into account the criticism of the complete icle dataset with respect to topics and encoding [5, 74] and following several previous studies [74, 22], we use a 7-language subset of the corpus normalized for topic bias and character encoding (the so-called iclenli subset [74]).2 This subset contains 110 essays for the 7 first languages shown in Table 3.3. The average number of tokens per essay is 747 (after tokenization and removal 2

I would like to express my gratitude to A. Cahill for providing the list of documents used in their paper.

42

Experiments and Results

of metadata). Further in this dissertation, we refer to this 7-way subset as icle, if not stated otherwise. Table 3.3: Number (No.) of essays per class in the 7-way icle dataset (the icle-nli subset). Language Bulgarian Chinese Czech French Japanese Russian Spanish Total

3.2

No. of essays 110 110 110 110 110 110 110 770

Punctuation as NLI Features

In this section, we explore and evaluate the impact of punctuation marks on the task of native language identification. We present a set of experiments designed to show that punctuation marks are robust features for NLI. We model punctuation information via PM n-gram features, aiming at capturing patterns of the use of PMs that are specific for the native language of the writer. We present our results on evaluation of PM and PM n-gram features under a variety of settings: standard multi-class classification, 2step classification, proficiency-level classification, as well as evaluate the strength of these features under cross-topic and cross-corpus NLI conditions.

3.2.1

Features

To evaluate the strength of punctuation-based features in the NLI task, we use POS n-grams and function words as base features, to which we add punctuation-based features. As several previous works on NLI [36, 35] and author profiling [60], we opted for POS n-grams and function words for this experiment, since lexicalized representation of documents, including character and word n-grams, has been criticized as capturing topic-dependent information instead of L1 patterns [5]. Since we conduct experiments also under imbalanced settings, where topic bias may occur, we follow previous NLI studies [36, 35, 24] and use more abstract POS n-grams and function word-based features to 43

Experiments and Results

represent the documents (essays) in our data. These feature representations are commonly used in NLI to avoid possible topic bias [36, 35, 24]. Next, we describe the representations used for punctuation-based feature evaluation. We do not investigate the performance of punctuation-based features in isolation, since our hypothesis is that they serve to organize the conveyed information and are tied to the grammatical structure, and therefore, they should be studied in their context. However, we report that when used in isolation, punctuation marks achieve 18.74% accuracy on the toefl11 dataset under 10-fold cross-validation and 33.25% on icle, which are twice as high as the baselines (which are included in Table 3.5). Part-of-speech (POS) tags As mentioned earlier, POS features capture the morphosyntactic patterns in a text. They are considered indicative features for NLI, especially when used in combination with other features, such as word and character n-grams [14, 42]. POS tags were obtained with the TreeTagger software package [64], which uses the Penn Treebank tagset (36 POS tags). We used POS n-grams with n varying from 1 to 3. Function words Function words are words that have a little lexical meaning but instead serve to express grammatical relations within a sentence. Since they are also tied to the structuring of information, function words and punctuation marks fill roles within the same domain, that is, punctuation and function words share the role of organizing/revealing the information structure of a sentence, and could provide complementary features for the NLI task. The FW feature set consists of 127 English function words from the Natural Language Toolkit (NLTK)3 . Punctuation-based features We use punctuation marks as separate features, as part of POS n-grams, and as n consecutive PMs omitting all the tokens in between (PM n-grams). As an example, consider the following sample sentence: (1) Peter said to Tom, “Why are you so late?” PM n-grams for the sample sentence (1) are shown in Table 3.4. We use a set of 36 punctuation marks (only 31 of which appear in the icle dataset). Punctuation marks were extracted using the string module of the Python standard library.4 3 4

http://www.nltk.org https://docs.python.org/2/library/string.html

44

Experiments and Results Table 3.4: Punctuation mark (PM) n-grams for the sample sentence (1). Type PM 2-grams PM 3-grams PM 4-gram

3.2.2

PM n-grams ,“ “? ?” ,“? “?” ,“?”

Experiment setup

We evaluate the strength of punctuation-based features on the toefl11-17 and icle datasets. We use the tokenized version of the toefl11-17 dataset, that is, the version of the dataset after the tokenization process is applied. Tokenization consists in breaking up the given text into units called tokens (words, punctuation marks, or numbers) by locating word boundaries, i.e., ending point of a word and beginning of the next word. We perform tokenization of the icle dataset using the Natural Language Toolkit tokenizer. We perform lowercasing and remove the text surrounded by in the icle dataset, since it corresponds to the dataset metadata, that is, we remove this information to process only the original text, as produced by the author. Weighting scheme We use term frequency (tf ) weighting scheme, that is, the number of times a term occurs in a document: the weight of a term that occurs in a document is proportional to the term frequency (tf ). Previous NLI studies have found that frequency-based feature values are more informative for this task than, e.g., a binary feature representation [36]. Classifier We perform classification using the liblinear scikit-learn [54] implementation of Support Vector Machines (SVM) [77] with OvR (one vs. the rest) multi-class strategy. Linear classifiers separate the classes using a linear function (hyperplane). The basic idea of SVM is to find an optimal hyperplane for linearly separable classes, which is achieved by maximizing the margin around the separating hyperplane. The effectiveness of SVM has been proven by numerous experiments on text classification tasks; moreover, it was the classifier of choice for the majority of the teams in the both editions of the NLI shared task [73, 39]. Evaluation The results are measured in terms of classification accuracy (a ratio of correctly classified documents to the total number of documents in the dataset), which is 45

Experiments and Results

a frequently used evaluation metric in NLI research. Where not otherwise specified, the experiments were carried out under k-fold cross-validation method with the parameter k equal to 10. The method consists in dividing the corpus into k subsets. One of the k subsets is used for testing, and the other k–1 subsets are used for training. The procedure is repeated k times; each time different k subset is used for testing. Then the result is computed as the average across all k trials.

3.2.3

Punctuation results

We provide the results for the set of experiments designed to clarify the role/impact of the use of punctuation as indicator of the writer’s native language. We start with the usual multi-class classification setting, then investigate the use of punctuation within the represented language families/geographical groups, for different proficiency levels, and finally under cross-topic and cross-corpus NLI conditions. The results are provided for POS n-grams (n = 1–3) without PMs and for the setting when using punctuation as part of POS n-gram features. We also present the results when adding PM n-grams to POS n-gram features that already include PMs, which is done to show that PM n-grams capture different information – patterns of the use of PMs. We use PM n-grams with n = 2 and n =5. This size of PM n-grams has been found to provide the highest average improvement (about 1%) for each dataset and approach (1- and 2step) when evaluated on POS n-grams. Accuracy results for top 10 PM n-gram sizes on the toefl11-17 and icle datasets (1- and 2-step approaches) are shown in Appendix A, where it can be noticed that the optimal size of PM n-grams is different for each dataset and approach. Further investigation is required to answer the question why the optimal PM n-gram size is different for the toefl11-17 and icle datasets and 1- and 2-step approaches; this research question will be addressed in future work. Multi-class and 2-step classifications The multi-class setting is the usual NLI task, where the L1 of the author of a document is predicted based on a specific representation of the document. 2-step classification experiments were designed to investigate whether there are influences within the language families, as well as at individual language level. Based on the languages represented in the toefl11-17 and icle datasets, we grouped the languages either by language family or by geographical location. While a grouping based on lan46

Experiments and Results

guage family is more theoretically justifiable, similar results in terms of accuracy for the 2-step classification seem to support the geographical grouping of languages as well. The languages were grouped in the following way: toefl11-17: Arabic; Asian (Chinese, Korean, Japanese); Romance (French, Italian, Spanish); German; Indian (Hindi, Telugu); Turkish. icle: Slavic (Bulgarian, Czech, Russian); Asian (Chinese, Japanese); Romance (French, Spanish). Table 3.5 shows the multi-class classification results (column 1-step) and 2-step classification results (column 2-step) in terms of 10-fold cross-validation accuracy on the toefl11-17 and 7-way icle datasets. Number of features (No.) is provided for each experiment. As a reference point we provide the majority baseline, which was also used in the NLI Shared Tasks 2013 and 2017. As the class sizes are balanced, the majority baseline is 9.09% for 11 classes in the toefl11-17 dataset and 14.29% for 7 classes in the icle dataset. We did not include here a comparison against the state-of-the-art approaches, since the purpose of the experiments described in this section is to investigate the impact of punctuation marks relative to other sets of features used, and for different settings (e.g., different proficiency levels, different language families, etc.). Table 3.5: 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) with PMs, without PMs, and with PM n-grams (n = 2, 5); 1- and 2-step approaches. TOEFL11-17 ICLE 1-step 2-step 1-step 2-step Features No. No. acc. acc. acc. acc. Majority baseline 9.09 9.09 14.29 14.29 POS n-grams w/o PMs 38.88 39.08 14,993 59.87 51.04 10,316 POS n-grams w/ PMs 48.83 48.43 29,763 69.48 65.19 18,594 9.95 9.35 9.61 14.15 Improvement: POS n-grams w/ PMs 48.83 48.43 29,763 69.48 65.19 18,594 POS n-grams w/ PMs + PM n-grams 49.62 49.42 41,211 70.39 66.49 25,665 Improvement: 0.79 0.99 0.91 1.30

In order to make sure that the obtained improvements presented in Table 3.5 and further in this section are statistically significant, we conducted McNemar’s statistical significance test (“within-subjects chi-squared test”) [47] with an α value of 0.05, as it was done in the NLI Shared Task 2017. McNemar’s test is a statistical test for paired nominal 47

Experiments and Results

data, which is based on a 2 times 2 contigency table of the two model’s predictions, e.g., Table 3.6. The null hypothesis is that the probabilities p(b) and p(c) are the same, when the alternative hypothesis is that the performances of the two models are not equally accurate. We used the continuity corrected version [1], where the test statistic (chi-squared, χ2 ) is computed as shown in Equation 3.1. Table 3.6: Contigency table of two model’s predictions.

model 1 correct model 1 wrong

χ2 =

model 2 correct

model 2 wrong

A

B

C

D

(| b − c | −1)2 , (b + c)

(3.1)

If the sum of c and b cells is sufficiently large, the χ2 value follows a chi-squared distribution with one degree of freedom. After setting a significance threshold α, in our case α = 0.05, we can compute the p-value – assuming that the null hypothesis is true, the p-value is the probability of observing a larger chi-squared value. If the p-value is lower than our chosen significance level, we can reject the null hypothesis that performances of two models are equal. As one can see from Table 3.5, the inclusion of PMs improves the results for all the considered settings. PM n-grams further improve the results by around 1%, which is statistically significant on the toefl11-17 dataset, but not statistically significant on icle. The improvement for the 2-step approach demonstrates that there are shared patterns of the use of punctuation across the grouped languages and across the individual languages, which is reflective of grammatical/prosody information structuring in different language families or groups. The results for the 2-step approach also suggset that the proposed geographical grouping of the languages is justifiable. Proficiency-level classification We perform classification based on the proficiency levels to determine the impact the punctuation has within each level of proficiency. It is expected that the use of punctuation 48

Experiments and Results

will become more native-like as students improve their L2 knowledge, and the influence of their native language to get weaker. To test this hypothesis, we have built a balanced dataset (from the point of view of proficiency levels) as a subset of the toefl11 dataset. The distribution of English proficiency levels in the toefl11 dataset is quite imbalanced, as shown in Table 3.2. To produce a balanced subset, we extract the same number of essays for each proficiency level (equal to the minimum number of essays for each level for each L1, shown in italics in Table 3.2). We use both the imbalanced and balanced subsets of the toefl11 dataset to perform proficiency-level classification using POS n-grams without PMs, with PMs, and with PM n-grams. The 10-fold cross-validation results on the imbalanced and balanced subsets are shown in Tables 3.7 and 3.8, respectively. Table 3.7: 10-fold cross-validation accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams for each proficiency level (imbalanced setting). 1-step acc. Low proficiency POS n-grams w/o PMs 41.47 POS n-grams w/ PMs 46.71 Improvement: 5.24 POS n-grams w/ PMs 46.71 POS n-grams w/ PMs + PM n-grams 47.07 Improvement: 0.36 Medium proficiency POS n-grams w/o PMs 42.60 POS n-grams w/ PMs 51.34 Improvement: 8.74 POS n-grams w/ PMs 51.34 POS n-grams w/ PMs + PM n-grams 51.69 Improvement: 0.35 High proficiency POS n-grams w/o PMs 32.39 POS n-grams w/ PMs 42.05 9.66 Improvement: POS n-grams w/ PMs 42.05 POS n-grams w/ PMs + PM n-grams 42.64 Improvement: 0.59 Features

2-step acc.

No.

39.32 45.49 6.17 45.49 45.86 0.37

8,681 13,311

42.48 51.51 9.03 51.51 51.14 –0.37

13,259 24,800

33.58 43.93 10.35 43.93 43.12 –0.81

12,480 23,006

13,311 15,657

24,800 32,740

23,006 30,292

As it was mentioned earlier, higher proficiency levels are expected to lead to the use of punctuation closer to an L2 native speaker. In this case, we would note lower improvement in performance when adding punctuation-based features. However, the results presented in 49

Experiments and Results Table 3.8: 10-fold cross-validation accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams for each proficiency level (balanced setting). 1-step acc. Low proficiency POS n-gramsw/o PMs 37.87 44.08 POS n-grams w/ PMs Improvement: 6.21 POS n-grams w/ PMs 44.08 POS n-grams w/ PMs + PM n-grams 45.07 0.99 Improvement: Medium proficiency POS n-grams w/o PMs 36.00 POS n-grams w/ PMs 43.36 Improvement: 7.36 POS n-grams w/ PMs 43.36 POS n-grams w/ PMs + PM n-grams 44.12 Improvement: 0.76 High proficiency POS n-grams w/o PMs 31.93 POS n-grams w/ PMs 37.25 Improvement: 5.32 POS n-grams w/ PMs 37.25 POS n-grams w/ PMs + PM n-grams 36.77 Improvement: –0.48 Features

2-step acc.

No.

38.22 42.76 4.54 42.76 44.70 1.94

8,471 12,900

35.52 44.11 8.59 44.11 42.26 –1.85

8,953 13,996

31.31 39.39 8.08 39.39 39.56 0.17

9,367 14,992

12,900 15,141

13,996 16,574

14,992 18,331

Tables 3.7 and 3.8 indicate that while the L1 classification results based on POS n-grams are lower for high proficiency levels, the impact of adding the punctuation features is higher for each proficiency level compared to the lower ones. Adding PM n-grams provides additional improvement in the majority of cases. A key finding from this experiment is that English learners keep their L1 punctuation style even when achieving high L2 proficiency, which is evidenced by the high improvement using punctuation marks for high-proficiency learners in both imbalanced and balanced settings.

Cross-topic experiments The toefl11 dataset is composed of essays written in response to different topics/prompts. There are eight topics in the dataset, all of which are represented in all 11 L1 groups. When evaluating the performance of an NLI system, there is a risk that the system captures topic information (especially when using lexical features) and not the characteristics of a native language as such. Therefore, we investigate the impact of punctuation features under 50

Experiments and Results

cross-topic conditions to show that these features are robust indicators of L1, regardless of the topics represented in the dataset. We conduct experiments on the toefl11-17 dataset splitting it in two ways. First, we split the dataset into folds based on the topics – a topic will be present in only one fold. Since the datasets covers eight different topics, this setting corresponds to 8-fold cross-validation. The results for this experiment are shown in Table 3.9. Table 3.9: 10-fold cross-validation and one fold/topic setting accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams.

Features POS n-grams w/o PMs POS n-grams w/ PMs Improvement: POS n-grams w/ PMs POS n-grams w/ PMs + PM n-grams Improvement:

TOEFL11-17 (10FCV) Acc. No. 38.88 14,993 48.83 29,763 9.95 48.83 29,763 49.62 41,211 0.79

TOEFL11-17 (topic = fold) Acc. No. 33.88 14,636 43.21 28,635 9.33 43.21 28,635 43.48 39,231 0.27

Next, we use 5,838 essays written on the first four prompts for training and 6,262 essays written on the other four prompts for testing. To compare the result of this experiment with a mixed-topic scenario with approximately same number of essays for training and testing, we split the toefl11-17 dataset using half of essays on each prompt for training (6,050 essays) and testing (6,050 essays). The results for this experiment are presented in Table 3.10. Table 3.10: Mixed- and cross-topic settings accuracy (%) for POS n-grams without PMs, with PMs, and with PM n-grams.

Features POS n-grams w/o PMs POS n-grams w/ PMs Improvement: POS n-grams w/ PMs POS n-grams w/ PMs + PM n-grams Improvement:

TOEFL11-17 (mixed-topic) Acc. No. 36.63 13,174 45.95 24,215 9.32 45.95 24,215 46.68 31,659 0.73

TOEFL11-17 (cross-topic) Acc. No. 32.27 13,042 40.74 23,950 8.47 40.74 23,950 41.36 31,604 0.62

The results for cross-topic classification presented in Tables 3.9 and 3.10 indicate that separating the training and test data based on topics leads to an accuracy drop of approximately 5% in both the cross-validation and train/test split conditions. Nevertheless, for 51

Experiments and Results

both settings adding the punctuation-based features leads to very similar increases in performance whether the topics are separated or mixed. This indicates that the punctuationbased features are robust and portable across topics. Cross-corpus experiments The focus of our cross-corpus experiments is to show that the punctuation-based features are robust and portable not only across different topics but also across different datasets, and are able to partly compensate for drop in accuracy under challenging cross-corpus conditions. For cross-corpus experiments, we extract subsets of the two datasets that represent the same languages. The toefl11-17 and the icle datasets have seven common languages: Chinese, French, German, Italian, Japanese, Spanish, and Turkish. We extract the subsets corresponding to these languages from the two corpora. We use each in turn for training and testing, respectively. For this experiment, we did not balance the icle dataset and used all the essays for each of the selected languages. The number of essays per class in the icle dataset is shown in Table 3.11. Table 3.11: Number (No.) of essays per class in the icle dataset used for the cross-corpus experiment. Language Chinese French German Italian Japanese Spanish Turkish Total

No. of essays 982 347 437 392 366 251 280 3,055

The results for the cross-corpus experiment (training on toefl and testing on icle, and the other way around) are shown in Table 3.12. 10FCV stands for 10-fold crossvalidation on the training data (accuracy, % and macro-averaged F1-score, %). We included macro-averaged F1-score (henceforth, F1-score/F1) for this experiment, since this setting is imbalanced, when F1-score (calculated by first computing the F1-score for each class, i.e., harmonic mean between precision and recall, and then taking the average across all classes [82]) is often used as evaluation metric in imbalanced settings, as it favours more consistent performance across classes rather than measuring global performance across all samples. 52

Experiments and Results Table 3.12: Cross-corpus classification results for POS n-grams without PMs, with PMs, and with PM n-grams. Training on TOEFL, testing on ICLE 10FCV Test Set Features Acc. F1 Acc. F1 POS n-grams w/o PMs 48.62 48.51 43.27 40.70 POS n-grams w/ PMs 60.29 60.22 54.50 53.05 Improvement: 11.67 11.71 11.23 12.35 POS n-grams w/ PMs 60.29 60.22 54.50 53.05 POS n-grams w/ PMs + PM n-grams 60.57 60.50 54.70 53.44 0.28 0.28 0.20 0.39 Improvement: Training on ICLE, testing on TOEFL 10FCV Test Set Features Acc. F1 Acc. F1 POS n-grams w/o PMs 79.47 75.21 34.22 32.26 86.67 83.82 41.64 39.72 POS n-grams w/ PMs Improvement: 7.20 8.61 7.42 7.46 POS n-grams w/ PMs 86.67 83.82 41.64 39.72 POS n-grams w/ PMs + PM n-grams 87.07 84.49 41.73 39.81 0.40 0.67 0.09 0.09 Improvement:

No. 13,587 25,870 25,870 35,235

No. 13,730 26,890 26,890 40,385

We note that while the loss in performance when training on toefl, testing on icle is relatively small (5%–7% in terms of both accuracy and F1-score), training on icle and testing on toefl leads to much higher drops (43%–45%). The drop under cross-corpus condition is expected and is in line with previous NLI research [8, 35]. However, despite the loss in performance suffered by the model based on POS n-gram features, the PM features lead to the same increase in performance on testing as they did on training, that is, their cross-corpus performance is similar to the within-corpus performance, which provides additional evidence to their robustness in NLI. The asymmetric results for cross-corpus experiment, i.e., different results across datasets under cross-corpus conditions, can be explained by multiple difference between the two datasets: (i) size – toefl is almost twice the size of icle in terms of number of documents; (ii) essay length – icle is composed of longer essays than toefl; (iii) proficiency – icle contains essays written only by high-proficiency learners, whereas toefl contains three different proficiency levels, meaning there is more variation in language (including more typos, grammatical errors, etc. indicated by the fact that toefl has about 200,000 more features than icle). Any of these characteristics (or combination thereof) can contribute 53

Experiments and Results

to the differences in results in the cross-corpus experiments. However, by maintaining the same conditions when using/not using punctuation marks, we can still investigate their impact under cross-corpus conditions. In order to reveal the most useful punctuation marks for the task, we analyzed the patterns of punctuation with highest weights and conducted an ablation study (removing punctuation marks one by one from the setting when all PMs are considered), which showed that the performance does not come from one pattern, but the combination. In this section, we showed that punctuation marks are robust indicators of the author’s native language. Their impact is consistent for 1- and 2-step classification approaches, proficiency-level classification, and under cross-topic and cross-corpus NLI conditions. We also presented the results for PM n-grams, which were designed to model punctuation information in order to capture patterns of the use of PMs. PM n-grams further improve NLI results in the majority of cases, even when added to a representation where PMs are already considered. We conducted the same set of experiments for FW unigram features: without PMs, with PMs, and with PM n-grams. The results for these experiments are presented in Appendix B. They provide an additional evidence for the robustness of punctuation features as L1 indicators in the NLI task.

3.3

Emotions as NLI Features

Words usually serve as morphological, lexical, grammatical, and etymological features. However, apart from this information, words also have an emotional dimension. Our hypothesis is that the way authors express their emotions in L2 production is influenced by their L1 idiosyncrasies. In order to model emotion information, we propose emotion polarity and emotion load features, which we describe further in this section.

3.3.1

Emotion-based features

First, we examine whether emotionally charged words (henceforth, emotion words) provide useful information for the NLI task. We compare the performances of emotion words with random words of the same feature set size on the toefl11-17 and icle datasets. As emotion words, we consider all words listed in the NRC Word-Emotion Association Lexicon (NRC emotion lexicon) [48], even those not included in any affect category; see Table 3.15 (removing the latter words deteriorated our results; studying this phenomenon 54

Experiments and Results

will be a topic of our future work). The NRC emotion lexicon contains a list of 14,182 emotion words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were done manually through Amazon Mechanical Turk5 . For the evaluation of emotion-based features, we use the same datasets (toefl11-17 and 7-way icle datasets) and the same experimental setup as it was used for the evaluation of punctuation-based features: liblinear scikit-learn implementation of Support Vector Machines (SVM) with OvR (one vs. the rest) multi-class strategy and term frequency (tf ) weighting scheme. Just like with punctuation experiments, we measure the results in terms of classification accuracy under 10-fold cross-validation. Table 3.13 presents the 10-fold cross-validation results (accuracy, %) on the toefl1117 and icle datasets, when using emotion words and random words of the same feature set size as features, as well as the results when excluding emotion words and random words from the bag-of-words (BoW) approach. Random words accuracy was calculated as average over five experiments with five different sets of random words. Table 3.13: Performance of emotion words and random words. Features BoW Random words Emotion words BoW – random words BoW – emotion words

TOEFL11-17 Acc., % No. 68.65 61,339 36.15 8,187 46.75 8,187 66.68 53,152 63.11 53,152

ICLE Acc., % No. 80.65 20,032 70.21 6,465 72.86 6,465 76.83 13,567 75.19 13,567

As one can see from Table 3.13, emotion words show higher accuracy than random words when evaluated in isolation. Moreover, the accuracy drop is higher when excluding emotion words from the BoW approach than when excluding random words. These results are consistent for the both datasets. The obtained results suggest that there are communalities in the use of emotions across different L1 groups, and that emotion words provide useful information for the NLI task. Table 3.14 presents the number of emotion words per L1 and the ratio of the number of emotion words normalized by the essay length in the toefl11-17 and icle datasets. It is noteworthy that in the toefl11-17 dataset, Hindi and Telugu native speakers use the largest number of emotion words when producing texts in English as L2. As we will show 5

https://www.mturk.com

55

Experiments and Results

further in Subsection 3.5.2, Hindi and Telugu are the most problematic L1s to identify, with the highest degree of confusion between them. When normalizing the number of emotion words by the essay length, Korean native speakers have the highest ratio. Korean and Japanese is another problematic language pair with high confusion between these two languages, as we will also show in Subsection 3.5.2. Table 3.14: Number of emotion words per L1 (No.) and number of emotion words divided by the total number of words (Ratio) in the toefl11-17 and icle datasets; sorted from the highest to the lowest. L1 HIN TEL GER CHI TUR KOR FRE SPA ITA JPN ARA

TOEFL11 No. L1 Ratio, % 96,184 KOR 24.93 88,979 HIN 24.62 88,268 CHI 24.32 87,486 TEL 24.19 83,945 JPN 24.15 82,878 TUR 23.90 82,454 FRE 23.30 81,497 GER 23.21 75,339 ITA 23.16 73,740 SPA 22.40 69,156 ARA 21.91

L1 CZE RUS BUL SPA CHI FRE JPN

No. 20,162 20,142 18,939 17,187 16,794 16,750 16,234

ICLE L1 CHI BUL JPN RUS FRE CZE SPA

Ratio, % 26.81 25.06 24.74 24.72 23.88 23.81 23.33

The results presented in Table 3.13 suggest that emotion words contain useful information for NLI (evidence that they outperform random words of the same feature set size). Next, we describe the proposed way to model the emotion information. Emotion polarity features We showed above that speakers of some L1s use a greater number of emotionally charged words than speakers of other L1s in their L2 production. We suggest that due to cultural identity and/or idiosyncrasies related to the author’s native language, essays written by speakers of some L1s are more positively/negatively oriented than essays written by speakers of other L1s. To investigate this hypothesis, we used what we call emotion polarity (emoP) features. We used the NRC emotion lexicon to extract the emoP features. In the NRC emotion lexicon, emotion associations are provided for each emotion word (target word). An example of the NRC emotion lexicon entry for the target word bad is shown in Table 3.15. Affect category represent eight emotions (anger, fear, anticipation, trust, surprise, sadness, 56

Experiments and Results

joy, or disgust) and two sentiments (negative or positive). Association flag has one of two possible values: 0 or 1. 0 indicates that the target word has no association with affect category, whereas 1 indicates that there is an association. Table 3.15: NRC emotion lexicon entry for the target word bad. Target word (emotion word)

Affect category

bad

anger anticipation disgust fear joy negative positive sadness surprise trust

Association flag 1 0 1 1 0 1 0 1 0 0

We replaced each emotion word in the corpus with both the emotion and sentiment information. For example, the word bad was replaced by “1011010100”. We used “1011010100” as a feature for the NLI task. Note that this is only one categorial feature with the indivisible value “1011010100”, not a 10-dimensional binary vector. This representation showed the highest result when compared to other ways of encoding the emotion information we tried, e.g., using a 10-dimensional binary vector or excluding the sentiment information (representing bad as “10110100”, taken as an indivisible symbol or as an 8-dimensional binary vector). Note that more than half of the words listed in the emotion lexicon (7,714 words) are represented by the same feature “0000000000”. Emotion load features When providing their opinion or describing personal experience in response to the prompts represented in the data, speakers of some L1s use a greater number of emotionally charged words than speakers of other L1s. We modelled this information using what we call emotion load (emoL) features. We used the emotion words from the NRC emotion lexicon to extract the emoL features. First, we counted the number of emotion words in each essay and divided it by the total number of words in the essay. The average value of this ratio (the number emotion words divided by the total number of words) for the essays in toefl11 was 0.236 and in icle was 0.246. We performed a grid search around these values to determine the threshold for 57

Experiments and Results

our classification and found that a slightly higher threshold of 0.27 and 0.28, respectively, provided the best results. If the ratio for an essay was higher than the threshold, we considered the essay as highly emotional, otherwise as low emotional. In order to increase the effect of this property, we used for it 1,000 coordinates in the feature vector. Namely, for highly-emotional essays we assigned the values of 1 to 500 distinct features “highlyemotional-1”, “highly-emotional-2”, etc., and values of 0 to 500 distinct features “lowemotional-1”, “low-emotional-2”, etc.; for low emotional essays, the values assigned were the other way around, i.e., 500 values of 0 and 500 values of 1, correspondingly.

3.3.2

Emotion results

Following previous studies on NLI [36] and author profiling [60], as well as our experiments with punctuation, we provide the results when adding emotion-based features to POS n-gram features (n = 1–3, with PMs). The 10-fold cross-validation results in terms of accuracy (%) on the toefl11-17 and icle datasets are shown in Table 3.16. EmoP stands for emotion polarity; emoL stands for emotion load. Number of features (No.) is provided for each experiment. Table 3.16: 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3, with PMs), when adding emotion polarity (emoP) and emotion load (emoL) features. Features POS n-grams w/ POS n-grams w/ POS n-grams w/ POS n-grams w/ Improvement:

PM PM + emoP PM + emoL PM + emoP + emoL

TOEFL11-17 Acc., % No. 48.83 29,763 49.98 29,967 48.82 30,763 50.14 30,967 1.31

ICLE Acc., % No. 69.48 18,594 72.73 18,786 70.65 19,594 73.51 19,786 4.03

The results presented in Table 3.16 indicate that the combination of emotion-based features provides the highest improvement when added to the POS n-gram representation. When used in isolation, emotion polarity (emoP) features are more indicative than emotion load (emoL) features on the both datasets. The proposed emotion-based features improve the result, when added to POS n-gram features, by 1.31% and 4.03% in terms of classification accuracy on the toefl11-17 and icle datasets, respectively. Although the gains are statistically significant on the both datasets (McNemar’s test [47] with an α value of 0.05), adding emotion-based features resulted in a higher improvement on icle. This can due to the fact that icle contains essays written by only high proficiency level 58

Experiments and Results

learners, whereas toefl contains three proficiency levels; it also can be due to the different L1 groups represented in the datasets and/or other peculiarities of the data described previously in Section 3.1. We aim to investigate the reason for this asymmetric improvement in future work. We also conducted experiments with emoP and emoL features using their n-gram representation; however, this representation led to only marginal accuracy improvement in some cases, while resulted in a significant increase of the size of the feature set, and therefore, was discarded.

We also evaluated the performance of emotion polarity (emoP) features when added to POS n-grams (with PM n-grams) and to function word unigrams (with PM n-grams) using the same set of experiments as used for the evaluation of punctuation-based features, e.g., multi-class classification, 2-step classification, classification for each proficiency-level represented in the toefl11-17 dataset, and cross-topic and cross-corpus settings. As mentioned earlier, POS n-grams and function words constitute a core set of standard features for the NLI task not only for L2 English data but for other languages as well [36]. We use POS n-grams and function word representations of the documents to which we add emotion polarity features.

As with punctuation-based features, 2-step classification was used to determine whether there are commonalities in the use of emotions across languages within the same family/geographical group, proficiency-level classification was done to evaluate the impact the emotion polarity features have within each level of proficiency, when cross-topic and crosscorpus experiments were carried our to evaluate the strength of emoP features with respect to the topic/corpus variations. The results for this experiments on the toefl11-17 and the icle datasets are shown in Appendix C. The obtained results indicate that emotion polarity features improve the results in the vast majority of settings and are indicative of the author’s native language even in essay domain.

While Rangel and Rosso [60] showed that the use of emotions depends on author’s age and gender, our results suggest that the way people express their emotions also depends on their L1, which is evidenced by the contribution of the emotion-based features to the NLI task, and that the proposed way of modelling emotion information is indicative for identifying the first language of the author. 59

Experiments and Results

3.4

Combination of Punctuation and Emotion Features

We conducted a set of experiments combining punctuation-based and emotion-based features to examine to which degree the combination of these features contributes to the NLI task. As baseline, we used POS n-grams without PMs and function word (FW) unigrams. Table 3.17 presents the results when adding punctuation- and emotion-based features to POS n-grams. The results for function words unigrams with and without punctuationand emotion-based features are presented in Table 3.18. The results are provided for each experiment in terms of 10-fold cross-validation accuracy (%) for 1-step and 2-step classification approaches on the toefl11-17 and icle datasets. Table 3.17: 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) without PMs, with PMs, with PM n-grams (n = 2,5) and with emotion polarity and emotion load features; 1- and 2-step approaches. TOEFL11-17 1-step 2-step Features No. acc. acc. Majority baseline 9.09 9.09 Baseline (POS n-grams w/o PMs) 38.88 39.08 14,993 POS n-grams w/ PMs 48.83 48.43 29,763 Improvement: 9.95 9.35 POS n-grams w/ PMs 48.83 48.43 29,763 POS n-grams w/ PMs + PM n-grams 49.62 49.42 41,211 Improvement: 0.79 0.99 Punctuation improvement: 10.74 10.34 POS n-grams w/ PMs + PM n-grams 49.62 49.42 41,211 POS n-grams w/ PMs + PM n-grams + emoP 50.51 50.17 41,415 POS n-grams w/ PMs + PM n-grams + emoL 49.49 49.50 42,211 50.30 42,415 POS n-grams w/ PMs + PM n-grams + emoP + emoL 50.66 Emotion improvement: 1.04 0.88 Total improvement: 11.78 11.22

1-step acc. 14.29 59.87 69.48 9.61 69.48 70.39 0.91 10.52 70.39 73.38 70.65 74.68 4.29 14.81

ICLE 2-step acc. 14.29 51.04 65.19 14.15 65.19 66.49 1.30 15.45 66.49 69.74 68.44 71.69 5.20 20.65

No. 10,316 18,594 18,594 25,665

25,665 25,857 26,665 26,857

The combination of punctuation-based and emotion-based features improves classification accuracy by 11.78% (1-step) and 11.22% (2-step) when added to POS n-gram features on the toefl11-17 dataset. On the icle dataset the obtained improvements are even higher: 14.81% and 20.65% for 1-step and 2-step approaches, respectively. When added to FW unigrams, the improvements are also high: 8.11% (1-step) and 7.39% (2-step) on the toefl11-17 dataset, and 15.97% (1-step) and 16.50% (2-step) on the icle dataset. Although the improvements were obtained mainly by the contribution of punctuation-based 60

Experiments and Results Table 3.18: 10-fold cross-validation accuracy (%) for FW unigrams without PMs, with PMs, with PM n-grams (n = 2,5) and with emotion polarity and emotion load features; 1- and 2-step approaches. TOEFL11-17 ICLE 1-step 2-step 1-step 2-step Features No. No. acc. acc. acc. acc. Majority baseline 9.09 9.09 14.29 14.29 Baseline (FWs w/o PMs) 27.89 31.94 127 50.13 46.36 127 FWs w/ PMs 31.90 37.00 163 58.31 55.97 158 4.01 5.06 8.18 9.61 Improvement: FWs w/ PMs 31.90 37.00 163 58.31 55.97 158 FWs w/ PMs + PM n-grams 34.08 37.05 11,611 59.22 58.70 7,229 2.18 0.05 0.91 2.73 Improvement: Punctuation improvement: 6.19 5.11 9.09 12.34 FWs w/ PMs + PM n-grams 34.08 37.05 11,611 59.22 58.70 7,229 FWs w/ PMs + PM n-grams + emoP 33.44 39.33 11,815 64.94 60.78 7,421 FWs w/ PMs + PM n-grams + emoL 33.93 36.74 12,611 58.18 58.57 8,229 38.89 12,815 66.10 62.86 8,421 FWs w/ PMs + PM n-grams + emoP + emoL 36.00 Emotion improvement: 1.92 2.28 6.88 4.16 Total improvement: 8.11 7.39 15.97 16.50

features, emotion-based features additionally enhanced the performance proving themselves useful for the NLI task. It is noteworthy that as in the previous experiment with POS n-grams (shown in Table 3.16), the combination of emotion-based features generally provides the best results, even though emotion load features, when used in isolation, improve the results only in some cases.

3.5

Method for NLI

One of the goals of this dissertation is to develop a robust NLI method, which identifies the author’s first language with high accuracy. In this section, we describe our NLI method (the cic-fbk method) developed for the NLI Shared Task 2017 and present the obtained results. Then, we focus on feature selection and feature filtering in order to improve the developed method, which includes adding punctuation mark (PM) n-grams and emotionbased features, as well as replacing typed character n-grams by untyped character n-grams and fine-tuning the size of the feature set. 61

Experiments and Results

3.5.1

CIC-FBK: Method for NLI

Our method takes the machine-learning perspective, approaching the task as a multi-class classification problem. A basic workflow for a machine-learning text classification task is shown in Figure 3.1. Though all the steps presented in Figure 3.1 are essential for obtaining high classification results, we focus on feature selection and feature filtering, since they directly affect the performance of a classification algorithm. We use a wide range of features and the Support Vector Machines (SVM) algorithm. In this subsection, we describe the features employed, the experimental setup, and the obtained results.

Figure 3.1: Basic workflow for a text classification task.

CIC-FBK features The features used in our method are word n-grams, lemma n-grams, part-of-speech n-grams, function words, character n-grams from misspelled words, typed character n-grams, and syntactic dependency-based n-grams of words and of syntactic relation tags. Further we describe in detail each feature type used in our method. Word, lemma, and POS n-grams Word and lemma (dictionary representation of words, i.e., words that are stripped of morphological marking) features represent the lexical choices of a writer, while part-of-speech (POS) features capture the morpho-syntactic patterns in a text. Combining word, lemma, and POS n-gram features (n-grams are contiguous sequences of n such features) has proved to be highly indicative for the NLI task, and was included in the majority of the previous state-of-the-art NLI approaches, including the winning approach in the NLI Shared Task 2013 [23]. Following previous works on the NLI task [23, 37], we use word, lemma, and POS n-grams with n ranging from 1 to 3. 62

Experiments and Results

In cases where topic bias is pervasive, that is, topics are not evenly distributed across the classes, word n-grams inevitably capture topic-specific information as well [5], but have been used successfully in topic-balanced setting [23]. Since the topics in the toefl11 dataset are considered near-equally represented, and the data is considered not to contain topic bias [74], we incorporate word and lemma n-grams in our method. We include punctuation marks and split n-grams by a full stop. We lowercase word and lemma n-grams and replace each digit by the same symbol (e.g., 12,345 → 00,000), as proposed in [45], to capture the format (e.g., 00.000 vs. 00,000), which reflects stylistic choice of the writer and not the value of a number that does not carry any stylistic information. Lemmas and POS tags were obtained using the TreeTagger software package [64]. Function words Function word (FW) features were described earlier in Subsection 3.2.1, when presenting punctuation mark and emotion experiments. We use a set of 318 English function words from the scikit-learn package [54]. Note that this function word list is different from the one used for punctuation and emotion evaluation, where we were concerned with the contribution of punctuation- and emotion-based features, and not with the overall performance. Spelling error character n-grams Spelling errors have been used as features for NLI since the work by Koppel et al. [26]. Spelling errors are considered a strong indicator of an author’s L1, since they reflect L1 influences, such as sound-to-character mappings in L1. Recently, Chen et al. [13] introduced the use of character n-grams from misspelled words. The authors showed that adding spelling error character n-grams to other core features (word, lemma, and character n-grams) improves NLI classification accuracy. We extract 39,512 unique misspelled words from the training and development sets using the spell shell command. Then we build character n-grams (n = 4) from the extracted misspelled words. This size of spelling error character n-grams provided the highest result when compared to several other sizes we examined (n = 1, 2, 3, 5, and their combinations). Typed character n-grams Character level features are sensitive to both the content and the form of a text and are able to capture lexical and syntactic information, punctuation and capitalization information [70]. Typed character n-grams, i.e., character n-grams grouped into different categories were introduced by Sapkota et al. [63]. The categories can be organized into three main 63

Experiments and Results

super categories (affix-, word-, and punctuation-related n-grams). They are defined in Table 3.19.6 Table 3.19: Categories of character n-grams introduced by Sapkota et al. [63]. Affix character n-grams An n-gram that covers the first n characters of a word prefix that is at least n + 1 characters long. An n-gram that covers the last n characters of a word suffix that is at least n + 1 characters long. An n-gram that begins with a space and that does space-prefix not contain any punctuation mark. An n-gram that ends with a space, that does not contain any space-suffix punctuation mark, and whose first character is not a space. Word character n-grams An n-gram that encompasses all the characters of a word, whole-word and that is exactly n characters long. An n-gram that contains n characters of a word that is at least n + 2 characters long, and that does not include neither the mid-word first nor the last character of the word. An n-gram that spans multiple words, identified by the multi-word presence of a space in the middle of the n-gram. Punctuation character n-grams (abbreviated as punct) An n-gram whose first character is a punctuation mark, beg-punct but the middle characters are not. An n-gram whose middle character is a punctuation mark mid-punct (for n =3). An n-gram whose last character is punctuation mark, end-punct but the first and the middle characters are not.

As an example, consider the following sample sentence: (2) John said, “Tom can repair it for 12 euros.” The character n-grams (n = 3) for the sample sentence (2) for each of the categories are shown in Table 3.20. For clarity, spaces are represented by the underscores. We use, the same, language-independent, character n-gram categories introduced by Sapkota et al. [63]. Typed character n-grams have shown to be predictive features for other related classification tasks, such as author profiling [43, 44] and discriminating between similar languages [17]. We use typed character n-grams with n = 4. This size of typed 6

We use the definitions as refined in [45].

64

Experiments and Results Table 3.20: Character n-grams (n = 3) per category for the sample sentence (2) after applying the algorithm by Sapkota et al. [63].

punct

word

affix

SC

Category prefix suffix space-prefix space-suffix whole-word mid-word multi-word beg-punct mid-punct end-punct

N -grams Joh sai rep eur ohn aid air ros _sa _ca _re _it _fo _12 hn_ om_ an_ ir_ it_ or_ Tom can for epa pai uro n_s m_c n_r r_i t_f r_1 ,_“ “To d,_ _“T id, os.

_eu 12_

2_e

character n-grams was selected based on grid search. One of the shortcomings of typed character n-grams is that they cannot be used for a high value of n (limited to n = 5 in our implementation). Syntactic dependency-based n-grams Various syntactic features, including production rules [80] and tree substitution grammars [72], have been previously explored for NLI. Tetreault et al. [74] experimented with the Stanford parser [16] dependency features and concluded that they are strong indicators of structural differences in L2 writing. We exploit the Stanford dependencies to build syntactic n-gram features using the algorithm designed and made available in [57, 58]. Syntactic n-grams differ from traditional n-grams in the way that the neighbours are taken. The neighbours follow syntactic relations in syntactic trees, instead of the word order in the text. Syntactic n-grams allow introducing syntactic knowledge into machinelearning methods; however, syntactic parsing is required before their construction. The type of syntactic n-grams depends on the elements they are formed of. The atomic element of syntactic n-grams can be word, part-of-speech (POS) tag, or syntactic relation (SR) tag. There are two types of syntactic n-grams – continuous and non-continuous. Continuous syntactic n-grams follow the syntactic path as one continuous line, without bifurcations, while non-continuous syntactic n-grams include bifurcations. Consider the following sample sentence: (3) I remember this great experience. The syntactic tree for the sample sentence (3) is shown in Figure 3.2. 65

Experiments and Results

Figure 3.2: Syntactic tree for the sample sentence (3). The dependencies generated by the Standard parser for the the sample sentence (3) are the following:

root(ROOT, remember), nsubj(remember, I), dobj(remember, experience), det(experience, this), amod(experience, great). These dependencies, including backoff transformation based on POS, were used as features for NLI in [74]. According to the metalanguage proposed in [66], the syntactic 2-grams of words are the following: remember[I], remember[experience], experience[this], experience[great]; when the syntactic 3-grams of words are: remember[I,experience], remember[experience[this]], remember[experience[great]], experience[this,great]; the syntactic 2-grams of syntactic relation tags are: root[nsubj], root[dobj], 66

Experiments and Results

dobj[det], dobj[amod]; the syntactic 3-grams of syntactic relation tags are: root[nsubj,dobj], root[dobj[det]], root[dobj[amod]], dobj[det,amod]. Here, the head element is on the left of a square parenthesis and inside there are the dependent elements; the elements separated by a comma refer to non-continuous syntactic n-grams, that is, the elements are at the same level in a syntactic tree. Syntactic n-grams outperformed traditional n-grams in the task of authorship attribution [69] and were applied in the tasks related to L2, for example, automatic English as L2 grammar correction [67]. In order to be able to capture particular syntactic patterns that can serve as evidence of the native language influence in L2 writing, we use continuous syntactic n-grams of words and of syntactic relation tags (n = 2–3). Other values of n we examined deteriorated our results. Experiment Setup For the evaluation of our method, we merged the toefl-17 training and development sets (tokenized versions) and conducted experiments under 10-fold cross-validation. We measured the performance in terms of both classification accuracy and macro-averaged F1-score. The former was used as evaluation metric in the majority of previous works on NLI, while the latter was the official evaluation metric in the NLI Shared Task 2017. However, since the toefl11-17 dataset is balanced in terms of represented L1s, the results for accuracy and F1-score are very similar. Frequency threshold The fine-tuning of feature set size, i.e., feature filtering, has proved to be a useful strategy for NLI [23] and other NLP tasks [70, 42]. Discarding low frequency features, which are usually associated with topic-specific information, allows to enhance the performance and to reduce the size of the feature set [42]. We selected the frequency threshold value that provided the highest 10-fold cross-validation result based on grid search: we consider only those features that occur in at least two documents in 67

Experiments and Results

the training corpus (minimum document frequency (min_df) is equal to 2) and that occur at least four times in the entire training corpus (threshold is equal to 4).

Weighting scheme We use log-entropy weighting scheme, which showed good results in previous studies on NLI [23, 13]. Log-entropy weighting scheme consists of local weighting (denoted as Llog (i, j)) and global weighting (denoted as Gent (i)). The local weighting is calculated by taking the logarithm value of adding-one smoothed term frequency: Llog (i, j) = log(f requency(i, j) + 1),

(3.2)

where f requency(i, j) is the frequency of term i with regard to document j. The global entropy weighting is calculated by the following formula: J P

Gent (i) = 1 +

pij log pij

j=1

log(J + 1)

(3.3)

,

where J is the total number of documents in the corpus.

J P

pij log pij is the additive

j=1

inverse of entropy of the conditional distribution given i and f requency(i, j) pij = P . f requency(i, j)

(3.4)

j

The final weighting W is calculated as follows: W = Llog (i, j) × Gent (i).

(3.5)

Classifier As mentioned earlier, Support Vector Machines (SVM) is considered among the best performing classification algorithms for text categorization tasks; moreover, it was the classifier of choice for the majority of the teams in the NLI Shared Task 2013. Following the proven effectiveness of SVM by numerous studies on text classification tasks in general and on NLI in particular, we adopt the use of linear SVM to perform the multi-class classification. We use the liblinear scikit-learn [54] implementation of SVM with OvR (one vs. the rest) multi-class strategy. We performed a grid search around the liblinear SVM classifier hyperparameters: penalty parameter (C ), loss function (loss), and 68

Experiments and Results

tolerance for stopping criteria (tol), setting the penalty parameter C to 100 based on the classifier performance on the merged toefl11-17 training and development sets. CIC-FBK results Table 3.21 presents the performance of each feature type in isolation, and the result achieved when all the features are combined. The results are provided for the experimental setup described above in terms of 10-fold cross-validation accuracy (%) on the merged toefl11-17 training and development sets. Number of features (No.) after feature filtering is provided for each experiment. Table 3.21: 10-fold cross-validation accuracy (%) for each feature type individually on the toefl11-17 dataset. Features Word n-grams (n = 1–3) Lemma n-grams (n = 1–3) POS n-grams (n = 1–3) Function words Spelling error character 4-grams Typed character 4-grams Syntactic n-grams of words (n = 2–3) Syntactic n-grams of SR tags (n = 2–3) Combination of the above

Accuracy, % 84.63 84.54 49.30 50.04 37.79 77.79 70.64 23.61 86.40

No. 230,714 228,229 14,510 302 12,322 35,480 148,728 5,344 726,494

In line with previous studies on the NLI task [6, 73, 23, 13], in our approach word and lemma n-grams (n = 1–3) are the most indicative features. When evaluated in isolation, word and lemma n-grams showed 84.63% and 84.54% 10-fold cross-validation accuracy, respectively. Word n-grams, however, alone account for almost all of the overall accuracy achieved when all the features are combined: combining word n-grams with all other features contributed 1.77% to overall 10-fold cross-validation accuracy. Typed character n-grams performed well in isolation with a much smaller feature set size (35,480 features); they achieved 77.79% accuracy. Syntactic n-grams of syntactic relation tags showed the lowest accuracy of 23.61% when evaluated in isolation, nonetheless, this accuracy is more than twice higher than the majority baseline of 9.09%. These features contributed 0.2% to the overall accuracy when combined with other feature types presented in Table 3.21. The combination of all the features showed 86.40% 10-fold cross-validation accuracy on the merged toefl11-17 training and development sets (henceforth, toefl11-17 training) and 88.09% accuracy on the toefl11-17 test set, which allowed our team to share the 69

Experiments and Results

first rank in the NLI Shared Task 2017 scoring and to achieve the second-best result in the competition. As it was mention in Subsection 2.2.2, this result differs 0.09% in terms of classification accuracy from the highest result achieved by the ItaliaNLP Lab team [14] (88.18% accuracy), which corresponds to one correctly predicted label.

3.5.2

Improving the CIC-FBK method

The cic-fbk method, described above, showed 88.09% accuracy and 88.08% F1-score on the toefl11-17 test set and shared the first rank in the NLI Shared Task 2017 scoring. In this subsection, we focus on feature selection, aiming to improve the developed cic-fbk method, which includes adding PM n-grams and emotion-based features, introduced in Sections 3.2 and 3.3, respectively.

Experiments on the TOEFL11-17 dataset In our experiments on the toefl11-17 dataset under 10-fold cross-validation, as well as on the complete icle dataset, typed character n-grams outperformed untyped character n-grams of the same size in the majority of examined configurations; however, on the toefl11-17 test set, untyped character 3-grams showed higher classification accuracy, as it is shown in Table 3.22. The results provided in Table 3.22 show the performance of typed and untyped character n-grams (n = 3 and n = 4) when these features are used in combination with other feature types presented in Table 3.21. The results are provided in terms of accuracy (%) and F1-score (%) on the toefl11-17 training dataset under 10-fold cross-validation and on the toefl11-17 test set. Table 3.22: Comparison of typed and untyped character n-grams (n = 3/n = 4) on the toefl11-17 training and test sets when used in combination with other feature types employed in the cic-fbk method. TOEFL11-17 10FCV Test Features Acc. F1 Acc. Typed 4-grams 86.40 86.39 88.09 Untyped 4-grams 86.41 86.40 87.91 Typed 3-grams 86.37 86.36 88.00 Untyped 3-grams 86.36 86.35 88.27

70

Set F1 88.08 87.90 87.97 88.27

No. 726,494 719,447 703,529 699,535

Experiments and Results

As one can see from Table 3.22, untyped character 3-grams outperform other examined character n-gram features on the toefl11-17 test set. Since the toefl11-17 test set was used for the evaluation of the performance of the NLI Shared Task 2017 submissions, in our further experiments we replace typed character 4-grams by untyped character 3-grams and refer to this as the modified cic-fbk method. The result achieved by combining untyped character 3-grams with other features (see Table 3.21) on the toefl11-17 test set (88.27% accuracy and F1) is already higher than the best result in the NLI Shared Task 2017, though by one correctly predicted label. Next, we add PM n-grams (see Section 3.2) and emotion-based features: emotion polarity and emotion load (see Section 3.3), to the modified cic-fbk method and evaluate its performance on the toefl11-17 training and test sets. The results for different variants of our method: the original cic-fbk method, the modified cic-fbk method, and the modified cic-fbk method with PM n-grams (n = 2,5) and emotion-based features on the toefl11-17 training set under 10-fold cross-validation and on the toefl11-17 test set are presented in Table 3.23. The highest result in the competition [14] and the number of features (No.) for each experiment are also provided. As the baseline, we use the original cic-fbk method.

Table 3.23: Results (accuracy, % and F1-score, %) for the original cic-fbk method, the modified cic-fbk method, the modified cic-fbk method with PM n-grams (n = 2,5) and emotion-based features on the toefl11-17 training and test sets. TOEFL11-17 10FCV Test Set Features Acc. F1 Acc. F1 Shared task best [14] – – 88.18 88.18 Original CIC-FBK method (baseline) 86.40 86.39 88.09 88.08 Modified CIC-FBK method 86.36 86.35 88.27 88.27 Modified CIC-FBK method + PM n-grams 86.45 86.43 88.45 88.45 Modified CIC-FBK method + emoP 86.42 86.41 88.09 88.09 Modified CIC-FBK method + emoL 86.26 86.25 87.91 87.90 Modified CIC-FBK method + PM n-grams + emoP 86.52 86.51 88.55 88.54 Modified CIC-FBK method + PM n-grams + emoL 86.28 86.27 88.18 88.16 Modified CIC-FBK method + emoP + emoL 86.24 86.23 87.91 87.89 Modified CIC-FBK method + PM n-grams + emoP + emoL 86.29 86.28 88.18 88.16 Improvement over the original CIC-FBK method:

71

0.12 0.12

0.46 0.46

No. 726,494 699,535 702,138 699,702 700,535 702,305 703,138 700,702 703,305

Experiments and Results PM n-grams and emotion polarity features slightly contribute to the performance7 , pushing 10-fold cross-validation accuracy, when used in combination, from 86.36% to 86.52% and accuracy on the test set from 88.27% to 88.55% (we refer to this as the improved cic-fbk method), while adding emotion load features did not improve the results. The improvement achieved by replacing typed character 4-grams by untyped character 3-grams combined with adding PM n-grams (n = 2,5) and emotion polarity features is not statistically significant when compared to the original cic-fbk method (according to McNemar’s statistical significance test with α = 0.05); however, this improvement would have allowed us to overcome the highest result in the shared task and would have placed our team first in the NLI Shared Task 2017 ranking. Figure 3.3 presents the confusion matrices for the original cic-fbk method performance on the toefl11-17 test set (top) and for the improved cic-fbk method (bottom), that is, modified cic-fbk method with PM n-grams and emotion polarity features. When comparing the two matrices, one can see that the highest level of confusion is still between Hindi and Telugu classes (even though the result for Hindi was slightly improved, the result for Telugu deteriorated). This language pair was the most problematic in the NLI Shared Tasks 2013 and 2017 (e.g., in the NLI Shared Task 2013 none of the approaches was able to reach 80.00% accuracy for Hidni). The low results for Hindi may also be due to the impact of proficiency, that is, 58.4% of the essays by Hindi native speakers fall into the high proficiency category, where the low-proficiency category represents only 2.6% of the essays. The highest improvement was achieved for Korean; however, Korean and Japanese remains another problematic language pair, in which Korean native speakers are often classified as Japanese. Both the most confused language pairs are derived from different language families but are spoken in geographically closely located territories. There were some attempts to explain the high degree of confusion between L1s in these two pairs, e.g., Malmasi et al. [41] suggest that the reason for the high confusion between Hindi and Telugu could be related to the similar way of teaching English in those countries. Though a more fine-grained error analysis is required to understand the high confusion between these language pairs, our observation that the most confused L1s (Hindi and Telugu) use the highest number of emotion words in their L2 writing, when Korean native speakers show the highest ratio of emotion words by the total number of words in their essays, as shown in Table 3.14, 7

Note that PMs were already included in our method as part of character n-gram and word n-gram features.

72

Experiments and Results

Figure 3.3: Confusion matrices for the original cic-fbk method (top) and the improved cic-fbk method (bottom) on the toefl11-17 test set. 73

Experiments and Results

could reveal some insights for developing an approach that will tackle these two language pairs in isolation to improve the overall performance on the toefl11 dataset. The highest result was achieved for German native speakers, which is also in line with the previous results on this dataset. Malmasi [33, p. 96] hypothesize that this high result for German can be due to the fact that German is typologically the closest language to English, when compared to the other L1s represented in the dataset, which may result in linguistic patterns that are possibly closer to native English writing.

Experiments on the ICLE dataset We evaluate the above-described modifications of our method, i.e., replacing typed character 4-grams by untyped character 3-grams and adding punctuation-based and emotionbased features, on the 7-way icle dataset. To be able to compare the performance of our approach with other methods [80, 74], as well as with the previous state-of-the-art result on this 7-way icle dataset [22], we conducted experiments under 5-fold cross-validation. The results for this experiment, as well as the previous state-of-the-art accuracy are shown in Table 3.24. As the baseline for this experiment, we use the original cic-fbk method (without modifications). Table 3.24: 5-fold cross-validation (5FCV) results (accuracy, % and F1-score, %) for the original cic-fbk method, the modified cic-fbk method, the modified cic-fbk method with PM n-grams (n = 2,5) and emotion-based features on the icle dataset. ICLE 5FCV Features Acc. F1 Previous best [22] 91.30 – Original CIC-FBK method (baseline) 90.52 90.50 Modified CIC-FBK method 91.43 91.44 Modified CIC-FBK method + PM n-grams 91.43 91.46 Modified CIC-FBK method + emoP 91.56 91.57 Modified CIC-FBK method + emoL 91.95 91.93 Modified CIC-FBK method + PM n-grams + emoP 91.43 91.46 Modified CIC-FBK method + PM n-grams + emoL 92.08 92.05 Modified CIC-FBK method + emoP + emoL 91.95 91.93 Modified CIC-FBK method + PM n-grams + emoP + emoL 91.95 91.93 Improvement over the original CIC-FBK method:

74

1.56 1.55

No. 140,681 127,249 128,670 127,393 128,249 128,814 129,670 128,393 129,814

Experiments and Results

On the 7-way icle subset untyped character n-grams outperform typed character n-grams, when used in combination with other features types presented in Table 3.21. The result achieved by replacing typed character 4-grams with untyped character 3-grams (91.43% accuracy) is already higher than the previous state-of-the-art results of 91.30% accuracy. Unlike in our experiments with POS tags and FWs, and unlike the results achieved on the toefl11 dataset, here the best result is obtained when excluding emotion polarity (emoP) features, and using just the combination of PM n-grams and emotion load features (92.08% accuracy). The obtained improvement is statistically significant when compared to the original cic-fbk method used as the baseline according to McNemar’s test with α = 0.05 (1.56% improvement). This improvement allowed us to push the state-of-the-art result on the 7-way icle dataset from 91.30% to 92.08% in terms of classification accuracy (0.78% accuracy improvement). Since the predicted labels of the previous state-of-the-art approach [22] are not available, we cannot judge whether the improvement achieved at this stage is statistically significant. However, below we show that we achieve statistically significant improvement over our own previous method, which already shows higher results than the state of the art. This suggests that the improvement of our method presented further in this dissertation over the state-of-the-art approach is likely to be statistically significant. The confusion matrices for the original cic-fbk method used as the baseline and for the improved cic-fbk method (the cic-fbk method after adding PM n-grams and emotion load features) on the 7-way icle dataset under 5-fold cross-validation are shown in Figure 3.4. It can be noted that the improvement was achieved for five out of seven L1s represented in the dataset.

Additional experiments: Feature filtering Feature filtering, i.e., fine-tuning the size of the feature set, has proved to be of great importance in natural language processing tasks [17, 44, 45, 70] in general and in NLI in particular [23, 42]. When developing the cic-fbk method, we selected the optimal frequency threshold (minimum feature frequency) and the optimal minimum document frequency (min_df) values for the toefl11 dataset based on grid search: we consider only those features that occur in at least two documents in the training corpus (min_df = 2) and that occur at 75

Experiments and Results

Figure 3.4: Confusion matrices for the original cic-fbk method (top) and the improved cic-fbk method (bottom) on the icle dataset. 76

Experiments and Results

least four times in the entire training corpus (threshold = 4). Setting these frequency threshold and min_df values improves 10-fold cross-validation accuracy on the toefl11 dataset by about 1%, compared to the configuration when all the features are considered, and reduces the size of the feature set by approximately 90% of the original. Next in this section, we further improve the performance of our NLI method on the icle dataset by fine-tuning the frequency threshold and min_df values, that is, by selecting the optimal size of the feature set for this corpus. We use the four best variants of our method presented in Table 3.24: (i) the modified cic-fbk method + PM n-grams + emoL; (ii) the modified cic-fbk method + emoL; (iii) the modified cic-fbk method + emoP + emoL; and (iv) the modified cic-fbk method + PM n-grams + emoP + emoL, and fine-tune the feature set size by varying the threshold and min_df values. The result for the variant of the method, which provides the highest result after fine-tuning the feature set size (the modified cic-fbk method + PM n-grams + emoP + emoL) is presented further in this section, while the results for the other three variants described above are presented in Appendix D. First, we provide the 5-fold cross-validation results in terms of accuracy (%) and F1score (%) for different frequency threshold values on the icle dataset. The results for this experiment are shown in Table 3.25. Figure 3.5 shows an alternate view of the results for accuracy variation with respect to the minimum feature frequency. Table 3.25: Accuracy (%) and F1-score (%) variation with respect to the frequency threshold on the icle dataset for the modified cic-fbk method + PM n-grams + emoP + emoL. Frequency threshold 1 (all) 4 (baseline) 10 11 12 13 15 20 25 30 35

5FCV Acc. F1 91.43 91.41 91.95 91.93 92.99 92.98 93.12 93.10 93.25 93.23 92.73 92.72 92.73 92.72 92.73 92.73 91.82 91.82 91.82 91.79 91.56 91.54

No. 190,449 129,814 48,115 42,259 40,715 36,794 32,599 26,285 21,586 18,935 16,659

As one can see from Table 3.25 and Figure 3.5, the optimal frequency threshold value (threshold = 12) for the icle dataset improves the result by 1.3% in terms of 5-fold cross77

Experiments and Results

Figure 3.5: Accuracy (%) variation with respect to the frequency threshold on the icle dataset. validation accuracy and additionally reduces the size of the features set by approximately 70% when compared to the threshold of four used in the previous experiments. Next, we use this frequency threshold as baseline and perform grid search around the minimum document frequency (min_df) values in order to further enhance the performance. The results for this experiment are presented in Table 3.26; an alternative view for accuracy variation with respect to the min_df is shown in Figure 3.6 Table 3.26: Accuracy (%) and F1-score (%) variation with respect to the minimum document frequency (min_df) on the icle dataset for the modified cic-fbk method + PM n-grams + emoP + emoL. mid_df 1 2 (baseline) 3 4 5 6 7

5FCV Acc. F1 93.25 93.23 93.25 93.23 93.25 93.23 93.38 93.36 93.25 93.23 92.86 92.85 92.73 92.73

78

No. 40,820 40,715 40,630 40,454 40,092 39,022 36,205

Experiments and Results

Figure 3.6: Accuracy (%) variation with respect to the minimum document frequency (min_df) on the icle dataset. As can be seen comparing the results presented in Tables 3.25–3.26 and the results presented in Appendix D, the threshold value of 12 is optimal for all the examined variants of our method; however, varying min_df values allowed us to slightly improve the result only for the modified cic-fbk method + PM n-grams + emoP + emoL. The improvement obtained by fine-tuning the features set size allowed us to achieve 93.38% in terms of 5-fold cross-validation accuracy on the 7-way icle dataset, which is 1.43% higher than the baseline of 91.95% (the performance of the method without finetuning the feature set size). This is a statistically significant improvement according to McNemar’s test with α = 0.05. The achieved improvement is even higher when compared to the previous state-of-the-art results of 91.30% [22] (2.08% improvement over the previous state-of-the-art result in terms of classification accuracy). Figure 3.7 presents the confusion matrices before and after fine-tuning the feature set size on the icle dataset for the modified cic-fbk method + PM n-grams + emoP + emoL. It can be noticed that the highest improvement was achieved for the two most problematic languages in this dataset: Spanish and French. One of the reasons that relatively high frequency threshold values are more appropriate for the icle dataset, as evidenced by our results, may be due to the fact that icle contains 79

Experiments and Results

more topics than toefl11-17, when it has been shown that low frequency features should be avoided for classification tasks under cross-topic conditions [45]. Our results showed that it is possible to significantly improve the accuracy results by fine-tuning the feature set size, that is, by removing redundant information. In the icle dataset, relatively high frequency threshold values were found the most effective. We hypothesize that one of the reasons for this result may be the large variety of topics represented in the icle dataset. Additional experiments, however, on other datasets are required to verify this conclusion.

80

Experiments and Results

Figure 3.7: Confusion matrix before (top) and after (bottom) selecting the optimal feature set size for the icle dataset. 81

Chapter 4 Conclusions and Future Work 4.1

Conclusions

In this dissertation, we addressed the task of automatic native language identification (NLI). NLI is the task of classifying the native language (L1) of an author based solely on his or her texts written in a second language (L2). The task is based on the assumption that author’s texts produced in a different language are influenced by their first language. In second-language writing, influences of the native language such as vocabulary and grammar are well-studied, and the phenomenon is called language transfer effect. We have investigated two aspects of the language transfer that have not been previously explored in NLI research: punctuation and emotions. We conducted a series of experiments to evaluate the impact of punctuation information on native language identification. We introduced novel features for this task: punctuation mark (PM) n-grams, designed to capture patterns of the use of PMs. We evaluated the impact of punctuation-based features when adding them to part-of-speech (POS) tag n-grams and function words. We started with the usual multi-class classification and 2-step classification, grouping the languages by language family/geographical location. This experiment showed that there are commonalities across language families, and also particularities that distinguish the use of punctuation for specific languages within a language family or geographical group. We evaluated the use of punctuation for different proficiency levels. An interesting observation from this experiment is that the use of punctuation remains influenced by L1 even for learners with high proficiency in L2. 83

Conclusions and Future Work

To verify the robustness of punctuation as language indicators we performed cross-topic classification (training and testing on different sets of topics) and cross-dataset classification (training and testing on different datasets). The cross-topic and cross-corpus experiments showed that the influence of punctuation transcends topics and corpora. In this experiment, punctuation-based features were able to partly compensate for the loss in performance under cross-topic or cross-corpus conditions, evidence that they are robust and persistent indicators of the native language of the author. We explored the impact of emotions on native language identification. We proposed emotion polarity and emotion load features to model emotion information. Our results with POS tag and function word feature sets containing and not containing emotion-based features showed that emotion-based features contribute to the NLI task in the majority of cases; however, their impact varies across the examined settings and datasets. We described the cic-fbk method, developed for the NLI Shared Task 2017, which approaches the task from a machine-learning perspective, using the Support Vector Machines (SVM) classifier trained on a large variety of features: word n-grams, lemma n-grams, POS n-grams, function word unigrams, character n-grams from misspelled words, typed character n-grams, and syntactic dependency-based n-grams of words and of syntactic relation tags. Our approach shared the first rank in the NLI Shared Task 2017 scoring. We further improved our NLI method on the toefl11-17 and the 7-way icle datasets by focusing on the feature selection, which consisted in replacing typed character n-grams by untyped character n-grams and incorporating novel features, i.e., punctuation mark n-grams and emotion-based features. The applied modification improved the result on the toefl11-17 test set from 88.09% to 88.55% accuracy, which is higher than the previous state-of-the-art result of 88.27% accuracy. This improvement, however, is not statistically significant. On the 7-way icle dataset, the proposed modifications allowed us to achieve 92.08% accuracy, which is higher, than the result of 90.52% accuracy obtained by our method without the proposed modifications, the difference being statistically significant. We further enhanced the performance on the 7-way icle dataset up to 93.38% accuracy by fine-tuning the size of the feature set. We found that higher threshold values are more appropriate for the icle dataset. This result is higher than the previous state-of-the-art of 91.30% accuracy, the difference likely being statistically significant (see explanation on page 75). However, the optimal set of features for the icle dataset is different from the optimal set of features for the toefl11 dataset. 84

Conclusions and Future Work

4.2

Scientific Contributions

We investigated the impact of punctuation on native language identification. We proposed novel features: punctuation mark (PM) n-grams designed to capture patterns of the use of PMs. We evaluated the role/impact of punctuation under a variety of settings: the usual multi-class classification setting, within the represented language families/geographical groups, for different proficiency levels, and finally under cross-topic and cross-corpus NLI conditions. Our results showed that punctuation is a strong and robust indicator of the author’s native language. We evaluated the impact of emotions and emotion-based features on the NLI task. We showed that emotion-based features contribute to the NLI task, even in essay domain, where the use of emotions is limited by genre. We developed an NLI method that approaches the task from a machine-learning perspective using a Support Vector Machines (SVM) algorithm trained on a large variety of features: word n-grams, lemma n-grams, part-of-speech n-grams, function words, character n-grams from misspelled words, character n-grams, syntactic dependency-based n-grams of words and of syntactic relation tags, punctuation mark n-grams, and emotionbased features. The method is able to identify the authors first language with high accuracy.

4.3

Future Work

One of the directions for future work would be to further explore punctuation and emotion information in the NLI task. For instance, we will conduct experiments weighting these features differently and examine other emotion lexicons as sources for extraction of emotion-based features. Moreover, additional experiments are required to examine which kind of information is being captured by emotion-based features, that is, whether this is only L1-dependent or also culture-dependent information. We will also evaluate punctuation- and emotion-based features on other related tasks, such as language family tree construction [50], non-native text identification [3], and native language-based text segmentation [38]. Other aspects of the first language influence will be explored in future work, such as errors based on the use of false cognates (i.e., faux amis – false friends, e.g., Spanish 85

Conclusions and Future Work

embarazada (pregnant) and English embarrassed). The information on the use of a false cognate instead of the intended word may be useful for identifying the native language of the writer. To the best of our knowledge, no work has been done on exploring this type of language interference. We intend to further improve our NLI method. For this, we will examine different preprocessing steps. Pre-processing has proved to be a useful strategy for other classification tasks, such as authorship attribution [45] and author profiling [18], but was rarely used in NLI. One of the possible pre-processing steps would be to replace named entities by the same symbol when using character n-gram representation of the data. This would allow to keep information about their occurrence and remove information about the exact named entity. Another pre-processing step will consist in replacing function words by a distinct symbol in character n-gram representation in order to capture patterns of the use of function words, as proposed in [45]. Cimino and Dell’Orletta [14] showed that 2-stacked sentence-document architecture is an efficient strategy for the NLI task. In future work, we will examine the robustness of the 2-stacked sentence-document architecture, as well as evaluate different ensemble methods and meta-classifiers on the set of features used in our NLI method. We will also examine the robustness of our method under challenging cross-corpus NLI conditions. Most of the datasets used in NLI research are limited to only one genre (learner corpora), which is a limiting factor that to a large extent hinders research in the field. Moreover, none of the existing datasets covers a situation when a learner is acquiring a third (L3) or nth language, while in second language acquisition it has been noted that there are both L1- and L2-based transfer effects on L3 production [61, 75]. One of the essential steps to promote research in the field consists in developing topic-balanced datasets that would be designed specifically for the task of NLI and cover more proficiency levels, styles, domains, and genres, e.g., research papers, reviews, e-mails, etc. Moreover, as it was mentioned in [39], it would be interesting to evaluate NLI methods on a larger number of linguistically diverse L1 classes. Development of a dataset that would cover other genres and contain a larger number of L1 classes is one of our avenues for future work.

86

Bibliography [1] Edwards Allen. Note on the “correction for continuity” in testing the significance of the difference between correlated proportions. Psychometrika, 13(3):185–187, 1948. [2] Naomi Baron. Commas and canaries: The role of punctuation in speech and writing. Language Sciences, 23(1):15–67, 2001. [3] Shane Bergsma, Matt Post, and David Yarowsky. Stylometric analysis of scientific articles. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 327–337, Montreal, Canada, 2012. ACL. [4] Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013. [5] Julian Brooke and Graeme Hirst. Native language detection with ‘cheap’ learner corpora. In Proceedings of the Conference of Learner Corpus Research, pages 37–47, Louvain-la-Neuve, Belgium, 2011. Presses universitaires de Louvain. [6] Julian Brooke and Graeme Hirst. Robust, lexicalized native language identification. In Proceedings of the 24th International Conference on Computational Linguistics, pages 391–408, Mumbai, India, 2012. The COLING 2012 Organizing Committee. [7] Paul Bruthiaux. Knowing when to stop: Investigating the nature of punctuation. Language and Communication, 13(1):27–43, 1993. [8] Serhiy Bykh and Detmar Meurers. Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimiza87

BIBLIOGRAPHY

tion. In Proceedings of the 25th International Conference on Computational Linguistics, pages 1962–1973, Dublin, Ireland, 2014. ACL. [9] Sophia Chan, Maryam Honari Jahromi, Benjamin Benetti, Aazim Lakhani, and Alona Fyshe. Ensemble methods for native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP, pages 217–223, Copenhagen, Denmark, 2017. ACL. [10] Yu-Chia Chang, Jason S. Chang, Hao-Jan Chen, and Hsien-Chin Liou. An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology. Computer Assisted Language Learning, 21(3):283–299, 2008. [11] Carole Chaski. Empirical evaluations of language-based author identification techniques. Forensic Linguistics, 8(1):1–65, 2001. [12] Lingzhen Chen. Native language identification on learner corpora. Master’s thesis, University of Trento, Department of Information Engineering and Science, Trento, Italy, 2016. [13] Lingzhen Chen, Carlo Strapparava, and Vivi Nastase. Improving native language identification by using spelling errors. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 542–546, Vancouver, Canada, 2017. ACL. [14] Andrea Cimino and Felice Dell’Orletta. Stacked sentence-document classifier approach for improving native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP, pages 430–437, Copenhagen, Denmark, 2017. ACL. [15] Daniel Dahlmeier and Hwee Tou Ng. Correcting semantic collocation errors with L1-induced paraphrase. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 107–117, Edinburgh, Scotland, 2007. ACL. [16] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, pages 449– 454, Genoa, Italy, 2006. ELRA. 88

BIBLIOGRAPHY

[17] Helena Gómez-Adorno, Ilia Markov, Jorge Baptista, Grigori Sidorov, and David Pinto. Discriminating between similar languages using a combination of typed and untyped character n-grams and words. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 137–145, Valencia, Spain, 2017. ACL. [18] Helena Gómez-Adorno, Ilia Markov, Grigori Sidorov, Juan-Pablo Posadas-Durán, Miguel A. Sanchez-Perez, and Liliana Chanona-Hernandez. Improving feature representation based on a neural network for author profiling in social media texts. Computational Intelligence and Neuroscience, 2016:13 pages, 2016. [19] Cyril Goutte and Serge Léger. Exploring optimal voting in native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP, pages 367–373, Copenhagen, Denmark, 2017. ACL. [20] Sylviane Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot. International Corpus of Learner English v2 (ICLE). Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium, 2009. [21] Jack Grieve. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3):251–270, 2007. [22] Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1363–1373, Doha, Qatar, 2014. ACL. [23] Scott Jarvis, Yves Bestgen, and Steve Pepper. Maximizing classification accuracy in native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 111–118, Atlanta, GA, USA, 2013. ACL. [24] Ekaterina Kochmar. Identification of a writer’s native language by error analysis. Master’s thesis, University of Cambridge, Cambridge, UK, 2011. [25] Moshe Koppel, Jonathan Schler, and Kfir Zigdon. Automatically determining an anonymous author’s native language. In Proceedings of the 2005 IEEE International 89

BIBLIOGRAPHY

Conference on Intelligence and Security Informatics, pages 209–217, Atlanta, GA, USA, 2005. Springer. [26] Moshe Koppel, Jonathan Schler, and Kfir Zigdon. Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 624–628, New York, NY, USA, 2005. ACM. [27] Artur Kulmizev, Bo Blankers, Johannes Bjerva, Malvina Nissim, Gertjan van Noord, Barbara Plank, and Martijn Wieling. The power of character n-grams in native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP, pages 382–389, Copenhagen, Denmark, 2017. ACL. [28] Robert Lado. Linguistics across cultures: applied linguistics for language teachers. Ann Arbor: University of Michigan Press, 1957. [29] Shibamouli Lahiri and Rada Mihalcea. Using n-gram and word network features for native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 251–259, Atlanta, GA, USA, 2013. ACL. [30] Thomas Lavergne, Gabriel Illouz, Aurélien Max, and Ryo Nagata. LIMSI’s participation to the 2013 shared task on native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 260–265, Atlanta, GA, USA, 2013. ACL. [31] Patsy M. Lightbown. Anniversary article. classroom SLA research and second language teaching. Applied Linguistics, 21(4):431–462, 2000. [32] André Lynum. Native language identification using large scale lexical features. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 266–269, Atlanta, GA, USA, 2013. ACL. [33] Shervin Malmasi. Native Language Identification: Explorations and Applications. PhD thesis, Macquarie University, Sydney, Australia, 2016. [34] Shervin Malmasi and Aoife Cahill. Measuring feature diversity in native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 49–55, Denver, CO, USA, 2015. ACL. 90

BIBLIOGRAPHY

[35] Shervin Malmasi and Mark Dras. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, pages 1403–1409, Denver, CO, USA, 2015. ACL. [36] Shervin Malmasi and Mark Dras. Multilingual native language identification. Natural Language Engineering, 23(2):163–215, 2015. [37] Shervin Malmasi and Mark Dras. Native language identification using stacked generalization. arXiv preprint arXiv:1703.06541, 2017. [38] Shervin Malmasi, Mark Dras, Mark Johnson, Lan Du, and Magdalena Wolska. Unsupervised text segmentation based on native language characteristics. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1457–1469, Vancouver, Canada, 2017. ACL. [39] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. A report on the 2017 native language identification shared task. In Proceedings of the Twelfth Workshop on Innovative Use of NLP for Building Educational Applications, pages 62–75, Copenhagen, Denmark, 2017. ACL. [40] Shervin Malmasi, Joel Tetreault, and Mark Dras. Oracle and human baselines for native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 172–178, Denver, CO, USA, 2015. ACL. [41] Shervin Malmasi, Sze-Meng Jojo Wong, and Mark Dras. Nli shared task 2013: MQ submission. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 124–133, Atlanta, GA, USA, 2013. ACL. [42] Ilia Markov, Lingzhen Chen, Carlo Strapparava, and Grigori Sidorov. CIC-FBK approach to native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP, pages 374–381, Copenhagen, Denmark, 2017. ACL. [43] Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. Language- and subtaskdependent feature selection and classifier parameter tuning for author profiling. In 91

BIBLIOGRAPHY

Working Notes Papers of the CLEF 2017 Evaluation Labs, volume 1866, Dublin, Ireland, 2017. CLEF and CEUR-WS.org. [44] Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov, and Alexander Gelbukh. Adapting cross-genre author profiling to language and corpus. In Working Notes Papers of the CLEF 2016 Evaluation Labs, volume 1609, pages 947–955, Évora, Portugal, 2016. CLEF and CEUR-WS.org. [45] Ilia Markov, Efstathios Stamatatos, and Grigori Sidorov. Improving cross-topic authorship attribution: The role of preŋ-processing. In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, 2017. Springer. [46] Laura Mayfield Tomokiyo and Rosie Jones. You’re not from ’round here, are you?: Naïve Bayes detection of non-native utterance text. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, pages 1–8, Pittsburgh, PA, USA, 2001. ACL. [47] Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947. [48] Saif Mohammad and Peter Turney. Crowdsourcing a word-emotion association lexicon. Computational Intelligence, 29:436–465, 2013. [49] Nick Moore. What’s the point? The role of punctuation in realising information structure in written English. Functional Linguistics, 3(1):6, 2016. [50] Vivi Nastase and Carlo Strapparava. Word etymology as native language interference. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2692–2697, Copenhagen, Denmark, 2017. ACL. [51] Garrett Nicolai, Bradley Hauer, Mohammad Salameh, Lei Yao, and Grzegorz Kondrak. Cognate and misspelling features for natural language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 140–145, Atlanta, GA, USA, 2013. ACL. [52] Garrett Nicolai and Grzegorz Kondrak. Does the phonology of L1 show up in L2 texts? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 854–859, Baltimore, MD, USA, 2014. ACL. 92

BIBLIOGRAPHY

[53] Terence Odlin. Language Transfer: cross-linguistic influence in language learning. Cambridge University Press, Cambridge, UK, 1989. [54] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [55] Ria Perkins. Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis. PhD thesis, Aston University, Birmingham, UK, 2014. [56] Marius Popescu and Radu Tudor Ionescu. The story of the characters, the DNA and the native language. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 270–278, Atlanta, GA, USA, 2013. ACL. [57] Juan-Pablo Posadas-Durán, Grigori Sidorov, and Ildar Batyrshin. Complete syntactic n-grams as style markers for authorship attribution. In Proceedings of the 13th Mexican International Conference on Artificial Intelligence, pages 9–17, Tuxtla Gutiérrez, Mexico, 2014. Springer. [58] Juan-Pablo Posadas-Durán, Grigori Sidorov, Helena Gómez-Adorno, Ildar Batyrshin, Elibeth Mirasol-Mélendez, Gabriela Posadas-Durán, and Liliana Chanona-Hernández. Algorithm for extraction of subtrees of a sentence dependency parse tree. Acta Polytechnica Hungarica, 14(3):79–98, 2017. [59] Francisco Rangel and Paolo Rosso. On the identification of emotions and authors’ gender in Facebook comments on the basis of their writing style. In Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and perspectives from AI, volume 1096, pages 34–46, Torino, Italy, 2013. CEUR-WS.org. [60] Francisco Rangel and Paolo Rosso. On the impact of emotions on author profiling. Information Processing & Management, 52(1):74–92, 2016. [61] H. Ringbom. Lexical transfer in L3 production. Cross-linguistic influence in third language acquisition: Psycholinguistic perspectives, pages 59–68, 2001. 93

BIBLIOGRAPHY

[62] Alla Rozovskaya and Dan Roth. Algorithm selection and model adaptation for ESL correction tasks. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 924–933, Portland, Oregon, USA, 2011. ACL. [63] Upendra Sapkota, Steven Bethard, Manuel Montes-y-Gómez, and Thamar Solorio. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, pages 93–102, Denver, CO, USA, 2015. ACL. [64] Helmut Schmid. Improvements In Part-of-Speech Tagging With an Application to German, pages 13–25. Springer, 1999. [65] Björn Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee K. Burgoon, Alice Baird, Aaron Elkins, Yue Zhang, Eduardo Coutinho, and Keelan Evanini. The INTERSPEECH 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Proceedings of the Annual Conference of the International Speech Communication Association, pages 2001–2005, San Francisco, CA, USA, 2016. ISCA. [66] Grigori Sidorov. Non-linear Construction of N-grams in Computational Linguistics. SMIA, 2013. [67] Grigori Sidorov. Syntactic dependency based n-grams in rule based automatic English as second language grammar correction. International Journal of Computational Linguistics and Applications, 4(2):169–188, 2013. [68] Grigori Sidorov, Sabino Miranda-Jiménez, Francisco Viveros-Jiménez, Alexander Gelbukh, Noé Castro-Sánchez, Francisco Velásquez, Ismael Díaz-Rangel, Sergio SuárezGuerra, Alejandro Treviño, and Juan Gordon. Empirical study of machine learning based approach for opinion mining in tweets. In Proceedings of the Mexican International Conference on Artificial Intelligence, volume 7629, pages 1–14, San Luis Potosí. Mexico, 2013. Springer. [69] Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, and Liliana Chanona-Hernández. Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3):653–860, 2014. 94

BIBLIOGRAPHY

[70] Efstathios Stamatatos. On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, 21(2):427–439, 2013. [71] Michael Swan and Bernard Smith. Learner English: A teacher’s guide to interference and other problems, volume 1. Cambridge University Press, Cambridge, UK, 2 edition, 2001. [72] Ben Swanson and Eugene Charniak. Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 193–197, Jeju Island, Korea, 2012. ACL. [73] Joel Tetreault, Daniel Blanchard, and Aoife Cahill. A report on the first native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 48–57, Atlanta, GA, USA, 2013. ACL. [74] Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of the 24th International Conference on Computational Linguistics, pages 2585–2602, Mumbai, India, 2012. The COLING 2012 Organizing Committee. [75] Marie-Claude Tremblay. Cross-linguistic influence in third language acquisition: The role of L2 proficiency and L2 exposure. Cahiers Linguistiques d’Ottawa, 34:109–119, 2006. [76] Oren Tsur and Ari Rappoport. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pages 9–16, Stroudsburg, PA, USA, 2007. ACL. [77] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [78] Shiyang Wen and Xiaojun Wan. Emotion classification in microblog texts using class sequential rules. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 187–193, Quebec, Canada, 2014. AAAI Press. 95

BIBLIOGRAPHY

[79] Sze-Meng Jojo Wong and Mark Dras. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association Workshop 2009, pages 53–61, Sydney, Australia, 2009. ACL. [80] Sze-Meng Jojo Wong and Mark Dras. Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1600–1610, Edinburgh, Scotland, 2011. ACL. [81] Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. Exploring adaptor grammars for native language identification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 699–709, Jeju Island, Korea, 2012. ACL. [82] Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42–49, New York, NY, USA, 1999. ACM.

96

Appendix A Results for Top 10 Punctuation Mark N-grams Table A.1: Top 10 10-fold cross-validation results (accuracy, %) for PM n-grams with POS n-gram (n = 1–3) features with PMs on the toefl11-17 dataset; 1-step and 2-step approaches. Number of features (No.) is provided for each experiment. TOEFL11-17 1-step Features POS n-grams w/ PMs PM 2,4-grams PM 2,5-grams PM 2,3,4-grams PM 3,7-grams PM 3,5-grams PM 2,4,5-grams PM 5-grams PM 2,7-grams PM 4-grams PM 2,3-grams

2-step Acc., % 48.83 49.70 49.62 49.58 49.58 49.52 49.52 49.40 49.39 49.37 49.27

No. 29,763 35,586 41,211 37,582 60,877 42,753 46,580 40,757 59,335 35,132 32,213

Features POS n-grams w/ PMs PM 3,5-grams PM 2,5-grams PM 2,3,4-grams PM 5-grams PM 2,4,5-grams PM 7-grams PM 2,4-grams PM 2,3-grams PM 3-grams PM 6-grams

97

Acc., % 48.43 49.46 49.42 49.42 49.31 49.20 49.19 49.19 49.17 49.14 48.94

No. 29,763 42,753 41,211 37,582 40,757 46,580 58,881 35,586 32,213 31,759 48,803

Results for Top 10 Punctuation Mark N-grams

Table A.2: Top 10 10-fold cross-validation results (accuracy, %) for PM n-grams with POS n-grams (n = 1–3) features with PMs on the icle dataset; 1-step and 2-step approaches. Number of features (No.) is provided for each experiment. ICLE 1-step Features POS n-grams w/ PMs PM 2,7-grams PM 4,7-grams PM 2,5-grams PM 7-grams PM 2,3,7-grams PM 5-grams PM 2,6-grams PM 6-grams PM 3,7-grams PM 2,8-grams

Acc., % 69.48 70.52 70.52 70.39 70.39 70.39 70.26 70.26 70.13 70.13 70.13

No. 18,594 35,363 38,531 25,665 35,086 36,628 25,388 30,143 29,866 36,351 41,087

98

2-step Features Acc., % POS n-grams w/ PMs 65.19 PM 2,4-grams 67.66 PM 2-grams 67.14 PM 3-grams 67.14 PM 2,3,4-grams 67.01 PM 4-grams 66.88 PM 2,3-grams 66.88 PM 2,5-grams 66.49 PM 2,7-grams 66.49 PM 7-grams 65.97 PM 4,7-grams 65.84

No. 18,594 22,316 18,871 19,859 23,581 22,039 20,136 25,665 35,363 35,086 38,531

Appendix B Results for Function Words with Punctuation-Based Features Table B.1: 10-fold cross-validation accuracy (%) for FW unigrams with PMs, without PMs, and with PM n-grams (n = 2, 5); 1- and 2-step approaches. TOEFL11-17 ICLE 1-step 2-step 1-step 2-step Features No. No. acc. acc. acc. acc. Majority baseline 9.09 9.09 14.29 14.29 FWs w/o PMs 27.89 31.94 127 50.13 46.36 127 FWs w/ PMs 31.90 37.00 163 58.31 55.97 158 Improvement: 4.01 5.06 8.18 9.61 FWs w/ PMs 31.90 37.00 163 58.31 55.97 158 FWs w/ PMs + PM n-grams 34.08 37.05 11,611 59.22 58.70 7,229 Improvement: 2.18 0.05 0.91 2.73

99

Results for Function Words with Punctuation-Based Features

Table B.2: 10-fold cross-validation accuracy (%) for FW unigrams for each proficiency level (imbalanced setting); toefl11 dataset. Features FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement: FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement: FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement:

1-step 2-step acc. acc. Low proficiency 40.38 39.32 44.97 45.41 4.59 6.09 44.97 45.41 PM n-grams 42.06 43.53 –2.91 –1.88 Medium proficiency 28.73 32.00 35.25 37.99 6.52 5.99 35.25 37.99 PM n-grams 34.64 38.47 –0.61 0.48 High proficiency 22.77 26.46 25.93 29.99 3.16 3.53 25.93 29.99 29.34 32.27 PM n-grams 3.41 2.28

100

No. 127 157 157 2,503

127 162 162 8,102

127 163 163 7,449

Results for Function Words with Punctuation-Based Features

Table B.3: 10-fold cross-validation accuracy (%) for FW unigrams for each proficiency level (balanced setting); toefl11 dataset. Features FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement: FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement: FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + Improvement:

1-step 2-step acc. acc. Low proficiency 38.57 37.71 41.09 42.26 2.52 4.55 41.09 42.26 39.19 41.16 PM n-grams –1.90 –1.10 Medium proficiency 32.17 29.29 34.27 37.37 2.10 8.08 34.27 37.37 PM n-grams 36.84 35.19 2.57 –2.18 High proficiency 23.79 21.46 28.79 30.22 5.00 8.76 28.79 30.22 PM n-grams 29.39 28.37 0.60 –1.85

No. 127 157 157 2,398

126 152 152 2,730

127 160 160 3,499

Table B.4: 10-fold cross-validation and one fold/topic setting accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams.

Features FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + PM n-grams Improvement:

TOEFL11-17 (10FCV) Acc. No. 27.89 127 31.90 163 4.01 31.90 163 34.08 11,611 2.18

101

TOEFL11-17 (topic = fold) Acc. No. 26.13 127 28.22 163 2.09 28.22 163 32.59 10,759 4.36

Results for Function Words with Punctuation-Based Features

Table B.5: Mixed- and cross-topic settings accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams.

Features FWs w/o PMs FWs w/ PMs Improvement: FWs w/ PMs FWs w/ PMs + PM n-grams Improvement:

TOEFL11-17 (mixed-topic) Acc. No. 26.58 127 31.14 162 4.56 31.14 162 38.41 7,606 7.27

TOEFL11-17 (cross-topic) Acc. No. 23.60 127 32.80 163 9.20 32.80 163 26.13 7,817 –6.67

Table B.6: Cross-corpus classification accuracy (%) for FW unigrams without PMs, with PMs, and with PM n-grams. Training on TOEFL, testing on ICLE 10FCV Test Set Features Acc. F1 Acc. F1 FWs w/o PMs 38.09 35.83 25.56 19.26 FWs w/ PMs 42.68 41.46 31.23 26.31 4.59 5.63 5.67 7.05 Improvement: FWs w/ PMs 42.68 41.46 31.23 26.31 39.71 37.65 FWs w/ PMs + PM n-grams 45.90 44.99 Improvement: 3.22 3.53 8.48 11.34 Training on ICLE, testing on TOEFL 10FCV Test Set Features Acc. F1 Acc. F1 FWs w/o PMs 65.06 57.75 20.92 13.39 FWs w/ PMs 74.31 68.67 27.66 22.99 Improvement: 9.25 10.92 6.74 9.60 FWs w/ PMs 74.31 68.67 27.66 22.99 FWs w/ PMs + PM n-grams 77.09 72.74 29.87 25.8 Improvement: 2.78 4.07 2.21 2.81

102

No. 127 162 162 9,527

No. 127 158 158 13,653

Appendix C Emotion Polarity Results Table C.1: 10-fold cross-validation accuracy (%) for POS n-grams (n = 1–3) with PMs and PM n-grams (n = 2,5), and with emotion polarity features; 1- and 2-step approaches. TOEFL11-17 ICLE 1-step 2-step 1-step 2-step Features No. No. acc. acc. acc. acc. Majority baseline 9.09 9.09 14.29 14.29 POS n-grams w/ PMs + PM n-grams 49.62 49.42 41,211 70.39 66.49 25,665 POS n-grams w/ PMs + PM n-grams + emoP 50.51 50.17 41,415 73.38 69.74 25,857 Improvement: 0.89 0.75 2.99 3.25

Table C.2: 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM n-grams (n = 2,5), and with emotion polarity features; 1- and 2-step approaches. TOEFL11-17 ICLE 1-step 2-step 1-step 2-step Features No. No. acc. acc. acc. acc. Majority baseline 9.09 9.09 14.29 14.29 FWs w/ PMs + PM n-grams 34.08 37.05 11,611 59.22 58.70 7,229 FWs w/ PMs + PM n-grams + emoP 33.44 39.33 11,815 64.94 60.78 7,421 Improvement: –0.64 2.28 5.72 2.08

103

Emotion Polarity Results

Table C.3: 10-fold cross-validation accuracy (%) for POS n-grams with PMs and PM ngrams, and with emotion polarity features for each proficiency level (imbalanced setting). Features Low proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement: Medium proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement: High proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement:

1-step acc.

2-step acc.

47.07 49.09 2.02

45.86 48.57 2.71

15,657 15,787

51.69 53.19 1.50

51.14 52.39 1.25

32,740 32,923

42.64 43.15 0.51

43.12 43.60 0.48

30,292 30,487

No.

Table C.4: 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM ngrams, and with emotion polarity features for each proficiency level (imbalanced setting). Features FWs w/ PMs + PM FWs w/ PMs + PM Improvement: FWs w/ PMs + PM FWs w/ PMs + PM Improvement: FWs w/ PMs + PM FWs w/ PMs + PM Improvement:

1-step acc. Low proficiency n-grams 42.06 n-grams + emoP 44.75 2.69 Medium proficiency n-grams 34.64 n-grams + emoP 40.27 5.63 High proficiency n-grams 29.34 n-grams + emoP 30.64 1.30

104

2-step acc.

No.

43.53 44.21 0.68

2,503 2,633

38.47 41.21 2.74

8,102 8,285

32.27 34.03 1.76

7,449 7,644

Emotion Polarity Results

Table C.5: 10-fold cross-validation accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (balanced setting). Features Low proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement: Medium proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement: High proficiency POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement:

1-step acc.

2-step acc.

45.07 45.44 0.37

44.70 45.71 1.01

15,141 15,267

44.12 45.62 1.50

42.26 43.69 1.43

16,574 16,719

36.77 37.48 0.71

39.56 40.32 0.76

18,331 18,499

No.

Table C.6: 10-fold cross-validation accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features for each proficiency level (balanced setting). Features FWs w/ PMs + PM FWs w/ PMs + PM Improvement: FWs w/ PMs + PM FWs w/ PMs + PM Improvement: FWs w/ PMs + PM FWs w/ PMs + PM Improvement:

1-step acc. Low proficiency n-grams 39.19 n-grams + emoP 41.21 2.02 Medium proficiency n-grams 36.84 n-grams + emoP 38.79 1.95 High proficiency n-grams 29.39 n-grams + emoP 30.45 1.06

2-step acc.

No.

41.16 41.33 0.17

2,398 2,524

35.19 38.38 3.19

2,730 2,875

28.37 31.06 2.69

3,499 3,667

Table C.7: 10-fold cross-validation and one fold/topic setting accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features.

Features POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement:

105

TOEFL11-17 (10FCV) Acc. No. 49.62 41,211 50.51 41,415 0.89

TOEFL11-17 (topic = fold) Acc. No. 43.48 39,231 44.07 39,433 0.59

Emotion Polarity Results

Table C.8: 10-fold cross-validation and one fold/topic setting accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features.

Features FWs w/ PMs + PM n-grams FWs w/ PMs + PM n-grams + emoP Improvement:

TOEFL11-17 (10FCV) Acc. No. 34.08 11,611 33.44 11,815 –0.64

TOEFL11-17 (topic = fold) Acc. No. 32.59 10,759 32.09 10,961 –0.50

Table C.9: Mixed- and cross-topic settings accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features.

Features POS n-grams w/ PMs + PM n-grams POS n-grams w/ PMs + PM n-grams + emoP Improvement:

TOEFL11-17 (mixed-topic) Acc. No. 46.68 31,659 46.94 31,850 0.26

TOEFL11-17 (cross-topic) Acc. No. 41.36 31,604 40.32 31,792 –1.04

Table C.10: Mixed- and cross-topic settings accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features.

Features FWs w/ PMs + PM n-grams FWs w/ PMs + PM n-grams + emoP Improvement:

TOEFL11-17 (mixed-topic) Acc. No. 38.41 7,606 39.17 7,797 0.76

TOEFL11-17 (cross-topic) Acc. No. 26.13 7,817 31.56 8,005 5.43

Table C.11: Cross-corpus classification accuracy (%) for POS n-grams with PMs and PM n-grams, and with emotion polarity features.

Features POS n-grams w/ POS n-grams w/ Improvement:

Features POS n-grams w/ POS n-grams w/ Improvement:

Training on TOEFL, testing on ICLE 10FCV Test Set Acc. F1 Acc. F1 PMs + PM n-grams 60.57 60.50 54.70 53.44 PMs + PM n-grams + emoP 61.49 61.40 55.12 53.55 0.92 0.90 0.42 0.11 Training on ICLE, testing on TOEFL 10FCV Test Set Acc. F1 Acc. F1 PMs + PM n-grams 87.07 84.49 41.73 39.81 PMs + PM n-grams + emoP 87.27 84.60 41.77 39.74 0.20 0.11 0.04 –0.07

106

No. 35,235 35,427

No. 40,385 40,605

Emotion Polarity Results

Table C.12: Cross-corpus classification accuracy (%) for FW unigrams with PMs and PM n-grams, and with emotion polarity features.

Features FWs w/ PMs + FWs w/ PMs + Improvement:

Features FWs w/ PMs + FWs w/ PMs + Improvement:

Training on TOEFL, testing on ICLE 10FCV Test Set Acc. F1 Acc. F1 PM n-grams 45.90 44.99 39.71 37.65 PM n-grams + emoP 45.29 44.41 35.71 34.81 –0.61 –0.58 –4.00 –2.84 Training on ICLE, testing on TOEFL 10FCV Test Set Acc. F1 Acc. F1 PM n-grams 77.09 72.74 29.87 25.80 PM n-grams + emoP 81.47 77.44 31.10 28.26 4.38 4.70 1.23 2.46

107

No. 9,527 9,719

No. 13,653 13,873

Appendix D Fine-Tuning the Feature Set Size on the ICLE Dataset Table D.1: Accuracy (%) and F1-score (%) variation with respect to the frequency threshold on the icle dataset for the modified (modif.) cic-fbk with emoL, with PM n-grams and emoL, and with emoP and emoL features.

Frequency threshold 1 (all) 4 (baseline) 10 11 12 13 15 20 25 30 35

Modif. CIC-FBK + emoL 5FCV No. Acc. F1 91.30 91.26 187,560 91.95 91.93 128,249 92.34 92.34 47,244 92.73 92.73 41,434 92.86 92.85 39,927 92.60 92.58 36,045 92.60 92.59 31,927 92.21 92.22 25,744 91.69 91.65 21,160 91.69 91.67 18,587 91.30 91.28 16,372

Modif. CIC-FBK + PM n-grams + emoL 5FCV No. Acc. F1 91.43 91.41 190,281 92.08 92.05 129,670 92.73 92.72 48,004 93.12 93.10 42,150 93.25 93.23 40,610 92.86 92.84 36,693 92.73 92.72 32,501 92.74 92.74 26,194 91.43 91.40 21,500 91.69 91.66 18,852 91.56 91.55 16,579

109

Modif. CIC-FBK + emoP + emoL 5FCV No. Acc. F1 91.30 91.26 187,728 91.95 91.93 128,393 92.21 92.21 47,355 92.60 92.60 41,543 92.99 92.98 40,032 92.73 92.71 36,146 92.60 92.59 32,025 92.21 92.22 25,835 91.69 91.65 21,246 91.82 91.81 18,670 91.82 91.81 16,452

Fine-Tuning the Feature Set Size on the ICLE Dataset

Table D.2: Accuracy (%) and F1-score (%) variation with respect to the minimum document frequency (min_df) on the icle dataset for the modified (modif.) cic-fbk with emoL, with PM n-grams and emoL, and with emoP and emoL features.

min_df 1 (all) 2 (baseline) 3 4 5 6 7

Modif. CIC-FBK + emoL 5FCV No. Acc. F1 92.86 92.85 40,032 92.86 92.85 39,927 92.86 92.85 39,842 92.86 92.85 39,666 92.86 92.85 39,304 92.47 92.46 38,234 92.47 92.47 35,417

Modif. CIC-FBK + PM n-grams + emoL 5FCV No. Acc. F1 93.25 93.23 40,715 93.25 93.23 40,610 93.25 93.23 40,525 93.25 93.23 40,349 93.25 93.23 39,987 92.99 92.98 38,917 92.60 92.59 36,100

110

Modif. CIC-FBK + emoP + emoL 5FCV No. Acc. F1 92.99 92.98 40,137 92.99 92.98 40,032 92.99 92.98 39,947 92.99 92.98 39,771 92.99 92.98 39,409 92.73 92.73 38,339 92.60 92.57 35,522