Generating New LIWC Dictionaries by Triangulation

1 downloads 0 Views 147KB Size Report
with other versions in Romance languages close to Catalan, that is, Span- ... polarity or their sentiment category, either to craft classification rules or as features ...
Generating New LIWC Dictionaries by Triangulation Guillem Mass´ o, Patrik Lambert, Carlos Rodr´ıguez Penagos, and Roser Saur´ı Barcelona Media Innovation Center Av. Diagonal 77, 08025 Barcelona, Spain {guillem.masso,roser.sauri}@gmail.com, [email protected], [email protected]

Abstract. This work aims at exploring a triangulation-based methodology for generating a sentiment dictionary in a language from equivalent dictionaries in other languages. Direct machine translation of dictionaries generally leads to incomplete or wrong results, but multilingual translation can help disambiguate and improve these data. More precisely, we want to translate the LIWC dictionary (Linguistic Inquiry and Word Count) into Catalan from the original English dictionary, complemented with other versions in Romance languages close to Catalan, that is, Spanish, French and Italian. Comparing translations from these dictionaries allows us to identify the most reliable solutions, namely, those common to the different languages. Since LIWC classifies words by categories, assigning the correct ones to the chosen translations is also an important issue, specially when the source categories are different. We present the results of a semi-automatic approach and the challenges that had to be addressed in the translation process. Keywords: sentiment analysis, translation, triangulation, LIWC, sentiment dictionary.

1

Introduction

With the development of the Web 2.0, a massive amount of text expressing personal sentiments is published daily on the internet. The automatic analysis of these texts is extremely valuable to many institutions and companies since it allows them to know the general sentiment about them or about their products or services. Sentiment analysis has become a very active research field, boosted both by its scientific interest and this commercial value. The main approaches to perform sentiment analysis use lists of sentiment words with their opinion polarity or their sentiment category, either to craft classification rules or as features of machine learning algorithms [1]. While the subjective texts available on the web is increasingly multilingual1 because most people prefer to write and read information in their own language, sentiment and subjectivity dictionaries are still not available for most languages, especially for minority languages. 1

Statistics of Internet world users by language on www.internetworldstats.com.

R.E. Banchs et al. (Eds.): AIRS 2013, LNCS 8281, pp. 263–271, 2013. c Springer-Verlag Berlin Heidelberg 2013 

264

G. Mass´ o et al.

In this paper we investigate the feasibility of the translation of the Linguistic Inquiry and Word Count (LIWC) [2] sentiment dictionary. To keep manual work minimal, we propose a semi-automatic approach based on triangulation from LIWC dictionaries existing in four languages. We test our investigations on the translation of LIWC into the Catalan minority language. Table 1. LIWC 2001 categories, with id code and name (emotion-related ones bolded) Cat. Dimension I. STANDARD LINGUISTIC DIMENSIONS 1 Total pronouns 2 1st person singular 3 1st person plural 4 Total first person 5 Total second person 6 Total third person 7 Negations 8 Assents 9 Articles 10 Prepositions 11 Numbers II. PSYCHOLOGICAL PROCESSES 12 Affective or Emotional Processes 13 Positive Emotions 14 Positive feelings 15 Optimism and energy 16 Negative Emotions 17 Anxiety or fear 18 Anger 19 Sadness or depression 20 Cognitive Processes 21 Causation 22 Insight 23 Discrepancy 24 Inhibition 25 Tentative 26 Certainty 27 Sensory and Perceptual Processes 28 Seeing 29 Hearing 30 Feeling 31 Social Processes 32 Communication 33 Other references to people 34 Friends 35 Family 36 Humans

Cat. Dimension III. RELATIVITY 37 Time 38 Past tense verb 39 Present tense verb 40 Future tense verb 41 Space 42 Up 43 Down 44 Inclusive 45 Exclusive 46 Motion IV. PERSONAL CONCERNS 47 Occupation 48 School 49 Job or work 50 Achievement 51 Leisure activity 52 Home 53 Sports 54 Television and movies 55 Music 56 Money and financial issues 57 Metaphysical issues 58 Religion 59 Death and dying 60 Physical states and functions 61 Body states, symptoms 62 Sex and sexuality 63 Eating, drinking, dieting 64 Sleeping, dreaming 65 Grooming APPENDIX: EXPERIMENTAL DIMENSIONS 66 Swear words 67 Nonfluencies 68 Fillers

LIWC is a text analysis software calculating the degree to which any text uses positive or negative emotions, self-references or causal words, among other language dimensions. Its aim is to provide an efficient and effective method for studying the various emotional, cognitive, structural, and process components present in individuals’ verbal and written speech samples. LIWC contains a list of word categories (see Table 1) and a dictionary of words with a set of related categories. For example, the categories associated to the word “love” are Affect, Positive emotion, Positive feeling, Present, Physical and Sexual.

Generating New LIWC Dictionaries by Triangulation

2

265

Related Work

We are not aware of any semi-automatic method reported to translate the LIWC dictionary. However, some research has already been conducted on translation of subjectivity or sentiment polarity dictionaries. Mihalcea et al. [3] translated a subjectivity dictionary directly with a bilingual dictionary. They reported that only a small fraction of the lexicon entries preserve their subjectivity in the translation, mainly because of the ambiguous entries in the source and target languages. To alleviate this problem, Steinberger et al. [4] proposed a triangulation method. They first produced high-level gold-standard sentiment dictionaries for two languages and then translated them automatically into third languages. The idea is that the overlapping target language word lists are likely to have senses similar to that of the two source languages. For example, “esperar” in Spanish has two translations into French: “attendre” (to wait), which has negative polarity, and “esp´erer” (to hope), which has positive polarity. In English, there is only the positive sense “to hope”. Thus triangulation into English allows to disambiguate the polarity of the French translations. Other approaches mentioned in the literature to build or translate sentiment lexicons include bootstrapping based on lexicon relations [5] or graph relations [6]. Lexicon bootstrapping consists of expanding a set of manually chosen seeds and their corresponding polarity with related words, and then filter the candidate words. Banea et al. [5] expand the seed words with words of the text of their definition, as well as with synonym and antonym words found in a dictionary. To translate an English sentiment lexicon, Scheible et al. [6] first build monolingual graphs based on 2 relation types: coordination between adjectives (e.g. healthy and tasty) and adjective-noun modification (e.g. healthy food). They then use a sentiment lexicon to determine the sentiment polarity in English graph nodes and a bilingual lexicon to draw seed relations between the English and foreign graphs. They finally expand the English-foreign relations with the SimRank algorithm, which determines the similarity between 2 nodes in different graphs. These two bootstrapping approaches expand a set of seeds based on similarity or dissimilarity between words in a dictionary (such as synonyms or antonyms) or nodes in a graph. They thus only work because each word has only one binary polarity attribute (positive or negative). These methods cannot be applied to translate the LIWC dictionary, in which there are more than 60 possible categories and each word may be associated to several categories. Another approach mentioned in the literature and which could be applied to the proposed task is Wordnet-based lexicon generation. Banea et al. [7] showed that for about 90% of Wordnet senses the subjective meaning does hold across languages (in this case Romanian and English). This property is exploited by Perez-Rosas et al. [8] and Hassan et al. [9] to build sentiment lexicons in a foreign language using multilingual Wordnet resources. However, in the present work, we wanted to develop a method applicable to languages in which Wordnet resources are not available.

266

3

G. Mass´ o et al.

Methodology

This section details how to generate a Catalan translation of the LIWC dictionary from LIWC dictionaries in other languages. Comparing translations from several dictionaries allows us to distinguish the more reliable ones, namely, these common to the different languages. Since LIWC classifies words by categories, assigning the correct ones to translations is also an important issue, mainly when the source categories are different. Furthermore, there are two LIWC versions (2001 and 2007) and not all the dictionaries are available in both versions. We describe below the steps of the whole process, from mapping categories of LIWC versions to assigning the most appropriate categories to the obtained translations. 3.1

Multilingual Translation and Alignment

As source language dictionaries, we choose Romance languages (namely Spanish, French and Italian) because of their closeness to Catalan. We also used the English dictionary as it is the original one created in the LIWC project. We used LIWC 2001 categories because most selected dictionaries were available in this format. For the French dictionary, only available with 2007 categories, we manually mapped 2007 categories to 2001 ones. Note that 2007 categories cannot all been mapped unambiguously to 2001 categories. The dictionaries were translated using the bilingual dictionaries of Apertium [10]. Apertium is a free and open-source rule-based machine translation platform, focused on Romance languages and other language pairs such as English-Catalan. The LIWC dictionaries are rather different in different languages. The English one has 2318 words, the Spanish one, 7475, the French one, 39164, and the Italian one, 5153. The large amount of words in the French dictionary is due to repeated words with different clitics (i.e. abandonne, j’abandonne, l’abandonne, m’abandonne, n’abandonne, s’abandonne and t’abandonne). After the translation process, we obtained 8526 Catalan words: 3359 have source words in the English dictionary, 5517 in the Spanish one, 3299 in the French one and 1807 in the Italian one. As the translation was direct and using bilingual dictionaries, not all the words were translated and the translated Catalan words are lemmas. As for roots, represented as root*, we got the translations of all the words beginning with the root. For this reason, the number of translated words from the English dictionary is larger than the number of words of the dictionary. The next step was to automatically align the different source words of each Catalan word. From this alignment, we observed the following figures. Only 76 words have source words in every language with the same categories. Other 499 words have source words in every language but with different categories. There are also 979 translations with source words in three languages, 205 of which have the same categories. There are 1773 Catalan words with source words in two languages, 631 of them with the same categories. Finally, 5199 translations have source words in only one language. We thus observed few reliable translations but a great amount of hardly reliable ones. In between, there is a range of increasingly

Generating New LIWC Dictionaries by Triangulation

267

unreliable translations. The degree of reliability for a Catalan word depends on the number of source languages it was translated from (the more, the better) as well as the number of shared and different categories of the corresponding source words (the more similar, the better). A further analysis is clearly needed, and we deal with it in the next section.

4 4.1

Analysis of Results Analysis of LIWC Categories

Each word of the LIWC dictionaries is annotated with a code composed of one or more category identifiers. These categories are hierarchical and grouped by linguistic or semantic criteria: Affective Processes are divided into Positive Emotions and Negative Emotions, and these categories are divided into other subcategories. A code can contain categories of the same or different groups. We have used the 2001 version categories, listed in Table 1. For each group, there is a main category (i.e. 1-Pronouns, 12-Affective Processes, 20-Cognitive Processes, etc.). This category can be the only one of the group in a code or there can be other categories of the same group. On the other hand, the secondary categories hardly go without their main category. However, we observed that some secondary categories are rather independent, such as 25-Tentative, 26-Certainty, 38-Past, 39-Present, 40-Future, 44-Inclusive, 45Exclusive, 50-Achievement, 52-Home, 58-Religion, 59-Death and 65-Grooming. These categories often appear in codes without their main category. The category groups are classified as Standard Linguistic Dimensions, Psychological Processes, Relativity Personal Concerns and Experimental Dimensions. As for the Standard Linguistic Dimensions, the categories are grammatical, while the other categories are basically semantic. An exception could be the categories 38-Past, 39-Present and 40-Future: they have a morphological role when they are related to a tense verb, but they have a semantic role when they are related to adverbs or other temporal expressions. On the other hand, 7-Negations and 8-Assents are linguistic categories but with a strong polarity, so they will have an important role if the dictionary is focused on sentiment analysis. The Italian dictionary has additional categories which are included in the Standard Linguistic Dimensions. As they are very specific, we have not taken them in account. This is not the only difference among languages. We can notice some different criteria, such as those for the Catalan translation c`elebre, whose source words have different categories: in Spanish, c´elebre has the categories 51 and 54, focused on the celebrities of movies or TV; in French, c´el`ebre has the categories 31 and 50, as a social achievement, and in Italian, celebre is tagged as a positive emotion. All of them are acceptable and are different approaches to the semantics of the word. This happens with several words. The roots are another source of problems. An extreme case is the Italian article root l*, which is translated into all Catalan words beginning with l. There are also less dramatic mistakes: the Spanish root sex* (categories 60 and 62) is

268

G. Mass´ o et al.

shared by sexo (’sex’) and sexto (’sixth’). If there are other source words, these mistakes can be solved by multi-lingual translation. 4.2

Analysis of Translations

We have aligned all the source category codes for each Catalan translation and we have automatically compared them. If we analyse the overall results, we can first see that the most frequent divergent categories are linguistic (1-6, 9 and 10) and verb tense (38-40) categories. This means that these categories are less consistent among languages or that they are more language-dependent. The second important issue is that many aligned codes share main categories but have different secondary categories. We can then consider that the source words belong to the same semantic group, although they can receive different nuances in different languages. It is also possible to find source words sharing secondary categories but not main categories. In this case, we can also think that they belong to the same semantic group but there are some annotation differences among languages. When we analyse each Catalan translation with more than one source category code, we can rank their reliability by number of source languages and by qualitative differences among codes. The translations from four languages with a very similar code will be much more reliable than the translations from two languages with completely different codes. We classify the similarity among codes as: – Case 5: all the categories within the different codes belong to the same group (i.e. humor : EN mood [12], humour* and humor* [12 13], SP humor* [12 13 14], FR humeur* [12], IT umorismo [12 13]). – Case 4: all the codes share categories from the same group, but they have other categories as well (i.e. problema: EN troubl* [12 16], SP problema [12 16 17 18], FR probl`eme* [12 16 20 23], IT problem* [12 16 24]). – Case 3: similar to case 5, but some language has a second (or third, fourth, etc.) word with a completely different code (i.e. nou: EN new [37], SP nuevo* [37] and nueve [11], FR nouveau [37] and neuf* [11], IT nuov* [37]). – Case 2: similar to case 4, but some language has a second (or third, fourth, etc.) word with a completely different code (i.e. bonic: EN beaut* [12 13] and pretty [20 25], SP bonito [12 13] and guapo* [12 13 14], FR bel and beau [12 13 27 28], IT carin* [12 13]). – Case 1: there is not any common group shared by every language, but most of the languages share some category group (i.e. fortuna: EN fortune* [56], SP fortuna [56], FR fortune* [56], IT fortun* [25]). – Case 0: every code is completely different (i.e. asseure: EN settl* [20], SP sentar [52], FR asseoir [1 2 4], IT sedere [60 61]). We consider that cases 2-5 are reliable and case 1 is slightly reliable when there are source words in 3 or 4 languages, cases 2-5 are slightly reliable if there are 2 source languages, and case 0 is always unreliable. Table 2 shows the number of translations per case.

Generating New LIWC Dictionaries by Triangulation

269

Table 2. Translations with several source languages and category codes

2 languages 3 languages 4 languages Total

Case 0 Case 1 Case 2 Case 3 Case 4 Case 5 418 22 33 475 194 40 201 33 32 340 128 4 113 54 22 248 58 462 314 109 87 1063 380

Total 1142 774 499 2415

Even if all the categories within the different codes belong to the same group (case 5), we can choose to assign only the common categories to the translation, or to assign it all the categories. The other cases are more challenging and there are more options to assign a category code to the Catalan word. If we choose the most conservative option, the new codes will be very simple and we can lose some semantic nuances. On the other hand, if we accept all the categories, the codes could be too complex. In fact, the linguistic categories 1-Pronouns, 9-Articles, 10-Prepositions and 11-Numbers, and the categories 38-Past, 39-Present and 40Future are problematic when they are not shared by (almost) all the codes. After analysing several options, we decided to accept the categories of the common groups and also the categories that complement the common ones in some code except for the problematic ones. For example, if the source codes of garantia (‘guarantee’) are ‘26’, ‘12 13 15 26’, ‘12 13 15 26 39’ and ‘20 26 38 47’, the suggested code is ’12 13 15 20 26 47’. As for case 1, the criteria are similar but we ignore the divergent code. In case 0, it is not possible to suggest any code. With these results, we can create a restrictive dictionary formed by 1203 Catalan translations with source words from 3 or 4 languages: 281 translations with only one source code and 922 from cases 2-5. We can also create a less restrictive dictionary with 1669 additional translations: 631 with only one code from 2 languages, 724 from cases 2-5 and 2 source languages and 314 from case 1. Nevertheless, there are still 5661 unreliable translations which should be manually revised. Since our ultimate objective was to create a sentiment dictionary for languages not covered by LIWC, we evaluated those entries that (a) belonged to the Positive and Negative Emotions (codes 12 to 19, inclusive), and where a translation was found in 3 or 4 of the triangulation languages with those codes (regardless if other codes were included). From the resulting supposedly (highprecision) 509 entries we selected 171 for manual evaluation by 3 linguists that assigned (without knowing the original, pre-translation entries) an ’S’ where the entry in Catalan can be assigned the emotional concepts in the codes attributed to it (e.g., ”positive emotion”, ”anger”, etc.), or ’N’ if in Catalan that was not the case. ’I’ was reserved for ’undecidable’, as in cases like ”sensibly” where no apriori polarity could be assigned to it. We tested our Kappa inter annotator agreement in 11 of those entries, and obtained perfect agreement in them. Overall, 146 (85%) were acceptable transpositions of emotional meaning, and 11 (6%) were not. 14 (8%) were undecidable, according to our evaluators.

270

5

G. Mass´ o et al.

Conclusions and Further Work

The proposed semi-automatic approach based on triangulation from four LIWC dictionaries in different languages is not trivial. Less than a half of the translations are slightly to highly reliable. However, these translations are lemmas, which can be expanded into more word forms, and we can obtain a reasonable dictionary size. We have to deal with several problems, such as the translation of roots into inaccurate words and the differences of criteria among languages. The triangulation process resolves partially these problems, but we still need a human check if we want an optimal result. As for the translation of roots, a dictionary with derivatives would be useful, but we would need one for each source language. Furthermore, it would be useful to increase the coverage of the bilingual dictionaries to increase the number of source languages per target entry and thus improve the reliability of translations. On the other hand, we found that such triangulation can effectively provide us with a core, high-quality dictionary of emotional words, since these codes are transported effectively when these entries are found in more than 50% of the translation dictionaries. For these purposes, the approach seems to be feasible, although evaluating it as a method for creating full LWIC dictionary for novel languages (to aid in Information Retrieval ranking and aggregation for results from less-developed languages) needs further study. Acknowledgments. The authors would like to thank Barcelona Media Innovation Center for their support and permission to publish this work. The research leading to these results has received funding from the Seventh Framework Program of the European Commission through the Intra-European Fellowship (CrossLingMind-2011-300828) Marie Curie Actions.

References 1. Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers (2012) 2. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001) 3. Mihalcea, R., Banea, C., Wiebe, J.: Learning multilingual subjective language via cross-lingual projections. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 976–983. Association for Computational Linguistics, Prague (2007) 4. Steinberger, J., Ebrahim, M., Ehrmann, M., Hurriyetoglu, A., Kabadjov, M., Lenkova, P., Steinberger, R., Tanev, H., V´ azquez, S., Zavarella, V.: Creating sentiment dictionaries via triangulation. Decision Support Systems 53(4), 689–694 (2012) 5. Banea, C., Mihalcea, R., Wiebe, J.: A bootstrapping method for building subjectivity lexicons for languages with scarce resources. In: Proceedings of the International Conference on Language Resources and Evaluations (LREC), Marrakech, Morocco, pp. 2764–2767 (May 2008)

Generating New LIWC Dictionaries by Triangulation

271

6. Scheible, C., Laws, F., Michelbacher, L., Sch¨ uze, H.: Sentiment translation through multi-edge graphs. In: Coling 2010: Posters, Beijing, China, pp. 1104–1112 (August 2010) 7. Banea, C., Mihalcea, R., Wiebe, J.: Sense-level subjectivity in a multilingual setting. Computer Speech & Language (2013) 8. Perez-Rosas, V., Banea, C., Mihalcea, R.: Learning sentiment lexicons in spanish. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 3077–3081 (May 2012) 9. Hassan, A., AbuJbara, A., Jha, R., Radev, D.: Identifying the semantic orientation of foreign words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 592–597 (June 2011) 10. Forcada, M.L., Ginest´ı-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., P´erezOrtiz, J.A., S´ anchez-Mart´ınez, F., Ram´ırez-S´ anchez, G., Tyers, F.M.: Apertium: a free/open-source platform for rule-based machine translation. Machine Translation 25(2), 127–144 (2011)