LSA learner sentence comprehension in

LSA learner sentence comprehension in agglutinative and non-agglutinative languages Iraide Zipitria Department of Research Methods in Education (MIDE), University of the Basque Country, Tolosa etorbidea, 70 20018 Donostia, Basque Country, Spain [email protected]

Jon A. Elorriaga and Ana Arruarte Language and Information Systems Department, Computer Science Faculty, University of the Basque Country, Manuel Lardizabal pasealekua, 1 20018 Donostia, Basque Country, Spain [email protected], [email protected]

Abstract. This work has been carried out in the context of automatic evaluation of learner summaries where text comprehension is gained using Latent Semantic Analysis (LSA) and Natural Language Processing (NLP) techniques. We had intuitively observed that lemmatized versions of LSA matrixes resembled better human Basque similarity judgement than the non lemmatized ones. This research was conducted to test this idea comparing the impact of lemmatization in an agglutinative vs. a non-agglutinative language (Basque and Spanish respectively) when modelling semantic similarity. Parallel Basque-Spanish corpora replicate the same semantic knowledge in both languages. The reason to compare these parallel corpora was to observe how close or related the obtained results were in these two morphologically diverse languages. Lemmatized and non-lemmatized LSA measures have been compared to human judgements. Keywords: Text evaluation, Latent Semantic Analysis, lemmatization, agglutinative languages, non-agglutinative languages.

INTRODUCTION A relevant task for Intelligent Tutoring Systems (ITSs) is to produce adequate diagnosis of learner comprehension. Among different strategies in student learning diagnosis, one is to infer learner comprehension analysing learner responses produced in Natural Language. The main advantage of this approach is to allow learners a greater response freedom. As a result, it is possible to gain richer information on learner comprehension. As they have no hint or boundaries in response production, learners have the whole responsibility over the produced answer. In this context, we are leading to free text comprehension for the evaluation of summaries. Summaries are widely used as an educational diagnostic strategy to observe comprehension, or how much information from text is retained in memory (Barlett, 1932; Garner, 1982; Kintsch et al., 1999). As it happens in other educational diagnostic methods, it does not necessarily produce a perfect match of learner knowledge, but it produces a good approximation of the information retained in memory. But, in Computer Supported Learning Systems a necessary condition for evaluation is to automatically comprehend the summary. Therefore, free text makes its automatic evaluation more complex and bounded to Natural Language Processing (NLP). Many NLP solutions are strongly bounded to the specific language for which they have been created. Then, when using automatic NLP related open-ended diagnostic systems, language change requires adapting its NLP related modules to the specific language. Similarly, although it is a more general approach, using language comprehension models such as Latent Semantic Analysis, LSA (Landauer & Dumais, 1997), might require certain level of adjustments for morphologically different languages. We are working in the context of automatic evaluation of learner summaries (Zipitria et al., 2004) where Latent Semantic Analysis is applied to comprehend learners’ output text (Zipitria et al., 2006). LSA has been tested under an agglutinative and a non-agglutinative language to observe if there are different LSA requirements or performances, comparing sentences produced in morphologically different languages. Previous work shows that a key issue to successful LSA modelling is the use of adequate corpora (Olde et al., 2002). Then, in order to produce adequate learner comprehension in our learning system, we have been testing LSA to find the nearest fit to human similarity judgments for the languages used in our educational context. The aim is to find the corpora that model better the agglutinative or non-agglutinative language. We had intuitively observed that lemmatized versions of LSA might function better in Basque (agglutinative) due to its sparse

distribution of single conceptual information on a high variety of forms. On the contrary, the effect in Spanish (non-agglutinative) was not expected to be relevant, considering its significantly lower variability of forms (see Table 1). Thus, two morphologically diverse languages and lemmatized vs. non-lemmatized corpora is tested in each case to observe LSA behaviour. Latent Semantic Analysis Latent Semantic Analysis (LSA) is a statistical corpus-based Natural Language understanding technique. LSA was first developed by (Deerwester et al., 1990) and later found to be comparable to humans by (Landauer & Dumais, 1997; Landauer et al., 1998). It has been widely used to model human semantic similarity (Foltz et al., 1998; Graesser et al., 2001; Kintsch et al., 1999; Wiemer-Hastings & Graesser, 2000; Wolfe et al., 1998). In addition, recent studies refer to the effect of syntax in LSA similarity judgments (Kanejiya et al., 2003; Serafin & Eugenio, 2004; Wiemer-Hastings & Zipitria, 2001). SELSA, adds syntactic neighbourhood to words to improve LSA results (Kanejiya et al., 2003). FLSA, widens LSA with dialogue act classification information (Serafin & Eugenio, 2004). Finally, SLSA shows that adding structural information, it is possible to obtain better measures and deeper knowledge on similarity of the different parts of the sentence (Wiemer-Hastings & Zipitria, 2001). Then, different variations of syntactic information over semantic calculations seem to be producing positive effect on LSA results. In addition, the relevance of the use of adequate corpora to obtain optimum LSA results has been found to be very relevant in previous studies (Olde et al., 2002). Moreover, LSA is known to be a general approach adaptable to a variety of languages (Landauer & Littman, 1990). Nevertheless, thus far LSA has mainly been tested with English corpora and non-agglutinative languages, although there are a few agglutinative works in languages such as Finish (Kakkonen et al., 2005). But, how do LSA results compare when dealing with two morphologically different languages? Does word level syntactically relevant information1 affect semantics in LSA? We have tested lemmatization effect under an agglutinative (Basque) and a nonagglutinative (Spanish) language. Studied languages Two different languages have been tested: Basque and Spanish. Basque is a non Indo-European agglutinative language. In agglutinative languages words are formed by joining morphemes together producing a great amount of word variability. Sentences tend to follow Subject-Object-Verb (SOV) structure, although, the order of words can vary. Therefore, Basque is considered a free word ordered language. In addition, to date, Basque has not known linguistic relatives. On the other hand, Spanish is an Indo-European non-agglutinative Romance language and Spanish sentences tend to follow the Subject-Verb-Object (SVO) syntactic structure. Therefore, we have two distinct languages in origin and morphology, with more differences than commonalities, which provide a linguistically comparative context for this study. Syntax and semantics There is psychological foundation to say that syntax and semantics are related. Syntax and semantic relatedness has been object of several psycholinguistics studies. From early language acquisition meaning is acquired together with syntax (Tomasello, 2003). Some views on the human language system assume separate processing levels for conceptual/semantic information, orthographic/phonological information, and syntactic information (Friederici, 2002; Jackendoff, 1999; Levelt, 1999). But the way exactly these are bound in processing is still under debate in psycholinguistics. Other views show evidence for interaction between syntax and semantics in sentence comprehension (Hagoort, 2003; Palolahti et al., 2005). But, research in LSA shows that despite the lack of syntactic information or word order, LSA produces human like similarity measures (Landauer et al., 1997). Next, this paper will observe the consequences of omitting relevant grammatical information in Basque and Spanish. It has been organised as follows: first, human sentence similarity judgment gathering; next, comparison of human results to lemmatized and non-lemmatized LSA data, and final conclusions.

HUMAN SIMILARITY MEASURES Procedure A web experimental context has been developed to gather similarity judgments from humans. The main purpose of this part of the experimental process is to obtain human similarity measures in Basque and Spanish to be compared to LSA similarity measures. Participants have been asked to rate semantic similarity between two 1

The concept syntactically relevant information is used as opposed to pure semantic information.

sentences on a 1 to 6 scale. Each sentence pair has a student summary sentence and a reading text sentence obtained from previous written data in summary evaluation (Zipitria et al., 2004), where, several summaries were written based on a text on cycling and doping issues. Spanish monolinguals only judge Spanish sentence pairs, while early Basque bilinguals rate Basque sentence-pairs. There were 75 sentence pairs in Basque and 47 sentence pairs in Spanish. The fewer amount of sentence pairs in Spanish is given by the greater tendency to produce longer and compound sentences. Therefore, more semantic information is collected on each individual Spanish sentence.

Participants A total of 68 participants were recruited. Participants were monolingual and bilingual first and second year University students, from the Education and Computer Science degrees of the University of the Basque Country. 33 proficient early bilingual participants (Basque acquisition at the age range of 0-4 years approximately) were exposed to Basque sentences and the other 35 monolinguals or non-proficient Basque bilinguals only to Spanish sentences. Results Pearson correlation was calculated in order to observe the level of agreement shown by participants. Similarity measures obtained from participants judging Basque sentences were significant with P