Paper Title (use style: paper title)

Latent Semantic Analysis Model as a Representation of Free-Association Word Norms1

David Ortega-Pacheco

Natalia Arias-Trejo, Julia B. Barrón Martínez

Escuela Superior de Cómputo Instituto Politécnico Nacional México D.F., México [email protected]

Facultad de Psicología Universidad Nacional Autónoma de México México D.F., México [email protected], [email protected]

Abstract— The current work aims to validate, by means of a computational model, an empirical database of free wordassociation norms of Mexican Spanish. Specifically, this work has two main goals: (1) to detect the associated weight of wordword pairs, and (2) to provide an understanding of a lexical network formed beyond an input-output word pair, similar to the mediated priming effect reported experimentally. We used the Term Frequency-Inverse Document Frequency weighting (TF•IDF) to obtain the associated weight between an inputoutput word pair and to calculate the TF•IDF-Matrix which is used as an input in the Latent Semantic Analysis (LSA) Model. The LSA model is a semantic representation at the lexical level that allows us to understand semantic relationships beyond input-output word pairs. Our computational model replicates and further explains previous experimental work on lexical networks. Keywords- Latent Semantic Analysis; TF•IDF Weighting; Lexical Semantic Priming and Free Word-Association Norms

I.

INTRODUCTION

Models of lexical representation assume the existence of an interconnected network. [1] proposed a model in which words are organized in a semantic network of interconnected nodes of similar meaning. For instance, the word ‘dog’ primes the word ‘cat’ as a consequence of an activation process that spreads across links. In contrast, distributed models of semantic memory [2], [3] assume that lexical concepts are interconnected due to their overlap in features (e.g., fur, claws, curvilinear body for cat and dog). Although these models propose different ways by which one concept affects recognition of another, both models consider that properties of concepts are the core of the semantic network. Studies of word relationships have demonstrated that prior exposure to a related word facilitates subsequent word processing in adults [4] school-age children [5] and infants [6], they are all faster and more accurate if a preceding word is related to a subsequent word. These effects are typically explored via the lexical decision task [7], [8] the naming task [9] and more recently by means of an adaptation of the Intermodal Preferential Looking task [6]. However, little work

has been done in the domain of computational models to better understand the kind of relationships that integrate a network. Moreover, within those relationships, the interconnected strength for different word-pairs is not fully understood. For example, priming effects are likely to be encountered when adults are exposed to words that are semantically and associatively related. In contrast, whilst adults have been seen to respond to word pairs that are semantically related [10], [5] or associatively related [11], [12] these effects are not always obtained. Whilst associative relations reflect word use, semantic relations reflect word meaning. Computational models are likely to help us understand the kind of word-word relationships establish by adults-speakers of the same language. Furthermore, they can provide us with specific indexes of strength connectivity. Word pairs to be used experimentally or in computational models are generally extracted from free word-association norms in which adults are asked to produce the first word that comes to mind after reading or hearing an input word. Previous adult norms have been published for English speakers [13], [14] Spanish speakers [15], and French speakers [16], among others. However, word association norms collected for Spanish speakers from Mexico have not been published. Research has found that due to the context of use of a language as well as the dialect of a given language, word association norms collected in a region may not be valid to be used in a different region [17]. With the aim of presenting a model that explains the structure of an interconnected lexicon for speaker of the Mexican dialect of Spanish, we previously collected word-association norms for 117 words with the participation of 150 adults. Our research has three main goals. First, to create wordassociation norms to better understand how the semantic memory is organized. We would like to capture the kind of relationships that young-adults of Mexican Spanish produced more frequently. Second, to validate word pairs for further use in experimental settings such as priming investigations. Third, to evaluate the strength of the word associations obtained empirically and to put forward a model of how words are related to each other directly or through neighbor-

1 This work was supported by a research grant awarded to Dra. Natalia Arias-Trejo by CONACYT-167900 “Mecanismos en la formación y modulación de redes semánticas durante la infancia y la etapa adulta”.

words. The current research is part of a more ambitious project in which we also compare adult-norms with youngchildren norms with the aim of describing the emergence and development of a lexical network. Also, there is a consensus establishing that word-word relationships are not representative of the speakers of the same language in different socio-cultural contexts. Thus, we aim to provide a specific set of association for future use in the MexicanSpanish linguistic context. The Latent Semantic Analysis (LSA) Model is a theory and a model to extract and represent the meaning of worduse context from statistical computations applied to a large corpus of texts. The concepts in the LSA are represented by vectors in a space of approximately 300 dimensions. The similarities among the meanings are represented by cosine of angles among vectors. The input to LSA is a matrix in which the rows represent types of words and the columns represent the contexts in which these words occur. Each cell in the matrix has the number of times that a particular type of word appears in a particular context [18]. Our objectives are (1) to detect the associated weight of word-word pairs using the TF•IDF Weighting approach. We used the obtained matrix (TF•IDF-Matrix) as an initial representation of word-word relationships; and (2) to provide an understanding of a lexical network formed beyond an input-output word pair, similar to the mediated priming effect reported experimentally, by means of the Latent Semantic Analysis (LSA) model. The TF•IDF-Matrix is used as input for the LSA model and this result is the semantic representation of the free-association norm.

A. Participants The participants were 150 young adults; between 18 and 28 years of age. Participants were Mexican Spanish monolinguals and had completed at least 12 years of formal education. With the aim of balancing the sample in terms of participants’ background, they came from four different major areas: sciences and engineering, medical and health studies, administration-finances-economy and social sciences. Participants came from Mexico City and surroundings. B. Design We selected the input words (117 frequent words in Spanish) from the MacArthur Communicative Development Inventory [19]. According to this Inventory, the selected words are acquired during early stages of lexical development. This manipulation guarantees that in the near future, we may be in a position to model lexical acquisition and the emergence of a lexical network as recent experimental work has proposed. However, due to the difficulties to produce associates with children and the lack of word-association norms that could be used with Mexican population, our research is a first step to solve these scientific needs. The selection criteria were that stimuli were highly imageable and only included nouns. C. Procedure for free-association norms For the purpose of data collection, we created a computational application which through a graphical user interface could present the instructions of the task of freeassociation norms. The graphical user interface presented one by one the input words, and under each word there was a blank space in which the participants had to write the first word that came to mind, within a maximum of 10 seconds, after reading another input word with the aim of capturing automatic and not strategic responses. The task had a mean duration of 12 minutes. III.

COMPUTATIONAL MODEL

In order to obtain the computational model for the corpus of the free-association norms, we carried out the process described in Figure 2. In the next subsections we explain in detail each block. Free-Association Norms Database IDF (Input-Output word pair)

Figure 1.

TF•IDF Weighting (Input-Output word pair)

Obtained links among some word associates.

II.

EXPERIMENTAL METHOD

In this section we provide a context of the experimental method that we used to obtain the free word-association norms database of Mexican Spanish.

TF (Input-Output word pair)

Latent Semantic Analysis Model

Figure 2.

Computational Model for the corpus of FreeAssociation Norms previously obtained.

A. Free-Association Norms Database The database query response is viewed as an input-output word pair in the format: {Bw,Aw}, where Bw is the BaseWord and Aw is its Associated-Word, for example: {bocina,música} → {Bw,Aw} 'speaker,music' The database consists of 5225 input-output word pairs, the total number of Base-Words is 117 words and the total number of Associated-Words is 667. B. Free-Association Norms Database In this block, we obtained the frequency of each inputoutput word pair ({Bw,Aw}) in the database: Term-Frequency ( TFBW , AW ) [20], to determine the local relevance between each word (Base-Word and Associated-Word). The output format of this block is as follow {B w,Aw, TFBW , AW }, for example: {bocina,música,5} → {Bw,Aw, TFBW , AW } C. IDF (Input-Output word pair) For this step, we calculated the Inverse Document Frequency (IDF) for each word pair to obtain the global relevance of an Associated-Word in the set of Base-Words [20]. To determine the IDF measure, we used the equation (1).

 N IDFAW  log  DFA W 

   



where N is the number of Base-Words in the data base set and DFAW is the number of different word pairs ({Bw,Aw}) in which Aw appears. D. TF•IDF Weighting (Input-Output word pair) To obtain the final relevance of each word pair relationship ({Bw,Aw}), we computed the TF•IDF weighting [20] as follows (equation 2):

TF  DF{ BW , AW }  TFBW , AW  IDFAW

IV.

RESULTS

The TF•IDF-Matrix and LSA-Matrix have a 667x117 dimensions. We extracted a representative section to show the effect to discover new relationships in the word-word context by the LSA Model (LSA-Matrix). TABLE 1 presents a section of TFIDF-Matrix and TABLE 2 is a representative section of LSA-Matrix. TABLE I. TFIDF-MATRIX REPRESENTATIVE SECTION. ROWS: ASSOCIATED-WORDS AND COLUMNS: BASE-WORDS. TFIDF-MATRIX ABARROTES ABEJA ACONDICIONADOR ADRENALINA AEROSTATICO AGUA VIDA VOLAR

TABLE II.

abeja agua aspiradora autobús avión babero 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.53431 0 0 0 0 0 0 0 0 26.3903 0

LSA-MATRIX REPRESENTATIVE SECTION. ROWS: ASSOCIATED-WORDS AND COLUMNS: BASE-WORDS.

LSA-MATRIX ABARROTES ABEJA ACONDICIONADOR ADRENALINA AEROSTATICO AGUA VIDA VOLAR

abeja -0.002 -0.024 0.000 0.000 0.000 -0.003 -0.002 -0.002

agua aspiradora -0.027 0.049 0.016 -0.007 0.017 0.209 0.000 -0.001 0.011 0.177 -0.346 0.009 0.081 0.021 0.023 0.095

autobús -0.018 -0.041 0.000 0.012 -0.008 0.017 -0.001 0.379

E. Latent Semantic Analysis Model Finally, the input for the Latent Semantic Analysis (LSA) is the TF•IDF-Matrix obtained in the previous step. LSA is based on the Singular Value Decomposition to obtain a new TF•IDF-Matrix with k dimension reduction [21]. The output is a LSA-Matrix that contains the new “hidden” word-word pair semantic relations.

avión -0.001 -0.017 -0.003 0.018 0.717 -0.010 0.035 23.454

babero 0.000 -0.049 0.001 -0.001 -0.027 -0.007 0.000 0.013

bandera barco 0.000 0.024 -0.001 0.020 0.000 0.001 0.000 0.000 -0.016 0.010 -0.001 14.563 -0.003 0.077 -0.057 0.033

Table 1 represents a TF•IDF approach to obtain a primary Model for the word-word relations in the Free Association Norms, where we consider that a relation between words exists, if the relation value in the table is greater than zero, and this value represents the weight for the word-word association. Thus, if the relation value is zero, the relation does not exist. At this stage is evident that the only relations in the table are Vida-Agua (life-watter), VolarAvión (fly-plane) and Agua-Barco (watter-ship), and we can see the weight of each association in TABLE 1. This Model provides us with a Computational Model in that the main data are the weights for the associations. We show a Network view for Table 1 in Figure 3.



The output in this block is a TF•IDF-Matrix that contains the word-word pair relationships by its TF•IDF word-word pair relevance.

bandera barco 0 0 0 0 0 0 0 0 0 0 0 12.32294 0 0 0 0

Abarrotes Aerostático Vida

Abeja Agua

Bandera

Acondicionador Barco

Autobús

Avión

Babero

Aspiradora

Volar

Adrenalina

Figure 3. Network view for the TFIDF-Matrix. Continuous black line represents word association (The weight for each association is in TABLE 1).

In a second stage, we use the LSA Model in order to extend the TF•IDF approach and use the information that it provides to obtain a new relation matrix for the Free Association Norms. The input for the Latent Semantic Analysis Model is the TF•IDF-Matrix and we obtain by Singular Value Decomposition a new version for the TF•IDF-Matrix with k dimension reduction. In this case, we use a reduction order value of k = 117, that is the maximum value in the matrix dimension. The relevance of the LSAMatrix is that it shows “hidden” word-word pair relations similar to those produced in experiment context. These “hidden” relations form new paths between words relations, like the mediated priming effect observed experimentally. We consider that exist a relation between words if the relation value in the table is greater than zero, and this value represents the weight for the word-word association. If the relation value is zero or less, the relation does not exist. For example, in the TF•IDF approach the relation between VidaBarco (life-ship) does not exist, but in the LSA-Matrix this relation appears and we obtain automatically in the same process its association weight. We show a Network view for TABLE 2 in Figure 4.

Bandera

Aerostático Abarrotes Abeja

Vida

Aspiradora Barco

Acondicionador

Babero

Avión Autobús

Volar

Figure 4.

Network view for the TFIDF-Matrix. Continuous red line represents the new word associations (The weight for each association is in TABLE 2).

V.

REFERENCES [1]

Agua

Adrenalina

The aim of this research was to collect word-association norms for Mexican Spanish and to create a computational model that could help us to understand the kind of wordword relationships that participants have formed. We also aimed to verify the possibility that two words could be related even in the scenario in which they did not show a direct input-output relationship. Experimental research has reported phono-semantic relationships [23], indicative of an interconnected lexicon at different levels. For example, a word like cat can facilitate processing of the word door. How is this possible? Facilitation is the product of activation of phonologically similar words, thus the word door is activated as the result of having activated dog via a semantic relationship with cat. Our results show that two words that fail to establish a direct link like (ship-life) can actually be related via water. We have been able to identify that different types of relationships permeate the organization of a lexical network. These relationships are likely to influence experimental results. Therefore, our results provide an extremely important tool for experimental designs to adequately select the word pairs to be employed in research investigating the organization of the lexicon. Moreover, future research should try to describe the characteristics of the most frequent mediators: semantic, phonological and perceptual. It would be desirable to compare our results with those of research performed in a different age group or within a different language with the aim of understanding some of the factors the influence the organization of a lexical network.

DISCUSSION

Although the existence of word-association norms collected with adult-speakers of various languages such as English, French and German; the present study is the first work collecting word-association norms for Mexican Spanish. As a consequence, this is the first computational model describing how word-word pairs may be formed in the context of a particular dialect of a language. Moreover, the Latent Semantic Analysis has previously been used to explore surrounding words within the context of internet free-text but not to explore word-association norms collected experimentally [22].

Collins, A. M., & Loftus, E. F. (1975). A spreading activation theory of semantic processing. Psychological Review, 82, 407-428 [2] Cree, G. S., & McRae, K. (2003). Analyzing the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese and cello (and many other such concrete nouns). Journal of Experimental Psychology: General, 132, 163-201. [3] McRae, K., & Boisvert, S. (1998). Automatic semantic similarity priming. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 863-883. [4] Nation, K., & Snowling, M. J. (1999). Developmental differences in sensitivity to semantic relations among good and poor comprehenders: Evidence from semantic priming. Cognition, 70 B1B13. [5] Neely, J. H. (1991). Semantic priming effects in visual word recognition: A selective review of current findings and theories. In D. Besner & G. W. Humphreys (Eds.), Basic processes in reading: Visual word recognition (pp. 264-336). Hillsdale, NJ: Erlbaum. [6] Arias-Trejo, N., & Plunkett, K. (2009). Lexical-semantic priming effects during infancy. Philosophical Transactions of the Royal Society B, 364, 3633-3647. [7] Fischler, I. (1977). Semantic facilitation without association in a lexical decision task. Memory and Cognition, 5, 335-339. [8] Perea, M., & Rosa, E. (2002). The effects of associative and semantic priming in the lexical decision task. Psychological Research, 66, 180194. [9] Thompson-Schill, S. L., Kurtz, K. J., & Gabrieli, J. D. E. (1998). Effects of semantic and associative relatedness on automatic priming. Journal of Memory and Language, 38, 440-458. [10] Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operation. Journal of Experimental Psychology, 90, 227-234.

[11] Alario, F. X., Segui, J., & Ferrand, L. (2000). Semantic and associative priming in picture naming. The Quarterly Journal of Experimental Psychology A: Human Experimental Psychology, 53A, 741-764. [12] Ferrand, L., & New, B. (2003). Semantic and associative priming in the mental lexicon. In P. Bonin (Ed.), Mental Lexicon: Some words to talk about words (pp. 25-43). Hauppauge, NY: Nova Science Publisher. [13] Kiss, G. R., Armstrong, C., Milroy, R., & Piper, J. (1973). An associative thesaurus of English and its computer analysis. In A. J. In Aitken, R. W.KiBailey & N. Hamilton-Smith (Eds.), The Computer and Literary Studies Edinburgh: University Press. [14] Moss, H. E., & Older, L. (1996). Birkbeck Word Association Norms. Hove, Sussex: Erlbaum. [15] Macizo, P., Gómez-Ariza, C., & Bajo, M. T. (2000). Associative norms of 58 spanish for children for 8 to 13 years old Psicológica, 21, 287-300. [16] Ferrand, L., & Alario, F. X. (1998). French word association norms for 366 names of objects. L’Année Psychologique, 98, 659-709. [17] Snodgrass, J. G., & Vanderwart, M. (1980). A standardized set of 260 pictures: Norms for name agreement, image agreement, familiarity, and visual complexity. Journal of Experimental Psychology: Human Learning and Memory, 6, 174-215. [18] Vivas, J. (2010). Modelos de memoria semántica. Evaluación de Redes Semánticas. Instrumentos y Aplicaciones. Capítulo del libro. [19] Jackson-Maldonado, D., Thal, D., Marchman, V., Newton, T., Fenson, L., & Conboy, B. (2003). MacArthur Inventarios del Desarrollo de Habilidades Comunicativas. User´s Guide and Technical Manual. Baltimore: Brookes. [20] Manning, Ch. et-al (2008). Scoring, term weighting and the vector space model. An introduction to Information Retrieval. Ed: Cambridge University press: England. [21] Landauer, T. K., Foltz, P. W. & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse processes, 25, 259-284. [22] Milagros Fernández, Eric de la Clergerie & Manuel Vilares. Mining conceptual graphs for knowledge acquisition. Proceedings of the 2nd ACM workshop on Improving non english web searching, ACM, Ney York, USA, 25-29. [23] Marslen-Wilson, W. D., & Zwitserlood, P. (1989). Accessing spoken word: The importance of word onsets. Journal of Experimental Pshychology: Human Perception and Performance, 15, 576-585.