Unsupervised Approach to Word Sense Disambiguation in Malayalam

23 downloads 1783 Views 109KB Size Report
information technology (IT), has led to the production of a vast amount of ... warehouses, Web pages, collections of scientific articles, blog corpora, etc.
Available online at www.sciencedirect.com

ScienceDirect Procedia Technology 24 (2016) 1507 – 1513

International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015)

Unsupervised Approach to Word Sense Disambiguation in Malayalam Sruthi Sankar K Pa, P C Reghu Rajb, Jayan Vc c

a,b Government Engineering College, Sreekrishnapuram, Palakkad, Kerala Centre for Development of Advanced Computing(CDAC), Trivandrum, India

Abstract Word Sense Disambiguation (WSD) is the task of identifying the correct sense of a word in a specific context when the word has multiple meaning. WSD is very important as an intermediate step in many Natural Language Processing (NLP) tasks, especially in Information Extraction(IE), Machine Translation(MT) and Question/Answering Systems. Word sense ambiguity arises when a particular word has more than one possible sense. The peculiarity of any language is that it includes a lot of ambiguous words. Since the sense of a word depends on its context of use, disambiguation process requires the understanding of word knowledge. Automatic WSD systems are available for structured languages like English, Chinese, etc. But Indian languages are morphologically rich and thus the processing task is very complex. The aim of this work is to develop a WSD system for Malayalam, a language spoken in India, predominantly used in the state of Kerala. The proposed system uses a corpus which is collected from various Malayalam web documents. For each possible sense of the ambiguous word, a relatively small set of training examples (seed sets) are identified which represents the sense. Collocations and most co-occurring words are considered as training examples. Seed set expansion module extends the seed set by adding most similar words to the seed set elements. These extended sets act as sense clusters. The most similar sense cluster to the input text context is considered as the sense of the target word. © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license © 2016 The Authors. Published by Elsevier Ltd. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-reviewunder underresponsibility responsibility organizing committee of ICETEST – 2015. Peer-review of of thethe organizing committee of ICETEST – 2015 Keywords: Word sense disambiguation; Unsupervised methods; Information extraction; Collocations; Context similarity.

* Corresponding author. Tel.: 9744071958. E-mail address:[email protected]

2212-0173 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST – 2015 doi:10.1016/j.protcy.2016.05.106

1508

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

1. Introduction The exponential growth of the Internet community, together with the fast development of several areas of information technology (IT), has led to the production of a vast amount of unstructured data, such as document warehouses, Web pages, collections of scientific articles, blog corpora, etc. As a result, there is an increasing urge to treat this mass of information by means of automatic methods. Traditional techniques for text mining and information retrieval show their limits when they are applied to such huge collections of data. In fact, these approaches, mostly based on lexicosyntactic analysis of text, do not go beyond the surface appearance of words and, consequently, fail in identifying relevant information formulated with different wordings and in discarding documents which are not pertinent to the user needs[1]. Text disambiguation can potentially provide a major breakthrough in the treatment of large-scale amounts of data, thus constituting a fundamental contribution to the realization of the so called semantic Web, an extension of the current Web, in which information is given well-defined meaning, enabling computers and people to work in cooperation [2]. The potential of WSD is also clear when we deal with the problem of machine translation. For instance, the Malayalam word ‘manam’ can be translated in English as sky or prestige depending upon the context. There are several cases where disambiguation can play a crucial role in automated translation of text. In computational linguistics, word sense disambiguation (WSD) is an open problem of natural language processing, which is the process of identifying the sense of an ambiguous word in a given context. WSD is an intermediate task in many natural language processing applications and thus solution to this problem impacts these applications including machine translation, information retrieval, improving relevance of search engines, anaphora resolution, etc. Human languages contain many ambiguous words which are interpreted in multiple ways depending on the context in which they occur. Consider the following example sentences in Malayalam: 1. ampalattil uttaram vakkal kaRmmam naTannu. 2.

A cOdyattinRe uttaram enikkaRiyAm.

The computational identification of meaning for words in context is called word sense disambiguation [1]. The occurrences of the word ‘uttaram’ in the two sentences clearly denote different meanings: support and answer respectively. In the case of humans, the identification of the correct meaning of an ambiguous word in a context is very simple. While most of the time humans do not even think about the ambiguities of language, machines need to process unstructured textual information and transform them into structured data which must be analyzed in order to determine the underlying meaning. 2. Related works The automatic disambiguation of word sense has been an important concern since 1950. Word sense disambiguation is an intermediate task and is necessary at one level or another to accomplish most natural language processing tasks. It is obviously essential in many language understanding applications, particularly in man-machine communication. The problem of word sense disambiguation (WSD) has been described as AI-complete, that is, a problem which can be solved only by first resolving all the difficult problems in artificial intelligence (AI), such as the representation of common sense and encyclopedic knowledge [3]. The different approaches to WSD are supervised approach, unsupervised approach and Knowledge based approach. Knowledge-based approaches encompass systems that rely on information from an explicit lexicon such as Machine Readable Dictionaries, thesauri, computational lexicons such as Wordnet or hand crafted knowledge bases. Knowledge based approaches to WSD such as Lesks algorithm, Walker’s algorithm, Conceptual density and random walk algorithm essentially do machine readable dictionary look up. Supervised methods make use of annotated corpora to train from or as seed data in a bootstrapping process. They are mostly word specific classifiers. The examples for supervised learning algorithms are SVM and decision list based algorithms. Unsupervised algorithms work directly from unannotated raw corpora. They have the potential to overcome the new knowledge acquisition bottleneck and they have achieved good result [3], [1].

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

David Yarowsky proposed a method that disambiguates English word senses in unrestricted text using statistical models of the major Roget’s Thesaurus categories [4]. Rogets Thesaurus is sometimes described as a dictionary in reverse. According to Roget, it is a collection of words in English language and it contains and of the idiomatic combinations peculiar to it, arranged, not in alphabetical order as they are in a Dictionary, but according to the ideas which they express. The Thesaurus is a catalogue of semantically similar words and phrases, divided into nouns, verbs, adjectives, adverbs and interjections. A phrase in Rogets is not one in the grammatical sense, but rather a collocation or an idiom, for example: fatal gift and poisoned apple [4]. The reader perceives implicit semantic relations between groups of similar words. In WordNet, English nouns, verbs, adjectives and adverbs are organized into sets of near synonyms, called synsets, each representing a lexicalized concept. Collocations are very important in resolving ambiguities. The first approaches to sense disambiguation were based on simple hand-built decision tables consisting almost exclusively of questions about observed word associations in specific positions. With 90-99% accuracy, Yarowsky tested the hypothesis that for certain definitions of collocation, a polysemous word exhibits essentially only one sense per collocation [5]. A statistical decision procedure based on decision lists for lexical ambiguity is discussed in [6]. Supervised systems have some disadvantages such as sparseness, availability of standard corpora, etc. In these cases knowledge-based methods and unsupervised methods can be used as a solution to sparseness and thus improving the system performance. In view of the problems of supervised systems, knowledge-based WSD and unsupervised WSD are emerging as a powerful alternative. Supervised approaches to WSD needs a large annotated corpora. The languages which have less resource face problems in NLP tasks. In this case, unsupervised methods can be used. Yarowsky exploits an unsupervised algorithm for WSD [7]. The algorithm is based on two constraints that words tend to have one sense per discourse and one sense per collocation. The algorithm starts with finding all examples of the given polysemous word in a large untagged corpus, with their contents as lines. Then for each sense, identify a number of examples representative of that sense(seed set). Using decision list algorithm, create a classifier and apply this classifier to the entire sample set. This process is continued until when the training parameters are held constant. The final classifier can be used to annotate the original corpus. Indian languages are very complex and all are morphologically rich. Thus NLP works in Indian languages are very complex task when compared with the foreign languages. There are systems which gives a good performance in WSD of foreign languages such as English, Chinese [8], etc. However in recent years, there are so many NLP works are going on in Indian languages with the support of Government and private organizations. An automatic system for WSD in Assamese using a Naive Bayes classifier is discussed in [9]. Assamese, the main language of most of the people in North-Eastern part of India is a morphologically very rich language. In Assamese, a word can behave differently when combined with a suffix or a sequence of suffixes to have an entirely different sense. Unigram Co-occurrences, POS of target Word, POS of next word feature and local collocation are the features used in this system. The complexity of the language and the unavailability of the resources are the main issues for Malayalam processing. [10] discusses the difficulties in processing Malayalam and the issues faced during statistical machine translation. The agglutinative nature of Malayalam is the main issue with the Malayalam text processing. The study shows that a single verb had 890 inflections for a single verb based on tense, aspect, modality, interrogation, conjunction, conditionals, person, number and gender. And this number will increase, as the language like Malayalam is having compound verbs by combining two verb forms together. [11], [12] gives an idea about the importance of the morphology, case markers and collocations in WSD task. Malayalam belongs to Dravidian language family and is spoken by more than 38 million people. Malayalam is highly agglutinative in nature. It is roughly estimated that a whopping eighty percentage of the vocabulary of the scholarly usage of the languages like Malayalam is constituted by Sanskrit. Tamil and English also influenced Malayalam in one or the other. Inflection, derivation, compounding and concatenation are the major morphological behavior in Malayalam [10]. In Malayalam, very few works are available in WSD. A paper published in IEEE by Haroon proposes a conceptual density based algorithm to disambiguate Malayalam ambiguous words [13]. Knowledge based approach is used to propose the algorithm but, there is no implementation or result details given. Another approach to Malayalam WSD is proposed and implemented by Mujeeb Rehman O and et al. This WSD system is based on a dictionary and is named as PADAM (Practical Aid for Disambiguation in Malayalam). This disambiguation system is implemented

1509

1510

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

using Lesk algorithm and it gives an accuracy of 58% when tested with 150 contexts. A Malayalam WordNet using relational database MySql is also developed along with this WSD system [14]. 3. Word sense disambiguation system in Malayalam 3.1. System design From the literature survey, it was clear that word sense disambiguation is an important task in natural language processing and it has a lot of applications like machine translation, information retrieval/Extraction, etc. For Malayalam language, there exists a WSD system which is based on Lesk algorithm [14]. This thesis proposes an WSD system which is rely on unsupervised method. The system architecture is shown in the Fig. 1. The proposed system consists of three modules, namely preprocessing module, sense cluster module and sense disambiguation module. The data which are not important or meaningless such as punctuations and stop words are removed from the text. It is done in preprocessing module. Sense cluster module creates word clusters for each sense of an ambiguous word. It is done using the seed set. The sense disambiguation module disambiguates the target word in the input text by context similarity. The system architecture is shown in Fig. 1.

Fig. 1. System Architecture

3.2

Implementation

This thesis proposes and implements a word sense disambiguation system in Malayalam. The proposed algorithm resolves the sense ambiguities(Nanarthas) in Malayalam. The WSD system is implemented in four phases which are listed below. 1. Corpus Collection 2.

Generation of seed sets

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

3.

Generation of sense clusters

4.

Sense disambiguation

Corpus Collection: A standard corpus is not available for Malayalam. So for the implementation of this system, we created a corpus using Malayalam web documents. The contents in the web documents are extracted and saved in a file. After collecting a considerable amount of data, preprocessing was done. The preprocessing step includes the following tasks: x Elimination of punctuations. x

Elimination of Stop words.

x

Stemming.

The punctuations except dots(.) are eliminated from the collected data. Dots are used as sentence boundary. After the elimination of punctuations, the stop words are removed from the text. This is done for removing unimportant words which do not have a specific meaning. Using Silpa stemmer, the whole corpus is stemmed. Then the corpus is arranged as one sentence per line, which is the format of the input to the tool ‘Word2Vec’. Word2vec, published by Google in 2013, is a neural network implementation that learns distributed representations for words. Word2Vec does not need labels in order to create meaningful representations. This is useful, since most data in the real world is unlabeled [15], [16]. If the network is given enough training data (tens of billions of words), it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters, and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math. Generation of Seed Sets: The next step after corpus collection is to generate the seed sets. The different senses of the ambiguous words are collected from the Datuk corpus. The Datuk corpus is a free and open MalayalamMalayalam dictionary dataset with over 106,000 definitions for more than 83,000 Malayalam words. The majority of words and definitions are grammar tagged, and a large number of records also have additional metadata attached to them. Seed set is a set of words which are representative of a particular sense of an ambiguous word. Seed set generation is done as follows. Identify all examples of the given polysemous word in the corpus. Then analyze their contexts and choose two or three words which uniquely represent a particular sense. These words are used as seed elements. For example, consider the polysemous word ‘rasaM’. The words ‘mUlakaM’, ‘shastraM’ and ‘upakaraNaM’ are used as the seed set for the sense ‘mercuRi’. Like this, for each polysemous word, a set of seed elements are manually generated. Generation of Sense Clusters: After the generation of the seed sets, the sense clusters have to be created. The sense clusters are created using the context words. Now we have the seed sets for each sense of the ambiguous word. Thus to create a sense cluster for a particular sense, we use the corresponding seed set. Sense cluster generation consists of the following steps. 1. Search for the ambiguous word in the corpus. 2.

If found, select the words in that sentence and its nearby sentences.

3.

Find out whether any seed set elements are in these selected words.

4.

Add these words to the sense cluster corresponding to that seed set which indicates a particular sense.

5.

Repeat the steps for all occurrences of the ambiguous words.

Finally we get a number of sense clusters, each helps to identify the correct sense of an ambiguous words. The similar words of the seed set words from the Word2Vec tool also included in the sense clusters.

1511

1512

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

Sense Disambiguation: To identify the sense of an ambiguous word used in a given text, first the text is preprocessed. The preprocessing involves elimination of punctuations and stop words. After this, the text is stemmed using the Silpa stemmer. Then search for the ambiguous word in the given text and extract the context words as a set. Then select the sense cluster, which has a maximum number of words in common with the set of context words. The sense corresponding to this sense cluster is used to tag the ambiguous word. The input to the WSD system is a Malayalam text document and the output from the system is a document in which the ambiguous word is tagged with its correct sense. 4. Evaluation The evaluation strategy of Word Sense Disambiguation is based on the correctness of the sense selection of an ambiguous word invoked in a particular context according to the human judgment. Since a standard set of contexts and senses is not available for evaluation purpose, 100 contexts are selected manually for the purpose. On evaluation, the system shows 72% accuracy for the test set of 100 contexts. The details of the evaluation are given in Table I. The Malayalam WSD system based on the Lesk algorithm gives an accuracy of 58% [14]. Table I. Evaluation Table No. of Context Given

Correct Sense Retrieved

Accuracy

25

15

60%

50

28

56%

75

52

69.3%

100

72

72%

5. Conclusion Word Sense Disambiguation is a very important task in natural language processing. The WSD process is a tedious job. This thesis proposes and implements the Word Sense Disambiguation based on context similarity, which is an unsupervised method. The proposed system develops sense clusters using the seed sets. Then based on the similarity between the given input text and the sense clusters, the most similar sense is selected as the sense of the ambiguous word. The system gives an accuracy of 72%. As a future work,the seed set generation can be automated using machine learning algorithms. Then the system will become completely independent of human intervention.

Acknowledgements We sincerely thank Mr. Bhadran V K, Associate Director, Centre for Development of Advanced Computing, for his sincere directions imparted with the project. We would like to gratefully acknowledge to all staff members in the department of Computer Science and Engineering, Government Engineering College, Palakkad, for their immense support. Our thanks is also due to those who supported us morally. References [1] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41(2):69, http://doi.acm.org/10.1145/1459352.1459355. [2] James Hendler Tim Berners-Lee and Ora Lassila. The semantic web. Scientific American, pages 29–37, May 2001.

February

2009.

K.P. Sruthi Sankar et al. / Procedia Technology 24 (2016) 1507 – 1513

[3] Nancy Ide and Jean Veronis. Introduction to the special issue on word sense disambiguation: The state of the art. Association for Computational Linguistics, 24(1):45–50, 1998. [4] David Yarowsky. Word-sense disambiguation using statistical models of roget’s categories trained on large corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages 454–460. Association for Computational Linguistics, 1992. [5] David Yarowsky. One sense per collocation. In Proceedings of the workshop on Human Language Technology, pages 266–271. Association for Computational Linguistics, 1993. [6] David Yarowsky. Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 88–95. Association for Computational Linguistics, 1994. [7] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995. [8] Xue-Yao Gao Chun-Xiang Zhang, Long Deng and Zhi-Mao Lu. Disambiguate chinese word sense based on linguistics knowledge. International Journal of Database Theory and Application, 7(6):203–210, 2014. http://dx.doi.org/10.14257/ijdta.2014.7.6.18. [9] Gitimoni Talukdar Pranjal Protim Borah and Arup Baruah. Assamese word sense disambiguation using supervised learning. In International Conference on Contemporary Computing and Informatics (IC3I), pages 946 – 950. IEEE, November 2014. [10] Jayan V and Bhadran V K. Difficulties in processing malayalam verbs for statistical machine translation. International Journal of Artificial Intelligence and Applications(IJAIA), 6(3), May 2015. [11] B. Sankaran and Vijay-Shanker K. Influence of morphology in word sense disambiguation for tamil. In Recent Advances in Natural Language Processing: Proceedings of the International Conference on Natural Language Processing, ICON-2003, number 515, page 93. Central Institute of Indian Languages, 2003. [12] Baskaran Sankaran and V. Vaidehi. Role of collocations and casemarkers in word sense disambiguation: A clustering-based approach. In IEEE International Conference on Systems, Man and Cybernetics, volume 1, pages 625–630, 2002. [13] Haroon R.P. Malayalam word sense disambiguation. In IEEE International Conference on Computational Intelligence and Computing Research(ICCIC). IEEE, December 2010. [14] Mujeeb Rehman O and P. C. Reghu Raj. Malayalam wordnet: A relational database approach. International Journal of Latest Trends in Engineering and Technology (IJLTET), 3(2), November 2013. [15] GCT Mikolov, K Chen, and J Dean. Word2vec (2013). [16] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.

1513