Hybrid word sense disambiguation using ... - Semantic Scholar

International Conference on Convergence and Hybrid Information Technology 2009

Hybrid Word Sense Disambiguation Using Language Resources for Transliteration of Arabic Numerals in Korean Minho Kim

Youngim Jung

Hyuk-Chul Kwon

Dept. of Computer Science Pusan National University Busan, Korea +82-51-510-2875

Dept. of Knowledge Resources KISTI Daejeon, Korea +82-42-282-5208

Dept. of Computer Science Pusan National University Busan, Korea +82-51-510-2875

[email protected]

[email protected]

[email protected]

to the same phonetic symbols [4]. According to the accuracy test results of 19 TTS products by Voice Information Associates, the weakest area of TTS products among the ambiguity-generating areas is in number processing, of which the average accuracy is 55.6% [23]. In the modern Korean language, numerals have three different origins - Korean, Chinese, and English - and they show a variety of variants. Their distribution is largely dependent on context1. For example, a single numeral '3' can be read in 5 different ways depending on its following classifier2 , as in (E 1-a, b, d, e, f) or its preceding morpheme, as in (E 1-c).

ABSTRACT The high frequency of the use of Arabic numerals in informative texts and their multiple senses and readings deteriorate the accuracy of TTS systems. This paper presents a hybrid word sense disambiguation method exploiting a tagged corpus and a Korean wordnet, KorLex 1.0, for the correct and efficient conversion of Arabic numerals into Korean phonemes according to their senses. Individual contextual features are extracted from the tagged corpus and are grouped in order to determine the sense of Arabic numerals. Least upper bound synsets among common hypernyms of contextual features were obtained from the KorLex hierarchy, and they were used as semantic categories of the contextual features of Arabic numerals. The semantic classes were trained to classify the meaning and the reading of Arabic numerals using decision tree and to compose grapheme-to-phoneme rules for an automatic transliteration system for Arabic numerals. The proposed system outperforms the customized TTS systems by 3.9%–20.3%.

(E 1) a. 3geuru3 [se/ *seog/ *seo/ *sam/ *seuli] “three stumps” b. 3nyeon [*se/ *seog/ *seo/ sam/ *seuli] “three years” c. big 3 [*se/ *seog/ *seo/ *sam/ seuli] “Big 3” d. 3doe [*se/ seog/ *seo/ *sam/ *seuli] “5.4 ℓ” e. 3mal [*se/ *seog/ seo/ *sam/ *seuli] “54 ℓ” f. 3dae [se/ *seog/ *seo/ sam/ *seuli] “three machines/the third”

Keywords Word sense disambiguation, Arabic numeral, TTS.

1. INTRODUCTION Mapping from texts to phones for text-to-speech transliteration relies heavily on pronunciation dictionaries or letter-to-sound rules. However, this mapping is very difficult because not all phonetic realizations of graphemes are found in dictionaries, and the same graphemes, according to the context, might not correspond

1

In (E 1-a, b, d, e), classifiers following Arabic numerals play an important role in determining the reading of those numerals. However, when a homographic classifier follows an Arabic numeral, multiple meanings and readings of an Arabic numeral are acceptable, as shown in (E 1-f). Thus, contextual features are required to be collected from a corpus, to be clustered for use as learning features, and to be learned for resolving the ambiguities of reading

Various elements composing “Arabic numeral expressions” affect on reading Arabic numerals. Their linguistic constraints in reading Arabic numerals and the transliteration rules of Arabic numeral expressions established by combining those constraints were well discussed in [10].

2

In Korean, classifiers are counter-particles following numerals; they characterize and classify the types of object that are counted. In the machine-learning field, a classifier is a learning model that classifies target answers. In order to avoid confusion, in this paper, the term ‘classification model’ is used to indicate the latter. 3 Geuru is “a unit of trees” and nyeon means “year.” Doe and mal are Korean units of volume for measuring liquids or grains; one doe is about 1.8 ℓ, and one mal is about 18ℓ.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICHIT’09, August 27–29, 2009, Daejeon, Korea. Copyright ⓒ 2009 ACM 978-1-60558-662-5/09/08…$10.00.

314

widely distributed as those in Chinese or Japanese, the WSD of homographs needs to be studied.

Arabic numerals in new data. In Section 2, related work on text normalization and WSD is presented. In Section 3, the linguistic issue of the ambiguities of reading Arabic numerals in Korean is revealed, language resources used for word sense disambiguation of homographs are described, and our system is illustrated in overview. In Section 4, a hybrid approach to WSD by learning contextual features extracted from corpora and re-categorizing the semantic categories of contextual features based on the lexical relations in KorLex1.0 is suggested. Evaluation of our Automatic Transliteration system of Arabic numerals adopting the proposed model is performed compared with two customized Korean TTS systems in Section 5. Conclusions and future work follow in Section 6.

2.2 Word Sense Disambiguation Based on Language Resources A number of studies have been carried out to disambiguate word senses based on various language resources. Depending on what kind of language resource is available for training a WSD system, three methodologies have been suggested: (a) supervised learning based on sense-tagged corpora, (b) dictionary-based disambiguation based on dictionaries, thesauri or wordnets, and (c) unsupervised disambiguation, in which untagged text corpora are available [16]. Because (a) and (b) are direct WSD methodologies, the two former methodologies are studied in this section. (a) WSD based on Sense-tagged corpora: Since sense tags vary depending on applications, available corpora can be different. For speech synthesis, the correct sense and the pronunciation of a target ambiguous word in its context needs to be tagged. If there is no established corpus fitting one’s research purpose, corpora with tags are constructed by each researcher according to the applications and purposes, as in [26, 28]. From the pronunciation and sense-tagged corpora, features are extracted and used to correctly classify instances of ambiguous words in new data. With learning algorithms such as the Bayesian classification model proposed in [7], or a decision tree adopted in [10, 25, 26], many studies have been reported to achieve high accuracies in WSD. However, features extracted from corpora are so highly lexicalized that similar individual words are required to be grouped together as semantic categories in order to reduce the number of learning parameters. Moreover, WSD based on a tagged corpus is expensive because established sense-tagged corpora are rare and the cost of constructing the corpora is a time- and labor-consuming job. (b) WSD based on dictionaries and wordnets: Semantic categories can be obtained from definitions in machine-readable dictionaries or from thesauri or wordnets. Since multiple semantic categories are assigned to an ambiguous word, algorithms for scoring candidates and selecting one semantic category in one context are studied. Yarowsky [25] suggested WSD by adopting approximate conceptual classes from Roget’s thesaurus for resolving the ambiguities of twelve polysemous words. The method achieved 92% accuracy; however, it had a limitation in minor sense distinction within a category. Since WordNet was developed and constructed by Miller and his colleagues, many studies on WSD based on WordNet have been proposed [5]. Based on a hierarchical structure containing nouns with 12 levels of depth and verbs with 4 levels of depth, conceptual distance and conceptual density for application to WSD have been studied. Twenty-five semantic categories in

2. RELATED STUDIES For the purpose of the practical application of natural language processing (NLP) techniques to TTS, numerous studies have been conducted. Especially, NLP techniques have been applied mainly to the text normalization and word sense disambiguation of input texts, and those techniques have been reported to be effective in improving the accuracy of TTS systems.

2.1 Normalization of Non-standard Words Researchers performing practical studies on the implementation of TTS systems have focused on the importance of the text normalization. In the following non standard word classes related to numerals were defined [22]: cardinal, ordinal, a digit number, year, number as street address, zip code, telephone number, money, big money such as million/billion/trillion, date, time, and percentage. The project research sought to handle non standard words in English texts and achieved high accuracy, although it had a limitation, in that insufficient pattern features and no arithmetic features were adopted, despite the fact that both of these features were languageindependent factors and contributed much to the identification of Arabic numeral expressions. This basic non standard word model, after undergoing slight modification, was applied to deal with NSW in Japanese and Chinese texts [19]. A generalized model applied to those Asian languages showed an accuracy of only 72.9% with numeral expressions. This result shows that transliteration of Arabic numerals is still very challenging. The identification of numerals becomes more complicated due to the high error rate in using white space in Korean texts [12]. However, few studies have focused on Korean text normalization that incorporates sufficient analysis on the characteristics of Arabic numerals and text symbols. The current Korean TTS systems adopt simple transliteration rules, and generate only one to three forms of pronunciation of Arabic numerals. As a result, they generate many types of incorrect readings for numerals, showing an accuracy of only 55%–87.7%4 [10]. Specifically, they seldom produced the correct pronunciation when Arabic numerals were combined with homographs. Since homographs in Korean are as

4

For a more reliable performance comparison, the performances of two commercial Korean TTS systems and the system proposed in this study were evaluated using the same test data sets, as discussed in Section 5.

315

Table 1. Reading Formulae of Arabic Numerals Criteria Origin

POS

Distribution of allomorphemes

Label

Sense

Example

Cardinal

Base

Kca_n

3, 4

Ses, nes

Ordinal

Base

Kor_b

3, 4

sesjjae, nesjjae

Base

Kca_b

3, 4

se, ne

Variant

Kca_v

3, 4

seo/seog, neo/neog

Ordinal

base

Kor_b

3, 4

sesjjae, nesjjae

Cardinal

base

Brn

3, 4

seuli, po

base

C_b [+D]

6, 10, 215

yug, sib, ibaeg sib o

variant

C_v [+D]

6, 10

yu, si

C_b [-D]

215

i-il-o

Noun Korean Cardinal Adjectives

English

+DSM Chinese

Cardinal

-DSM

of numerals, the meaning of numerals, the distribution of allomorphemes, and the existence of decimal scale markers (DSM) 5. The subcategories of RFA are shown in Table 1 with examples6. As we can see in (E 2), however, Arabic numerals combined with (a) homographic classifiers (E 2-a), (b) polysemic classifiers (E 2b), and (c) text symbols in multiple uses (E 2-c) do not select a unique RFA.

WordNet have been applied in [1]. In that work, the accuracy of WSD using twenty four coarse-grained semantic categories was higher than that of WSD using individual synsets. In respect of sense granularity, semantic categories in WordNet are finer than those of Roget’s thesaurus used in [25]. WSD based on dictionaries or wordnets have limitations in handling real data produced currently. Thus a hybrid method exploiting a medium-sized tagged corpus and a wordnet together for WSD is suggested in this paper.

(E 2) a. 3 dae [se] “three machines” 3 dae [sam] “the third generation” or “the three largest” b. 215 gi [ibaeg yeol daseos] “two hundred and fifteen planes” 215 gi [i-il-o] “Plane number two-one-five” c. 97-06-04 [gusibchil nyeon yu wol sa il] “June 4, 1997” 97-06-04 [gu-chil-e gong-yug-e gong-sa] “97-06-04 (ID)

3. AUTOMATIC TRANSLITERATION OF ARABIC NUMERALS Computer-readable texts contain not merely alphabetic letters but also non-alphabetic symbols. Specifically, the greater amount of information and scientific content within a text, the greater the occurrence of Arabic numerals and text symbols, because they have graphic simplicity and deliver more precise information. The occurrence frequency of Arabic numerals and text symbols in Korean newspaper articles is 9.7%. In Section 3, linguistic issues on recognizing and reading Arabic numerals are discussed and language resources are described for the purpose of implementing an Automatic transliteration system of Arabic numerals.

The distribution of Arabic numerals having multiple RFA is wide: its ratio is 45.5%, and the distribution ratio of multiple RFA derived from homographic classifiers is 14.2% [10]. Since many Chinese homographic classifiers are combined with Arabic numerals and select different RFA depending on the senses of the classifiers [3], precedent word sense disambiguation of the homographic classifiers is required for selecting the correct RFA. Table 2 shows each sense of the homographic classifiers and the RFA.

3.1 Ambiguities in Reading Arabic Numerals Yoon et al.[27] showed that the Korean numeric system includes different reading formulae of Arabic numerals (RFA). The criteria for subcategorizing reading formulae for the Korean numeric system are the origin of numerals, the part of speech (POS)

Table 2. Homographic classifiers and RFA (excerpted) Homographic classifiers Pronunciation Sense 1 unit of machinery the time of life or persons Dae 2 in the time of life 3 the largest (item) 1 flight number Pyeon 2 unit of volumes

5

Korean DSMs (Decimal Scale Markers) are sib (“10”), baeg (“100”), cheon (“1,000”), man (“10,000”), eog (“100,000,000”), and others [9].

6

In this paper, letters in italics stand for grapheme-to-phoneme conversions of Korean, and phrases in quotation marks or parentheses are the interpretation of such sample phrases.

Twenty-five homographic classifiers were analyzed.

316

RFA Kca_b C_b [+D] C_b [+D] C_b [-D] Kca_b

endings. Content words are separated from function morphemes and are lemmatized using morphological analysis and a dictionary search. The neighboring words of Arabic numerals are also segmented, and their POS determined. The lemmatized neighboring

3.2 Tagged Corpora and Korean WordNet (a) RFA-tagged corpus: For the purpose of analyzing the ambiguities caused by homographic classifiers and resolving those ambiguities by learning contextual features, the corpus was composed up of news articles issued from January 1st, 2000 to December 31st, 2001 by ten major newspapers in Korea. The corpus covered politics, economics, society, international news, sports, entertainment, and opinion. Sentences containing Arabic numerals and their neighboring words are randomly sampled. Then the correct RFA tags are labeled7. The process of labeling RFA tags is semiautomated, using the rule-based transliteration system of Arabic numerals expressions developed by [27]. Linguists validated and revised the automatic labeling of RFA. The training corpus is composed up of 407,489 occurrences of Arabic numerals. (b) KorLex 1.0 (Korean Lexico-Semantic Network): KorLex1.0 is constructed based on WordNet2.0. The construction is performed semi-automatically by mapping English words in WordNet to English words in a machine-readable English-Korean bilingual dictionary and dictionaries of special terminology. Their corresponding Korean words are obtained for the translation-candidate English words in WordNet automatically. Correct Korean words are selected considering their definitions and examples as listed in a Korean dictionary by linguists [9]. At this stage, the main problem results from the several senses suggested by the first automatic mapping using the machine readable dictionary, which cannot be disambiguated completely. Despite the difficulties entailed in mapping the correct word sense and linguistic and conceptual disparities between two languages, the semi-automatic translation and post-manual-processing method is cost-effective and assures coherent and objective mapping. The KorLex 1.0 statistics are shown in Table 3.

Figure 1. System Architecture of Auto-TAN words are treated as candidates of contextual features, as described in Section 4.1. Secondly, proper nouns and terminologies containing Arabic numerals are individually transliterated by searching a transliteration dictionary of proper nouns and terminologies and by longest matching. In the third step, the target Arabic numerals are recognized, and their pattern features and arithmetic features are extracted. Pattern features are the “number of numerals in one word”, the “number of text symbols combined with Arabic numerals in one word” and others. Arithmetic features are the “size of an Arabic numeral”, the “first place of an Arabic numeral” and others. They characterize the types of Arabic numeral clusters in texts [10]. For example, if “97-06-04”, and “02-513-4463” are input, they are converted into a pattern, “N-N-N”. The pattern features of the two input strings are identical. Once the pattern features are obtained, the arithmetic features are extracted from their input strings. The first features of “97-06-04” and “02-513-4463” are “9” and “0”, respectively. These features distinguish a date from a telephone number that has the same “N-N-N” pattern structure. As mentioned in Section 1, classifiers or neighboring words of an Arabic numeral, provide key information to determine RFA. If “97-06-04” occurs with “il-si (the date)” or “nal-jja (the date)”, the Arabic numerals are considered to represent the date and are transliterated into “gusibchil nyeon yu wol sa il (C_v[+D])”together with the text symbol, “-”. If “beon-ho (ID number)” comes with “97-06-04”, the numerals are transliterated as “guchil gongyug gongsa (C_b[-D])”. In the fourth step, word sense disambiguation of the homographic classifiers is performed based on the hybrid WSD method illustrated in Section 4. In the final step, the default RFA is applied to the Arabic numerals for which all disambiguation through contextual features, and pattern and arithmetic features, has failed. As a result, after the procedure, transliterated strings of the input Arabic numerals are output.

Table 3. KorLex 1.0 Statistics Noun Verb

Words 41,368 15,613

Synsets 58,656 12,440

KorLex1.0 is the first version of the Korean Lexico-Semantic Network and was released at the 2nd Workshop on Knowledge Information Processing and Ontology [14].

3.3 System Architecture of Auto-TAN Since Arabic numerals are widely distributed in Korean texts and are found in various contexts, an Arabic numeral transliteration system requires sub-modules for morphological analysis as well as word sense disambiguation and comprehensive dictionaries. The overall procedure of automatic transliteration of Arabic numerals, and the system architecture, are illustrated in Figure 1. In the first step, the input sentences are segmented into words by white space. In Korean, one word is composed of content words and function morphemes such as case makers, postpositions, or

4. HYBIRD METHOD FOR WORD SENSE DISAMBIGUATION

7

In an exact sense, RFA is not a sense tag of homographic classifiers. However, as we can see in Table 2, not only does RFA differ from the context but it also becomes a good indicator of word senses. The ultimate goal of our system is to predict the correct reading formulae of Arabic numerals

As shown in Section 3.1, since RFA is determined depending on the sense of homographic classifiers and the ambiguities of homographs are resolved by semantic correlation with neighboring words, words around the Arabic numerals and homographic classifiers can be used as distinctive features to predict the correct

317

words that show more than 5 by MI and more than 100 by χ 2 are selected as contextual features for the disambiguation of ‘dae’.

RFA. In Section 4.1, extraction and training of contextual features for the WSD of homographic classifiers are shown with sample sentences in the RFA-tagged corpus, and in Section 4.2, generalized re-categorization of contextual features is performed based on the lexical hierarchy in KorLex Noun 1.0.

(Ex 3’’) a. taegsi, beoseu, seunghabcha b. mihonnam, namja, yeoja c. jeonje, jogeon

4.1 Extracting and Training Contextual features from Tagged Corpus

Table 4. Relevancies of Neighboring Words Measured by MI and χ 2 (exerpted)

According to Gale et al. [7], ±50 words around the target ambiguous word contribute to distinguish the senses, and ±20 words are considered to be a practical context width for the disambiguation of English homographs. However, the result of experimentations with the Sejong corpus composed of 150 million words has illustrated that 96.4% of words contributing contextual features are distributed within ±3 words from the target ambiguous word in Korean texts [11]. Based on the result from the Korean corpus, ±3 words from the target Arabic numerals combined with homographic classifiers were extracted from our corpus. All instances of Arabic numerals appearing with homographic classifiers were counted to 13,196. The steps for the extraction of contextual features will be outlined, using a homographic classifier ‘dae’ as an example.

Classifier Sense

1

dae Senses of ‘dae’ 1

2

3

Sample sentences (E 3-a) taegsi–neun eobs–go beoseu 30dae–wa sugbag-eobso–ui seunghabcha–man unhaengdoe–n–da “There are no taxis but thirty buses, and passenger vans run by hotels” (E 3-b) mihonnam–ui boheomlyo– neun 30dae gyeolhonhan namja–wa yeoja–boda nopda “the premium of bachelors is higher than that of married men and women in their thirties” (E 3-c) yadang–eun yeongsuhoedam– ui 30dae jeonje jogeon–eul jujanghago “The opposition required thirty main pre-conditions before the key leaders conference”

2

RFA

[seoreun] 3

[samsib]

Neighboring words

MI

χ2

charyang “vehicles” taegsi “taxi” misa-il “missile” beoseu “bus” yeogaeggi “plane” abantteXD “Avante XD (car model)” seunghabcha “passenger van” ju-ingong “hero” namja “man” bubu “couple” yeoja “woman” yeoseong “female” mihonnam “bachelor” jejeong “(a rule) institute” pyogyeol “ballot taking” jogeon “condition” a “proposal” jeonje “premise” deunglog “registration”

10.58 8.99 6.12 10.56 8.77

1524.67 507.11 67.72 1524.67 869.52

9.58

761.50

8.99

507.11

10.16 10.16 6.38 9.56 5.53 8.25 9.35 9.35 6.35 9.35 7.77 9.35

1143.25 1143.25 82.22 761.83 44.59 303.60 1305.99 1959.43 79.99 1959.42 216.38 652.86

Step 2: Semantic Clustering of Contextual Features Individual words are too numerous to be used as learning parameters. Since too many parameters lower the learning efficiency, words proceeding or following the combination of Arabic numerals and homographic classifiers should be clustered into semantic categories. Due to lack of information on the semantic category of each word, semantic categories of contextual features are set up by authors, as shown in Table 5.

[samsib]

Table 5. Semantic Categories of Contextual features Semantic category Sample words Time ojeon “morning”, bam “night”, si “hour” Date il-si “date”, wol “month”, il “day” Order je (a prefix meaning order), wi “rank” Number mun-ui “inquiry”, beon-ho “ID number” Quantity chong “total”, sagwa “apple” Formula sig “formula”, bangjeongsig “equation” Index ji-su “index”, gagyeog “price” Location jangso “place”, jiyeog “region”, dosi “city” Sport gyeong-gi “game”, chuggu “soccer” Proper noun names of entities

Content words are separated from function morphemes and are lemmatized through morphological analysis. For example, lemmatized content words in (E 3) are collected as follows. (E 3’) a. taegsi, eobs-, beoseu, sugbag-eobso, seunghabchadoe b. mihonnam, boheomlyo,gyeolhonha-, namja, yeoja, nopc. yadang, yeongsuhoedam, jeonje jogeon, jujangha Step 1: Measuring Relevancies of Neighboring Words Among the lemmatized content words, three nouns from the left and three from the right of the target Arabic numeral and homographic classifier considered to be related to each sense of the homographic classifier are extracted. The relevancies between neighboring words and each sense of ‘dae’, based on the cooccurrence frequency in the entire corpus, is measured by Mutual Information (MI) and χ 2 , as shown in Table 4. The following

Step 3: Training learning features and testing performance The distinctive power of each feature is explicitly represented through a constructed tree, and consequently, decision tree has been adopted in numerous studies (Mitchell, 1997). In addition,

318

lexical relations between the word sense of homographic words and their context can reduce or remove the ambiguities. In inheritance systems, a hyponym inherits all of the features of the more generic concept and adds at least one feature that distinguishes it from its superordinate and from any other hyponyms of that superordinate [24]. KorLex Noun, as an inheritance system, is applied to re-categorization of contextual features procedurally as follows.

because contextual features in natural language texts affect each other, and the decision rules can be easily derived from the result of decision tree construction, a decision tree was adopted as our classification model. In order to construct the decision tree, the information gain for each feature is calculated, and then the best feature is selected step by step. According to Quinlan [20], this gain criterion tends to prefer attributes with large numbers of possible values. To compensate for this strong bias, a modification of the measure called the gain ratio is widely used. k

gainratio( X ) =

−∑ j =0

n Ti freq(Cj , S ) freq(Cj , S ) ) − ∑ × info(Ti) × log 2( S S i =0 T n

−∑ i =0

Ti T

× log 2(

Ti T

Step 1: Clustering lemmatized words used as contextual features extracted from the tagged corpus {taegsi (“taxi”), beoseu (“bus”), seunghabcha (“passenger van”)} in (E 3-a), { mihonnam (“bachelor”), namja (“man”), yeoja (“woman”} in (E 3-b), and {jeonje (“premise”), jogeon (“condition”))}in (E 3-c). Step 2: Mapping words by cluster to the KorLex hierarchy. Step 3: Listing all common hypernyms of synset nodes mapped from contextual features. Step 4: Finding the least upper bound of synset nodes in a cluster mapped from contextual features. Here, susonggigwan (“transport”), saram (“person”) and jeonje (“premise”) are selected as least upper bound of synsets, as shown in [Figure 2]. Step 5: Selecting the least upper bound as a semantic category for the cluster of contextual features. It can differ from case to case.

(1)

)

S: Sample set of Arabic numerals, X: Elements, T: Training sets C j : Class to which S belongs (e.g. C_b[+D], Kca_b) After the decision tree is constructed, the performance of the model is tested using a 10-fold cross-validation checking method on the learning data containing 13,196 occurrences of Arabic numerals appearing with homographic classifiers. The baseline accuracy is measured by adopting this one rule: if the number of groups among the target Arabic numerals is ‘1’, the RFA is ‘C_b[+D]’, which is the most frequent class. The proposed WSD model based on local contextual features showed 85.2% accuracy, whereas the baseline showed 73.2% accuracy. Though the WSD model showed improvement in accuracy, two problems still remain to be resolved: (1) semantic categories of contextual features are made arbitrarily; and (2) unseen words or unfamiliar senses from the corpus might require the modification of the established semantic categories. Therefore, in Section 4.2, generalized and dynamic categorization of contextual features based on KorLex Noun 1.0, in which lexical information has been structured hierarchically, will be described.

Hierarchies of KorLex 1.0 (E 3-a) gaeche (“entity”)

(E 3-b) gaeche (“entity”)

mulche (“object”)

mulche (“object”) saengmulche (“living_thing”)

ingongmul (“artifact”) sudan (“instrumentality”)

gyotonggigwan (“vehicle”)

jadongcha (“car”) taegsi (“taxi”)

saram (“person”) daejunggyotong (“public_trnasport”)

beoseu (“bus”)

namja (“man”) mihonnam (“bachelor”)

yeoja (“women”)

teuleog (“truck”)

mal (“statement”) myeongje (“proposition”) gongjun (“postulate”) jeonje (“premise”)

yuegae teuleog (“van”) seunghabcha (“passenger_van”)

4.2 Generalization of semantic categories of contextual features

gwangye (“relation”) sahoegwangye (“social_relation”) tongsin (“communication”) mesiji (“message”)

yugiche (“organism”)

susonggigwan (“transport”)

yungeo (“wheeled_vehicle”) Jagachujincha (“self-propelled_vehicle”) jadongcha (“motor_vehicle”)

(E 3-c) chusanghwa (“abstraction”)

LUB (Least Upper Bound)

jogeon (“condition”)

Nouns appeared in corpus

Fig. 2. Automatic and dynamic selection of Least Upper Bound By application of the semantic relations in the KorLex hierarchy to the training corpus, forty-six semantic categories were obtained. Learning by application of the forty-six generalized semantic categories proceeded in the same manner as that described in Section 4.1. WSD of homographic classifiers with the application of KorLex 1.0 obtained further improvement, showing 87.3% accuracy in resolving the ambiguities of homographic classifiers and predicting the correct RFA. In order to make a close analysis of the performance improvement produced by the WSD method with the application of KorLex, the accuracy of WSD with the application of the Korean Synonym Dictionary was measured. Neighboring words having high relevancies in each sense of the homographic classifiers and being used as contextual features in Section 4.1 were mapped to the entries of the dictionary. Based on the synonym and near-synonym relations in the dictionary, the contextual features were expanded. Disappointingly, however, the accuracy dropped to 81.8%, because there is no distinctive information for classifying each sense of homographs or polysemic words. This result prompts the suggestion that wordnets are not simple dictionaries of synonyms or expanded semantic relations but organic lexical networks.

In this section, resolutions of the two problems underlying WSD using a tagged corpus are suggested, based on the lexical hierarchy and semantic relations contained in KorLex Noun 1.0. Two assumptions are presupposed for the application of a lexical hierarchy in KorLex Noun: (H-1) semantic ambiguities caused by homographs or polysemic words can be reduced or removed by mapping the words to the hierarchy properly in KorLex; (H-2) hyponyms inherit semantic characteristics from their hypernyms in KorLex. According to Lyons [15], hyponymy is transitive and asymmetrical, and, since there is normally a single superordinate, it generates a hierarchical semantic structure, in which a hyponym is said to be below its superordinate. This convention provides the central principle for organizing noun hierarchy in WordNet, and each lexicalized concept is defined by the lexical relations instead of being defined by its definition [17]. (H-1) is based on the idea that

319

systems. The accuracy of these systems was measured manually with respect to generating the correct sounds of the Arabic numerals. The phonetic distinctions such as /sam/ and /s?am/ of ‘3’ are possibly generated by the TTS systems; however, they are ignored since they do not change the sense of ‘3’. Given that correct transliteration of an Arabic numeral is readily synthesized to speech by searching the phoneme database, the accuracies between AutoTAN and the two TTS systems can be compared at the identical level. Table 8 shows the accuracy of Auto-TAN and the two TTS systems. The results show that the accuracy of Auto-TAN outperformed "VoiceWare" and "CoreVoice" by 3.9%–20.3%. In addition, Auto-TAN performed consistently regardless of the source or the manner in which a corpus was sampled. Table 9 shows typical errors generated from the two TTS systems and Auto-TAN.

Table 6 presents and compares the accuracies of the proposed hybrid WSD method and others. Table 6. Comparison of WSD accuracies WSD method Accuracy (%) Baseline 73.2 WSD based on local context 85.2 Hybrid WSD based on local context and 87.3 semantic relations in KorLex WSD with expansion on synonym dictionary 81.8 Though the result is encouraging, several problems remain. First, disambiguation of homographic or polysemic contextual features based on the least upper bound algorithm does not perform sufficiently well. For example, in some cases, the top-most synsets are selected as the least upper bound of mapped synsets, whereas the top-most synsets are not useful to characterize the semantic properties of contextual features. Constraints are necessary to prevent one-way ascending. Second, KorLex 1.0 construction, referenced to Princeton WordNet, has potential problems in application to Korean language processing. The overall mapping rate of contextual features is not high not only because numerous Korean words or concepts that do not exist in WordNet are missing in KorLex1.0, but also because too-fine-grained synsets in Princeton WordNet were translated into phrases. In order to mitigate the problem, frequent Korean words and concepts should be added, and the semantic relations need to be refined.

Table 8. Comparison of Accuracy of Auto-TAN, VoiceWare, and CoreVoice Systems Systems Set 1 Set 2 Set 3 Set 4 Set 5 Auto-TAN 99.1 97.7 95.9 97.3 95.6 VoiceWare 87.8 86.1 79.4 91.7 83.8 CoreVoice 87.1 88.7 78.8 87.8 82.9 Table 9. Typical Errors generated from TTS systems and Auto-TAN Incorrect Correct Input Arabic Numerals System result reading 6wol “June” [*yug] [yu] C, V 3mal “54 liter” [*sam] [seo] C, V 9.11 teleo “nine-eleven [*gu-jeom il-il] [guØil-il] C, V terror” 3∼4 gae “three or four [*sam-e-seo sa] [seo-neo] C, V items” 50dae jungban “mid-fif[*swin] [osib] C ties” MP3 “MP3” [*sam] [seuli] C

5. EXPERIMENTS AND RESULTS In this Section, the comparison of the accuracy of the AutoTAN adopting the proposed WSD method and that of two commercial TTS systems is presented experimentally. For a more reliable evaluation of our system, we constructed five test sets from different resources and sampled by different methods, as shown in Table 7. Table 7. Structure of Test Data Sets Data set Set 1 Set 2 Set 3 Set 4 Set 5

Size 10,000 words 1,000 words 1,000 words 1,000 words 1,000 words

Source data Ten newspapers Newspaper A Newspaper A Newspaper B Newspaper B

Sampling method Random Balanced Random Balanced Random

3 dae-leul “three generations+objective” or “three machines+objective”

[*sam]

[se]

C, V, AutoTAN

“CoreVoice” and “VoiceWare” systems adopt only three or four reading formulae of an Arabic numeral such as “Kca_b”, “C_b[+D]”, “C_b[-D]”, and “Brn”, thus they cannot correctly generate the correct reading of ‘6’ and “3” in “6wol” and “3mal”, respectively. They generate not only incorrect reading of Arabic numerals but incorrect text symbols because they do not recognize the sense of Arabic numerals combined with text symbols. Due to lack of refined WSD module, the two systems show serious defect in reading numerals occurred with homographic classifiers. They adopt only “C_b[+D]” to what they don’t distinguish senses of Arabic numerals and homographic classifiers. Though Auto-TAN shows superior performance to those two customized systems, it needs to be improved. Auto-TAN applied a refined-however, incorrect- rule, “If an Arabic numeral combined with Roman alphabet, then RFA is Brn” in reading “21” of “BK21”. Reading Arabic numeral occurred with homographic classifiers is not easy when context is not sufficient in the given width. Thus, Auto-TAN fails to transliterate “3” in “3 dae-leul modu bonaessda”, either.

The source data and sampling method of Test set 1 are identical to the training data. Article texts of Newspaper B have fewer RFA variations and fewer grammatical errors than those of Newspaper A. Since Newspaper B offers its daily news reading (TTS) service on its website, the articles are text normalized in order to be used as source texts for the TTS system. To build a balanced corpus, the data should represent the overall characteristics of a language by sampling evenly according to genre and time [8]. In this work, the balanced corpus was composed of instances showing exhaustive RFA according to the distribution ratio of RFA in the training corpus. The accuracy of our Auto-TAN and the accuracies of two commercial TTS systems, "VoiceWare" and "CoreVoice", were measured and compared in the five test sets. “VoiceWare” and “CoreVoice” have been evaluated as the best Korean commercial TTS

320

[11] Kim, J., Choi, H., and Oak, C. 2003. Disambiguation model of homographs based on statistic using weight. Korean Information Science: Softwares and Applications, 30(11), 1112–1123 [12] Kwon, H., Kang, M., and Choi, S. 2004. Stochastic Korean word-spacing with smoothing using Korean spelling checker. Computer Processing of Oriental Languages, 17, 239–252 [13] Leacock, C. and Chodorow, M. 1998. Combining local context and WordNet similarity for word sense identification. (In Fellbaum, C. (Ed.), WordNet – An electronic lexical database (pp. 265–283), Cambridge, MA: MIT Press.) [14] Lee, E., Lim, S., and Kwon, H. 2004. Output of Korean translation of WordNet 2.0. (Paper presented at the 2nd Workshop on Knowledge Information Processing and Ontology, Daejeon, S.Korea) [15] Lyons, J. 1977. Semantics. 2 vols. New York: Cambridge University Press [16] Manning, C. D. and Schutze, H. 2001. Foundations of statistical natural language processing. (Cambridge, Messachusetts: MIT Press) [17] Miller, G. A., Beckwith, R., Fellbaum, C., and Gross, D., 1993. Introduction to WordNet: An On-line Lexical Problems. [18] Mitchell, T. M. 1997. Machine learning. (New York: McGraw-Hill) [19] Olinsky, C. and Black. A. W. 2000. Non-standard word and homograph resolution for Asian language text analysis. (Paper presented at the ICSLP2000, Beijing,China.) [20] Quinlan, J. R. 1993. C4.5: programs for machine learning. (San Mateo: Morgan Kaufmann Publishers) [21] Resnik, P. 1995. Using information content to evaluate semantic similarity. (Paper presented at the 14th International Conference of Artificial Intelligence) [22] Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of non-standard words. Computer Speech and Language, 15(3), 287–333 [23] Tetschner, W. 2004. Text-to-speech – Naturalness and accuracy, ASR News, Retrieved September 28, 2004, from http://www.asrnews.com/ttsap/ttspap11.htm [24] Touretzky, D.S. 1986. The Mathematics of Inheritance Systems. Los Altos, Calif.:Morgan Kaufmann. [25] Yarowsky, D. 1992. Word sense disambiguation using statistical models of Roget’s categories trained on large corpora. (Paper presented at COLING1992) [26] Yarowsky, D. 1996. Homograph disambiguation in text-tospeech synthesis. Progress in Speech Synthesis, 157–172 [27] Yoon, A., Kwon, H., and Lee, M. 2003 An automatic transcription system for Arabic numerals in Korean (Paper presented at the 2003 International Conference of Natural Language Processing and Knowledge Engineering, ) [28] Yu, M. S. et al. 2003. Disambiguating the senses of non-text symbols for Mandarin TTS systems with a three-layer classifier. Speech communication, 39(3/4), 191-229

6. CONCLUSTIONS AND FUTURE WORD In this paper, the ambiguities in reading Arabic numeral were analyzed, and the hybrid word sense disambiguation model exploiting tagged corpus and Korean wordnet was proposed. Relevant words were selected among the neighboring words of the target Arabic numeral and homographic classifier as contextual features. The individual contextual features were re-categorized into forty six semantic classes based on the lexical hierarchy in KorLex. Nouns labeled with semantic class(es) were trained in order to determine the meaning and the reading of Arabic numerals using the C4.5 algorithm. The hybrid WSD model showed 87.3% accuracy for the sense disambiguation of homographic classifiers and thus prediction of correct RFA. Auto-TAN adopting the proposed WSD model outperforms the current TTS systems by 3.9% to 20.3%. Auto-TAN performed well in distinguishing senses of Arabic numerals even when they come with homographic words. For future work, WSD for ambiguous contextual features, by adopting a scoring algorithm, should be continued. Since KorLex1.0 has not yet been completely refined, continuous studies on WSD for other applications with the refined KorLex are promising.

7. Acknowledgements This work was supported by the Korea Science and Engineering Foundation(KOSEF) grant funded by the Korea government(MOST) (No. R01-2007-000-20517-0)

8. REFERENCES [1]

Agirre, E. and Rigau, G. 1996. Word sense disambiguation using conceptual density (Paper presented at the COLING1996) [2] Allan, K. 1997. Classifiers. Language, 53(2), 285-311 [3] Chae, W. 1983. A study on numerals and numeral classifier constructions in Korean. Linguistics Study, 19(1), 19–34 [4] Daelmans, W. and Bosch, A. 1994. A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion. (Paper presented at ESCA-IEEE Speech Synthesis Conference) [5] Fellbaum, C. (Ed.) 1998. WordNet – An electronic lexical database. (Cambridge, MA: MIT Press) [6] Francis, W. and Kučera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. (Boston: Houghton Mifflin) [7] Gale, W. A., Church, K. W., and Yarowsky, D. 1992. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415-439 [8] Hausser, R. 1999. Foundations of Computational Linguistics: Man–Machine Communication in Natural Language. (Berlin, Heidelberg: Springer-Verlag Germany) [9] Hwang, S., and Yoon, A. 2005. Semantic feature inheritance revisited for building Korean lexical semantic network (1): A case study using sex feature. Korean Linguistics, 29, 309338 [10] Jung, Y., Yoon, A. and Kwon, H. 2006. Disambiguation based on Wordnet for transliteration of Arabic numerals for Korean TTS. LNCS, 3878, 366-377

321