Stemming in Agglutinative Languages: A ... - Semantic Scholar

Stemming in Agglutinative Languages: A Probabilistic Stemmer for Turkish B. Taner Dinçer1, Bahar Karao lan1 1

Ege Üniversitesi ,Uluslararası Bilgisayar Enstitüsü , 35100 Bornova, zmir, Türkiye {dtaner, bahar}@ube.ege.edu.tr

Abstract. In this paper, we introduce a new lexicon free, probabilistic stemmer used in a developing Turkish Information Retrieval system. It has a linear computational complexity and its test success ratio is 95.8%. The main contribution of this paper is to give a thorough description of a probabilistic perspective for stemming which can also be generalized to apply to other agglutinative languages like Finnish, Hungarian, Estonian and Czech.

1

Introduction

In analytic languages like English, stemming is relatively unsophisticated, because morphological variations of words forms are limited. On the other hand, in the agglutinative languages like Turkish, stemming is still a hard problem, since they have the capacity to generate theoretically an infinite number of possible word forms [4, 5]. The morphology analyzers are the only precise tools as stemmers for Information Retrieval (IR) systems in agglutinative languages [4, 8, and 9]. But in this way, stemming for IR Systems in agglutinative languages causes a paradox that can be defined by a conflicted pair of requirements: stemming is demanded from IR systems for agglutinative languages to overcome storage complexity and to increase performance1 [11, 18], on the contrary, IR systems’ overall performance decrease, because of a high level of computational complexity that is raised from using morphology analyzers as stemmers [13]. The probabilistic stemming that we proposed can be used to overcome this challenge, effectively. In this paper, we present a probabilistic stemming model which is based on a presented statistical framework for Turkish IR systems. Additionally, we also validate the proposed model by the analyses and results of an experiment. Hence, the new stemmer obviously conquers the mentioned obstacles by having a computational complexity of O(n) (i.e., linear) and 69% compression level (i.e., storage complexity). The paper is organized as follows: In section 2, we give brief definitions and clarify the need for stemming in IR systems for agglutinative languages. In section 3, we overview related previous work, in addition, we present basic notations and we 1

In information Retrieval, performance is measured by Precision and Recall [2]. In fact, response time of the overall system is also a performance measure, but in this study it is referred by computational complexity, unless otherwise is mentioned.

2

B. Taner Dinçer1, Bahar Karao lan1

explain the statistical framework and the probabilistic stemming model. In Section 4, we assert the formal definition of proposed probabilistic stemming. In section 5, we summarize the analyses and results of our experiment. The conclusions are given in section 6.

2

Stemming and Information Retrieval

IR systems are used to handle information gathered from a large amount of electronic documents. Information on a document is basically composed of words’ semantics. Hence, IR systems actually manage/store those words which are the representatives of semantics that are truly the building blocks of intended information. A word in a document has different morphological forms according to its grammatical usage in the text sequence, but the semantic that is represented by its stem remains unchanged. Therefore, IR systems generally use the stems instead of word forms both to overcome storage problems and to increase performance [2]. The definition of stemming is given by Lovins [1] as “a procedure to reduce all words with the same stem to a common form, usually by stripping each word of its derivational and inflectional suffixes”. Stemming for IR in analytic languages like English is rather effortless than in agglutinative languages. As an example, Porter’s algorithm [3] for English is based on a series of a limited number of cascaded rewrite rules, and can accept nearly all word forms by only exhausting about 1200 morphological variations [4]. But in Turkish as an agglutinative language, even there are approximately 23,000 stems; theoretically an infinite number of different word forms can be generated [4, 5]. With this level of morphological complexity, the morphology analyzers which are transducer models from a lexicon plus embedded rules are the merely accepted standard for morphological parsing [4]. But these morphology analyzers are based on two-level language model [8, 12]. And the twolevel model is NP-Hard [13]. As a consequence, there are two challenges for IR systems; Computational complexity arises from using a morphology analyzer as a stemmer, on the contrary, without a stemmer, storage complexity arises from indexing each different form of a word which has the same semantic, but a different morphology. Therefore, stemming is demanded from IR systems for agglutinative languages to reach a manageable storage size and a satisfactory level of performance [11, 18]. But in turn, linear computational complexity is important for stemming algorithms to be efficiently used for IR purposes [2]. To highlight the necessity of stemming in agglutinative languages, the degree of compression achieved by Porter’s [3] stemmer applied to a newswire database containing 567,564 English words and Oflazer’s [9] morphology analyzer as a stemmer applied to political news database that contained 376,187 Turkish words are tabulated in Table 1. Table 1. Level of compression for Turkish and English text corpora with stemming[11]. Corpus Turkish English

Word Tokens 376,187 567,574

Distinct terms 41,370 18,384

Distinct stems 6,363 11,671

Compression (%) 84.6 36.4

Stemming in Agglutinative Languages: A Probabilistic Stemmer for Turkish

3

On Table-1, despite the fact that English database has many more word tokens; distinct terms of Turkish are greater than that of English. This indicates the diversity of Turkish word variations which is also verified by the compression ratio results. The compression level 36.4% is far smaller than 84.6% which clearly indicates the lower degree of morphological variations for English. But, the 84.6% compression ratio observed for Turkish database emphasizes the need for stemming.

3

Stemming and Related Work

The stemming algorithms used for IR purposes can be grouped in four: 1) Table Lookup, 2) Successor Variety, 3) n-gram and 4) Affix Removal. In Table Lookup algorithms, all possible terms2, which can be used as indexed terms, and their stems are stored in a table. So terms from queries and documents can be stemmed very fast, but in practice there is no such unique term table for English and other natural languages for some index terms are domain or context dependent. Also, the storage overhead for such a table is a problem. Successor Variety algorithms are based on determining word or morpheme boundaries with distributions of phonemes observed from a large body of utterances. An n-gram based stemmer uses a similarity measure between query and document space terms. A term having a length of m has m-n+1 n-grams. For each pair of terms a Dice’s coefficient calculated over the unique n-grams of both terms and a similarity matrix is constructed. After similarity matrix has been constructed, terms are clustered using a single linkage clustering method. Affix removal stemmers has an algorithm that removes suffixes and/or prefixes from the terms to get the stems. As mentioned above, Porter’s stemmer is a good example of affix removal algorithms. All of those four groups of stemmers are developed for either analytic languages like English or some indo-European languages such as French and German. And they can not cope with a high level of morphological richness in languages having inflection paradigm like south-western or Oghuz group Turkic agglutinative languages which also includes Turkish, Turkmen, Azerbaijani or Azeri, Ghasghai and Gagauz [6, 7]. There are also a number of stemming algorithms developed for Turkish which are the efforts of fitting the morphology analyzer into a IR system by decreasing the strictness of some methods such as the tagging and identification of morphemes down to an acceptable level of accuracy. Examples of these stemmers are of Duran’s [14], Alpkoçak et al [15]. They still search a pre-constructed electronic lexicon to find a probable stem and then, verify that it is correct by matching the rest of the word form with a recognizable suffix sequence by looking up possible combinations of suffixes according to morphotactics of Turkish. Because their reference concept is morphology analyzes. Strictness level of this later verification phase leads both the accuracy and especially the computational complexity; in a high

2

Term is a word that is expected to be an index term in the given IR system. i.e., each form of words in a document is a term and index terms are the word forms that are selected for indexing purposes.

4


level of strictness of verification, accuracy will be high, but computational complexity will be closed to NP-hard, and vice versa. Therefore, for these stemmers, the level of computational complexity increases over again from a point close to linear, to a point close to NP-Hard as much as the level of morphological knowledge, which they use, is increased. An early attack to stemming problem in Turkish is done by Köksal [16]. It is completely a different effort from the others by its non-linguistics perspective. He has developed a stemmer that simply takes a fixed length part of a word form from the start as the stem of that word form. Originally, Köksal claimed that a stem length of 5 gives the best result in tests. But in practice, there is no common fixed length for all document sets. A value that works well with a particular document set under a particular topic may decrease the performance of IR system dealing with a document set under a different topic. Despite its inefficiencies, this approach is our reference concept, since it proves that even at worst case (i.e., without any morphological knowledge), the IR performance for Turkish can be increased by just taking fixed length substrings from the start of the word forms as of their stems. Therefore, it is reasonable to expect that the performance could further be increased by enhancing this simple concept with pragmatics of the morphological knowledge. In our proposed model, we use the statistics of generalized morphological knowledge gathered from a previously analyzed text database as the pragmatics, to enhance this prior reference model. By the use of pre-computed statistics, we also have guaranteed to stay around a linear computational complexity.

3.1 Notation We denote a word form in any agglutinative language by a string that is s n = h1 h2 hn , where each hi ( i = 1,2, , n ) is a member of the corresponding alphabet A and n is the number of letters (i.e., length of the string). In our study, we use the Turkish alphabet which has 29 letter and the underscore ‘_’ for blank character as follows,

A = {a, b, c, ç, d , e, f , g , , h, ı, i, j , k , l , m, n, o, ö, p, r , s, , t , u, ü, v, y, z , ' _'}

And, we adapted the following notations to denote substrings of any string s n for 1 ≤ i ≤ j ≤ n , s n [i : j ] = hi hi +1 sn [: j ] = h1h2

hj

h j and sn [i :] = hi +1

hn

Based on our notation, the special substring s n [i : i + 1] = hi hi +1 is denoted by the ordered pair of letters (h1 , h2 ) i where sub-index, i ( i = 1,2,..., n − 1 ), indicates the starting position of the ordered pair in that string and h1 = hi , h2 = hi +1 ∈ A . For i = n , the ordered pair is formed by an added blank as (hn , ' _' ) i = n . Thus any string s n = h1 h2 hn has n ordered pair in our study.


5

For a given ordered pair of letters (h1 , h2 ) j that can be appeared at a position 1 ≤ j ≤ nmax in any Turkish word form (where nmax is the maximum number of letters that a Turkish word form can have), and a given particular word form denoted by s n = h1 h2 hn where n ≥ j , the notation (h1 , h2 ) j ∈ s n refers that there exist an

ordered pair (h1 , h2 ) i at position i ( 1 ≤ i ≤ n ) in s n provided that (h1 , h2 ) i = (h1 , h2 ) j for i = j . Finally, we define two more symbols namely g m = s n [: m] and e m = s n [m :] in order to represent any word form as an ordered pair

of two substrings by s nm = ( g m , e m ) for all 1 ≤ m ≤ n .

3.2 The Sample Space and The Ordered Pair Probabilities Suppose that the set L be the collection of all possible ordered pairs of letters (h1 , h2 ) i that can appear in any Turkish word form for positions i = 1,2,..., nmax . Then, L will be the sample space and can be defined as follows,

L = { (h1 , h2 ) i h1 , h2 ∈

and 1 ≤ i ≤ nmax

}

And further suppose that the sets G k , E k and Tk where G k , E k , Tk ⊂ L and 1 ≤ k ≤ nmax represents the events defined as follows,

Gk = { (h1 , h2 ) i

i = k and (h1 , h2 )i ∈ g m and 1 ≤ m ≤ nmax

}

Ek = {(h1 , h2 )i i = k and (h1 , h2 )i ∈ em and 1 ≤ m ≤ nmax }

{

Tk = (h1 , h2 ) i i = k , h1 = sn [k : k ], h2 = sn [k + 1,k + 1] , 1 ≤ i ≤ nmax

}

Thus, for each ordered pairs (h1 , h2 ) i at positions i = 1, 2,..., n of any given word form denoted by s n = h1 h2 hn , one can define probabilities of being in the above three sets as follows,

Pr (s n [i : i + 1] ∈ Gi ) = Pr ((h1 , h2 ) i ∈ Gi ) = PG ((h1 , h2 ) i )

(1)

Pr (s n [i : i + 1] ∈ Ei ) = Pr ((h1 , h2 ) i ∈ E i ) = PE ((h1 , h2 ) i )

(2)

Pr (s n [i : i + 1] ∈ Ti ) = Pr ((h1 , h2 ) i ∈ Ti ) = PT ((h1 , h2 ) i )

(3)

Where, equation (1) refers the probability that the ordered pair (h1 , h2 ) i is in the stem part of the given word form, in the same way, equation (2) refers the probability that the ordered pair (h1 , h2 ) i is in the suffix part of the given word form and finally, equation (3) refers the probability that the ordered pair (h1 , h2 ) i is in

6


between the stem part and the suffix part of the given word form (i.e., h1 is the last letter of the stem part and h2 is the first letter of suffix part).

4

Probabilistic Stemming Assertion

The probabilistic perspective and the framework in this study are formalized in the Proposition 1 and our probabilistic stemming model is defined in the Definition 1 and the validity is examined by an experiment as given in the section 4. Proposition 1: If there exists an integer 1 ≤ m ≤ n that satisfies [PE ((h1 , h2 ) m ) > PG ((h1 , h2 ) m )] and PT ((h1 , h2 ) m−1 ) ≥ α , for a word form s nm = ( g m , e m ) and a constant 0 ≤ α ≤ 1 , then it is said to be that the g m−1 is the probable stem of that word form. Definition 1: The probabilistic stemming of any word form denoted by s n = h1 h2 hn ( 1 ≤ n ≤ nmax ) is to stripe off substring e m−1 from, and to take

substring g m−1 as the stem of that word form at the position m that satisfies the Proposition 1 for a particular 0 ≤ α ≤ 1 .

5

Experiment and the Results

Both experimentation and test databases are selected from a Turkish text database that is a collection of Turkish news texts having 1M words forms which are morphologically analyzed and disambiguated by Hakkani-Tür et al. [17]. Properties of the selected databases and the results are given in Table 2. Table 2. Properties of Experimentation and Test databases Databases Experimental Test

Term Count 149,189 148,486

Distict Terms 36,902 36,563

Distinct Stems 10,568 10,253

Unknown Stems 0 4,102

Produced (Correct) 0 9828

Pair Count 5,648 -

Success level (%) 0 95.8

Terms in both databases have been preprocessed to identify stem and suffix parts according to their tagged morphological analyses results. The databases are not processed any further such as elimination of stop words, determination of mis-spelled words etc. It has to be noted that if a morphology analyzer is used as a stemmer for the test database, the absolute upper bound of the compression ratio is 71%, (i.e., when all of the 36,563 distinct terms are reduced completely into their 10,253 actual stems), and the proposed stemmer can reach a compression ratio of 69% (i.e., 10,982 stems of which the 9,828 stems are correctly found and 1,154 terms are missed, out of 36,563 distinct terms). The probabilities given in equation (1), (2) and (3) are calculated by the following formulas.


PG ((h1 , h2 ) i ) = f g ,i * wg ,i / N

7

; PE ((h1 , h2 ) i ) = f e,i * we,i / N ;

PT ((h1 , h2 ) i ) = f t ,i * wt ,i / N

Where,

f g ,i ,

f e,i , and

f t ,i

are the frequencies observed from the

experimentation database for the events that the ordered pairs of letters has appeared in a stem part, suffix part, and transition between a stem part and suffix part of a word form, respectively. And the correction factors w g ,i , we,i , and wt ,i are previously calculated positive real values for the corresponding frequencies f g ,i , f e,i , and f t ,i . N is the total number of ordered pairs in experimentation database. After the processing of experimentation database, the total number of 5,648 unique ordered pairs of letters is observed for the sample space L , of which 2,845 pairs only belong to the event set G k , 1084 pairs only belong to the event set E k and 322 pairs only belong to the event Tk . And in the intersection of the event sets G k and E k , there were 1,397 unique ordered pairs. Indeed, these observations are very encouraging, since they indicate that 50% of all ordered pairs of letters appear only in the stem part, 19% of them appear only in suffix part and 24% of them are shared by stem and suffix parts. This statistics clearly explains why our proposed model has been succeeded. Our test database has 10,253 distinct stems of which 4,102 stems are not in the experimental database. This ratio is fairly large, and amounts to about 40% unknown stems. The stemmer produced 72,967 possible stems for 36,563 distinct test terms. Thus, for each term approximately 1.5 stems have been produced. In fact, most of these 72,967 produced possible stems are recognized Turkish word forms, but either, they are understemmed to a root which has a diffent semantic than the target stem, or overstemmed to a derivational or inflectional form which is different from the target stem. In calculations, we accepted the exact match with the target stem as correct. As a result, our proposed stemmer achieved to produce correct stems for the 95.8% of the test terms (i.e., 9,828/10,253). Our stemmer couldn’t stem 1,154 distinct terms, which were the variations of 425 distinct stems, out of 36,561. But, most of the missed terms are found to be foreign words, abbriviations, acronmy, or pronouns that possibly eliminated by any text preprocessors used in an ordinary IR system.

6

Conclusions

We have presented a new stemming model based on a probabilistic framework for agglutinative languages. Our model has linear computational complexity and is capable of achieving the 95.8% success. This result suggests that the proposed model of stemming can be generalized to apply to other agglutinative languages like Hungarian, Finnish, Estonian and Czech which have production and inflection paradigms.

8

7


Acknowledgements

We thank K. Oflazer for providing us with morphologically analyzed Turkish database, and A. Öztürk for very helpful insights and comments about statistical aspects and Ö. Berk and K. Demir of Mu la University and A. C. Koro lu for very helpful discussions on linguistics background in this study.

References 1. 2. 3. 4. 5. 6. 7. 8.

9. 10. 11.

12. 13. 14. 15.

Lovins, J.B.: Developing of a Stemming Algorithm. In: Mechanical Translation and Computational Linguistics, Vol. 11 (1968) 22-31 R. Baeza-Yates, B. Ribeiro-Neto.: Modern Information Retrieval. 1st ed. Addison-Wesley, England (1999) Porter, M. F.: An Algorithm for Suffix Stripping. In: Program, Vol.14, No.3 (1980) 130137 Jurafsky D., Martin J. M.: Speech and Language Processing. Prentice-Hall, New Jersey USA (2000) Hankamer, J.: Turkish generative morphology and morphological parsing. In: Second International Conference on Turkish Linguistics. Istanbul, Turkey (1984) Crystal, D.: The Cambridge Encyclopedia of Language. Cambridge University Press , Cambridge, UK (1987) Lewis, G.L.: Turkish Grammar. Oxford University Press, UK (1991) Koskenniemi, K.: Two-level Morphology: A General Computational Model for WordForm Recognition and Production. In: Publications of the Department of General Linguistics, Vol.11. University of Helsinki, Helsinki (1983) Oflazer, K.: Two Level Description of Turkish Morphology. In: Proceedings of EACL’98. Utrecht, the Netherlands (1993) Solak, A. & Oflazer, K.: Design and implementation of a spelling checker for Turkish. In: Linguistics and Literary Computing, Vol.8. (1993) Ekmekçioglu, F. Çuna, L., Michael F., Willett, P.: Stemming and N-gram matching for term conflation in Turkish texts. In: Information Research, 1(1) Available at: http://informationr.net/ir/2-2/paper13.html (1996) Öztaner, S. M.: A Word Grammar of Turkish with Morphophonemic Rules. M. Sc. Thesis. Department of Computer Engineering, METU, Ankara, Turkey (1996) Barton, G. Edward.: Computational Complexity in Two-Level morphology. In: ACL Proceedings, 24th Annual Meeting. (1986) Duran, G.: Turkish Stemming Algorithm. M. Sc. Thesis. Department of Computer Engineering, Hacettepe University, Ankara (1997) Alpkoçak A., Kut A., Özkarahan E.: Bilgi Bulma Sistemleri için Otomatik Türkçe Dizinleme Yöntemi. In: Bili im Bildirileri. Dokuz Eylül Üniversitesi, zmir, Türkiye (1995) 247-253 Köksal, A.: Bilgi Eri im Sorunu ve Bir Belge Dizinleme ve Eri im Dizgesi Tasarım ve Gerçekle tirimi. Doçentlik Tezi. Fen Bilimleri Enstitüsü, Bilgisayar Bilimleri Mühendisli i Anabilim Dalı, Hacettepe Üniversitesi, Ankara (1979) Hakkani-Tür D.Z., Oflazer K., Tür G.: Statistical Morphological Disambiguation for Agglutinative Languages. In: COLLING (2000) Solak, A., Can. F.: Effects of Stemming on Turkish Text Retrieval. Technical Report BU-CEIS-94-20. Department of Computer Engineering and Information Science, Bilkent University, Ankara (1994)

16.

17.

18.