UNSUPERVISED LANGUAGE MODELING 1

UNSUPERVISED LANGUAGE MODELING NYELVBANYASZOK Abstract. The purpose of this document is to survey existing methods used for the unsupervised learning of language models using a set of unified evaluation criteria.

1. Unsupervised learning of morphosyntactic rules 1.1. Morphosyntactic rules. What is a word? It’s hard to tell! Orthography (specifically, the location of spaces) is not a reliable indicator of words. Orthography varies from language to language: c.f. German Krankenversicherung (”health insurance”). Compound words are new words formed out of other words, e.g., black bird, girlfriend, babysit, supermarket parking lot attendant, emergency sail change. Morphemes are the smallest meaning-bearing elements in a text. A morpheme is an arbitrary pairing of sound and meaning. Morphemes can be classified along several dimensions: (1) Free vs bound morphemes: Free morphemes can constitute words on their own (e.g., will, Sam). Bound morphemes must appear with one or more morphemes to form a word (e.g., en-able, stipulation). Words often consist of a free morpheme with one or more bound morphemes attached to it; e.g. en-able-ment. In this sort of word, the free morpheme is called the stem, and the bound morphemes are affixes. An affix attached to the front of a word is called a prefix, while an affix attached to the back of a word is called a suffix. Prefix and suffix pairs attached to a stem are sometimes called circumfixes (e.g., the German ge-/-t pair). (2) Lexical vs grammatical morphemes: Morphemes can be grouped in grammatical classes or categories such as nouns, verbs, determiners. Some of these categories are open classes (also called lexical morphemes), i.e., new morphemes can be easily coined and added to the category (nouns, verbs and adjectives are open classes), while other categories are closed classes (also called grammatical morphemes, e.g., determiners, prepositions. Open class/lexical morphemes tend to be free morphemes. Closed class morphemes can either be free or bound morphemes. Example: English past tense is the bound morpheme -ed, but English future tense is the free morpheme will. (3) Inflectional vs. Derivational morphemes: Bound grammatical morphemes seem to come in (at least) two types. While the precise difference between inflectional and derivational morphemes is hard to define and not well understood, derivational morphemes build new words, often with their own idiosyncratic meaning, while inflectional morphemes permit a word to agree with other words in its context. In steril-iz-ation-s, steril- is an adjectival stem, -iz- is a verbal derivational suffix, which changes the adjective to a verb, -ation is a nominal derivational suffix, which changes the verb to a noun, and -s is a nominal inflection suffix, the plural.

Key words and phrases. unsupervised, natural language. NKFP.. 1

2

NYELVBANYASZOK

Derivation Produces a new form May change category Semantics often unpredictable (produc-er ) Need not be fully productive Potentially recursive (anti-anti-missle) Closer to the stem

Inflection Specializes a form Never changes category Semantically predictable Usually productive Never recursive Further from the stem

Null inflectional and derivational morphemes: English derives verbs from nouns, e.g., I networked the office, We motored the yacht up the river. A morpheme-based theory posits a phonologically null or empty morpheme which converts nouns to verbs. A process-based theory describe inflection and derivation as processes which may or may not involve morphemes. Example: Noun to Verb derivation (English) Phonological change: None - this process does not alter the phonology. Syntactic change: The output of this process is a verb. Semantic change: If the original noun is N, then X Ns Y means roughly ”X does something to Y that involves an N ” (e.g., Fred networks the office means ”Fred did something to the office that involves a network”). Order of morphological processes: Derived forms can appear inside compounds, but inflected forms usually cannot. E.g.: correct: parking lot attendant, incorrect: *parking lots attendant, but: parks commissioner, systems analyst. Lexicon → Derivational morphology + Compounding → Inflectional morphology → Surface forms Irregular inflected forms can appear in compounds since they are present in the lexicon, e.g., micecatcher vs. *rats-catcher 1.2. Purpose of stemming and morphological analysis. Morphological analysis is beneficial for many natural language applications dealing with large vocabularies. For example, in text retrieval it is customary to preprocess texts by returning words to their base forms. Terms with a common stem will usually have similar meanings, for example: connect, connected, connecting, connection, connections Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes -ed, -ing, -ion, ions to leave the single term connect. The problem is similar with prefixes and circumfixes. This process is called stemming or more generally morphological analysis. Other fields of application are speech recognition and OCR (optical character recognition), where predictive models of language are used for selecting the most plausible words at a given place of the text. Consider, for example the estimation of the standard n-gram model, which entails the estimation of the probabilities of all sequences of n words. When the vocabulary is very large, say 100 000 words, the statistical analysis may have the following problems: (1) If words are used as basic representational units in the language model, the number of basic units is very high and the estimated word n-grams are poor due to sparse data. (2) Due to the high number of possible word forms, many perfectly valid word forms will not be observed at all in the training data, even in large amounts of text. These problems are particularly severe for languages with rich morphology, such as Finnish, Turkish or Hungarian. The utilization of morphemes as basic representational units instead of words in a statistical language model seems a promising course. The construction of a comprehensive morphological analyzer based on linguistic theory (i.e., what human linguists suggests) requires a considerable amount of work by experts. This is both slow and expensive and therefore not applicable to all languages. An early example of this for English is Porter’s stemmer from 1980, which is a collection of a small number of hard-wired rules to accomplish stemming of words. It seems that this algorithm is the one mostly used in current stemming applications on the market. Since Porter many other algorithms have been proposed. Much of these use different learning techniques. In the following we summarize some of the most interesting approaches.

SURVEY OF UNSUPERVISED METHODS USED IN LANGUAGE MODELING

3

2. Papers 2.1. Porter’s benchmark algorithm for suffix stripping.

1

2.1.1. Objective of Learning. The objective of the algorithm is to provide a morphological analysis of words and find the stems. 2.1.2. Output. The stem of a composite word, e.g., adjustable→adjust, sensitivity → sensitive. 2.1.3. Principle of learning. There is no learning. Stemming rules are hard-wired. 2.1.4. Algorithm. Let c denote a consonant, v a vowel, C a list ccc..., and V a list vvv.... Any word can be represented in the form [C]VCVC ... [V]≡[C](VC)m[V], where m (called the measure) is the number of repetitions of (VC). Some examples: m = 0 TR, EE, TREE, Y, BY. m = 1 TROUBLE, OATS, TREES, IVY. m = 2 TROUBLES, PRIVATE, OATEN, ORRERY. Other notation: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). The algorithm has a finite number of steps to manipulate a word. There are no iterations! Step 1a: SSES → SS (caresses → caress); IES → I (ponies → poni ); SS → SS (caress → caress); S → ∅ (cats → cat) Step 1b: (m > 0) EED → EE (feed → feed, agreed → agree); (*v*) ED → ∅ (plastered → plaster, bled → bled ); (*v*) ING → (motoring → motor, sing → sing) If 1b is successful: AT → ATE (conflat(ed) → conflate); BL → BLE (troubl(ed) → trouble); IZ → IZE (siz(ed) → size); (*d and not (*L or *S or *Z))→ single letter (hopp(ing) → hop, tann(ed) → tan, fall(ing) → fall, hiss(ing) → hiss, fizz(ed) → fizz ); (m = 1 and *o) → E (fail(ing) → fail, fil(ing) → file) Step 2: (m > 0) ATIONAL → ATE (relational → relate); (m > 0) TIONAL → TION (conditional → condition, rational → rational ); (m > 0) ENCI → ENCE (valenci → valence); +17 similar rules Step 3: (m > 0) ICATE → IC (triplicate → triplic); (m > 0) ATIVE → ∅ (formative → form); (m > 0) ALIZE → AL (formalize → formal ); +4 similar rules Step 4: (m > 1) AL → ∅ (revival → reviv ); (m > 1) ANCE → ∅ (allowance → allow ); (m > 1) ENCE → ∅ (inference → infer ); +16 similar rules Step 5a: (m > 1) E → ∅ (probate → probat, rate → rate); (m = 1 and not *o) E → ∅ (cease → ceas) Step 5b: (m > 1 and *d and *L) → single letter (controll → control, roll → roll. The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix. 2.1.5. Results. This method is used as a gold standard. It is very simple to implement, and usually considered ”good enough” for all practical purposes. It is hard to do something much better. 2.1.6. Problems, Future Work. Originally designed for English. Generalization for languages with rich morphology (e.g., Hungarian) seems difficult. Prefixes are not removed. 2.2. Stemming using a web site ranking algorithm.

2

2.2.1. Objective of Learning. To provide a morphological analysis of words and find the stems. 1M. F. Porter: An algorithm for suffix stripping, Program 14 (3), 130-137, July 1980.

2M. Agosti, M. Bacchin, N. Ferro, and M. Melucci, Improving the Automatic Retrieval of Text Documents, C.A.

Peters (Ed.): CLEF 2002, LNCS 2785, pp. 279290, 2003.

4

NYELVBANYASZOK

2.2.2. Output. A probability distribution over potential splits of a word. Predicted stem is by the most probable split. Byproduct: a weighted network of potential stems and suffixes. 2.2.3. Principle of learning. Based on a very heuristic stem-suffix graph weighting algorithm in the spirit of Kleinberg’s HITS (Hyperlink Induced Topic Search) algorithm (a paradigmatic algorithm for web page ranking). 2.2.4. Algorithm. Assume that each word is a concatenation of one stem and one suffix, word = stem + suffix. Then the question is how to split the words? Consider all possible splits of all words in a lexicon. This defines two sets: the set of potential stems X and the set of potential suffixes Y. Example: Lexicon={aba, abb, baa} defines X = {a, b, ab, ba}, Y = {a, b, aa, bb, ba}. Connecting an element from X with one from Y if concatenation gives an existing word in the Lexicon defines a graph. Next we assign weights to nodes. This is done iteratively. Let Yx define the set of all potential suffixes liked with a potential stem x, and Xy the set of all potential stems linked with y. The HITS algorithm defines a simple iterative procedure whose fixed point is the self-consistent weighting scheme: X weight of x : px = py y∈Xy

weight of y :

py

=

X

px

x∈Yx

This is similar in spirit to web site ranking: hubs (stems) gets high score if linked with high score authorities (suffixes) and vice versa. In this case: ”good stems point to good suffixes, and good suffixes are pointed by good stems”. P P Weights are normalized to yield probabilities, x px = 1, y py = 1. Given a word = x + y the best split is defined as the one which maximizes px (stem probability). 2.2.5. Results. Implemented for Italian. Efficiency of information retrieval is measured. Compared with a Porter-like algorithm. Precision and recall are essentially the same. Is stemming measured directly? 2.2.6. Problems, Future Work. Other languages and fine tuning of the heuristic algorithm are in progress. 2.3. Morphological analysis using latent semantic analysis.

3

2.3.1. Objective of Learning. To discover conflation sets such as ABUSE={abuse, abused, abuses, abusive, abusively, . . To improve stem-and-affix statistics methods by adding a semantic analysis. 2.3.2. Output. Conflation sets. 2.3.3. Principle of learning. Unsupervised learning based on Latent Semantic Analysis (LSA). 2.3.4. Algorithm. There are four major steps: (1) Identify potential affixes: Words are inserted into a trie. Potential suffixes are identified by the branching points in the trie. This is similar to Harris’ method. (2) Pairs of potential suffixes are called ”rules” if they share a critical number of pseudo stems. All potential rules are collected. (e.g., -s→ ∅, -ed→-ing, -ing→ ∅, -ers→-ing, etc.) (3) Pairs of potential morphological variants (PPMVs) identified (e.g., ”used/user”, ”aligned/aligning”). These are pairs connected by a rule. 3P. Schone and D. Jurafsky, Knowledge-Free Induction of Morphology Using Latent Semantic Analysis, Proceedings of CoNLL-2000 and LLL-2000, pages 67-72, Lisbon, Portugal, 2000.


5

(4) Semantic vectors computed to compare PPMVs. An N ×2N semantic matrix M is formed. where rows and columns correspond to the N most frequent words in the corpus (stop words removed). Each element Mij measures the frequency of finding word i within a distance 50 of word j in the corpus. The first N columns measure frequencies when i precedes j, the second N columns when j precedes i. Thus the meaning of word i is associated with the profile of nearby words. An SVD is done to find the most relevant lower dimensional semantic representation. (Typically N = 103 and some hundred singular values are retained.) Semantic vectors are computed and projected into this subspace for both components of a PPMV. The overlap of the two (projected) semantic vectors defines the measure of semantic similarity of the two words. If this measure is high enough, the PPMV is accepted (words belong to the same conflation set), otherwise rejected. Example: PPMVs car/cars car/caring car/cares car/cared

overlap 5.6 -0.71 -0.14 -0.96

PPMVs ally/allies ally/all dirty/dirt rating/rate

overlap 6.5 -1.3 2.4 0.97

2.3.5. Results. Comparison with the human-labeled CELEX database and Goldsmith’s Linguistica. Precision is much better than Linquistica’s, recall is a bit worse. Overall F-score is better by 3% (Linguistica= 82%). 2.3.6. Problems, Future Work. Refined in the next paper. Prefix analysis added. Semantic analysis is boosted by frequency analysis. 2.4. An improved morphological analysis using latent semantic analysis.

4

2.4.1. Objective of Learning. To discover conflation sets such as ABUSE={abuse, abused, abuses, abusive, abusively, . . Combine stem-and-affix statistics methods and semantic analysis. 2.4.2. Output. Conflation sets. 2.4.3. Principle of learning. Latent Semantic Analysis (LSA), Minimum Edit Distance (MED) 2.4.4. Algorithm. Prefix identification added. Circumfixes also identified. Different languages added. The former, purely LSA-based algorithm is amended with three additional elements: (1) Semantic probabilities of PPMV acceptance are extended with an additional orthographic probability. This latter is based on minimal edit distance of the two words. This yields about a 3% improvement. (2) Use local syntactic context information. There are words in the close vicinity of the PPMV elements which show statistically relevant frequency anomalies. An example for the rule -s→ ∅ helps to understand this: Context for L Context for R agendas are seas were a legend this formula two red pads pleas have militia is an area these ideas other areas railroad has A guerrilla Words like a, an, this, is, has are indicator of singular, are, other, these, two, have of plural, which can be used to enhance the acceptance of a PPMV based on the rule -s→ ∅. Local syntactic indicators can be found for other rules as well. (3) Transitive closure: If (A,B) and (A,C) are independently found to be valid PPMVs then this is an indication that (B,C) is more likely to be valid too. 2.4.5. Results. Comparison with the human-labeled CELEX database, Goldsmith’s Linguistica and the authors’ earlier program. F-score improves again by at least 3% (Goldsmith=82%, this=88% for English). Especially good for German! 4P. Schone and D. Jurafsky, Knowledge-Free Induction of Inflectional Morphologies, Proceedings of ... 2001.

6

NYELVBANYASZOK

2.4.6. Problems, Future Work. They plan to combine this with a method used by Yarowsky and Wicentowski (2000). 2.5. Unsupervised Learning of the Morphology of a Natural Language.

5

2.5.1. Abstract. This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar. 2.5.2. Objective. The central task of morphological analysis is the segmentation of words into the components that form the word by the operation of concatenation. 2.5.3. Input. The program in question takes a text file as its input (typically in the range of 5,000 to 1,000,000 words) and produces a partial morphological analysis of most of the words of the corpus; the goal is to produce an output that matches as closely as possible the analysis that would be given by a human morphologist. It performs unsupervised learning in the sense that the program’s sole input is the corpus; we provide the program with the tools to analyze, but no dictionary and no morphological rules particular to any specific language. 2.5.4. Minimum Description Length Model 1. The underlying model that is utilized invokes the principles of the minimum description length (MDL) framework [1] MDL focuses on the analysis of a corpus of data that is optimal by virtue of providing both the most compact representation of the data and the most compact means of extracting that compression from the original data. It thus requires both a quantitative account whose parameters match the original corpus reasonably well (in order to provide the basis for a satisfactory compression) and a spare, elegant account of the overall structure. 2.5.5. Minimum Description Length Model 2. The central idea of minimum description length analysis [1] is composed of four parts: first, a model of a set of data assigns a probability distribution to the sample space from which the data is assumed to be drawn; second, the model can then be used to assign a compressed length to the data, using familiar information-theoretic notions; third, the model can itself be assigned a length; and fourth, the optimal analysis of the data is the one for which the sum of the length of the compressed data and the length of the model is the smallest. That is, we seek a minimally compact specification of both the model and the data, simultaneously. Accordingly, we use the conceptual vocabulary of information theory as it becomes relevant to computing the length, in bits, of various aspects of the morphology and the data representation. 2.5.6. Results.

Category Count Percent Good 829 82.9% 52 5.2% Results (English). Wrong analysis Failed to analyze 36 3.6% Spurious analysis 83 8.3% Category Count Percent Good 833 83.3% 61 6.1% Results (French). Wrong analysis Failed to analyze 42 4.2% Spurious analysis 64 6.4% 5John Goldsmith, University of Chicago


7

For both English and French, correct performance is found in 83% of the words; details are presented in Tables 10 and 11. For English, these figures correspond to precision of 829/(829 + 52 + 83) = 85.9%, and recall of 829/(829 + 52 + 36) = 90.4%. 2.5.7. Remaining Issues. • Identifying related stems (allomorphs). (like win and winn-ing). • Identifying paradigms from signatures. (automatically identify NULL.ed.ing as a subcase of the more general NULL.ed.ing.s.) • Determining the relationship between prefixation and suffixation. • Identifying compounds. • As noted at the outset, the present algorithm is limited in its ability to discover the morphology of a language in which there are not a sufficient number of words with only one suffix in the corpus. • the capability of the algorithm the ability to posit suffixes that are in part subtractive morphemes. That is, in English, we would like to establish a single signature that combines NULL.ed.ing.s and e.ed.es.ing (for jump and love, respectively). We posit an operator Ix/which deletes a preceding character x, and with the mechanism, we can establish a single signature NULL.leled.leling.s, composed of familiar suffixes NULL and s, plus two suffixes leled and leling, which delete a preceding (stem-final) e if one is present. 2.5.8. Top signatures. Top signatures, 500,000-word English corpus. (1) NULL.ed.ing.s : accent add administer afford alert amount appeal assault attempt (2) ’s.NULL.s : adolescent afternoon airline ... (3) NULL.ed.er.ing.s : attack back bath ... (4) NULL.s : abberation abolitionist abortion ... (5) e.ed.es.ing : achiev assum brac ... (6) e.ed.er.es.ing : advertis announc bak ... (7) NULL.ed.ing : applaud arrest astound ... (8) NULL.er.ing.s : blow bomb broadcast ... (9) NULL.d.s : abbreviate accomodate aggravate ... (10) NULL.ed.s : acclaim beckon benefit ... References [1] Rissanen, Jorma. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Co, Singapore. Nyelvbanyaszok E-mail address: [email protected]