parsing german - Semantic Scholar

2 downloads 0 Views 307KB Size Report
(2) 'Der Bus geht nach Wien.' (The bus is bound for Vienna.) gehenl (move along). C((CASE NOM) AND (RESTRICTION ANIMATE)) ->. A(CRI LOCOMOTION).
COLING~ , ~ Horec~ /~.) North-Holland PubltJhing Company

Accdcm~ 1982 PARSING

GERMAN

Ingeborg Steinacker, H a r a l d Trost D e p a r t m e n t of Medical C y b e r n e t i c s U n i v e r s i t y of Vienna The first part of this paper is d e d i c a t e d to an o v e r v i e w of the parser of the system V I E - L A N G (Viennese Language U n d e r s t a n d i n g System). The parser is a production system which uses an interleaved method that combines syntax and semantics. It parses directly into the internal r e p r e s e n t a t i o n of the system, w i t h o u t producing an i n t e r m e d i a t e syntactic structure. The last part d i s c u s s e s the r e l a t i o n s h i p between some special features of the G e r m a n language, and properties of the parser that o r i g i n a t e in the language.

GENERAL

APPROACH

A sentence is parsed word per word, from left to right. The parser is l a r g e l y a d a t a - d r i v e n p r o d u c t i o n system. P r o d u c t i o n s involve the use of s y n t a c t i c and s e m a n t i c information at all major stages of the process. Noun phrases, for example, are r e c o g n i z e d by an ATN which v e r i f i e s the result of s y n t a c t i c analysis s e m a n t i c a l l y . It returns s e m a n t i c a l l y valid NPs only. The parser b e l o n g s to the class of semantic p a r s e r s as s u g g e s t e d by [I], [4], [7]. It has two main sources of information: one is a semantic net, which p r o p a g a t e s the information about selectional restrictions, the other is the parsing-lexicon, which for each word contains different senses a s s o c i a t e d with the information n e c e s s a r y to d i s t i n g u i s h one sense from the others. I n f o r m a t i o n includes syntactic features of the sentence (infinitive, s u r f a c e - c a s e s of d e p e n d e n t n o u n phrases .... ), semantic restrictions and words that occur together with the input-word. The p r o d u c t i o n s make use of a correspondence between syntactic i n f o r m a t i o n in the sentence and the roles of the net (see chapter internal r e p r e s e n t a t i o n for an e x p l a n a t i o n of roles). Productions are used not only for generating the internal r e p r e s e n t a t i o n of c o n s t i t u e n t s but also as e x p e c t a t i o n s that guide the a n a l y s i s of the rest of the sentence. The g e n e r a t i o n of the internal structure corresponding to the sentence is centered around the verb. Since the r e p r e s e n t a t i o n of other c o n s t i t u e n t s can be initiated i n d e p e n d e n t l y of the verb, the parser builds a s e m a n t i c s t r u c t u r e i m m e d i a t e l y after a c o n s t i t u e n t is recognized. These s t r u c t u r e s are stored in a list, until the main verb of the sentence has been found. Then the parser tries to fill the c a s e - s l o t s of the verb with the given structures. The s e m a n t i c c a t e g o r i e s of the s t r u c t u r e s have to be m a t c h e d against the value r e s t r i c t i o n s of the roles of the verb. INTERNAL REPRESENTATION

The source of semantic information Net [2]. This net formalism has e p i s t e m o l o g i c a l l y clear and explicit.

365

is a Structural the advantage S I - N e t s are based

Inheritance of being on a s t r i c t

I. STEINACKER and H. TROST

366

d i s c r i m i n a t i o n b e t w e e n few s t r u c t u r a l c o m p o n e n t s , and their content (what is represented). Real world k n o w l e d g e is r e p r e s e n t e d in the form of c o n c e p t s and roles. Roles explain relationships between concepts. A concept is d e f i n e d by its a t t r i b u t e s w h i c h c o n s i s t of two parts: the role and the value restriction. The value r e s t r i c t i o n is a c o n c e p t w h i c h d e f i n e s the range of p o s s i b l e f i l l e r s for the attribute, the role d e f i n e s the f u n c t i o n of a filler with regard to the c o n c e p t being defined. Role-filler concepts can be r e g a r d e d as s e m a n t i c c a t e g o r i e s . Generic concepts are organized in a hierarchy of superand subconcepts. A subconcept inherits the attributes of the superconcept. If a concept has more than one superconcept it inherits the combined set of a t t r i b u t e s . W h e n p r o c e s s i n g an input i n d i v i d u a l s of the addressed concepts are instantiated. These i n d i v i d u a l s c o n s t i t u t e the e p i s o d i c layer of the net. A word sense a d d r e s s e s either a concept or the attribute of a concept. If an input word relates to a concept, as most nouns and v e r b s do, that c o n c e p t is i n s t a n t i a t e d . If it c o r r e s p o n d s to a role both the c o n c e p t and the a t t r i b u t e are instantiated, i. e. the g e n e r i c concept, the role defining the attribute and the value restriction. Most a d j e c t i v e s and most p r e p o s i t i o n s are mapped into roles (size, colour, location, time, .... ) but also some nouns (e.g. father is the role of a p e r s o n in the c o n c e p t family). The net is s t r u c t u r e d in a way that f a c i l i t a t e s the i n c o r p o r a t i o n of r e s u l t s g a i n e d in l i n g u i s t i c s : a t t r i b u t e s of a c t i o n s are d e f i n e d in a way c o r r e s p o n d i n g to cases of a case g r a m m a r . This can best be i l l u s t r a t e d by an example: A c t i o n s are r e p r e s e n t e d as n e t - c o n c e p t s , e.g. DO. The c o n c e p t DO is d e f i n e d by a t t r i b u t e s with roles like AGENT, OBJECT, GOAL, RESULT, that are restricted by adequate role-filler concepts. By defining attributes in this way a c o r r e s p o n d e n c e b e t w e e n s u r f a c e cases in a s e n t e n c e and roles of the net can e a s i l y be e s t a b l i s h e d .

THE

PARSING-LEXICON

In the parsing-lexicon each word-sense is associated with productions. These productions r e f l e c t the c o r r e s p o n d e n c e b e t w e e n s u r f a c e cases of the s e n t e n c e and s e m a n t i c cases within the net. The number of tests in a p r o d u c t i o n c o r r e l a t e s to the number of s e n s e s of a word. By e x e c u t i n g these tests the parser gains the information necessary to choose the correct reading of a word. T e s t s check the s y n t a c t i c and the s e m a n t i c c o n t e x t in which an input w o r d is found. Sometimes morphological information and the o c c u r r e n c e of certain w o r d s have to be taken into c o n s i d e r a t i o n as well. The range of tests r e f l e c t s our g e n e r a l a p p r o a c h to parsing: c o m b i n i n g syntax and s e m a n t i c s at all s t a g e s of the p a r s i n g p r o c e s s [8]. D e p e n d i n g on the stage of the p r o c e s s the failure of a i n t e r p r e t e d in two ways. If the end of the s e n t e n c e e n c o u n t e r e d the result is taken as false, if p a r s i n g is in the test is r e p e a t e d at later stages of the process. Actions associated structure-building

with the procedures.

tests Some

mostly actions

test is has been progress

deal with semantic are u s e d to c o n t r o l the

367

PARSING GERMAN

p a r s i n g process. Usually the semantic s t r u c t u r e for a c o n s t i t u e n t of the sentence is built after the constituent is recognized but a c t i o n s can delay the c r e a t i o n of n e t - s t r u c t u r e s . The reasons for such a d e l a y are e x p l a i n e d in the following chapter. A verb-sense is recognized by taking into consideration the syntactic surroundings of the verb and the s e m a n t i c c a t e g o r i e s that match the s e l e c t i o n a l r e s t r i c t i o n s defined by the verb. After a v e r b - s e n s e has been chosen expectations are built up r e g a r d i n g missing constituents. The o c c u r r e n c e of c e r t a i n surface-structures also leads to the f o r m a t i o n of e x p e c t a t i o n s . T h e r e f o r e tests that are a s s o c i a t e d with verbs first check the surface structure of the sentence (cases, prepositions...). The c o n s t i t u e n t s that s a t i s f y these s y n t a c t i c tests have to fulfill semantic selectional restrictions. After having passed these tests, actions create the semantic r e p r e s e n t a t i o n for the verb and fill its roles with the selected c o n s t i t u e n t s . Unless object

an entry in the lexicon includes a test r e g a r d i n g subject and of a sentence the following default actions are executed automatically: the subject of a sentence is m a p p e d onto the A G E N T and the object (accusative) is m a p p e d onto the O B J E C T of the action. A Detailed

Example

The two senses of 'gehen' in the following example can be d i s a m b i g u a t e d by using the entries in the p a r s i n g - l e x i c o n listed below (parts of the entry which are irrelevant to the example are left out). These sample entries include important kinds of tests and actions. (i) (2)

'Ich gehe in den Park.' 'Der Bus geht nach Wien.'

gehenl (move along) C ( ( C A S E NOM) AND (RESTRICTION A(CRI LOCOMOTION) g e h e n 2 (bound for) C ( ( C A S E NOM) AND (RESTRICTION A(CRI (PUBL.-TRANSPORT)) C(PLOC) -> A (CRV(+,DESTINATION,*)).

(I walk into the garden.) (The bus is bound for Vienna.)

ANIMATE))

->

PUBL.-TRANSPORT.))

->

In the example the '+' p a r a m e t e r is an individual of the concept PUBL.-TRANSPORT, the '*' p a r a m e t e r is the l o c a t i o n e x p r e s s e d by the p r e p o s i t i o n a l phrase, namely Vienna. The nounphrase 'Ich' (I) fulfills the restriction ANIMATE, because speakers are always interpreted as humans. Surface-tests: C a s e - t e s t s search for an NP of the surface-case indicated by the second parameter. If an NP is found that s a t i s f i e s the condition, the tests that are c o n n e c t e d by AND or OR to the case-test are executed. The c o n s t i t u e n t of the sentence which s a t i s f i e s the tests is referred to with an asterix '*' in the a s s o c i a t e d action(s). The test PLOC refers location. It is

to a p r e p o s i t i o n a l p h r a s e that a test which uses syntactic

indicates some and semantic

368

I. STEINACKER and H. TROST

information. Restriction-tests: These s e m a n t i c tests are used to check selectional restrictions. They are often used in c o m b i n a t i o n with s y n t a c t i c tests. If both tests are met by a c o n s t i t u e n t this is a s i g n i f i c a n t indicator, that the c o r r e c t i n t e r p r e t a t i o n has been s e l e c t e d . Structure-building

Actions:

The a c t i o n C R I ( c o n c e p t ) c r e a t e s an i n d i v i d u a l of the concept. The action CRV(pl,p2,p3) individuates an a t t r i b u t e of the c o n c e p t pl. The c o n c e p t pl, the role p2 and the c o n c e p t P3 as value-restriction are i n s t a n t i a t e d . If pl or p3 are a d d r e s s e d by '+' the p a r a m e t e r refers to the first c o n c e p t that was individuated when processing this p a r t i c u l a r entry in the parsing-lexicon. A '*'-parameter refers to the semantic representation of the constituent which s a t i s f i e s the first test of the p r o d u c t i o n .

SPECIAL

FEATURES

Morphological

OF G E R M A N

Ambiguities

We b e l i e v e that m a k i n g use of the interaction between syntax and s e m a n t i c s has m a n y a d v a n t a g e s over a s t r i c t l y s e q u e n t i a l a p p r o a c h to parsing. Introducing semantic information helps to r e s o l v e some a m b i g u i t i e s at an e a r l y stage of the analysis and thus to avoid unnecessary backtracking. T y p i c a l l y , m o r p h o l o g i c a l a m b i g u i t i e s can be resolved by such an i n t e r a c t i o n . The G e r m a n language is rich in inflectional forms, therefore the morphological component often comes up with m o r e than one p o s s i b l e stem for an input word. These stems usually belong to different c a t e g o r i e s of words, e.g. 'meinen' can be i n t e r p r e t e d as a verb (to suppose) or it can be reduced to the p o s s e s s i v e p r o n o u n 'mein' (my). Syntax restricts the type of a c o n s t i t u e n t , w h i c h is e x p e c t e d at a g i v e n point in the analysis. Usually it is sufficient to use syntactic information to d i s a m b i g u a t e m o r p h o l o g i c a l a m b i g u i t i e s of this kind. If a word is reduced to two d i f f e r e n t stems of the same c a t e g o r y of words, s e l e c t i o n a l restrictions in the semantic net are used to c h o o s e one stem. The parsing-lexicon relates surface cases to semantic restrictions of the attributes of the action. In most cases this i n f o r m a t o n is s u f f i c i e n t for disambiguation. The inflected (to hear) and

form 'gehoert' is reduced to the 'gehoeren' (to b e l o n g to).

two

verbs

'hoeren'

(3) D i e s e s Buch g e h o e r t mir. (This is my book.) (4) Hast du d i e s e s G e r a e u s c h g e h o e r t ? (Did you hear that n o i s e ? ) In (3) the s u b j e c t of the s e n t e n c e has to be a ' P O S S E S S I B L E O B J E C T ' , in (4) the object of has to be a s u b c o n c e p t of 'SOUND'. A violation of s e l e c t i o n a l r e s t r i c t i o n s , is a clear i n d i c a t o r that the wrong i n t e r p r e t a t i o n of the v e r b has been chosen.

PARSING GERMAN

Disconnected

369

Constituents

A n o t h e r c h a r a c t e r i s t i c f e a t u r e of the G e r m a n language is the verb second p h e n o m e n o n . In G e r m a n a verb can o c c u p y three d i f f e r e n t positions within a sentence: the first in q u e s t i o n s and commands, the second in main clauses, and the last in s u b o r d i n a t e clauses. C o m p o u n d p r e d i c a t e s are d i v i d e d into two parts. The auxiliary or the modal verb hold %he place of the verb, and the rest of the p r e d i c a t e is put at the end of the sentence. One has to deal with a t w o - p i e c e p r e d i c a t e w h e n e v e r c o m p o u n d tenses are used, in s t r u c t u r e s i n v o l v i n g the i n f i n i t i v e etc. For a parser that uses a traditional approach of sequential s y n t a c t i c and semantic processing these f e a t u r e s cause e x t e n s i v e backtracking. The method of combinig syntactic and semantic analysis does not avoid backtracking completely but it makes r e - i n t e r p r e t a t i o n easier. This claim is s u p p o r t e d in the following p a r a g r a p h using the e x a m p l e of a c o m p o u n d p r e d i c a t e . (5) Mein Bruder hat das Buch, yon dem du mir e r z a e h l t hast, schon g e l e s e n . (My b r o t h e r a l r e a d y read the book, w h i c h you told me about.) In (5) the object and a r e l a t i v e c l a u s e s e p a r a t e the two parts of the p r e d i c a t e . One p o s s i b l e r e a d i n g of the verb 'haben' is to possess. The object 'das Buch' s a t i s f i e s the s e m a n t i c r e s t r i c t i o n ' P O S S E S S I B L E - O B J E C T ' , t h e r e f o r e 'hat' is taken as the p r e d i c a t e and a possess relation is e s t a b l i s h e d b e t w e e n the r e p r e s e n t a t i o n s for s u b j e c t and object. When the past participle 'gelesen' is e n c o u n t e r e d at the end of the sentence this d e c i s i o n has to be revised in favour of the c o m p o u n d p r e d i c a t e 'hat g e l e s e n ' . The p o s s e s s r e l a t i o n which was e s t a b l i s h e d has to be replaced by the c o n c e p t that is a d d r e s s e d by 'lesen, n a m e l y 'INFORMATION-TRANSFER'. The s e m a n t i c representations of the o b j e c t book and the r e l a t i v e clause are not a f f l i c t e d by this change. Book also fits into the hierarchy of 'INFORMATION-SOURCE' and therefore satisfies the s e l e c t i o n a l r e s t r i c t i o n s for the object of 'INFORMATION-TRANSFER' also. S e p a r a b l e p r e f i x e s also add to the problem of finding the right verb. Syntactically verbadjuncts are p a r t i c l e s , that are part of the verb. In some tenses a v e r b a d j u n c t b e c o m e s s e p a r a t e d from the v e r b and is put at the end of the clause. V e r b a d j u n c t s can s p e c i f y the verb, but s o m e t i m e s they change its sense c o m p l e t e l y (aufhoeren = to stop, h o e r e n = to hear). (6) Das Kind hoert (After an hour

nach einer Stunde e n d l i c h zu w e i n e n the child f i n a l l y stops crying.)

auf.

Such features either cause d e l a y in t h ~ c o n s t r u c t i o n of the internal r e p r e s e n t a t i o n for a sentence, or they result in backtracking b e c a u s e the c o r r e c t m e a n i n g of the verb b e c o m e s a p p a r a n t at the end of the sentence. CONCLUSION The s t r u c t u r e of the G e r m a n language adds some d i f f i c u l t i e s to the general problem of parsing n a t u r a l language. Flexible word-order

370

I. STEINACKER and H. TROST

and multiple sources for ambiguities led us to choose a data-driven approach. Syntactic and semantic information are used for disambiguation of existing structures and for expectations that control processing of new input. Since backtracking is inevitable in some cases we tried to make it as efficient as possible. The integration of syntax and semantics facilitates backtracking to a large degree because semantic representations for all constituents are built independently. If backtracking occurs e.g. after having selected a wrong verb-sense, the parser has to destroy the existing semantic representation and replace it with the one indicated by the new verb. The slots of the instantiation of the concept for the new verb have to be filled with the already existing structures instead of having to start the parse all over again. ACKNOWLEDGEMENTS This research is part of the project 'Development of a NLU System with regard to Medical Applications' (supervision R.Trappl), which is supported by the Austrian 'Fonds zur Foerderung der wissenschaftlichen Forschung', grant #4158. REFERENCES [I] Boguraev, B.K., Automatic Resolution of Linguistic Ambiguities, University of Cambridge, (1979). [2] Brachmann, R.J., A Structural Paradigm for Representing Knowledge (Bolt Seranek and Newman, 1978). [3] Buchberger, E., Steinacker, I., Trappl, R., Trost, H., Leinfellner, E., A NLU System for Medical Applications, SIGART 79 (1982), 146-147. [4] Gershman, A.V., Knowledge-Based Parsing, Yale Univ., RR-156,(1979). [5] .den, G., On the Use of Semantic Constraints in Guiding Syntactic Analysis, Univ. of Wisconsin, WR-3, (1978). [6] Palmer, M., A Case for Rul~-driven Semantic Processing, in: Proceedings of the 19th Ann. Meeting of the ACL, Stanford, (1981). [7] Riesbeck, C.K. and Schank, R.C., Comprehension by Computer: Expectation-based Analysis of Sentences in Context, Yale University, RR-78, (1976). [8] Steinacker, I., Parsing between Syntax and Semantics, Automatische Sprachverarbeitung 81, Potsdam, (1981). [9] Trost, H. and Steinacker, I., The Role of Roles, Some Aspects of World Knowledge Representation, in: Proc. of the 7th Int'l. Joint Conf. on Artificial Intelligence, Vancouver, (1981). [l@]Wilks, Y., Processing Case, Am. J. of Computational Linguistics 4,(1976).