An Integrated Approach to Multilingual Hypertext

An Integrated Approach to Multilingual Hypertext Generation

1

Joakim Nivre Göteborg University

Torbjörn Lager Uppsala University

[email protected]

[email protected]

Introduction

Most present-day NLG systems divide the generation task into three functional steps along the following lines (cf. Reiter 1994, Reiter and Dale 1997): • • •

Text planning Sentence planning Linguistic realization

Text planning combines the tasks of content determination and discourse planning in order to construct a text plan from a suitably represented generation goal. The sentence planning step involves such tasks as sentence aggregation, referring expression generation, and sometimes lexicalization, and aims at converting text plans into sentence plans. Linguistic realization, finally, converts the sentence plans into surface text through syntactic, morphological and orthographic1 processing. In this paper, we will describe an approach to natural language generation that collapses these three steps into a single process of text realization, converting a generation goal into a surface text without any intermediate representations. In order to achieve this, we adopt a generalized grammar formalism, where phrase structure and text structure can be described by the same kind of rules, and use pragmatic principles inspired by the Gricean maxims of quality and quantity to guide the application of these rules. We call the approach integrated, first because it integrates text planning, sentence planning and linguistic realization into a single process, but also because it permits the integration of different realization techniques, such as canned text, templates and grammar rules, in a flexible and efficient manner. It goes without saying that the integrated approach cannot hope to achieve the same level of sophistication as systems adopting specialized techniques for the different tasks involved in text planning, sentence planning, and linguistic realization. Nevertheless, we believe that it provides a useful alternative for applications which do not require the full power of state of the art methods. It is easy to implement, relatively efficient and robust, and provides more flexibility than simpler approaches relying only on canned text or template filling. The approach has been developed and tested in the context of the TREE project,2 which deals with multi-lingual generation of employment advertisements on the World Wide Web. Since the TREE application will be used throughout the paper to illustrate our ideas, we will set the stage in section 2 by giving a brief introduction to the architecture of the TREE system. In section 3 we will introduce the generalized grammar formalism and in section 4 we will explain the pragmatic principles guiding the text realization process. 1. We will concentrate in this paper on the generation of written texts, as opposed to spoken utterances, although the approach described below can easily be applied to the latter problem as well.

1

2

Background: The TREE System

The aim of the TREE project is to provide an Internet-based service for job seekers and employers within the European Union. The basic idea is to take job ads via an e-mail feed, or extract them from existing job databases, and to store information about the vacancies in a language-neutral database. Users can then search the database using their own language and get customized summaries of job ads generated on demand. The current TREE prototype supports four EU languages: English, Flemish, French and Swedish. (For an overview of the TREE project, see Somers et al 1997.) 2.1

General Architecture

The TREE system consists of four main modules: • • • •

Job Database Analyzer Search Interface Generator

The job database is a relational database containing information about job vacancies. Most of the information is stored as numerical codes rather than free text. Only items that are not translated between languages, such as street names and most geographical names, are stored directly as natural language strings. (More information about the job database will be given in section 2.2 below). The task of the analyzer is to extract information corresponding to ‘slots’ in the database from job adverts submitted to the system. Input can be free or semi-structured text as well as coded data from existing job databases. In the former case, the analyzer relies on information extraction techniques using weighted pattern matching (cf. Somers et al 1997). In the latter case, analysis becomes a problem of database conversion which is best solved by defining a customized conversion mechanism for each contributing database. The search interface allows users of the system to search for jobs in their own language, specifying different parameters such as type of job, location, education requirements, etc. The query system uses case-based reasoning techniques to identify the set of vacancies most closely matching the specifications of the user. The generator, finally, outputs descriptions in the user’s own language of the vacancies that are considered relevant in relation to the user’s query. The user can choose different description formats, ranging from a one-line description to a full advertisement containing all the information available about the vacancy. (The generator module will be described further in section 2.3 below and throughout the rest of the paper.) 2.2

The Job Database

The job database allows a fine-grained specification of vacancies through the use of close to one hundred different attributes concerning such things as requirements for education and 2. TREE (TRans European Employment) is Language Engineering project LE 1182 of the European Commission’s Fourth Framework Programme (Telematics). We would like to express our thanks to our colleagues on the project: Jens Allwood (Göteborg University), Jeremy Ellman, Alex Rogers, (MARI Computer Systems), Harold Somers, Bill Black, Teresa Paskiewicz, Alan Wallington (UMIST), Luca Gilardoni, Annarosa Multari (Quinary), Mick Riley, Siobhan Walsh (Newcastle upon Tyne City Council), Edy Geerts, Marianne Kamoen (VDAB, Flemish Employment and Vocational Training Service), Don Macdonald (YouthNet).

2

experience, working hours, type and duration of contract, age requirements, language skills, salary, application procedure, etc. In what follows, we will restrict ourselves to a small subset of these attributes in order to illustrate how the generator converts database entries into vacancy descriptions. Moreover, the database entries will be given in a logical notation, which corresponds to the logical database interface of the generator, rather than to the relational database itself. The point of having a logical database interface on top of the actual database is to make the generator independent of the specific implementation details of the application database and to allow the definition of more complex logical constructs in terms of the simple database concepts. For example, this allows us to introduce entities which are only implicit in the application database but which need to be made explicit in the semantic representations of the generator. This will be exemplified in section 3.1 below. To exemplify the logical database representation, here is a very simplified representation of a vacancy for an architect in the city of Göteborg (Sweden): (1) job_type(111,120). location(111,’Göteborg’). education(111,7). experience(111,2). salary_amount(111,20000). salary_currency(111,’SEK’). salary_rate(111,3).

The number 111, occurring as the first argument in every clause, is the unique identifier of the vacancy in the database. The first two clauses specify that the vacancy concerns job type 120 (‘architect’) and that the job is located in Göteborg. The next two clauses state that the education requirements are of type 7 (‘higher secondary education’) and that at least two years experience is required. The last three clauses specify the salary with regard to amount (20,000), currency (‘SEK’) and pay rate (‘per month’). 2.3

The Generator

The task of the generator in the TREE system is to generate a job ad, or vacancy description, from a given database entry in the language selected by the user. A goal to the generator therefore has three parameters: • • •

Ad type (length, layout, etc.) Vacancy ID Language

As we shall see in the next section, the ad type corresponds to a syntactic text category in the grammar, while the vacancy ID is the core element of the semantic representation associated with a text of that category. For example, to generate a ‘one-liner’ describing the architect job represented in section 2.2 above, the generator would be fed the following goal: (2) one_liner:111

where ‘one_liner’ is a syntactic text category and ‘111’ is the unique database identifier of the vacancy to be described, which is used as a skeleton semantic representation for the text to be generated. In addition to the text category (ad type) and the semantic representation (identifier), a call to the generator must also specify the language in which the ad should be generated. We will return to this problem in section 3.5 below.

3

3

Generalized Hypertext Grammar

A key idea behind the integrated approach to generation is the assumption that information about text structure and phrase structure can be represented in the same formal framework. We call this framework Generalized Hypertext Grammar. 3.1

Formalism

A grammar rule has the following format: (3)

C0:S0 → SS1, …, SSn # Conditions .

where Ci is a syntactic category, Si is a semantic representation, and each SSi has the format Ci, the format Ci:Si, or the format [W1, …, Wm], where Wj is a word-form (string). The hash symbol (#) is used to separate the rule body from a set of database conditions. If the set of conditions is empty, the hash symbol may simply be omitted. Our grammar formalism belongs to the tradition of logic grammar (cf. Abramson and Dahl 1989) and readers are sure to notice the similarity between our formalism and that of Definite Clause Grammar (Pereira and Shieber 1980). It is thus important to note that the categories and semantic representations used in grammar rules are really logical terms, often containing variables that will be instantiated through unification in the generation process. Syntactic Categories Besides traditional phrase structure categories, such as sentence, noun phrase, prepositional phrase, etc., our generalized hypertext grammar contains syntactic categories of text, such as ‘full ad’, ‘one-liner’, and the generic category ‘text’, which is our name for plain paragraph text with no special formatting requirements. These categories are similar to the traditional syntactic categories in that they are used to classify contentful entities, i.e. entities with a semantic representation. However, we also permit syntactic categories which lack semantic representations and only refer to the formatting of texts, such as ‘heading1’, ‘line break’ and ‘bold’. Some of these categories (e.g. ‘bold’) take the form of functors, which can be applied to other syntactic categories (with or without semantic representations). Semantic Representations The grammar formalism in itself does not impose any constraints on the format of semantic representations. Within the TREE project, however, semantic representations usually take the following form: (4) sem(Type,Val)

where Type is a semantic type, and Val refers to an object of that type. The set of semantic types used within the TREE project mostly consists of application-specific types taken from the job database such as ‘job type’, ‘location’, ‘education’, etc. Thus, a noun like ‘architect’, when used to refer to a job type, will have the following semantic representation:3 (5) sem(job_type,120)

The semantic type system also includes functor types that may be used to construct complex types from simpler ones. For example, given the two simple types ‘job type’ and ‘location’, we can form a complex type where, e.g., the job type is restricted to a certain location: 3. The number 120 is the internal code for the job type ‘architect’ (cf. example (1) above).

4

(6) rest(job_type,location)

This type occurs in the semantic representation of a noun phrase like ‘architect in Göteborg’, where the structure of the semantic value mirrors the structure of the complex type: (7) sem(rest(job_type,location),rest(120,’Göteborg’))

Database Conditions Database conditions attached to grammar rules provide the necessary link between semantic representations and the application database. More precisely, the logical database interface defines satisfaction conditions of the following kind: (8) ∀X,Y job_type(X,sem(job_type,Y)) ⇔ db(job_type(X,Y))

This biconditional states that an object Y of type ‘job type’ is the job type of vacancy X iff the database relation ‘job type’ holds between X and Y. In the logical database interface, we may also define satisfaction conditions for complex semantic types, such as the ‘located job’ type discussed earlier: (9) ∀X,Y,Z loc_job(X,sem(rest(job_type,loc),rest(Y,Z))) ⇔ db(job_type(X,Y) & db(location(X,Z))

This is a good illustration of the way in which the logical database interface can be used to introduce objects that are only implicit in the application database, such as a ‘located job’, and which are needed in the semantic representations of the generator, in this case to define the semantics of noun phrases like ‘architect in Göteborg’. 3.2

Phrase Structure Rules

In the following sections we will now exemplify the use of the grammar formalism to encode different aspects of the syntax and semantics of texts. We will begin at the level of traditional grammar, i.e. the level of phrase structure. Consider the rules (10–14),4 which allow the noun phrase in (15) to be generated from the generation goal in (16). (10) n(N):sem(rest(HType,RType),rest(HVal,RVal)) → n(N):sem(HType,HVal), pp:sem(RType,RVal). (11) pp:Sem → p:Sem, n(_):Sem. (12) n(sing):sem(job_type,120) → [architect]. (13) n(sing):sem(location,S) → [S]. (14) p:sem(location,_) → [in]. (15) architect in Göteborg (16) n(sing):sem(rest(job_type,loc),rest(120,’Göteborg’))

Rule (10) says that a nominal constituent5 (with certain agreement features) whose semantic representation is of the type ‘head + restriction’ may consist of a nominal expression (with the same agreement features) denoting the ‘head object’, followed by a prepositional phrase expressing the restriction, while rule (11) says that a prepositional phrase may consist of a preposition followed by a nominal expression, where both constituents inherit the semantic 4. The rules (12–14) really belong to the lexicon, which is handled in a different way in the actual TREE system. For ease of exposition, however, we will ignore the distinction between lexical rules and syntactic rules in what follows. 5. Note that we do not make any categorical distinction between noun phrases and nouns (or ‘n-bars’), but instead rely on semantic information to ensure that the rules do not overgenerate.

5

representation of the mother. Rules (12–14) supply the lexical material necessary to generate (15) from (16). 3.3

Text Structure Rules

As already mentioned, we assume that the same kind of rules that was used in the previous section to describe phrase structure can also be used to describe the structure of larger text elements, at least for the simple kind of texts that we are dealing with in the TREE project. As a first and almost trivial example, we give the rule for a ‘one-liner’ job ad used in the TREE system: (17) one_liner:Vac → n(_):LocJob # loc_job(Vac,LocJob).

This rule says that a one-liner describing the vacancy Vac (where the variable Vac will be instantiated to the unique identifier of a vacancy in the database) can consist of a nominal expression with the semantic representation LocJob, provided that it is true in the current state of the database that LocJob is the ‘located job’ of Vac.6 Thus, given the database facts in (1) and the generation goal in (2), we may derive the goal in (16), which in turn is realized as the ‘text’ in (15). As a second example, let us consider a greatly simplified version of the rule for a ‘full ad’ used in the current version of the TREE system: (18) full_ad:Vac → heading1(n(_):LocJob), heading2([’Specifications’]), text:sem(spec,Vac).

This rule says that a full ad, describing a particular vacancy Vac, consists of three elements, the first of which is the same kind of noun phrase that appears in a one-liner, but formatted as a first level heading.7 The second element is the word ‘Specifications’, formatted as a second level heading, while the third element is a text whose content is a specification (‘spec’) of the vacancy Vac. Let us now see how the specification text can be further expanded:8 (19) text:sem(spec,Vac) → text:sem(education,Vac), text:sem(experience,Vac), text:sem(salary,Vac).

The first element of the specification text concerns education requirements and is covered by the following rules: (20) text:sem(education,Vac) → bold([’Education:’]), comma_list(n(_)):Edus, line_break # setof(Edu,education(Vac,Edu),Edus). (21) text:sem(education,Vac) → [].

The database condition associated with the first rule checks whether there are any education requirements connected with the vacancy being specified. If this is so, the variable Edus gets instantiated to the list of requirements, expressed in terms of integer codes (cf. section 2.2), which becomes the semantic representation of a ‘comma list’, which is a text category that takes another syntactic category as argument — in this case the category ‘n(_)’. The intuitive 6. For a definition of the ‘located job’ relation, see (9) above. 7. We will return to formatting categories in section 3.4 below. 8. Again, we are using a very simplified version of the rule actually used in the TREE system, in order to make the presentation easier to follow. In particular, we have suppressed a large number of items that may occur in the specification of a vacancy, such as age requirements, language skills, driving license required, working hours, type of contract, etc.

6

idea is that a ‘comma list’ is a list of items of the argument category, separated by commas, corresponding one by one to the items on the semantic representation list. The rules for expanding comma lists look as follows: (22) comma_list(Cat):[Sem] → Cat:Sem. (23) comma_list(Cat):[Sem|[X|Y]] → cat:Sem, [’,’], comma_list(Cat):[X|Y].

The second rule for education requirements (rule [21]) is needed for the case where there are no education requirements associated with a vacancy, in which case no text at all should be generated. The second element of the specification is about experience requirements: (24) text:sem(experience,Vac) → bold([’Experience:’]), [’at least’], N, [’years experience’], line_break # experience(Vac,Exp). (25) text:sem(experience,Vac) → [].

Rule (24) is a good illustration of the ease with which co-called ‘canned text’ and ‘templates’ can be integrated into the generalized hypertext grammar. Since the pattern ‘At least N years of experience’ is invariable, except for the value of N, there is no need to use grammar rules to generate that phrase, although it would of course be possible in principle. Other examples of canned text occur in the headings for each item of the specification (‘Education’, ‘Experience’, etc.). The third item in the specification is the salary specification: (26) text:sem(salary,Vac) ---> bold([’Salary:’]), pp:Rate, [’,’], n(_):Sal, line_break # salary_rate(Vac,Rate), salary(Vac,Sal). (27) text:sem(salary,Vac) ---> bold([’Salary:’]), pp:Rate, [’,’], [’amount not specified’], line_break # salary_rate(Vac,Rate). (28) text:sem(salary,_) ---> [].

The first rule generates a prepositional phrase indicating the ‘salary rate’ (e.g., ‘per month’), a comma, and a noun phrase specifying the ‘salary’ (e.g. ‘2000 SEK’), where ‘salary’ is a logical construct composed of the database concepts ‘salary amount’ and ‘salary currency’ via conditions in the logical database interface. The second rule generates the same kind of prepositional phrase specifying the rate, but followed by a piece of canned text: ‘amount not specified’. The third rule again covers the case when no specification can be generated. Given the text structure rules outlined in this section, together with phrase structure rules and a suitable lexicon, and given the ‘database’ in (1), we can now generate the text in (29) from the generation goal in (30). (29)
Architect in Göteborg

Specifications
Education: higher secondary education
Experience: at least 2 years experience
Salary: per month, 20000 SEK
(30) full_ad:111

3.4

Formatting Rules

Since the TREE system is meant to provide a web-based service, the generated texts have to be encoded in HTML. And while other applications may have other requirements, it is still the case that most realistic applications require some sort of text encoding. In keeping with the general philosophy of the integrated approach, we handle this aspect of generation using

7

the same kind of grammar that is used to describe phrase structure and text structure. As an illustration, here is a subset of the formatting rules used in the TREE system: (31) line_break → [’
’]. (32) heading1(Item) → [’
’], Item, [’
’]. (33) bold(Item) → [’’], Item, [’’].

Note that in the case of functor categories like ‘heading1’ and ‘bold’, the hypertext grammar has to be integrated with the rest of the grammar in order to allow the expansion of whatever category occurs as the argument of the functor category. This means that, although we have presented phrase structure rules, text structure rules and formatting rules in separate sections, they are really different parts of the same integrated grammar. 3.5

Multilinguality

Multilingual generation obviously requires different grammars for different languages. In the integrated approach, however, substantial parts of the generalized hypertext grammar will be common to all or most languages. This applies first of all to the formatting rules, but it may also concern a subset of the text structure rules. In developing generation modules for the different languages in the TREE project (currently English, French, Flemish, Swedish), we found that it was possible to reuse large parts of the generalized hypertext grammar across languages. In fact, the current implementation of the system has a single hypertext grammar, with different submodules for language specific rules. Among other things, this means that the overhead for adding a new language to the system is reduced considerably.

4

Pragmatic Text Realization

Given a generalized hypertext grammar of the kind outlined in the previous section, we still face the problem of choosing which rules to apply in a given situation. Our claim in this section is that this problem can be solved by an appeal to pragmatic principles inspired by Grice’s famous conversational maxims (cf. Grice 1975, 1989), and in particular the maxims of quality and quantity.9 4.1

The Maxim of Quality

The maxim of quality says that you should only say what is true or, with a more cautious formulation, what you believe to be true and have adequate evidence for. In the context of a generation system of the kind discussed here, this means that any text generated should be a true description of the world as given by the current state of the database. In the integrated approach to generation, this is achieved by letting the application of a grammar rule be dependent on the satisfaction of all database conditions associated with the rule. For example, consider the three rules for generating a salary specification, repeated below for convenience: (34) text:sem(salary,Vac) ---> bold([’Salary:’]), pp:Rate, [’,’], n(_):Sal, line_break # salary_rate(Vac,Rate), salary(Vac,Sal). 9. Concerning the other maxims, we may note that the maxim of relation (‘Be relevant!’) is vacuously satisfied as soon as the generator is called upon to do its job, since this means that a match has been found for the user’s query and the generated text is therefore guaranteed to be relevant. The maxim of manner (‘Be perspicuous!’) has more to do with the way the grammar rules are constructed than with the way in which they should be applied, i.e., the grammar should be constructed in such a way that a resulting text is always as clear and as well structured as possible.

8

(35) text:sem(salary,Vac) ---> bold([’Salary:’]), pp:Rate, [’,’], [’amount not specified’], line_break # salary_rate(Vac,Rate). (36) text:sem(salary,_) ---> [].

The first rule can be used to generate a true text if and only if the vacancy Vac has the salary rate Rate and the salary Sal. For the second rule, only the first of these conditions has to be satisfied. And the third rule generates a true text, viz., the empty one, in any situation. This means that if there is no information about salary (i.e. amount and currency) in the database, the first rule will be blocked because the second condition is not satisfied. And if there is no information about salary rate either, only the third rule can be applied in accordance with the maxim of quality. 4.2

The Maxim of Quantity

The maxim of quantity says that you should make your contribution as informative as is required — but not more informative than is required. Determining how much information is adequate in a given context is a very tricky problem in the general case. However, in certain restricted contexts, it can safely be assumed that more information is always better than less information. This is at least the case for the TREE generator, partly because the regulation of information quantity is handled by other parts of the system,10 and partly because the amount of information available about any single vacancy is restricted enough not to cause any problems in this respect. Thus, in the context of TREE, we may rephrase the maxim of quantity to say simply that you should try to make your contribution as informative as possible. Moreover, we believe that this simplified version of the maxim is applicable also in many other practical applications. The way to enforce the maxim of quantity, within the integrated approach, is to require that more specific rules are chosen over less specific rules, where specificity can be defined in terms of the set of possible states of the application database that satisfy the database conditions associated with a rule. The smaller this set, the more specific the rule. Thus, if we return to rules (34–36), we see that rule (34) has the highest specificity and will therefore be chosen first (provided that the maxim of quality can be satisfied). Rule (35) is less specific but nevertheless more specific than the maximally unspecific rule in (36). As a result, the last rule, generating the empty text, will only be tried if none of the other rules can be applied (because they violate the maxim of quality).

5

Conclusion

In this paper we have presented an integrated approach to multilingual hypertext generation, where all the knowledge which is needed for converting a generation goal into a hypertext is expressed in a single generalized hypertext grammar, and where the whole generation task is performed by means of a single process of text realization, guided by pragmatic principles similar to the Gricean maxims of quality and quantity. Most of the techniques used in this approach are neither very novel nor very interesting in themselves. For example, the way we construct text structure rules to solve the problem of text planning is very similar to the so-called schema-based approaches found in the literature (see, e.g., McKeown 1985, Kittredge et al 1991). What is different, though, is that we have integrated these rules into the same framework that is used also for sentence planning and 10.On the one hand, this problem is taken care of by the user himself in selecting the type of ad he wants to have generated (i.e. full ad, one-liner, etc.). On the other hand, it is handled by the search interface, which decides how many ads to present to the user as the result of a given query.

9

linguistic realization. Similarly, while the use of canned text and templates for linguistic realization is standard in many practical generation systems, it is not customary to integrate these techniques with grammar rules into a single framework. It is important to keep in mind, however, that one of the reasons why the integrated approach works well in a context like that of the TREE system is that we are dealing with a fairly restricted generation task. For example, problems of content determination are mostly outside the scope of the generator and are handled by the search interface. Furthermore, the task of discourse planning is greatly simplified by the fact that the job ads generated by the system have a fairly rigid and stereotypical structure. On the other hand, we think that this is typical of many other practical applications, and an important task for the future is to investigate precisely what are the conditions that have to be satisfied in order for the integrated approach to be viable. In our view, the main advantage of this approach is that it offers a simple and flexible solution, which is nevertheless based on sound theoretical principles.

References Abramson, H. and Dahl, V. (1989) Logic Grammars. Springer Verlag. Grice, H. P. (1975) Logic and Conversation. In Cole, P. and Morgan, J. (eds) Syntax and Semantics. Vol. 3: Speech Acts. New York: Academic Press. Grice, H. P. (1989) Studies in the Way of Words. Cambridge, MA: Harvard University Press. Kittredge, R., Korelsky, T. and Rambow, O. (1991) On the need for domain communication knowledge. Computational Intelligence 7, 305–314. McKeown, K. (1985) Discourse strategies for generating natural-language text. Artificial Intelligence 27, 1–42. Pereira, F. C. N. and Warren, D. H. D. (1980) Definite clause grammars for language analysis — a survey of the formalism and a comparison with augmented transition networks. Artificial Intelligence 13, 231–278. Reiter, E. (1994) Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? Proceedings of the 7th International Workshop on Natural Language Generation, pp. 163–170. Reiter, E.and Dale, R. (1997) Building applied natural language generation systems. Natural Language Engineering 3:1, 57–87. Somers, H., Black, B., Ellman, J., Gilardoni, L., Lager, T., Multari, A., Nivre, J. and Rogers, A. (1997) Multilingual Generation and Summarization of Job Adverts: The TREE Project. Proceedings of the Fifth Conference on Applied Natural Language Processing.

10