Noun Sense Disambiguation with WordNet for ... - Semantic Scholar

3 downloads 6280 Views 222KB Size Report
[6] provide an estimate for upper and lower bounds of WSD method performance, and stresses that method comparison is very difficult, ..... s6 (103609645) a container (usually with a slot in the top) for keeping .... edge Engineering, SEKE'97 (Madrid, Spain), Knowledge Systems Institute, ... IOS Press, Amsterdam, 2002. 9.

Noun Sense Disambiguation with WordNet for Software Design Retrieval Paulo Gomes, Francisco C. Pereira, Paulo Paiva, Nuno Seco, Paulo Carreiro, Jos´e Lu´ıs Ferreira, and Carlos Bento CISUC - Centro de Inform´ atica e Sistemas da Universidade de Coimbra. Departamento de Engenharia Inform´ atica, Polo II, Universidade de Coimbra. 3030 Coimbra [email protected], http://rebuilder.dei.uc.pt

Abstract. Natural language understanding can be used to improve the usability of intelligent Computer Aided Software Engineering (CASE) tools. For a software designer it can be helpful in two ways: a broad range of natural language terms in the naming of software objects, attributes and methods can be used; and the system is able to understand the meaning of these terms so that it could use them in reasoning mechanisms like retrieval. But, the problem of word sense disambiguation is an obstacle to the development of computational systems that can fully understand natural language. In order to deal with this problem, this paper presents a word sense disambiguation method and how it is integrated with a CASE tool. It is also shown how natural language can be integrated in the classification and retrieval of software objects. An example of usage of our system, and some experimental work are also presented.

1

Motivation and Goals

Software design is one phase in software development [1], in which development teams use Computer Aided Software Engineering (CASE) tools to build design models of software systems. Most of these tools work as editors of design specification languages, revealing a lack of intelligent support to the designer’s work. There are several ways to improve these tools, one possible way is to integrate reasoning mechanisms that can aid the software designer, like retrieval of relevant information, or generation of new software designs. But to accomplish a fruitful integration of these mechanisms in a CASE tool, they must be intuitive and easy to use by the software designers. One way to provide a good communication environment between designer and tool is to integrate natural language understanding. The usage of natural language queries for retrieval mechanisms, or the automatic classification of design elements using word sense disambiguation, are just two possible ways of achieving a good system-user communication interface. Nevertheless, natural

2

language has some characteristics that are hard to mimic from the computational point of view. One of these aspects is the ambiguity of words. The same word can have different meanings, depending on the context in which it is used. This poses a big problem for a computational system that has to use natural language to interact with humans. In the research field of natural language this problem has been named Word Sense Disambiguation (WSD), see [9]. In order for a system to use natural language it must deal with this important problem. We are developing a CASE tool (REBUILDER) capable of helping a software designer in her/his work in a more intelligent way. This tool is capable of retrieving designs from a knowledge base, or generating new designs, thus providing the designer with alternative solutions. From the designer’s point of view, REBUILDER is an Unified Modelling Language (UML [21]) editor with some special functionalities. The basic modelling units in UML (and in REBUILDER) are software objects. These objects must be classified so that they can be retrieved by REBUILDER. In order to do this classification, we use WordNet [15] as an index structure, and also as a general ontology. REBUILDER automates object classification, using the object’s name to determine which classification the object must have. To do this, we have to tackle the WSD problem with just the object’s name and the surrounding context, which in our case comprises object’s attributes (in case of a class) and other objects in the same design model. This paper presents a WSD method for the domain of software design using UML models. We also show the experimental results obtained by our approach in the WSD problem. One advantage of our method is that it only uses WordNet, while the others use WordNet in combination with other sources of knowledge. In the remaining of this paper we will start by describing the WordNet ontology. Section 3 presents related work on the WSD problem, and about other software design systems. Section 4 presents our approach starting by an overview of REBUILDER, and then going into the definition of the WSD method used. We also describe how classification and retrieval are done in our system. Section 5 presents an example of object disambiguation using our approach. Section 6 presents two experimental studies: one on the influence of the context in the accuracy of the WSD method, and the other on the influence of the semantic distance metric in the accuracy of the WSD method. Finally section 7 presents some of the advantages and limitations of our system.

2

WordNet

WordNet is a lexical resource that uses a differential theory where concept meanings are represented by symbols that enable a theorist to distinguish among them. Symbols are words, and concept meanings are called synsets. A synset is a concept represented by one or more words. If more than one word can be used to represent a synset, then they are called synonyms. There is also another word phenomenon important in WordNet: the same word can have more than one different meaning (polysemy). For instance, the word mouse has two meanings, it can denote a small rat, or it can express a computer mouse. WordNet is built

3

around the concept of synset. Basically it comprises a list of word synsets, and different semantic relations between synsets. The first part is a list of words, each one with a list of synsets that the word represents. The second part, is a set of semantic relations between synsets, like is-a relations (rat is-a mouse), part-of relations (door part-of house), and other relations. Synsets are classified in four categories: nouns, verbs, adjectives, and adverbs. In REBUILDER we use the word synset list and four semantic relations: is-a, part-of, substance-of, and member-of.

3

Related Work

This section presents research work related to our approach. Two main topics are addressed: WSD methods, and software reuse systems that focus on object classification and retrieval. 3.1

Word Sense Disambiguation

Li et. al. [11] propose an algorithm for WSD based on WordNet, which uses relations between verb and noun. Their method is applied to nouns, that are objects of verbs in sentences. Smeaton and Quigley [22] describe an information retrieval system for image retrieval based on image captions. They use two disambiguation methods, both based on a semantic distance between words. This semantic distance uses WordNet, and the probability of the Most Specific Common Abstraction (MSCA) between synsets, occurring in a large text corpus. Michalcea and Moldovan [13] propose a WSD method based on the idea of semantic density between words. They use WordNet as a general ontology and Internet as a source for raw corpora providing statistical information for word associations. Their approach is targeted for verb-noun pairs and they define semantic density between words as: the number of common words that are within a semantic distance of two or more words. The closer the semantic relationship between two words, the higher the semantic density between them. They also present a WSD iterative method [12], based on WordNet and SemCor [14]. This method disambiguates words iteratively based on several procedures that use word relations in WordNet. Kwong [10] has recently presented a WSD study using WordNet and Roget’s Thesaurus. Her study tests four different WSD methods, which are then combined with the entry sense ordering method to provide the best results. This combination enhances significantly the accuracy of other methods. Gale et. al. [6] provide an estimate for upper and lower bounds of WSD method performance, and stresses that method comparison is very difficult, since experimental set-ups are very diverse. Yarowsky [25] presents a WSD method based on statistical models of Roget’s Thesaurus categories. These categories can be regarded as conceptual classes which tend to correspond to senses.

4

Another work involving disambiguation of word senses in Roget’s Thesaurus was done by Nastase and Szpakowicz [16]. They use the information in WordNet to disambiguate word senses, based on word lists of neighbor senses. Then these word lists are compared yielding a disambiguation score that can be used to determine which sense to use. Rigau et. al. [20] present a WSD method based on the combination of several unsupervised algorithms. Wilks and Stevenson [24] propose a WSD method based on part-of-speech tagging. Their main claim is that part-of-speech tagging has some connection to semantic disambiguation. Resnik [18, 19] uses WordNet and data on word frequency as a basis for a WSD method that takes into account the semantic similarity of synsets. 3.2

Classification and Retrieval of Software

ROSA [7] is a software reuse system that retrieves software components using natural language descriptions. This system converts natural language queries, given by the user, into frames, which are the structures used internally by the system. It’s classification mechanism uses a grammar to parse software descriptions, so that, these can also be transformed into frames. The retrieval is based on the similarities of the semantic frames, and on the matching of the noun phrases in the semantic structures. Burg and Riet [3] propose that intelligent CASE systems can profit from linguistics. They describe a CASE system that analyzes textual documents describing domain requirements, extracting the structure of the domain model. This analysis consists on parsing the texts and retrieving the word meanings corresponding to the concepts in the model. LaSSIE [5] is a knowledge-based system that helps programmers searching for useful information about large software systems. It uses a knowledge base and a semantic analysis algorithm capable of retrieving software components. It provides an interactive interface making easy the access to the software through several semantically-based views. Althoff and Tautz [23] have a different approach to software reuse and design. Instead of reusing code, they reuse system requirements and associated software development knowledge. The RSL [4] is a software design system that allows the reuse of code and design knowledge. Component retrieval can be done using a natural-language query, or using attribute search. Component ranking is an interactive and iterative process between RSL and the user. Prieto-D´ıaz [17] approach to code reuse is based on a faceted classification of software components. Conceptual graphs are used to organize facets, and a conceptual closeness measure is used to compute similarity between facets. Borgo [2] uses WordNet as a linguistic ontology for retrieval of object oriented components. It uses a graph structure to represent both the query and the components in memory. The retrieval mechanism uses a graph matching algo-

5

rithm returning the identifiers of all components whose description is subsumed by the query.

4

Our Approach

This section describes our approach to WSD, and how it fits in the object classification, retrieval and similarity tasks, within REBUILDER. 4.1

REBUILDER

The main goals of REBUILDER are: to create a corporation’s memory of design knowledge; to provide tools for reusing design knowledge; and to provide the software designer with a design environment capable of promoting software design reuse. It comprises four different modules: Knowledge Base (KB), UML Editor, KB Manager and Case-Based Reasoning (CBR) Engine. It runs in a client-server environment, where the KB is on the server side and the CBR Engine, UML Editor and KB Manager are on the client side. There are two types of clients: the design user client, which comprises the CRB Engine and the UML Editor; and the KB administrator client, which comprises the CBR Engine and the KB Manager. Only one KB administrator client can be running, but there can be several design user clients. The UML editor is the front-end of REBUILDER and the environment dedicated to the software designer. The KB Manager module is used by the administrator to manage the KB, keeping it consistent and updated. The KB comprises four different parts: the case library which stores the cases of previous software designs; an index memory that is used for efficient case retrieval; the data type taxonomy, which is an ontology of the data types used by the system; and WordNet, which is a general purpose ontology. The CBR Engine is the reasoning part of REBUILDER. As the name shows, it uses the CBR paradigm to establish a reasoning framework. This module comprises six different parts: Retrieval, Design Composition, Design Patterns, Analogy, Verification, and Learning. The Retrieval sub-module retrieves cases from the case library based on the similarity with the target problem. The Design Composition sub-module modifies old cases to create new solutions. It can take pieces of one or more cases to build a new solution by composition of these pieces. The Design Patterns sub-module, uses software design patterns and CBR for generation of new designs. Analogy establishes a mapping between problem and selected cases, which is then used to build a new design by knowledge transfer between the selected case and the target problem. Case Verification checks the coherence and consistency of the cases created or modified by the system. It revises a solution generated by REBUILDER before it is shown to the software designer. The last reasoning sub-module is the retain phase, where the system learns new cases. The cases generated by REBUILDER are stored in the case library and indexed using a memory structure.

6

4.2

Object Classification

In REBUILDER cases are represented as UML class diagrams (see Figure 1 for an example), which represent the software design structure. Class diagrams can comprise three types of objects (packages, classes, and interfaces) and four kinds of relations between them (associations, generalizations, realizations and dependencies). Class diagrams are very intuitive, and are a visual way of communication between software development members. Each object has a specific meaning corresponding to a specific synset, which we call context synset. This synset is then used for object classification, indexing the object in the corresponding WordNet synset. This association between software object-synset, enables the retrieval algorithm and the similarity metric to use the WordNet relational structure for retrieval efficiency and for similarity estimation, as it is shown in section 4.5. 4.3

Word Sense Disambiguation in REBUILDER

The object’s class diagram is the context in which the object is referenced, so we use it to disambiguate the meaning of the object. To obtain the correct synset for an object, REBUILDER uses the object’s name, the other objects in the same class diagram, and the object’s attributes in case it is a class. To illustrate this, suppose that an object named board is created, this object can mean either a piece of lumber or a group of people assembled for some purpose. This name has two possible synsets, one for each meaning. But suppose that there are other objects in the same diagram, such as ’board member’ and ’company’. These objects can be used to select the right synset for board, which is the one corresponding to the group of people. The disambiguation starts by extracting from WordNet the synsets corresponding to the object’s name. This requires the system to parse the object’s name, which most of the times is a composition of words. REBUILDER uses specific heuristics to choose the right word to use. For instance, only words corresponding to nouns are selected, because commonly objects correspond to entities or things. A morphological analysis must also be done, extracting the regular noun from the word. After this, a word or a composition of words has been identified and will be searched in WordNet. The result from this search is a set of synsets. From this set of synsets, REBUILDER uses the disambiguation algorithm to select one synset, the supposed right one. Suppose that the object to be disambiguated has the name ObjN ame (after the parsing phase), and the lookup in WordNet has yielded n synsets: s1 , . . . , sn . This object has the context ObjContext, which comprises several names: N ame1 , . . . , N amem , which can be object names, and/or attribute names. Each of these context names has a list of corresponding synsets, for instance, N amej has p synsets: nsj1 , . . . , nsjp . The chosen synset for ObjN ame is given by: ContextSynset(ObjN ame) = M in{SynsetScore(si , ObjContext)}

(1)

7

Fig. 1. Example of an UML class diagram.

Where i is the ith synset of ObjN ame (i goes from 1 to n). The chosen synset is the one with the lower value of SynsetScore, which is given by: SynsetScore(s, ObjContext) =

m X

ShortestDist(s, N amej )

(2)

j=1

Where m is the number of names in ObjContext. The SynsetScore is the sum of the shortest distance between synset s and the synsets of N amej , which is defined as: ShortestDist(s, N amej ) = M in{SemanticDist(s, nsjk )}

(3)

Where k is the kth synset of N amej (k goes from 1 to p). The shortest path is computed based on the semantic distance between synset s and nsjk . Three semantic distances have been developed, the next section describes them. The ObjContext mentioned before comprises a set of names. These names can be: object names or attribute names, depending on the type of object that is being disambiguated. For instance, a class can have as context a combination of three aspects: it’s attributes, the objects in the class diagram which are adjacent to it, or all the objects in the diagram. Packages and interfaces do not have attributes, so only the last two aspects can be used. This yields the following combinations of disambiguation contexts: – – – – –

attributes (just for classes); neighbor objects; attributes and neighbor objects (just for classes); all the objects in the class diagram; attributes and all the objects in the class diagram (just for classes).

The experiments section presents a study of the influence of each of these context combinations on the disambiguation accuracy. 4.4

Semantic Distance

As said before three semantic distances were developed. The first semantic distance used is given by: S1 (s1 , s2 ) = 1 −

1 ln (M in{∀P ath(s1 , s2 )} + 1) + 1

(4)

8

Where M in is the function returning the smallest element of a list. P ath(s1 , s2 ) is the WordNet path between synset s1 and s2 , which returns the number of is-a relations between the synsets. ln is the natural logarithm. The second semantic distance is similar to the one above, with the difference that the path can comprise other types of WordNet relations, and not just isa relations. In REBUILDER we also use part-of, member-of, and substance-of relations. We name this distance as S2 . The third semantic distance is more complex and tries to use other aspects additional to the distance between synsets. This metric is based on three factors. One is the distance between s1 and s2 in the WordNet ontology (D1 ), using all the types of relations. Another one uses the Most Specific Common Abstraction (M SCA) between A and B synsets. The M SCA is basically the most specific synset, which is an abstraction of both synsets. Considering the distance between s1 and M SCA (D(s1 , M SCA)), and the distance between s2 and M SCA (D(s2 , M SCA)), then the second factor is the relation between these two distances (D2 ). This factor tries to account the level of abstraction in concepts. The last factor is the relative depth of M SCA in the WordNet ontology (D3 ), which tries to reflect the objects’ level of abstraction. Formally we have: similarity metric between s1 and s2 : S3 (s1 , s2 ) = +∞ ⇐ does not exist M SCA S3 (s1 , s2 ) = 1 − ω1 · D1 + ω2 · D2 + ω3 · D3 ⇐ exists M SCA

(5)

Where w1 , w2 and w3 are weights associated with each factor. Weights are selected based on empirical work and are: 0.55, 0.3, and 0.15. D1 = 1 −

D(s1 , M SCA) + D(s2 , M SCA) 2 · DepthM ax

(6)

Where DepthM ax is the maximum depth of the is-a tree of WordNet. Current value is 14 for WordNet version 1.6. D 2 = 1 ⇐ s1 = s2 |D(s1 , M SCA) − D(s2 , M SCA)| D2 = 1 − p ⇐ s1 6= s2 D(s1 , M SCA)2 + D(s2 , M SCA)2 D3 =

Depth(M SCA) DepthM ax

(7)

(8)

Where Depth(M SCA) is the depth of M SCA in the is-a tree of WordNet. 4.5

Object Retrieval and Similarity

A case has a main package named root package (since a package can contain sub packages). REBUILDER can retrieve cases or pieces of cases, depending on the user query. In the first situation the retrieval module returns packages only, while in the second one, it can retrieve classes or interfaces. The retrieval

9

module treats both situations the same way, since it goes to the case library searching for software objects that satisfy the query. The retrieval algorithm has two distinct phases: first it uses the context synsets of the query objects to get N objects from the case library. Where N is the number of objects to be retrieved. This search is done using the WordNet semantic relations that work like a conceptual graph, and with the case indexes that relate the case objects with WordNet synsets. The second phase ranks the set of retrieved objects using object similarity metrics. In the first phase the algorithm uses the context synset of the query object as entry points in WordNet graph. Then it gets the objects that are indexed by this synset using the case indexes. Only objects of the same type as the query are retrieved. If the objects found do not reach N , then the search is expanded to the neighbor synsets using only the is-a relations. Then, the algorithm gets the new set of objects indexed by these synsets. If there are not yet enough objects, the system keeps expanding until it reaches the desired number of objects, or until there are nothing more to expand. The result of the previous phase is a set of N objects. The second phase ranks these objects by similarity with the query (see [8] for more details about the similarity metrics). Ranking is based on object similarity, and we consider three types of similarities: package, class, and interface. Package similarity is based on four aspects: type similarity, sub-package similarity, diagram similarity, and dependencies similarity. Type similarity evaluates the distance between the query synset and the retrieved object synset, measured in terms of number of semantic relations between synsets. The second aspect returns the similarity between the query sub-packages and the retrieved subpackages, which is a recursive call to the package similarity metric. Diagram similarity is based on the similarity between the UML objects of both packages. As both diagrams are graphs, this metric computes the graph similarity between diagrams, taking into account the types of nodes (classes or interfaces). The last term of the package similarity is the external similarity computed using the package dependencies, which are UML relations between packages. Class similarity is based on three features: type similarity, intra-class similarity, and inter-class similarity. The type similarity is the same as in package similarity, evaluating the objects’ synset distance. The intra-class similarity is based on the attributes and methods of the classes. The inter-class similarity is based on the classes’ relations. Interface similarity is equal to class similarity, except that the intra-interface similarity is based only on the interface methods, since interfaces do not have attributes.

5

Example

This section presents an example of WSD using a simple class diagram (see 2). Taking as a starting point the class diagram in figure 2, suppose that the bank class has to be disambiguated. The WSD method takes every object name and gets the corresponding synsets from WordNet:

10

Institution

Customer

1..*

1..*

1..* Bank

Payment

BankAccount

1..*

1..* CreditCard

1..*

Fig. 2. Class diagram for a Banking Company used in the WSD example.

BankAccount ba1 (111270738) a fund that a customer has entrusted to a bank and from which they can make withdrawals. CreditCard cc1(111285084) a card (usually plastic) that assures a seller that the person using it has a satisfactory credit rating. Payment – p1 (111200962) a sum of money paid. – p2 (100852841) the act of paying money. Institution – i1 (106689622) an organization founded and united for a specific purpose. – i2 (103114802) an establishment consisting of a building or complex of buildings where an organization for the promotion of some cause is situated. – i3 (104866972) a custom that for a long time has seen an important feature of some group or society. – i4 (100185219) the act of starting something for the first time. – i5 (103263734) a hospital for mentally incompetent or unbalanced person. Customer c1 (108196833) someone who pays for goods or services. Bank – s1 (106948080) a financial institution that accepts deposits and channels the money into lending activities. – s2 (107572141) sloping land (especially the slope beside a body of water). – s3 (111278196) a supply or stock held in reserve for future use (especially in emergencies). – s4 (102424236) a building in which commercial banking is transacted. – s5 (106973175) an arrangement of similar objects in a row or in tiers. – s6 (103609645) a container (usually with a slot in the top) for keeping money at home. – s7 (107572010) a long ridge or pile. – s8 (111268031) the funds held by a gambling house or the dealer in some gambling games. – s9 (107572386) a slope in the turn of a road or track. – s10 (100129500) a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning).

11

The WSD method computes all the semantic distances between the bank synsets (s1 to s10) and all the other synsets that form the context. The table of semantic distances in number of links between synsets is presented in table 1.

Table 1. The semantic distances (in number of links between synsets in WordNet) that are computed by the WSD method.

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

ba1

cc1

p1

p2

i1

i2

i3

i4

i5

16 15 18 16 16 16 16 17 15 17

15 16 11 14 13 14 16 8 16 20

13 12 13 9 12 10 12 10 12 16

13 10 14 10 12 10 11 11 11 14

3 10 12 10 7 9 10 10 10 17

11 9 7 7 9 7 10 10 9 15

16 15 17 15 14 14 16 17 15 19

15 13 15 10 13 12 14 12 13 13

12 11 11 8 11 9 12 13 11 16

c1 11 11 13 11 10 8 11 11 11 11

Based on the distance values presented in table 1, the WSD method computes the semantic distance based on equation 4. Notice that the method chooses the minimum distance (in bold) in case a context synset has several synsets, the case of payment (p1 and p2) and institution (i1 to i5). The results are presented in table 2. The best synset was s1, which corresponds to the correct synset of the object bank.

Table 2. The semantic distances computed by the WSD method.

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

ba1

cc1

p

i

c1

0.739 0.735 0.746 0.739 0.739 0.739 0.739 0.743 0.735 0.743

0.735 0.739 0.713 0.730 0.725 0.730 0.739 0.687 0.739 0.753

0.725 0.706 0.725 0.697 0.719 0.706 0.713 0.706 0.713 0.730

0.581 0.697 0.675 0.675 0.675 0.675 0.706 0.706 0.697 0.725

0.713 0.713 0.725 0.713 0.706 0.687 0.713 0.713 0.713 0.713

Sum 3.49 3.59 3.59 3.55 3.56 3.54 3.61 3.55 3.60 3.66

12

6

Experimental Results

This section presents two studies of the proposed WSD method. One choosing the best context configuration for object disambiguation, and the second one showing the best semantic distance. The KB we use for tests comprises a case library with 60 cases. Each case comprises a package, with 5 to 20 objects (the total number of objects in the knowledge base is 586). Each object has up to 20 attributes, and up to 20 methods. The goal is to disambiguate each case object in the KB. After running the WSD method for each object we have collected the selected synsets, which are then compared with the synsets attributed by a human designer. The percentage of matching synsets determines the method accuracy. To study the influence of the context definition in the disambiguation accuracy, we considered five different combinations (see table 3). Table 3. The definitions for the context configurations. Configuration

Object Attributes

Neighbor Objects

C1 C2 C3 C4 C5

Yes No Yes No Yes

No Yes Yes No No

All Objects No No No Yes Yes

The accuracy results we obtained are presented in Table 4. These results show that the best result is reached by configuration C4, and that configuration C5 presents slightly worst results than C4. This is due to the abstract aspect of some of the attributes, which introduce ambiguity in the disambiguation method. For instance, if one of the attributes is name, it will not help in the disambiguation task, since this attribute is used in many objects and is a very ambiguous one. Table 4. The accuracy values for the context configurations.

Accuracy

C1

C2

C3

C4

C5

60.19%

68.47%

68.79%

71.18%

71.02%

REBUILDER uses three different semantic distances, as described in section 4.3. These distances are: using only the is-a links of WordNet (S1 ), using the is-a, part-of, member-of and substance-of links (S2 ), and S3 described in section 4.3. The previous results (shown in table 4) are obtained with S2 . A combination of these three distances and the best context configurations (C4 and C5) were

13

used to study the influence of the semantic distance in the accuracy of the WSD method. Results are shown in Table 3. Table 5. The accuracy values for the different semantic distances. Accuracy S1 S2 S3

C4

C5

69.27% 71.18% 64.97%

69.11% 71.02% 65.13%

Experimental results show that semantic distance S2 obtains the best accuracy values, followed by S1 , and finally S3 . Context configuration C4 continues to get better results than C5, except for semantic distance S3 .

7

Conclusions

This paper presents an approach to the WSD problem applied to classification and retrieval of software designs. We describe an WSD method developed for use in the context of software design along with the configurations that might be used with it: context and semantic distance definitions. We also present the experimental results obtained in regard to the different context and semantic distance configurations, showing how the accuracy of our method is influenced by them. We only use WordNet, while most of the other systems use a combination of WordNet and a text corpus. Our results show that our method achieved the best results using a simple semantic distance in combination with a context which comprises all the objects in the design model. Some of the potential benefits of WSD in CASE tools are: providing software object classification, which enables a semantic retrieval and similarity judgment; and improving the system usability. Other advantage is to open the range of terms that can be used by the software designers in the objects names, attributes and methods, in opposition to a CASE tool that constraints the terms to be used. One of the limitations of our method is the lack of more specific semantic relations in WordNet. We think that with more semantic relations between synsets, it would improve the accuracy of our WSD method.

Acknowledgments This work was supported by POSI - Programa Operacional Sociedade de Informa¸c˜ao of Funda¸c˜ao Portuguesa para a Ciˆencia e Tecnologia and European Union FEDER, under contract POSI/33399/SRI/2000, and by program PRAXIS XXI.

14

References 1. Barry Boehm, A spiral model of software development and enhancement, IEEE Press, 1988. 2. Stefano Borgo, Nicola Guarino, Claudio Masolo, and Guido Vetere, Using a large linguistic ontology for internet-based retrieval of object-oriented components, 9th International Conference on Software Engineering and Knowledge Engineering, SEKE´ 97 (Madrid, Spain), Knowledge Systems Institute, Illinois, 1997, pp. 528– 534. 3. J. F. M. Burg and R. P. van de Riet, Trully intelligent case environments profit from linguistics, 9th International Conference on Software Engineering and Knowledge Engineering, SEKE’97 (Madrid, Spain), Knowledge Systems Institute, Illinois, 1997, pp. 407–414. 4. Bruce A. Burton, Rhonda Wienk Aragon, Stephen A. Bailey, Kenneth D. Koehler, and Lauren A. Mayes, The reusable software library, IEEE Software 4 (1987), no. July 1987, 25–32. 5. Premkumar Devanbu, Ronald J. Branchman, Peter G. Selfridge, and Bruce W. Ballard, Lassie: A knowledge-based software information system, Communications of ACM 34 (1991), no. 5, 34–49. 6. William Gale, Kenneth Church, and David Yarowsky, Estimating upper and lower bounds on the performance of word-sense disambiguation programs, 30th Annual Meeting of the ACL, ACL, 1992, pp. 249–256. 7. M. R. Girardi and B. Ibrahim, A similarity measure for retrieving software artifacts, 6th International Conference on Software Engineering and Knowledge Engineering (Jurmala, Latvia), 1994, pp. 478–485. 8. Paulo Gomes, Francisco C. Pereira, Paulo Paiva, Nuno Seco, Paulo Carreiro, Jos L. Ferreira, and Carlos Bento, Case retrieval of software designs using wordnet, European Conference on Artificial Intelligence (ECAI’02) (Lyon, France) (F. van Harmelen, ed.), IOS Press, Amsterdam, 2002. 9. Nancy Ide and Jean Veronis, Introduction to the special issue on word sense disambiguation: The state of the art, Computational Linguistics 24 (1998), no. 1, 1–40. 10. Oi Yee Kwong, Word sense disambiguation with an integrated lexical resource, NAACL Workshop on WordNet and Other Lexical Resources (Pittsburgh, USA), 2001. 11. Xiaobin Li, Stan Szpakowicz, and Stan Matwin, A wordnet-based algorithm for word sense disambiguation, 14th International Joint Conference on Artificial Intelligence (IJCAI’95) (Montreal, Canada), 1995, pp. 1368–1374. 12. R. Mihalcea and D. Moldovan, A method for word sense disambiguation of unrestricted text, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99) (Maryland, NY, USA), 1999. 13. Rada Mihalcea and Dan I. Moldovan, Word sense disambiguation based on semantic density, COLIN/ACL Workshop on Usage of WordNet in Natural Language Processing Systems (Montreal, Canada), 1998. 14. G. Miller, M. Chodorow, S. Landes, C. Leacock, and R. Thomas, Using a semantic concordance for sense identification, Proceedings of ARPA Human Language Technology Workshop, 1994, pp. 240–243. 15. George Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller, Introduction to wordnet: an on-line lexical database., International Journal of Lexicography 3 (1990), no. 4, 235 – 244.

15 16. Vivi Nastase and Stan Szpakowicz, Word sense disambiguation in roget’s thesaurus using wordnet, NAACL 2001 Workshop on WordNet and other Lexical Resources (Pittsburg, USA), June 2001, pp. 17–22. 17. Rub´en Prieto-Diaz, Implementing faceted classification for software reuse, Communications of the ACM 34 (1991), no. 5, 88–97. 18. Philip Resnik, Disambiguating noun groupings with respect to Wordnet senses, Proceedings of the Third Workshop on Very Large Corpora (Somerset, New Jersey) (David Yarovsky and Kenneth Church, eds.), Association for Computational Linguistics, Association for Computational Linguistics, 1995, pp. 54–68. 19. , Using information content to evaluate semantic similarity in a taxonomy, Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (San Mateo) (Chris S. Mellish, ed.), Morgan Kaufmann, August 20–25 1995, pp. 448–453. 20. German Rigau, Jordi Atserias, and Eneko Agirre, Combining unsupervised lexical knowledge methods for word sense disambiguation, Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (Somerset, New Jersey) (Philip R. Cohen and Wolfgang Wahlster, eds.), Association for Computational Linguistics, Association for Computational Linguistics, 1997, pp. 48–55. 21. J. Rumbaugh, I. Jacobson, and G. Booch, The unified modeling language reference manual, Addison-Wesley, Reading, MA, 1998. 22. Alan F. Smeaton and Ian Quigley, Experiments on using semantic distances between words in image caption retrieval, 19th International Conference on Research and Development in Information Retrieval (Zurich, Switzerland), 1996, pp. 174– 180. 23. Carsten Tautz and Klaus-Dieter Althoff, Using case-based reasoning for reusing software knowledge, International Conference on Case-Based Reasoning (ICCBR’97) (Providence, RI, USA) (David Leake and Enric Plaza, eds.), SpringerVerlag, 1997, pp. 156–165. 24. Yorick Wilks and Mark Stevenson, The grammar of sense: Is word-sense tagging much more than part-of-speech tagging?, Tech. Report CS-96-05, Department of Computer Science, University of Sheffield, Sheffield, UK, 1996. 25. David Yarowsky, Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora, Proceedings of COLING-92 (Nantes, France), jul 1992, pp. 454–460.

Suggest Documents