Textual Similarity: Comparing texts in order to discover how closely

1 downloads 0 Views 2MB Size Report
discover how closely they discuss the same topics. Different aspects of text ..... Comparing sentences After comparing two texts, it could be interesting to narrow down to the ... as well, since WordNet support these classes too. Combining the ...
Textual Similarity: Comparing texts in order to discover how closely they discuss the same topics

Andreas Schmidt Jensen & Niklas Skamriis Boss

Kongens Lyngby 2008 IMM-BSc-2008-15

Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 [email protected] www.imm.dtu.dk

Summary

This thesis describes the design and implementation of a tool for measuring textual similarity. The thesis looks into different aspects of text processing and graph searching in an attempt to define similarity. Furthermore, a solution for measuring textual similarity is proposed and implemented. Challenges such as disambiguation of word senses, part-of-speech tagging and several graph searching algorithms are described and used in the measurements. The developed tool is tested using human evaluation of textual similarity and it is concluded that the tool to some degree is able to measure textual similarity with the same results as a human being.

ii

Resum´ e

Denne afhandling beskriver design og implementering af et værktøj til bestemmelse af tekstmæssig lighed. Afhandlingen kigger p˚ a forskellige aspekter af tekstprocessering og søgning i grafer, i et forsøg p˚ a at definere lighed. En løsning til bestemmelse af tekstmæssig lighed foresl˚ as og implementeres. Udfordringer som tvetydige betydninger af ord, bestemmelse af ordklasser og forskellige grafsøgningsalgoritmer beskrives og benyttes i beregningerne. Værktøjet testes ved at bruge menneskelige vurderinger af tekstmæssig lighed, og det konkluderes at værktøjet i nogen grad er i stand til at bestemme tekstmæssig lighed med samme resultat som et menneske.

iv

Preface

This thesis was prepared at DTU Informatics at the Technical University of Denmark from February through June 2008, as part of the requirements for acquiring the B.Sc. degree in engineering. The thesis goal was to implement a tool which is able to compare texts to discover how closely they discuss the same topics. Different aspects of text processing and graph searching had to be dealt with, such as disambiguation of word senses and the maximum flow problem. This report documents the work with the tool, from the early design phase through implementation and optimizations. Finally the tool is compared to how a human compares texts and the results are discussed.

Lyngby, June 2008 Andreas Schmidt Jensen Niklas Skamriis Boss

vi

Acknowledgements

We would like to thank our supervisor, Robin Sharp, who has provided help and advice through the five months the project has lasted.

viii

Contents

Summary

i

Resum´ e

iii

Preface

v

Acknowledgements

vii

1 Introduction

1

2 Requirements

3

2.1

Functional requirements . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Additional requirements . . . . . . . . . . . . . . . . . . . . . . .

6

3 Defining similarity

9

3.1

Textual similarity . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2

Relations between words . . . . . . . . . . . . . . . . . . . . . . .

10

x

CONTENTS 3.3

Semantic similarity . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.4

Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.5

WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

3.6

Similarity of words . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.7

Similarity of two sentences . . . . . . . . . . . . . . . . . . . . . .

16

3.8

Similarity of texts . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.9

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4 Designing the tool

23

4.1

Similarity engine . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.2

Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . .

26

4.3

Connecting model and view . . . . . . . . . . . . . . . . . . . . .

31

5 Algorithms

33

5.1

Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . .

33

5.2

Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.3

Longest common subsequence . . . . . . . . . . . . . . . . . . . .

34

5.4

Lowest common ancestor . . . . . . . . . . . . . . . . . . . . . . .

35

5.5

Maximum flow . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.6

Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.7

The Lesk algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

39

6 Implementation in short 6.1

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41

CONTENTS

xi

6.2

View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

6.3

Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.4

Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.5

Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

7 Optimization

45

7.1

Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

7.2

Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

8 Testing

51

8.1

General performance . . . . . . . . . . . . . . . . . . . . . . . . .

51

8.2

Human evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .

53

8.3

Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

8.4

Functional test . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

8.5

Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . .

57

9 Discussion

59

9.1

File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

9.2

Command line usage . . . . . . . . . . . . . . . . . . . . . . . . .

60

9.3

Adjusting evaluations . . . . . . . . . . . . . . . . . . . . . . . .

61

9.4

Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

9.5

Additional optimizations . . . . . . . . . . . . . . . . . . . . . . .

62

9.6

Similarity using WordNet . . . . . . . . . . . . . . . . . . . . . .

62

xii

CONTENTS

10 Conclusion

63

A User manual

65

A.1 Comparison of two texts . . . . . . . . . . . . . . . . . . . . . . .

65

A.2 Performance comparison . . . . . . . . . . . . . . . . . . . . . . .

66

A.3 Folder comparison . . . . . . . . . . . . . . . . . . . . . . . . . .

67

A.4 Other possibilites . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

B Implementation

69

B.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

B.2 View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

B.3 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

C Test results

81

C.1 Human evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .

81

C.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

C.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

C.4 Functional test cases . . . . . . . . . . . . . . . . . . . . . . . . .

89

C.5 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

D Pseudocode D.1 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

99 99

D.2 Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . . . . 100 D.3 Longest common subsequence . . . . . . . . . . . . . . . . . . . . 101

CONTENTS

xiii

D.4 Lowest Common Ancestor . . . . . . . . . . . . . . . . . . . . . . 103 D.5 Adapted Lesk algorithm . . . . . . . . . . . . . . . . . . . . . . . 105

E Source code

107

xiv

CONTENTS

List of Figures

3.1

WordNet noun structure . . . . . . . . . . . . . . . . . . . . . . .

13

3.2

WordNet verb structure . . . . . . . . . . . . . . . . . . . . . . .

15

3.3

Bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.4

The sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . .

21

5.1

Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . .

37

6.1

UML Diagram of the model . . . . . . . . . . . . . . . . . . . . .

42

7.1

Performance before optimization . . . . . . . . . . . . . . . . . .

46

7.2

Performance after structural optimization of Lesk algorithm . . .

47

A.1 Main window of the tool . . . . . . . . . . . . . . . . . . . . . . .

66

A.2 Main window and result window of a comparison . . . . . . . . .

67

A.3 Result window for performance comparison . . . . . . . . . . . .

68

xvi

LIST OF FIGURES A.4 Result window for folder comparison . . . . . . . . . . . . . . . .

68

B.1 UML Diagram of graph part of the model . . . . . . . . . . . . .

70

B.2 Flow network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

B.3 UML Diagram of text processing part of the model . . . . . . . .

72

B.4 UML Diagram of the model . . . . . . . . . . . . . . . . . . . . .

74

B.5 UML Diagram of the view . . . . . . . . . . . . . . . . . . . . . .

77

B.6 UML Diagram of the control . . . . . . . . . . . . . . . . . . . .

78

B.7 UML Diagram of the tool . . . . . . . . . . . . . . . . . . . . . .

79

C.1 WordNet: Hypernymy disabled for WSD . . . . . . . . . . . . . .

84

C.2 WordNet: Hypernymy enabled for WSD . . . . . . . . . . . . . .

85

C.3 Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

C.4 Combination of WordNet and edit distance . . . . . . . . . . . .

87

C.5 Relation between number of words and performance (WordNet) .

88

C.6 Relation between number of words and performance (WordNet) .

88

C.7 Relation between number of words and performance (Levenshtein) 89

D.1 Steps of LCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Chapter

1 Introduction

Textual similarity is a concept which can be seen as a way of describing the similarity between strings. A string can contain meaning, i.e. semantics can be derived from it. The semantic similarity of strings is a special case of semantic relatedness, which has roots in artificial intelligence and dates back to 1968 [14]. Semantic similarity is used as a tool to find similar concepts, where as semantic relatedness is used to find related concepts. For instance, a car and a steering wheel are more related than a car and a bicycle, but the latter would be considered more similar. Semantic relatedness is used in the Google search engine, where it is used to determine whether strings are related in terms of occurrences on a webpage [3]. An often used tool for similarity measurements is the WordNet project developed by Princeton University [2][13][14]. It consists of words from the English language, where each word is related to other words using relations such as synonymy, antonymy and more advanced concepts such as hyponymy and hypernymy. These relations enables the user of WordNet to discover that a car and a steering wheel are related, dark and light are antonyms, bike and bicycle are synonyms etc. This project proposes two different solutions for measuring textual similarity: One using the WordNet project to measure the semantic similarity and one using

2

Introduction

the so-called edit distance between strings. These solutions are very different by nature, so a comparison between them will be made as well. This report describes the development of a tool for measuring textual similarity. Chapter 2 contains a description of the requirements for the tool. We describe the minimum requirements and propose several extensions. In chapter 3 we define a method for measuring similarity between texts. The design of the tool is described in chapter 4. Chapter 5 describes in detail the algorithms used in the comparison. In chapter 6 we discuss the implementation of the tool. Chapter 7 discusses optimization and proposes different optimizations which increases the performance of the tool. Chapter 8 documents the tests we have run and discusses the results. In chapter 9 we discuss some of the extensions which did not make it into the final tool. Chapter 10 concludes the paper. Other projects concerning similarity has also been described in the literature. Most of the literature is concerned with similarity between single words using WordNet, and different methods for measuring this similarity has been proposed. Most articles are concerned with a method called information content which uses frequency of a word and measures probability of encountering the word in a given corpus. Our project is more concerned with the similarity of entire texts and the information content has not been used. Instead, our primary source of inspiration is a project by Thanh Ngoc Dao and Troy Simpson, who proposes a method for measuring similarity between sentences [6]. This project has implemented an application in C#.Net which is able to measure similarity between sentences. Some of the ideas from their project has been reused in this project. The main inspiration from this project is the idea of measuring similarity between sentences, i.e. collect similarity of words in sentences and combine the similarity values into a similarity score for the sentences, which is something we have not found elsewhere. We have extended this idea by enabling our tool to combine similarity of sentences into a similarity score for entire texts.

Chapter

2 Requirements

Before we can decide on how the similarity tool is going to work, we need to specify what will be required of the final tool. In this chapter, we will state what is required of each part of the tool as a minimum, i.e. the goals that must be reached in order to get a working tool. Furthermore, since this is only the minimum requirements, we will also discuss several ways of extending the tool. These extensions are proposed in the initial phases, so it is not yet clear whether they are realistic or too optimistic in a project like this.

2.1

Functional requirements

This project has been given some external requirements. These includes the ability to compare two texts using at least two different algorithms, and furthermore it should be possible to make a performance comparison between the different approaches. The performance measurements should be based on the accuracy of the measurements and the computation times. The given requirements will form the base for our own additional requirements.

4

Requirements

Our own additions will be based on previous experiences with software development and from what we believe makes a good and useful application.

2.1.1

Software requirements

The subject of this project is similarity between texts. Therefore, the resulting application should, as an absolute minimum, be a tool, which when given two texts, computes a score that describe how much the two texts are alike. The similarity should be based on whether the words mean the same, but should not consider whether the overall meaning of sentences in the text are similar. Similar words are words which to some extend are described by the same subject. For example, red and blue are similar because they are both colors. A car and an apple are not very similar, since the subject they have in common would be something like “object”. We say that the more specific the common subject of a pair of words is, the more similar the words are. Deciding these similarities manually would be a huge task, so a word database is needed. The requirements for this database is, that it must contain a large part of the English language and each word in the database must contain relations to more and less specific descriptions of the same topic.

2.1.2

Graphical User Interface

The Graphical User Interface (GUI) should enable the user to select two texts for comparison. The GUI is created to provide the user with an easy and intuitive tool, enabling him to compare texts using the mouse and keyboard. There should be a progress bar which provides the user with information on how much progress has been made during a comparison. This would also avoid confusion on whether a computation has been started. The GUI should consist of several windows. The main window enables the user to compare texts. It must be possible to specify different engine settings, specify texts and be able to view intermediate results when the engine is working. When the texts are compared and a score has been obtained, a result window should appear. This window must provide the user with information about several things: • Similarity result, as a number within some range

2.1 Functional requirements

5

• Performance, including timing of – Preprocessing – Database lookup – Computations – Total time, possibly as a bar chart.

2.1.3

Functional extensions

The discussed requirements are what the tool as a minimum should consist of. We will now discuss several extensions which could be interesting. At this time it is unknown whether the ideas are realistic. An extension should provide more usability or additional functions to the user of the program.

Adjusting the evaluations The methods for computing similarity could contain parameters which can be adjusted to change the result of a computation. This suggests an extension, where the user is able to adjust these parameters using a component in the graphical user interface.

Command line usage Perhaps it could be convenient to be able to compare many texts where mutual relations and performance are not relevant, meaning that the main functionality provided by the GUI is not needed. The tool could be extended by a possibility of providing command line parameters when it is launched, i.e. specify texts and settings without showing the GUI. The application should then compute similarity and show the results in the terminal.

Comparing sentences After comparing two texts, it could be interesting to narrow down to the sentences in the texts to find the most related sentences, and see what words are most common. This could for instance be done by allowing the user to choose a sentence from each of the two texts for comparison.

File formats The application should as a minimum be able to handle txtfiles, so an obvious extension is to allow other formats, such as PDF, DOC and ODF. One way to implement this is to use an API for reading the formats and then handle the documents as normal text.

6

Requirements

Finding best matches This extension suggests providing a folder with multiple texts along with a single text, which is compared to the content of the folder. The result should be a sorted list of the texts in the folder, which will allow the user to see the related texts.

Heatmap A way of visualizing the result of a comparison is to create a socalled heatmap, where different values are represented as different colors. This could enable the user to see which parts of two texts are most alike. For instance, sentences which are very similar could be represented by green, while sentences which are very different could be red.

Multiple texts One idea is to compare more than two texts at once to discover how they mutually relate. This means computing a similarity score for every combination of the given texts. The idea is that the user is able to provide a folder with a number of texts, and the application is able to show which texts are related. This could for instance be shown as a 2D-graph, where closely related texts are grouped together.

2.2

Additional requirements

The functional requirements form a range of technical requirements for the underlying system. These requirements can be solved using many different tools. First of all we need some kind of word database which is able to communicate easily with Java, since this is our chosen programming language. The lexical word database called WordNet has the lexical requirements we need. In order to communicate with Java, an API called JAWS (Java API for WordNet Search) is used. WordNet includes nouns, verbs, adjectives and adverbs. As a minimum requirement of the tool regarding words and word classes we will include nouns and verbs in the similarity computations.

2.2.1

Engine

The engine is the main part of the tool, and is the part which does the actual computations. It must be able to communicate with WordNet through the

2.2 Additional requirements

7

JAWS. Furthermore, it must be able to use techniques for computing a similarity score using WordNet. The engine must also be able to compute a similarity score using a different measurement called edit distance. Edit distance describes how much a word needs to be changed in order to become another word. Having two ways of measuring similarity, an obvious addition to the application is an option to compare the performance of the two methods.

2.2.2

Possible extensions

Additional languages It should be possible to specify another WordNet database, i.e. make the program able to work with other languages than English.

Adjectives and adverbs The minimum requirement is to use nouns and verbs for computations. An extension could be to include adjectives and adverbs as well, since WordNet support these classes too.

Combining the measures One of the minimum requirements is that different methods of comparison must be available. We will be implementing a measurement using WordNet, and one using edit distance. These approaches can be combined by using for instance the average value.

Optimizations Optimizations should not be seen as an extension in the same way as the other proposed extensions. If the implementation shows that a part of the tool can be optimized in some way, the tool chould be extended with this optimization. We will look into optimizations in chapter 7.

8

Requirements

Chapter

3

Defining similarity

This chapter will introduce the concepts used to implement the similarity engine. Similarity is computed using different techniques which, when combined, can give an estimate for the similarity of texts. The process of computing the similarity for a text can be split into several parts, each part is described in detail in this chapter. The idea is to describe the similarity between two words by using a lexical word database. By finding a similarity value for every combination of words in two sentences, a similarity value for sentences can be obtained. Finally, the similarity for a text is found by computing a similarity value for each combination of sentences. To understand the techniques used for the computation, the reader must have knowledge about taxonomies and especially the lexical taxonomy WordNet. This chapter will explain the basics of these concepts. The techniques we are using include Part of Speech-tagging, Word sense disambiguation and Graph search.

10

3.1

Defining similarity

Textual similarity

Textual similarity refers to the concept of similarity between texts. This is a general concept, since the similarity can be computed in many different ways. This chapter will define the special case called semantic similarity, which uses the meaning of words to decide upon similarity. A more general form of textual similarity is using an algorithm such as edit distance, which looks at the number of steps needed to change one word into another. This method discards meaning and as such could be error-prone. This will be tested and discussed.

3.2

Relations between words

A basis for describing the similarity between two words is to use the word classes hyponym and hypernym. A hypernym is the opposite of a hyponym. The classes are used to describe the relation between concepts in terms of generalization. For example, car is a hyponym of vehicle because car is a specific kind of vehicle. Likewise, vehicle is a hypernym of car since it is more general. This relation between words is often called an IS-A relation (e.g. car IS-A vehicle). Another kind of relation is the PART-OF relation which consists of the classes meronym and holonym. Like the previous relation, meronym is the opposite of holonym. The word pair bicycle and saddle is an example of a meronym. The saddle is a meronym of the bicycle because saddle is PART-OF bicycle, while bicycle is a holonym of saddle, because of the same relation. Verbs are related using troponyms. A troponym is a particular case of the meaning of another verb. For instance, to duel is a troponym of to fight.

3.3

Semantic similarity

A specialized case of textual similarity is called semantic similarity. Semantic similarity uses the actual meaning of words to decide similarity and is a special case of the concept of semantic relatedness. Relatedness between words can be determined by the use of hyponymy and meronymy. For example, a car and a steering wheel are related, ie. the steering wheel is a meronym of car but they are not very similar. Semantic similarity only uses hyponymy to determine similarity. Since a car and a bicycle are both hyponyms of vehicle, they are considered similar.

3.4 Taxonomy

11

This knowledge can be used to determine a similarity value for a pair of words. Different methods are available for computing a value, several of which will be discussed in this chapter.

3.4

Taxonomy

The term taxonomy covers classification of concepts. Concepts are classified by their mutual relations into an hierarchical structure. The relation between the concepts is typically subtype-supertype. Using words as concepts, it can be seen as a hypernym-hyponym relation. Using the previous car example, the car is a subtype of vehicle, and vehicle is a supertype of car. This can also be seen since any car is a vehicle but not every vehicle is a car. The definition of a taxonomy is kind of vague, since it is actually just a way of classifying concepts hierarchically. This means that basically anything can be classified using a taxonomy.

3.5

WordNet

WordNet is a lexical database containing English words, including a description of the words, grouping of synonyms and semantic relations between words. The words in WordNet are structured hierarchically using hyponymy and hypernymy and the words can easily be seen as concepts. In this way WordNet can be interpreted as a taxonomy.

3.5.1

Definition

A vocabulary of any given language can be defined as a set of forms each of which is associated with one or more senses. If a form has more than one sense it is polysemous. If two words share a sense, they are synonymous [13]. In WordNet synonymous words are put together in groups called synsets. This means that two senses of one word will be put in different synsets. Apart from the words, a synset contains relations to other synsets. These relations are based on hyponymy, hypernymy, meronymy and holonymy. Since this project is about the semantic similarity, our focus will be on the hyponymy and hypernymy.

12

3.5.2

Defining similarity

Word classes

WordNet contains words from the four syntactic word classes noun, verb, adjective and adverb. Except a connection between nouns and adjectives using attributes, there is generally no connection between the classes, so a comparison between different classes is not possible using WordNet. This means a similarity measure will only consist of noun-noun, verb-verb, etc. comparisons.

3.5.3

Morphology

A human will easily observe that words such as run, runs, ran and running are all the different forms of the same word, i.e. they have the same base form, run. The WordNet database contains only the base form of each word, thus preprocessing is needed to transform every word to its base form. Fortunately WordNet contains a set of morphology functions called Morphy, which uses a set of predefined rules and a table of exceptions to perform this preprocessing [9]. These functions are used automatically through JAWS, and are not something we will go into further details with.

3.5.4

Structure

WordNet is structured by means of the words in the database. That is, the relations between synsets form the structure of WordNet. The structure varies for each class. Since this project only employ similarity on nouns and verbs, only the structure of these classes will be described.

3.5.4.1

Nouns

The first thing to notice when examining the noun taxonomy of WordNet is that every synset is connected to at least one other synset and it contains a root, i.e. the taxonomy is structured as a tree. The root element of the tree is called entity. This structure makes it easy to provide methods for computing similarity between two synsets in the tree. The closer two words are in the tree, the more similar they are. Therefore the distance should be taken into consideration. Should there be more than one path between two synsets, the shortest one is picked [14]. Figure 3.1 shows a

3.5 WordNet

13

Figure 3.1: Simplified WordNet noun graph. The most similar synset for bicycle and car is wheeled vehicle. simplified1 graph of the relation between bicycle and car. The shortest path between these synsets is 4, defined by the number of edges between the two nodes. The figure also shows a common ancestor; wheeled vehicle. This ancestor is actually the one closest to both synsets, and is called the lowest common ancestor (LCA). The LCA could be chosen as the similarity score between these words. This would lead to a result where high score means low similarity, which is really not what we are interested in. The multiplicative inverse could be used, but would result in a very fast degrade of similarity. What we would really like is a value between a lower and an upper bound, where a small value suggests low similarity between synsets. For instance, we could use 0 and 1, since values within this range represent percentage in decimal values. In this case a value of 1 would mean equality. Several ideas have been proposed in the literature. One method, proposed by Wu and Palmer, is taking depth of each concept as well as their LCA into account [18]:

simW P (S1 , S2 ) =

2 · DLCA DS1 + DS2

(3.1)

where S1 and S2 are the two compared words, LCA is the lowest common 1 There

are several synsets containing car, but they are omitted for the sake of simplicity

14

Defining similarity

ancestor and DS is the distance from a synset to the root (i.e. the entity-node). Another method proposed by Philip Resnik [14] uses the information content of the lowest common ancestor of the compared words. Information content is approximated by looking at the frequency of occurrence of a synset. This frequency is calculated by counting the occurrence of a concept in a large amount of texts. Because of the fact that the frequency needs to be made from a large amount of texts, we have chosen not to go further into detail with this method. A frequency count is available in WordNet, but the use of it through JAWS is very complicated, so this is not a method we will investigate further. A third method is based on the distance between the words and is scaled by using the overall depth of the tree. It was proposed by Leacock and Chodorow [2]. simLC (S1 , S2 ) = − log

len(S1 , S2 ) 2·D

(3.2)

This method is solely based on the actual words which are compared, thus there is no need for finding the LCA. A problem is that in order to use this method, we need to know the overall depth of the tree. Since this depth is not necessarily static, a full search through the tree is needed to find this depth.

3.5.4.2

Verbs

The structure of verbs in the WordNet taxonomy is quite different. The major difference is that not all synsets in the verb part of WordNet are connected; this can be seen in figure 3.2. The fact that these subtaxonomies are not connected means that there is no similarity between words in different subtaxonomies. Since this will lead to a lot of word pairs with a similarity score of 0, it will generally lead to a lower similarity score when the overall similarity between two texts is measured. The similarity measures using LCA discussed in the previous section (equation 3.1 and 3.2) are also valid for verbs, since it is still possible to find root nodes in each subtaxonomy. An issue is that there can be more than one root node (e.g. in the graph for sing, there are three root nodes: utter, perform and act). A solution could be to find the LCA between the synsets and use the root node closest to the LCA as the root node for each synset.

3.6 Similarity of words

15

Figure 3.2: Simplified WordNet verb graph. Observe that the subtaxonomies for grow and sing are not connected.

3.6

Similarity of words

The technique for computing a similarity value between two texts uses the similarity of the words in each text. Therefore, we need to be able to represent how closely the words relate.

3.6.1

Unknown words

Some of the words which are presented to WordNet may not be present in the database, thus a similarity score for this word cannot be measured. There can be many reasons why a word is not present. It could be from a foreign language or it could be a “new” word (i.e. a word which is accepted into the language after the current WordNet database was deployed). It could also be an unknown abbreviation, or a misspelled word. Unless it is caused by a misspelled word, there is little to do about missing words in the database. A misspelled word could in theory be corrected, by looking up in a database containing common misspelled word. Another possibility would be to look at the text as a whole and see what themes it discusses. If an unknown word looks like it could be a misspelling of a word relevant for the theme, there is a possibility that this might be the correct word. This idea is similar to the Google Spell Checker [7], which uses a database of common spellings of a word and suggest that in this content, another spelling is more common.

16

Defining similarity

Edit distance There are different ways of handling unknown words. The simplest solution would be to ignore these words, i.e. every word compared to this word would receive a similarity score of 0. Since this solution could make the final result inaccurate, another solution using edit distance can be used. This solution is proposed by Levenshtein [11]. The Levenshtein distance measures the minimum number of token insertions, deletions and substitutions needed to transform one string into another. For instance, the levenshtein distance between the words apple and maple would be 2 because one insertion and one deletion operation is needed to get from apple to maple. The result needs to be transformed in order to be used in a similarity context, since it will be a number within the range of the length of the words. The proposed function [5] will scale the result to a number between 0 and 1, where 1 means equal and 0 means no similarity at all.

simLD (W1 , W2 ) = 1 −

edit(W1 , W2 ) max(|W1 |, |W2 |)

(3.3)

An advantage of the edit distance is that it is less sensitive to spelling errors, since a word with a single typo compared to the correct spelling of the word, would lead to an edit distance of 1.

3.7

Similarity of two sentences

We have described how to compute the similarity of two words using WordNet and the Levenshtein distance. This will now be extended with techniques that can help computing the similarity of two sentences. This includes being able to distinguish between nouns and verbs and finding out which sense of a word is used.

3.7.1

Part of Speech tagging

We described in section 3.5.2, that we would only consider similarity between two words of the same part of speech (POS). That is, the similarity between a noun and a verb will always be 0. To be able to do this, we need to find out

3.7 Similarity of two sentences

17

which words are nouns and which are verbs. Since many words can be both noun and verb, we need to “tag” every word with the correct part of speech. POS tagging is a technique which is used to identify each word in a sentence as noun, verb, adjective etc. Implementing this is a huge task, so we will not go into details with it, but instead use an available POS tagger implemented in Java. See appendix B for more information about this tagger.

3.7.2

Stop words

In any correct English sentence, many words will be syntactic words, i.e. they do not contribute to the semantics of the sentence. Words like for, of and to are considered stop words, since they are needed for the sentence to make sense as a whole, but are not describing the actual meaning. For example, in the sentence the car is fast, the words car and fast are considered the core part of the sentence, while the and is are describing the connection between the words in the core part. The stop words are very common and appear in most sentences. Therefore, this could lead to a generally higher score and a suggestion could be to remove these words before the sentences are compared [6].

3.7.3

Word sense disambiguation

A POS taxonomy in WordNet is divided into synsets containing synonymous words. We have also seen that a word can be polysemous, i.e. have different senses. These senses will all be contained in the same taxonomy, but in different synsets. When computing a similarity score between two words, it is necessary to ensure that the right sense of the words are compared. This is done by finding an unambiguous sense in the given context. An algorithm often used for word sense disambiguation is the Lesk Algorithm, which was proposed by Michael E. Lesk in 1986 [10]. The algorithm uses the definition of a word to decide whether it has something in common with another word. Basically the algorithm makes a comparison of word definitions of each word sense of every word in a sentence and computes a score. The pair of definitions which gets the highest score is picked as the correct pair. For example when disambiguating the sentence time flies like an arrow, the algorithm compares

18

Defining similarity

every sense of time with those of fly and arrow. Next, every sense of fly is compared with those of time and arrow etc. Since previous combinations have already been compared they are not used in new iterations [1]. In WordNet, each synset contains a definition of the words it consists of. This makes it possible to implement an algorithm which will be using WordNet for disambiguating word senses.

3.7.3.1

Extending the Lesk algorithm using hyponymy

An extension proposed by Banerjee and Pedersen [1] suggests using definitions of the hypernyms and hyponyms of a word. This approach has both advantages and disadvantages. Using definitions of the hyponyms and hypernyms of the words which are compared could lead to a more accurate result, since these words in some degree describe the word we are looking at. This means that there is a greater possibility of getting a high score, when looking at the correct senses. When adding these definitions, the computation time will obviously increase, since the algorithm has to search through more text. Depending on the algorithm and the number of hyponyms and hypernyms, there is a chance that the computation time will be very long and tedious, and a slightly less accurate result is preferred because it can be computed faster. Of course, this is not something which can be concluded yet, but it will be discussed in chapter 7.

3.7.4

Similarity matrix

In order to compute a similarity score between two sentences, similarity scores for each word pair needs to be collected in some data structure so that we can compute an overall score for the pairs. Since every word in a sentence is compared with every word in the other sentence, a solution is to create a similarity matrix sized N × M , where N is the number of words in the first sentence and M is the number of words in the second sentence [6]. Table 3.1 shows an example of a similarity matrix between two simple sentences. The most common words have been removed, so these words will not create noise in the results. The next step is to use these similarity values to compute an overall similarity for the sentences.

3.7 Similarity of two sentences

chocolate cake unhealthy

apple 0,31 0,62 0,11

19 important 0,11 0,11 0,00

making 0,00 0,00 0,11

pie 0,00 0,77 0,11

Table 3.1: This table shows the similarity matrix from the two sentences A chocolate cake is unhealthy and The apple is important when making pie

3.7.5

Similarity of the matrix

Using the values in the matrix we would now like to compute a total similarity value between 0 and 1. This can be done by transforming the matrix into a weighted bipartite graph [6]. A bipartite graph is an undirected graph, which is partitioned into two parts. There is no connection between nodes within the parts, i.e. edges goes only between the two parts. A similarity matrix can be transformed into a bipartite graph by adding a node for each word from the texts. The words from the first text is contained in the first part of the graph, and the words from the second text is contained in the other part. The similarity values from the matrix will be the weight of the edges going from every node in the first part to every node in the second part. This is illustrated in figure 3.3. The bipartite graph can be used to compute an overall value of the similarity matrix by finding the maximum bipartite matching. A matching consists of edges in the bipartite graph, and it is ensured that each node in the graph is at most connected to one of the edges in the matching. The sum of the matching is the value of the maximum bipartite matching. To scale the value between 0 and 1, we can calculate the average of the matching. The maximum bipartite matching can be solved by transforming the bipartite graph into a flow network. The maximum flow through this network will be the value of the maximum bipartite matching. A graph theoretic approach to this problem is described in chapter 5.

3.7.6

Sigmoid function

Using the methods for similarity measurement proposed above, the similarity will always be scaled between 0 and 1. It will often be the case, that two sentences will share some words, even though the actual meaning is quite different.

20

Defining similarity

Figure 3.3: Similarity matrix transformed into a bipartite graph. Similarity values of 0 have been removed.

This problem is partially solved by removing the most common words from the text. There is still a possibility that less common words exists in both sentences, even though they do not contribute towards the actual meaning of the texts. This means that the computed similarity value will be somewhat higher than expected. Our proposal to this problem is to use a function where high values gets rewarded and low values receives a penalty. This way, even though some words are similar, if overall there is a lot of difference, the value will be accordingly low. The sigmoid function is ideal for this purpose: f (x) =

1 1 + e−x

The sigmoid function is shown in figure 3.4. It has been scaled to cover values from 0 to 1. It can be seen that a small value will become even smaller when put into the sigmoid function, and a large value will be even larger. If the standard sigmoid function does not reward and punish enough, we can extend it with a linear function. The following function uses different linear functions depending on the similarity score. This makes it possible to, for instance, reward high

3.8 Similarity of texts

21

Figure 3.4: The standard sigmoid function scaled from 0 to 1 values more than low values get punished.  ax − b if x ≤ 0.5 g(x) = cx − d if x > 0.5 The overall sigmoid function becomes  1 1+e−(ax−b) f (x) = 1 1+e−(cx−d)

if x ≤ 0.5 if x > 0.5

The actual values can be adjusted to fit some goal. This could for instance be to make sure the results of the tool are closely related to human evaluation of semantic similarity.

3.8

Similarity of texts

The methods described above are used for measuring an overall similarity score for two sentences. Some preprocessing like POS tagging, removal of words etc. must be done before the actual value can be computed. The method of measuring similarity involves a maximum bipartite matching for a bipartite graph consisting of the words in the sentences. The idea for measuring an overall text similarity is based on the same idea.

3.8.1

Dividing the text into sentences

A sentence is in this context defined as a collection of words followed by one or more special characters. The special characters are . (dot), ! (exclamation

22

Defining similarity

mark) and ? (question mark). By ending a sentence every time one of these special characters occurs, a collection of sentences is obtained. This leads to a problem when a text contains abbreviations. An abbreviation is ended with a dot, but should not be treated as the end of a sentence, meaning that special care must be taken when removing dots from a sentence. We have decided that other non-alphanumeric characters are removed, which means that sentences in parenthesis will not be seen as independent sentences, but part of the sentence which contains it.

3.8.2

Similarity score

A similarity score is computed for every pair of sentences in the texts and filled into a similarity matrix. The matrix can be transformed into a bipartite graph and a maximum bipartite matching is obtained using a flow network. The final similarity score is obtained by using the sigmoid function to reward higher values.

3.9

Summary

This chapter has described textual similarity and suggested a method for measuring this. Textual similarity covers the more specific concept semantic similarity and a method for measuring this has been described. A more general method using edit distance has also been described. A method for computing similarity between texts has been suggested. This method includes partitioning the text into sentences, which is partitioned into words. Similarity is computed using either the lexical database WordNet or the edit distance. By using bipartite graphs a similarity value for a text is obtained.

Chapter

4 Designing the tool

Before implementation can be started, some decisions has to be made regarding design of the similarity tool. As specified in the requirements specification, the similarity tool will consist of a similarity engine and a graphical user interface connected to the engine. We need to answer the many questions that may arise during this phase. This chapter will describe the ideas, questions and decisions which have been made. The system will be written in Java, which means an object-oriented view on the problems is needed in the design phase. The system will be designed using the model-view-control design pattern, which means that the application will consist of three parts:

• Model The model will contain the similarity engine, i.e. take care of the actual computations. The model consists of the techniques described in chapter 3. Techniques and principles needed for implementing them will be discussed. • View The view is the graphical user interface. It takes care of every interactions from the user. It will consist of a main frame, a settings dialog and different result windows.

24

Designing the tool • Control The control is the link between the model and the view. Communication between the model and the view will be through this part. The idea is to ease implementation of each part, so that changes in the model should not deprecate the view and the other way around.

The sections in this chapter will describe each of these parts.

4.1

Similarity engine

The engine is the core part of the tool, meaning that there are several decisions to be made during the design phase of this part. First of all we need to decide which of the proposed methods for similarity measurements we will be using. It was already decided during the description of the methods, that any method using information content is discarded. The two other methods discussed in this paper uses distances in the taxonomy tree of a word class. In one it is required to find the lowest common ancestor (equation 3.1), and in the other it is required to expand the entire tree, in order to find the maximum depth (equation 3.2). We have decided to use the method proposed by Wu and Palmer (equation 3.1), since it does not only take the actual words into account but also their most descriptive ancestor, the LCA. Additionally, using the other method requires that WordNet tree is expanded entirely to find the maximum depth of the tree. This makes the solution less dynamic in terms of changing WordNet database.

4.1.1

Taxonomy

WordNet is as described structured like a tree. A node in the tree contains references to their hyponyms and hypernyms. While this overall structure is ideal for using graph searching as a tool for measurement, the access to WordNet is not optimized for searching. For this, a data structure which makes the task easy must be implemented. By looking at the relations needed to use the similarity method, it is seen that the data structure can be made very simple. The nodes needed to compute a similarity value is the nodes containing each synset and the node containing the LCA. The root node is also needed for computing each nodes distance to

4.1 Similarity engine

25

the root. It can be seen from this, that searching needs only be done in one direction: from each synset towards the root. By definition the LCA cannot be further away from the root than the synsets it is describing, so it is clear that a node is only required to contain references to its hypernyms. Figure 3.1 shows this relation. The connection to WordNet is obtained by creating a reference to each synset in the corresponding node. This way graph search is easy to perform, and actual comparison between synsets is done by using the references in the nodes.

4.1.2

Text processing

When a comparison is started, the specified texts are processed and the sentences from each text must be collected. In chapter 3 we described a technique, which compares every pair of words in two sentences and get an overall sentence similarity. Then, this is extended to compare every pair of sentences in two texts to get the final similarity between the texts. This leads to a structure, where a collection of methods is created for computing similarity between sentences. This collection is combined in a method, which computes the similarity between a pair of sentences. The idea is that this method can be used in a collection of methods, which computes similarity between texts. A method which takes two texts as inputs, should then use this collection of methods to compute a similarity score between the texts. One obvious advantage is that this allows us to implement some of the extensions we have discussed in chapter 2. For instance, by dividing the structure into a part computing similarity between sentences and one for similarity between texts, the extension which provides similarity between sentences has already been implemented. It also enables us to implement the extensions of other methods, i.e. the folder comparison and the folder-text comparison, since we are able to call the method for computing a similarity score between two texts for each pair of texts.

Compound words Some words in English are compound words, i.e. they actually consists of multiple words. For instance, science fiction is a compound word, since it consists of the words science and fiction. When a sentence is split into words, it is done by assuming that a word is surrounded by whitespace. This means a compound word like science fiction will be split up into science and fiction.

26

Designing the tool

This problem could be solved by splitting up in steps and in each step check in the WordNet database whether the string is a compound word. This would decrease performance a lot, thus it is a feature we will not be implementing.

4.2

Graphical User Interface

Designing a graphical user interface is in our case an important part of the software development project. It should provide both an easy and intuitive approach to similarity computations and more advanced possibilities of text comparisons and performance measurements. This allows both experienced and not so experienced users to use the tool for comparison. While an experienced user might need many different methods of comparison, another is satisfied with the possibility of comparing two texts. By designing a GUI early in the development process, it will also help the developer test the model visually.

4.2.1

Outline

In chapter 2, we have made a range of requirements that the GUI should satisfy. First of all, the GUI consists of several windows, which all serve different purposes in means of interaction and results. The main purpose of the GUI is to provide a simple graphical tool, that makes it possible for the user to compare two texts, and make performance comparisons of the similarity engine. The result of the text comparison and the performance measurement should be displayed in a clear way, that makes it easy to interpret. For this, we define a main window which enables the user to perform similarity measures of different kinds, a settings dialog for specifying very technical settings (i.e. location of WordNet database) and several result windows.

4.2.2

Main window

The main window is the first window the user will see, when launching the tool. It should enable the user to perform a similarity measure. This means that it should contain a possibility of choosing files, deciding upon similarity method (WordNet or edit distance) and a button which starts the comparison. This design is very simple, so there are of course many possible extensions, which

4.2 Graphical User Interface

27

should provide access to more advanced techniques. It should be possible for the more experienced user to manipulate the settings to provide more specific results.

File browser The user should be able to choose files for comparison. It would be possible to let the user type the location of a file manually, but a more convenient solution is to be able to browse for files. The obvious solution is to use the built-in functionality JFileChooser for this, since it provides exactly the functionality we need.

Similarity tool The main window should provide the ability to choose between the different similarity tools which can be used. Since this project uses WordNet and edit distance as measures, it should be possible for the user to choose which method to use. Since we will make it possible to combine the methods of comparison it is be necessary to use checkboxes. Care must be taken when using this approach, since it would be possible to uncheck both boxes. This erroneous combination should be handled correctly.

Stop words Many of the most common words are removed by default when performing comparisons. As we have described, they are helpful for a human being in understanding the meaning, but can confuse a machine trying to compare subjects. On the other hand, a user could be interested in computing a similarity value, where stop words are not removed, perhaps because they in a specific context actually are descriptive. It can also be used to discover whether the results are better, when stop words are removed. The words are taken from a text-file containing the most common words ordered by frequency. The most common words are contained in the beginning of the file. An obvious approach is to remove the less common words first. The solution would be to provide a slider specifying how many percent of the words in the file which should be removed. Specifying 50% means that the first half of the words in the stop words file, will be removed in the specified text.

Word sense disambiguation In order to decide the meaning of each word in a sentence, a solution is proposed using hypernyms and hyponyms. The common user will probably just use this default setting, but in some cases it could be convenient to be able to specify which definitions are to be used. This

28

Designing the tool

leads to adding checkboxes in the main window allowing the user to specify which definitions should be used, when deciding word senses. It should also be possible to disable WSD so a checkbox must be added for this.

Progress The user should always be able to follow the progress of a computation. A common user will mostly be interested in knowing, that the tool is working, i.e. some progress is happening. The solution is to create a progress bar, which at all time is updated from the engine, so the user can see that the tool makes progress.

Console output An experienced user would probably be interested in more than just how much progress has been made. For this, a console should be provided, which the engine can send text to. For instance, every time two texts have been compared, the console should output the similarity matrix between the texts, so the user can follow the progress more specifically.

4.2.2.1

Extensions

These design choices are all part of the minimum requirements of the graphical user interface. We have also discussed some possible extensions of the tool. Some of these extensions have been chosen to be included as well. This section will describe the design of these extensions.

Sentences The general design of the engine includes designing a method for comparing sentences, since the similarity of every combination of sentences in two texts must be measured. This means that comparing two specific sentences would not be a very large extension, since the engine already contains the functionality needed. In the GUI, this extension means that a list of every sentence in each text must be added. It should then be possible to pick a sentence from each list and compute the similarity.

Methods We have proposed two extensions, which enables the user to compare more than two texts. When the design of the engine is completed, an extension which allows comparison of multiple texts is only a matter of reusing functionality. Therefore it is also an extension, which will be easy to implement.

4.2 Graphical User Interface

29

The main window should contain a option of choosing which method should be used. There are of course different approaches. One is to use a radio button for each method, another could be to use a drop down list. We have decided that selecting the right comparison method will be more obvious if a drop down list is used.

Optimizations As discussed in chapter 2, optimizations might be needed when the tool has been implemented. If an optimization is implemented, we will also add appropriate components in the GUI to ease performance testing of the optimization.

4.2.3

Settings dialog

The main window contains the settings for the actual comparison. The settings dialog should only contain settings, which are changed rarely. This is settings such as the location of the WordNet database, since this location is considered static, and should only be changed if the user wants to provide the tool with a different WordNet database. This could make it possible to use the tool with other languages. Other settings, which are considered static are the location of the file containing stop words and the location of the POS tagger.

4.2.4

Result windows

Since the various methods will provide very different kinds of results, a window for each kind of result is needed. The windows could all be part of the same window in different tabs, but since the methods are different, a new window for each kind of result seems better.

4.2.4.1

Text-text comparison

The simplest kind of comparison is the similarity measure between two texts. This comparison results in a similarity value. The output from the engine can be seen in the console output in the main window, but for a common user a more visualized result would be appropriate. A window showing the result should be visible, when a comparison is done. The user should be able to see the result as a percentage, where 100% means

30

Designing the tool

the texts are equal. The processing time should also be provided, as well as a performance chart, showing how well each part of the computation performs.

Heatmap We have chosen to implement the heatmap extension as well. This extension should help visualize which parts of the texts are equal. To implement it the result window will be extended with tabs, where the first tab is the general comparison result and the second contains the heatmap.

4.2.4.2

Folder comparison

By comparing every text in a folder with each other, a similarity matrix is obtained. This matrix should be visualized in a way, that shows the relation between the texts. When the texts has been compared, a result window with a graph, showing the relation between the texts should be visible. A list of the texts should be visible, enabling the user to choose which of the relations should be drawn. We have tried using a neural network with a clustering algorithm, the Kohonen network. While possible, it is not suitable for this tool. We will discuss this in chapter 9. Instead, we propose a solution using genetic algorithms. We will implement an algorithm, which will decide location of each text in a 2D-graph, where the relation between the distance between each text is maintained from the multidimensional graph. The genetic algorithm attempts to maintain the relation by using principles described in chapter 5.

4.2.4.3

Folder-text comparison

The idea with comparing a text with the contents of a folder is to find the texts in the folder which are most similar to the single text. When performing a folder-text comparison, the console output will contain a similarity value of every text in the folder. For the common user, this result is not very useful. What he wants, is a visualization of the result. For instance a result showing the top 10 most similar texts, since these will be the most relevant. The result window should contain a table of the most similar texts as well as a bar visualizing the similarity between the texts.

4.3 Connecting model and view 4.2.4.4

31

Performance comparison

The performance comparison should both compare time and results, i.e. compare overall efficiency. For this, the text-text comparison window can be reused. All that needs to be done is to show a graph for each of the methods, to provide an easy overview of the results.

4.3

Connecting model and view

The model perform the comparisons and update the view accordingly. When a user changes settings in the GUI, the correct parts of the model should be updated to hold the values the GUI shows. This means that a unit controlling the communication is needed for synchronizing the two parts. The control is also containing the methods used for starting the application, i.e. it creates the view (by creating a main window) and connects it with the model. When a user clicks a button in the GUI to compare texts, a method in the control is invoked. This method starts the comparison in the model.

32

Designing the tool

Chapter

5 Algorithms

In chapter 3 we have discussed how different algorithms can be used to measure similarity between texts. This chapter will describe each of the algorithms we have used. Technicalities such as pseudocode is included in appendix D.

5.1

Dynamic programming

The concept dynamic programming refer to a programming technique which solves a problem by combining the solution to subproblems [4, chapter 15]. It is applicable when subproblems share sub-subproblems. When subproblems are shared, a normal divide-and-conquer algorithm do too much work, by solving the same subproblem several times [4, chapter 2]. An algorithm which uses a dynamic-programming approach works by storing the solution to subproblems in some kind of data structure, often a table, and then re-uses the previous calculated solutions. Dynamic programming is often used to solve optimization problems. In optimizations problem there might exist more than one solution to a problem, and the goal is to find an optimal solution.

34

Algorithms

5.2

Levenshtein distance

The Levenshtein distance is an implementation of the edit distance which is a measurement of the difference between two strings. The measurement is done by computing the number of edit operations, that has to be done in order to transform one string into another. An edit operation is defined as an operation that either changes a character into another, deletes or inserts a character in a string. These edit operations are as follows:

• Substitution: Swapping a character, i.e. change ’a’ to ’b’ in a string containing the character ’a’. • Deletion: Delete a character. i.e. remove ’a’ in a string containing the character ’a’. • Insertion: Insert a letter, i.e. insert ’a’, so that the new string contains ’a’.

For example, the edit distance between the strings “football” and “basketball” is 5, since “foo” is changed into “bas” and “ke” is inserted. If the edit distance between two texts has the value zero, the texts are identical, since no changes has to be carried out before the texts match each other. Our implementation uses a commonly used dynamic programming approach to solve the problem. This implementation uses a table, which stores the calculations. The solution is found by backtracking. See an example of such table along with pseudocode of the algorithm in appendix D

5.3

Longest common subsequence

Different algorithms exists for measuring how common two strings are, one of these is the edit distance which was explained in the previous section, another is determining whether one string is a substring of the other. The longest common subsequence is similar, but the difference is, that the characters in the subsequence does not need to appear consecutively. For example, the length of longest common substring of “abcdefg” and “achge” is 1 since no two consecutive characters exists in both strings, while the length

5.4 Lowest common ancestor

35

of the longest common subsequence is 3: either by choosing the sequence ha, c, ei or ha, c, gi. Longest common subsequence is often used in biology to determine how similar DNA strings are. For a formal definition of the algorithm, we refer to [4, p. 350] The implementation uses the same dynamic programming approach as the Levenshtein distance. See appendix D for pseudocode and a table showing the principles of the backtracking.

5.4

Lowest common ancestor

In the computation of the similarity of two words, we need to find the depth of the lowest common ancestor of the two words. In WordNet a word can be polysemous and thereby be contained in several synsets. The connection of synsets in WordNet, can be considered as a directed search tree, where each synset is represented by a node in the tree. The lowest common ancestor is a node containing a synset with a general description of both words. The search for the lowest common ancestor is complicated by the fact that a word can be polysemous. This leaves us with two search cases: If the user has chosen not to disambiguate word senses, each word can have more than one meaning, which means we have to find the ancestor of the most common subject. This leaves us with a search tree, where we have to start our search on several nodes. If word sense disambiguation is enabled the search will only have two starting points. We created an algorithm based on the technique from a breadth-first-search [4, chapter 22]. It works by using a queue which makes it easy to handle the several starting points. The exploration of new nodes is done in steps from the next node in the queue. The idea of the queue is to alternate between the different search branches, so that the lowest common ancestor is found as close as possible to the two closest start nodes, with different words. This approach is shown in figure D.1 in appendix D. When a new node is explored, every hypernym of the synset is added to the queue. Also the new node is assigned with a reference to the node where the search started, and a color indicating whether the node has already been visited. When the search reaches a node that has already been visited which is an ancestor of the other word, the search terminates, and the lowest common ancestor has been found.

36

5.5

Algorithms

Maximum flow

A directed graph can be considered a flow network. By doing this the maximum flow problem becomes interesting. Given a weighted directed graph we are able to calculate the maximum flow through the graph. The graph is considered a flow network with a certain capacity. The capacity is determined by the weights assigned to the edges in the graph, and these capacities constrain the flow through the network. The flow can be illustrated be adding a source, where material is produced and enters the network, and a sink where the material is consumed and exits the network. The maximum flow can now be calculated, as how much material that can flow through the network in a given period of time.

5.5.1

The Ford-Fulkerson algorithm

The calculation of the maximum flow in a given network is calculated by the Ford-Fulkerson algorithm. The algorithm works by creating a new network, called a residual network. The residual network describes how much flow we can add to a path from the source to the sink. This path is called an augmenting path, because the final flow is augmented by the available flow along this path in the network. The final flow is only increased if there exists a path from the source to the sink in the residual network. If no path can be found, we have found the maximum flow through the network. Pseudo code for this algorithm is included in appendix D.

5.5.2

Bipartite graph

A bipartite graph is an undirected graph G = (V, E) in which the vertices V can be partitioned into two sets V1 and V2 where (u, v) ∈ E implies that u ∈ V1 and v ∈ V2 or u ∈ V2 and v ∈ V1 . This means, that every edge must go between the two sets V1 and V2 [4, p. 1083]. In a standard bipartite graph the weight of an edge is implicitly set to unity. A weighted bipartite graph is then made by placing an explicit weight on each edge. In a bipartite graph it is possible to calculate the maximum bipartite matching. A matching in an undirected graph G = (V, E) is a subset of edges M ⊂ E where it is true that for every vertex v ∈ V , at most one edge in M is connected to v. A vertex is matched by matching M, if an edge in M is connected to

5.5 Maximum flow

37

Figure 5.1: A bipartite graph G = (V, E) with vertex partition V = L ∪ R. (a) A matching with cardinality 2. (b) A maximum matching with cardinality 3. [4] v. If not, it is unmatched. A maximum matching is a matching of maximum cardinality, i.e. the number of edges in M is the maximum possible [4, p. 664]. Figure 5.1 shows an example of bipartite matching. In a weighted bipartite graph, the maximum matching is the matching where the sum of edge values is maximal. The maximum bipartite matching problem can be solved by transforming the bipartite graph into a flow network and calculate the maximum flow.

5.5.3

Maximum flow with multiple sources and sinks

We might want to find the maximum flow in a graph consisting of multiple sources and sinks. Fortunately we are able to use the same technique as used when finding maximum flow in a graph with only one source and one sink. A directed graph containing multiple sources and sinks, can easily be transformed to a directed graph with only one of each, which allows us to solve it as a normal flow problem. The transformation, from a multiple source/sink network to a single source/sink network, is done by adding a new supersource and supersink. The supersource provides all of the previous sources with unlimited flow, and the supersink consumes as much flow as needed from the previous sinks. A bipartite graph can be interpreted as a directed graph, representing a flow network with multiple sources on one side (V1 ), and multiple sinks on the other

38

Algorithms

side (V2 ). This way we can calculate the final maximum flow through the graph by using the transformation technique previously described. The bipartite graph if transformed to a traditional flow network, by adding a supersource connected to the vertices in V1 , and a supersink connected to the vertices in V2 .

5.6

Genetic algorithms

Genetic algorithms is a subtopic of evolutionary computing and is often used to find or approximate a solution to an optimization problem. [8, chapter 8] The evolution of a solution to a problem is done by using techniques inspired by evolutionary biology. The evolution of individuals in the nature, was first described by the english naturalist Charles Darwin in 1859 [17], where he also used the well known phrases “natural selection” and “survival of the fittest”. These two phrases describes the basic principles of genetic algorithms. The evolution is usually started by creating a population of randomly generated individuals. Each individual represents a solution which is characterized by its genes. The genes represents a range of parameters, used to solve the problem. The population evolves in generations, where only the genes from the best individuals is passed on to the next generation. Each individual is rated by a fitness function, in order to determine which of the individuals that should survive. New individuals are created by selecting the best individuals, as base for the new individuals. The genes of the new individuals are reproduced through mutation of a previously selected individual, a crossover of genes from two selected individuals, or is left intact.

Initialization The genetic algorithm is initialized by forming a population. This is done by creating a large number of random individuals.

Selection When a new population of individuals has been created, each individual is rated by using a fitness function to determine how well each solution solve the given problem. The best individuals are selected for reproduction to form a new population.

Reproduction Reproduction is usually done in three different ways. Mutation, crossover or by leaving the individual intact. Leaving several individual intact is important to ensure that the new generation will not perform worse than the previous generation, since we want to find an optimal solution. New

5.7 The Lesk algorithm

39

individuals can also be formed by mutation of one of the previous good individuals. The mutation is done by altering some of the characteristics, that gave the previous good individual its ability to solve a given problem. This might lead to an even better solution. A new individual can also be created by a crossover between two of the best individuals from the previous generation. This is done in an attempt to pass on the best properties from both parents to the new individual.

Termination The evolution can be very slow, and the running time often depends on the complexity of the given problem. Because of this there can be different stop criteria. Often the evolution is terminated, after a given number of generation, or a given time period. Another option is to stop the evolution, when there has not been produced a better individual for a number of generations.

5.7

The Lesk algorithm

The Lesk algorithm is used for word sense disambiguation. The algorithm is based on finding longest common subsequence between the definition of two words. The algorithm takes a sentence as input and the result is a list of synsets from WordNet containing the suggested senses. The synsets are found by choosing the most fitting synsets for each pair of words in the sentence. The synset which occurs the most for each word is chosen as the best sense. The pseudo code for the algorithm is found in appendix D. The pseudocode for the method getDefinition(Synset) has been omitted. It finds the definition of the specified synset as well as the definition of its hyponyms and hypernyms. These definitions are then used in the LCS algorithm, which has been explained earlier in this chapter.

40

Algorithms

Chapter

6 Implementation in short

The implementation is a result of the decisions made in the design phase. Actual coding of each part is not something which will be discussed, but an outline of the flow between the parts will be described in this chapter. For a more thorough description of the implementation and the different APIs we have been using, see appendix B The tool consists of a model, which takes care of text processing and graph searching, a view, which contains visual components used for interacting with the model, and a control, which takes care of interactions between the model and the view.

6.1

Model

The main purpose of the model is to compute similarity values. This is done using the text processing and graph searching algorithms we have described previously. These algorithms have been implemented in different packages, since they are naturally divided into the two kinds of algorithms we are using. Figure 6.1 shows the relation between the parts, containing the classes which are used to implement the algorithms. The graph-package contains such things

42

Implementation in short

Figure 6.1: UML diagram showing the relation between the parts of the model.

as the WordNet tree used for searching, the bipartite graph and flow network and the genetic algorithms used for a similarity relations map. The similarity-package contains classes used for each part of the text processing and similarity computations.

6.2

View

The view consists of windows presented to the user, which enables him to interact with the model. This includes windows for visualizing results, manipulating with settings etc. Implementation is done using a GUI editor available for Eclipse, thus it is done by drag-and-drop. Some classes contains specialized methods for creating graphs and other visual components, used to show results.

6.3 Control

6.3

43

Control

The control consists of a class, which takes care of communication between the model and the view. The idea is to avoid deprecation in the view, when the model is changed and ensure that settings specified in the view are set correctly in the model.

6.4

Considerations

The tool has been implemented using Java 5.0 on Mac OS X. The main reason for using Java is the fact that it is platform independent. This means that the implementation can be done using different operating systems and the final result will also be runnable on different operating systems. The approach for teaching programming at DTU has been Java, thus it has come naturally to us to use this as programming language. The fact that we have developed the tool using Mac OS X has restricted the use of the newest version of Java. Java 6.0 is only available for the 64 bit version Mac OS X, which is currently not available to us. These facts has made it an easy choice to implement the tool using Java 5.0. If the main priority of the implementation of tool was to optimize on performance, it could have been relevant to look at other programming languages such as C or C++. Java includes an automatic garbage collector, which means that the programmer cannot control when garbage collection occurs. In C and C++ the programmer is able to control this himself, a fact which makes it possible to optimize code which will increase performance. Since it is not a primary concern in this project, these languages are discarded for the benefit of Java. Another advantage of Java is the included GUI library Swing, which enables us to create a graphical user interface. This is not a part of standard C and C++ and since one of the requirements is the creation of a GUI, Java seems more attractive.

6.5

Result

The implementation has resulted in a jar-file which consists of our code as well as the APIs used for access to WordNet and part-of-speech tagging. The

44

Implementation in short

actual WordNet database has not been included in the jar-file, partly because of the size, but also because it should be possible for the user to specify another location of a WordNet database. The jar-file can be run using the command: java -jar Similarity.jar It might be a good idea to increase the Java memory size by using the parameters -XmsNNNm -XmxNNNm. The jar is available from the internet address: www.damn. dk/similarity/.

Chapter

7 Optimization

During implementation it became clear, that the similarity measure using WordNet was very slow. We have seen examples of comparison taking an unreasonable amount of time, meaning the processing becomes very tedious. In this chapter we will look into these issues and try to discover what part of the computation that is slowing it down. We will be looking for bottlenecks and propose a solution using threads, since it will enable the tool to use several CPUs (if available). We will also be looking at other ways of optimizing, such as “re-implementing” features of Java instead of using the built-in method. Also, restructuring of some parts of the tool could lead to a huge performance boost.

7.1

Bottlenecks

A bottleneck can be seen as phenomenon where performance of the entire tool is highly reduced because of a single part. This can be caused by the design of the part, but it is also possible, that the bottleneck is caused by methods, which are fundamental to the programming language or part of an API. This section will discuss some of these possibilities.

46

Optimization

Figure 7.1: The figure shows how well the tool performs before any optimizations has been made.

7.1.1

Lookups in WordNet

The first thing to discover is whether the low performance is caused by the WordNet API. If a request to WordNet is not implemented efficiently in the API, this could lead to the low performance we are observing, since WordNet calls are executed all the time. The result of this can be observed by looking at the performance chart, which is shown in the result window in the tool. Figure 7.1 shows an example of such chart. In this example, the computation time was 38.42 seconds. We can see that the WordNet lookups (the green bar) are performed very fast compared to the total amount of time. This means, that the WordNet lookups probably is not the bottleneck we are looking for. Instead, we observe that it seems the word sense disambiguation is the part of the computation that takes the longest amount of time. It actually seems that the LCS algorithm, which is a large part of the Lesk algorithm is the bottleneck we are looking for.

7.1 Bottlenecks

47

Figure 7.2: The figure shows how well the tool performs after the Lesk algorithm has been moved from computation into preprocessing.

7.1.2

Examining the Lesk algorithm

The algorithm is executed on a sentence to find the correct sense of each word using the other words in the sentence. Looking at the structure of the code which uses the Lesk algorithm, one optimization can actually quickly be found. The algorithm is used in every instance of CompareSentences, which means that for every pair of sentences, the Lesk algorithm is executed. This means that overlap will occur, since every sentence is contained in more than one sentence pair. We have optimized the Lesk algorithm by moving it from the comparison into the preprocessing. Since it is executed on a sentence, it would make sense to move it into the class Sentence. When this is done, the creation of a sentence automatically finds the correct synsets using WordNet. The result of this optimization can be seen in figure 7.2. Observe that the preprocessing is now taking almost all the time, but the total computation time was only 9.46 seconds. The optimized tool performs 4 times as fast as compared to before the optimization.

Built-in String methods The LCS algorithm compares each character of a string to find the subsequence. The Java String method contains several

48

Optimization

built-in methods for manipulating with strings. One of these is used to select a character at a given position in the string; String.charAt(i). Because this function is called on every character in each of the strings, it seems that this could decrease performance, if the function is not implemented efficiently. The most efficient method seems to be by creating a character array containing the characters in a string once and access this array each time. We have implemented this optimization and the result has been approximately 10% performance increase of the LCS. While this is only an increase of LCS and not a very large one, it does not seem very useful, but since LCS is considered a bottleneck, every optimization which can be implemented for this could improve the overall performance.

7.2

Threads

Threads can be used to increase the performance of a program, if the computer, on which it is executed, has multiple processing units (CPUs). Then, each thread is distributed to a CPU which will perform the computations of this thread. If the program is creating more threads than CPUs available, multiple threads will be using the same CPU. To increase the performance of the computation of word similarity, we have implemented threads in which these tasks will be executed. The reason for implementing threads for this part is, that many word lookups are done in this part, and even though it may not be a bottleneck, it is still a part of the tool, which cannot be optimized, since the requests are handled by an external API. The threads will be connected to a monitor which contains a list of tasks, i.e. pairs of words to compare. A thread will request a task, execute it, send the result back to the monitor and request a new task until every task has been executed. The reason that we are using a monitor is that only one thread can use functions contained by the monitor at a time. If we did not use a monitor, we could risk that two threads would compute similarity of the same word pair or that other unexpected problems could occur.

7.2 Threads

7.2.1

49

Results

The usage of threads has increased the overall performance of a comparison between two texts by a few percent when two threads are used. Additional increase in performance could be achieved, by using a computer with more than two processing units. A folder comparison is receiving a 15% increase in performance when two threads is used. It is also possible, that implementation of threads in another part of the tool could improve it as well. For instance, assigning a thread to each pair of texts in a folder comparison should in theory improve this comparison proportional to the number of processing units, but this is left to further development.

50

Optimization

Chapter

8 Testing

Testing cannot be seen as one independent part of a software development process. Every part is undergoing tests during implementation. The point of this approach is to discover errors early in the process, because the earlier an error is found and corrected, the less expensive it is. This is obvious, since when correcting an error in the core parts, other parts may be experiencing new problems, since they depended on the incorrect code. Testing need not only be a test of correctness. For example, in a project like this, the result could be compared to human evaluation, and a test could be conducted from this. It is also possible to do some testing on efficiency. This chapter is going to describe the tests we have conducted as well as discussing the results. The results will not be part of the chapter, but can be viewed in appendix C.

8.1

General performance

The performance of the tool is important, because a computation could take a huge amount of time meaning that it would not be very useful for the user. It seems obvious that a text of greater length would require more computation

52

Testing

time, but it is not so obvious how much more time is required. It could be interesting to see how the performance react when longer texts are provided. To determine this, we have created a graph for each method which shows the relation between the total number of words in the texts chosen for comparison and the computation time. The graphs are included in appendix C. All of the graphs tend to be polynomial. The graph for the edit distance is easily explained. The running time of the Levenshtein algorithm is O(nm), where n and m are the lengths of the given strings. Every word in the first sentence is compared to every word in the other sentence, so the running time of a similarity measurement of two sentences is also polynomial. This also applies for the similarity measurement of two a text, since every sentence in one text is compared to every sentence in the other text. This means that the running time of measuring similarity using edit distance is polynomial. The computations using WordNet are a bit different, since they use the adapted Lesk algorithm, which uses LCS. The running time of LCS is also O(nm). The Lesk algorithm is run for every pair of words in a sentence. From WordNet, the senses of each word are obtained. For each pair of senses, definitions is obtained from WordNet, and the LCS of each pair of definitions is calculated. All of these measurements are done in polynomial time, thus the running time of the Lesk algorithm is polynomial. From this, we conclude that the running time of measuring similarity using WordNet is polynomial. This also holds when WSD is disabled, since the comparison of sentences is running in polynomial time.

8.1.1

Timing in Java

An easy method for performance comparison is to measure the running time of one or more central algorithms in an application. Timing in Java is not straight forward because of how the memory is managed in the Java Virtual Environment (JVM). The JVM uses some kind of data structure to keep track of objects being used by the running application, and what memory is allocated by the different objects. Each time a new object is instantiated, a chunk of memory is automatic allocated for the new object by the JVM. Likewise it is up to the JVM to free unused memory, which is allocated by objects that are no longer in use. This task is called garbage collection (GC), and refers to ”throwing away” unused objects. The GC in the JVM provides several advantages, but also some disadvantages. Among the huge advantages is that memory management is no longer left as a task for the programmer. Therefore a lot of time consuming bugs like memory

8.2 Human evaluation

53

leaks is avoided and this leads to an easier and faster development process. Unfortunately tasks like timing an application is even more complicated, since the JVM is running a thread handling the GC concurrent to the chosen application. This will inevitable result in some loss in CPU performance in the application, because of the fact that the hardware has to handle several tasks. Nevertheless we have chosen to perform time measurements in Java. Timing in Java may not give a very precise measure of the time consumption of an algorithm, but will still give an idea of how good an application is performing. A measurement of running time will to some degree be usable, when it is not used as a comparison of an algorithm implemented in two different programming languages.

8.2

Human evaluation

The results obtained using the tool are not very useful, without knowledge on how accurate they are. That is, are the results comparable to how similar a human would consider the texts. This comparation is a kind of validation of the tool, since it provides an evaluation of how accurate the tool is. We have made a test paper, which has been send to 15 people. A test person is supposed to read 7 texts and fill out a table with values between 0 and 10. A value of 10 means the texts are basically equal while 0 means they are not equal at all. The reason for using so few test persons is, that the point with this test is not to perform a statistical analysis of the tool, but rather try to show the tendency of the results from the tool. By taking the average of each test persons values, a similarity matrix based on human evaluation is obtained. The result is shown in table C.1.

8.3

Efficiency

Different methods have been implemented for measuring similarity between texts. The methods are very different in nature, thus there will be differences in result and performance. In the previous section we discussed testing using human evaluation. By comparing the evaluations to the result from each method,

54

Testing

and also take performance time into account, it will be possible to discover which method is the most efficient. The result of comparing the same texts using WordNet and edit distance (table C.4) are shown in appendix C. Two similarity matrices have been created using WordNet: one where hypernymy is used in the Lesk algorithm (table C.3), and one where it isn’t (table C.2). This allows us to discover whether hypernymy improves Lesk. We have also created a similarity matrix for a computation using both measures (table C.5). In this computation, hyponymy has been disabled. To get a better view of the results, a scatter plot has been created for each of the similarity tools in the application. These plots can be seen in appendix C.2. A correlation coefficient can be measured for each of these plots, to determine how closely the results relate. The coefficients are shown in table 8.1. The correlation coefficient indicates how strong the relation between the two similarity matrices is. The closer the correlation coefficient is to 1, the stronger the relationship is. The results are uncorrelated if the correlation coefficient is 0. The results in the table shows that the correlation between edit distance and the human evaluation is very high. This means that using edit distance, we get results similar to the human results. It is also seen that the computation time is much higher when using WordNet. Comparing the performance of the WordNet computation including hyponymy in Lesk with the performance of edit distance, it is seen that the tool performs 200 times faster using edit distance. By disabling hyponymy, the performance is increased 10 times, meaning edit distance performs 20 times faster. Table 8.1: Result of the comparison Method WordNet: hyponymy disabled WordNet: hyponymy enabled Edit distance Combination

8.3.1

Correlation coefficient 0.7515 0.7615 0.9571 0.8603

Total time 98.57 seconds 1005.70 seconds 5.44 seconds 102.11 seconds

Improving the results

The sigmoid function, which was discussed in chapter 3, is implemented such that it contains several variables which can be modified to try to change the result of a similarity computation. By changing these variables, we might be able

8.3 Efficiency

55

to improve the results when using WordNet, compared to the human evaluation. Finding good parameters for the sigmoid function can be hard, since it requires many similarity measures and comparisons with the human evaluation. Instead, we have changed the parameters manually in order to determine whether it is possible to improve the results. Table 8.2 shows the results of the folder compare after changing the parameters of the sigmoid function. The table shows that it is indeed possible to improve the results of the WordNet computations, but a perfect parameter combination might be hard to find, since the other results are actually worse, when using the new parameters.

Table 8.2: New values in the sigmoid function Method WordNet: hyponymy disabled WordNet: hyponymy enabled Edit distance Combination

8.3.2

Deviation 0.7970 0.7459 0.8453 0.8395

Choosing the best method

The test results shows very clearly that with regards to the text used, Levenshtein distance is the most accurate of the methods. Since it also performs much faster than the WordNet method, it seems easy to conclude that this method is the best. This is a bit surprising, since the WordNet method is using semantic similarity and should be able to find similarities in texts, where the meaning is the same, but the words are different. We could argue, that the reason for this result is, that the human evaluation has been done too coarse. A human might conclude very fast, that texts are alike, when many words are contained both texts. If each participant was supposed to use at least an hour to compare the texts, differences which are not obvious when the text is skimmed could been found which could lead to a result closer to the results of the WordNet method.

56

Testing

8.4

Functional test

Functional testing is also known as black box testing. This kind of testing is used to make sure that all parts work as intended. To do this we design our test cases so that they cover as much as possible. This is a cost/benefit decision as it is always possible find new input for the method, without getting much more information about the correctness of the program. A rule of thumb for creating a test case is to test the method for • normal inputs, like what normally will be fed to the method • extreme inputs, like very high or low values, or trying to go into a state that is not legal • border inputs, if the method has one or more border values they should be tested • exception handling, can the system handle all the exceptions thrown from the method

8.4.1

Tests

We have implemented JUnit test cases to test the functionality of several parts of the application. The following functionalities are tested using JUnit: • LCS • Removal of stop words • Removal of non-alphanumerics • Splitting a text into sentences • Splitting a sentence into words • Finding POS tags • Adapted Lesk algorithm • LCA • Word similarity using WordNet • Word similarity using Edit distance

8.5 Graphical User Interface

57

Appendix C contains further explanation of what inputs are used in the tests. The tests can be run in Eclipse by creating a test suite, which will run every test case. A view in Eclipse will show the progress of the running and show the results of a run.

8.4.2

Results

The tests should be run every time changes are made in the functionality, since this can affect the functionality of other parts. These JUnit test cases have helped finding errors in different parts and allowing us to ensure that the model is working as expected.

8.5

Graphical User Interface

When the functional tests has shown that the parts of the tool are working correctly, they must be tested altogether. The graphical user interface can be used for this, since the GUI is using the model as a whole. An automatic test is difficult to conduct, so instead, we will describe use cases for testing the GUI. If the results are as expected, we say the tool is working correctly. The use cases can be seen in appendix C.

8.5.1

Results

The use cases has been used for testing the GUI, and through this test some errors have been discovered. These errors has then been corrected and the use cases has been run again. This way, we ensure that errors does not occur in the cases covered by the use cases.

58

Testing

Chapter

9 Discussion

The work with the similarity tool has been described in the previous chapters. Several extensions has been proposed, and some is implemented in the final tool. Not every extension has been implemented, some because of the time schedule, but some was not implemented, because the implementation turns out to be difficult to implement and the final result would not be satisfactory. This chapter will discuss these extensions and how they could have been implemented.

9.1

File formats

One of the primary objectives with the final application, is to provide the user with a utility for text comparison. It is obvious that the user have to provide some textual material to the application in order to get a result. To provide more usability, it would be a good idea to provide support for several document types. We have chosen to look into the PDF file format, since this is what we would find most useful. One advantage of using the PDF format, is that most of the popular text editors, such as Microsoft Word and Open Office, are able convert a written document into a PDF file. This would lead to indirect support of several document types.

60

Discussion

PDF is an abbreviation for Portable Document Format and is also the type of the file format for PDF files. The document format is very commonly used, especially on the Internet because it provides several benefits. One of the largest benefits is that the document is shown in exactly the same way in all viewers across all hardware platforms. This fact is one of the main reasons for the extensive use of the PDF format. Unfortunately it is more complicated to import text from PDF files, than from ordinary text files. Because of this we need some kind of API for Java, which would make programmatic reading of PDF files easy. There exists several APIs that provides this ability, but none of them are free. One option could be to implement this functionality ourself, but because it would be a large and irrelevant task in this project, we have left the support of additional file formats to further development.

9.2

Command line usage

Many applications function without a graphical user interface. Instead, they allow the user to specify settings at startup using the command line. This could be seen as a huge disadvantage, since the ordinary user is accustomed to a GUI and would not be satisfied with a command line tool. In a terminal the same tools for visualizing results are not available, thus results during progress are difficult to display. If progress is send to the GUI often, this could slow down the overall progress. A silent command line tool, which is run in the background would not be subject to this decrease in performance. As a background process it could be used for a test for optimal parameters to for instance the sigmoid function. This could be done by instantiating the tool many times, and after the computations has finished be able to choose the parameters with the best results. Implementation of command line usage should probably be in the control part of the tool. A class taking care of extracting settings from the parameters and starting the computations could be created. Since the model uses the control to send progress and results to the GUI, it should not be a huge task to implement the ability to send these results to the command line instead. It is left to further development.

9.3 Adjusting evaluations

9.3

61

Adjusting evaluations

The sigmoid function contains parameters which can be changed in order to manipulate the function and in this way adjust the evaluations. It was shown in chapter 8 that the manipulation can actually adjust the evaluations. To find an optimal setting, many tests need to be done. This suggests an addition to the GUI, where the user is able to adjust the parameters and perhaps at the same time be able to see the resulting sigmoid function. Another possibility would be to implement an algorithm to find optimal parameters, when given a similarity matrix of human evaluations. The application should then compare this similarity matrix with the resulting similarity matrix from the tool and adjust the sigmoid function until results are satisfactory. This could for instance be done using genetic algorithms.

9.4

Neural networks

One of the extensions we have built into the application, is a graphical representation of the relation between a collection of texts, as described in chapter 4. The main problem when creating such a graph, is that each of the texts in the similarity matrix, has a comparison score related to every other texts, including it self. This makes the relations between the texts very hard to imagine, since the dimension of the matrix is equal to its size. In order to visualize it, we need to reduce the number of dimensions of the similarity matrix. Often a graphical representation of such a relation, is drawn in two or three dimensions. There exist several methods to reduce a multidimensional space to fewer dimensions. A commonly used approach, is to use some kind of clustering algorithm, which groups data that have similar characteristics. In this case we would like to group texts that are similar and separate less similar texts. One way to achieve data clustering and thereby a reduction of the multiple dimensinoal space, is by using an artificial neural clustering network, like the self-organizing map, also called a Kohonen map and Kohonen network [8]. The self-organizing map is able to produce a low-dimensional representation, typical two or three, of the input space, while preserving the topological relations of the input space. By doing this the algorithm is able to group input data into a predetermined number of groups. Unfortunately the use of a Kohonen map leads to some disadvantages. When a user of the program provides a collection of texts, and expects a graphical representation of the relation between texts, he

62

Discussion

does not always have an idea of how many groups the texts should form, which would be needed to create a reasonable clustering. Another disadvantage of a Kohonen map, is that the technique requires a very large set of data, to perform an accurate clustering. This would lead to a very poor result, if the user wants to compare a smaller collection of texts.

9.5

Additional optimizations

We have discussed some optimizations of different parts of the tool. Overall the tool has improved performance-wise, but additional optimizations are still valid to discuss. As already discussed, threads could be implemented in other parts of the tool to improve performance in these parts as well. Since the LCS is still a part which takes a lot of time, threads could improve it. This only holds on computers with more than one processing unit, but since this is often the case, it would be a reasonable way to optimize.

9.6

Similarity using WordNet

Semantic similarity has been computed using WordNet to perform graph searching. The results from the graph search is then fed into a method which computes the similarity. We have discussed different methods, which can be used to compute similarity using WordNet. We have only used the method proposed by Wu and Palmer, but it could have been interesting to compare the results of other methods, for instance discover which of the methods computes the best results with regards to the human evaluations. If other methods should be implemented, the user should be able to choose which method to use and compare the performance and the results, just like it is possible with the current WordNet method and the edit distance.

Chapter

10 Conclusion

The main goal of this project was to implement a tool for measuring textual similarity between texts. Two methods have been proposed as solutions to this problem: Measuring the semantic similarity, a special case of textual similarity, where the meaning of words are taking into account, using the lexical database WordNet and measuring the similarity using edit distance. Both of the methods have been successfully implemented in the final tool. As a minimum requirement, it was specified that it must be possible to perform performance comparisons of the methods. This functionality has been implemented and visualized in the graphical user interface as two bar charts showing the performance of each tool. We have conducted several tests on the functionality and the code to ensure the tool is working correctly. These tests was run successfully, so we conclude that all known errors are removed and the functionality of each method is as expected. This fact is, however, not enough for the final tool to be satisfactory. If the measurements are inaccurate in terms of what a human would consider similar, the tool is not very useful. Therefore, several test persons have been evaluating the similarity of a number of texts. The average similarity value for each pair of texts has then been compared to the results given by the tool, to validate

64

Conclusion

that the tool is providing useful data. The results have shown that use of edit distance yield the best results, with a correlation coefficient of 0.96, while use of WordNet gave a coefficient of 0.76. It was also shown that the results of a WordNet comparison could be improved by adjusting the sigmoid function, which is the final step of a measurement. Several of the proposed extension have been implemented. The design of the similarity engine made it easy to implement extensions which enables the user to compare more than two texts at a time. By using genetic algorithms we have visualized the results of a comparison between multiple texts, by creating a map where similar texts are grouped together. The tool has been subject to different optimizations, all of which have improved the overall performance. The most effective optimization was a restructuring of text processing code, which lead to an increase in performance by up to 4 times. Threads was implemented to allow use of multiple processing units. This optimization increased performance by up to 15% on a computer with two processing units. The final result has been satisfactory. The tool meet the minimum requirements and several extensions has been implemented. By comparing the measures performed by the tool with human evaluations, we can conclude that the tool is able to measure textual similarity with little deviation from what a human would consider similar.

Appendix

A User manual

This appendix will describe how to use some of the main features of the application. This includes comparing two or more texts. The main window of tool consists of four areas (see figure A.1). The usage starts in area A, where the user must specify the method which is used in the comparison and the texts which should be compared. Area B contains the console and progress bar, which is updated during a comparison. In area C the user can specify various settings for the comparison, such as which tool is to be used. Area D is only used for comparisons between two texts. Here the sentences of each text will appear, so that the user can compare these individually.

A.1

Comparison of two texts

To compare the similarity of two texts, the method must be set to “Compare two texts”. The user must specify the location of the texts which should be compared. The user then has the possibility of specifying different settings, such as whether hyponyms and hypernyms should be used for word sense disambiguation. To begin the comparison, the user must click “Compare texts”. The progress bar

66

User manual

Figure A.1: The main window of the tool. and the console output will then begin to update, to show the user that the comparison has been started. This is seen in figure A.2(a). When the comparison is done, a result window will appear. This window shows the result of the comparison as a percentage describing how similar the texts are. The processing time is also visible along with a graph showing which parts of the computation took most time. Figure A.2(b) shows this window.

A.2

Performance comparison

When the method “Compare two texts” is selected it is possible to compare the performance of the two methods. This includes the resulting similarity score for each method as well as a computation time and a performance chart. The settings for WordNet can be chosen in area C. This makes it possible to test the performance of edit distance and WordNet without using WSD. It is not possible to compare different WordNet settings. The comparison is started by clicking the button “Compare performance”. The console will write which of the methods is being used, and the progress bar will show how far the progress is. When the comparison is finished, the window shown in figure A.3 will be shown. Both of the charts are scaled to the same

A.3 Folder comparison

(a) Comparison of two texts

67

(b) Result of a comparison

Figure A.2: The main window when a comparison has been started is shown in (a). The result window is shown in (b). range, which makes it easier to compare the performance.

A.3

Folder comparison

A comparison of the content of a folder is done by choosing the method “Compare folder content”. The user must specify the location of the folder which should be used for the comparison. The user can specify settings in the same way as when comparing two single texts. To begin the comparison the button “Compare texts” is used. During the comparison, a similarity matrix will be visible in the console output, showing the result so far. When the comparison is done, a result window showing the relation between the texts in the folder will be visible. This window is seen in figure A.4.

A.4

Other possibilites

The tool allows other comparisons, such as comparing a single text with folder content and the similarity between single sentences. These methods are started using the same principles, so a description is not included in this appendix.

68

User manual

Figure A.3: The result window for a performance comparison.

Figure A.4: The result window for a comparison of folder content.

Appendix

B Implementation

The choices made in the design phase leads to the actual implementation of the application. In this phase, the decisions which were made during the design must now be implemented. This appendix go into details with implementation. The overview can be found in chapter 6. Methods which are needed to understand decisions will be explained in details, while other methods will only be outlined or even omitted. This appendix will also describe the application programming interfaces for POS-tagging and access to the WordNet database through Java.

B.1

Model

The model takes care of computations and thus is the core part of the tool. As described in chapter 4 the model consists of two parts: graph searching and text processing. This leads to a natural partition of the model into these two parts. The graph searching part implements algorithms like LCA and maximum flow. The text processing part takes care of splitting texts into sentences, which is split into words. Algorithms like LCS and levenshtein distance are implemented in this part.

70

Implementation

Figure B.1: UML diagram showing the part of the model containing graph searching.

B.1.1

Graph search

To perform a graph search through the WordNet database, we have created a class called Net. This class performs the general tasks of communicating with the GUI through the control and initiates the actual similarity computations. It is implemented as a Thread to avoid the GUI hanging while the model is working. It contains every synset used in the current context, i.e. every synset contained in the texts which are compared and every hypernym of these. This means that the synset entity will always be contained in the net. To be able to perform search in the graph, the contained synsets must know their ancestors. For this purpose, we have created a class Node, which contains a synset, a list of hypernyms and some variables used for graph searching, such as the depth of the node.

Bipartite graph and flow network Maximum flow in flow networks uses the principles of graph search. For a description of the maximum flow problem, refer to chapter 5. The class BipartiteGraph implements the structure of a bipartite graph and methods for transforming the graph into a corresponding flow network and finding the maximum flow of this network. A bipartite graph is created by instantiating the class using the constructor BipartiteGraph(String[], String[], double[][]). The String arrays contains the vertices V1 and V2 and the

B.1 Model

71

Figure B.2: Bipartite graph transformed into a flow network. The number on a node is its capacity.

double array contains the edges E. Since each vertex on the left side is connected to every vertex on the right side, the edge-matrix is actually just a similarity matrix. The flow network is created by calling the method createFlowNetwork(). It initializes a number of nodes, two more than the number of vertices, since we will add a source and a sink. The capacity of each node is decided by choosing the maximum similarity going to or coming from the specific node. See figure B.2 for an example of this. The maximum flow is found by reducing capacity while a new path between the source and the sink can be found (using breadth-first-search).

Genetic algorithm The genetic algorithms for creating a similarity map is implemented using three classes: GeneticAlgorithm, Population and Individual. The actual computations are made in Population and Individual, while the last contains methods which can be called from other parts of the tool. The method getOptimalSolution(double[][]) is used to get a similarity map for a collection of texts. This method takes a similarity matrix as parameter. It instantiates a random population. This population is then altered a number of times using the best individuals. When this is done, the best individual is taken, and an optimal solution is found.

72

Implementation

Figure B.3: UML diagram showing the part of the model containing text processing. Figure B.1 shows the UML diagram for the graph searching part of the model.

B.1.2

Text processing

The text processing consists of 3 classes concerning the parts of a text (the entire text, sentences and words) and 3 classes concerning the similarity of the parts.

Processing the texts The processing of a text is done using the class FullText. It is instantiated using the constructor FullText(Net, String, String). The reference to the Net is used to access words and update the GUI with information in the console. Each String is a text. The class contains a method for preprocessing, preprocess(boolean) and a method for computing the similarity value between the specified texts, compute(boolean). The boolean value determines whether progress should be shown in the GUI. This is not always wanted, for instance when computing similarity between texts in a folder, the standard progress is not wanted, thus false is given as a parameter. The preprocessing consists of splitting the text into sentences using the techniques described in chapter 3. The method for computing a similarity value is basically a loop which computes the similarity for every combination of sentences. The result is saved in a similarity matrix. Afterwards, a bipartite graph is created using the matrix and the similarity is computed.

B.1 Model

73

Processing sentences Comparing sentences uses two classes: Sentence and CompareSentences. Sentence is used to preprocess the sentences, i.e. split into words, determine POS of each word and perform word sense disambiguation using the Lesk algorithm. The Lesk algorithm is implemented in the class Lesk and contains a method for computing and a method for collecting word definitions from WordNet. Whenever a new Sentence is created, the sentence is split into words, and the correct sense of each word is chosen. CompareSentences is used to compare every pair of sentences in a text. It can also be used to find the similarity between two single sentences. The structure is much alike the structure of FullText, since the computation is done by creating a similarity matrix using the words in each sentence and find the overall similarity using a bipartite graph.

Processing words Word are processed in the classes Words and SimilarityThread. Instances of SimilarityThread is created in CompareSentences. The number of instances depends on the number of threads specified.1 Each instance is connected to a monitor, which gives each thread a task, i.e. a word pair. The thread creates an instance of Words by using the constructor Words(Net) and depending on the settings call either similarity(Synset, Synset) (if WordNet is chosen as tool) or editDistSimilarity(String, String) (if edit distance is chosen). similarity(Synset, Synset) finds the lowest common ancestor using the method LCA(Synset[], Synset[]). When the LCA has been found, the depth of each node is found by calling getDepth() in Node. The similarity can then be found using equation 3.1 editDistSimilarity(String, String) calls levenshteinDistance(String, String) which returns an integer value. The similarity is then found using equation 3.3. The resulting similarity value is inserted into a similarity matrix which is what CompareSentences will use in the bipartite graph to compute the final value for the sentences. See figure B.3 for a UML diagram of the text processing part.

1 It was decided that the tool could be optimized using threads to divide computation to multiple processing units. See chapter 7 for more information.

74

Implementation

Figure B.4: UML diagram showing the entire model.

B.1.3

Connection between parts

The implementations of each part of the model has been described, but interaction exists between the two parts. It has already been mentioned that the class Words uses methods from Node, which is from a different part. Another core interaction mentioned is the use of BipartiteGraph. What remains is the interaction between Net and the processing of texts. The interactions can be seen in figure B.4. When a similarity computation is requested from a user, methods in Net is called. Because it works as a Thread, flags are set in the control, when the GUI requests something. As soon as the Net sees this flag, it will start the requested task. There are several different flags which can be set: • fullTextSim – Requests similarity between two texts • sentenceSim – Requests similarity between two sentences • doCompare – Requests comparison of tools • fullFolderCompare – Requests similarity between texts in a folder

B.1 Model

75

• folderAndTextCompare – Requests similarity between a single text and a folder Depending on which flag is set, different methods will be called. Obviously, if the flag requesting similarity between two sentences is called, the method called will only be using CompareSentences, while the other methods will be using the entire model, i.e. instantiate FullText. When a computation is finished, a method is called, which sends the results to the GUI. Different methods exists for sending results, since the results are different by nature.

B.1.4

Application programming interfaces

Some of the features of the tool are too great a task to implement in a project like this. This means we need to request other software to perform some of the tasks for us. This can be done using an application programming interface (API). An API is a source code interface providing standardized requests, which lets the user use functionality of other tools in their own programs.

Java API for WordNet Searching The WordNet database contains several data files with the actual data. The files are cross linked and it would require lots of work to be able to search in the database efficiently. The API called Java API for WordNet Searching (JAWS)[15] provides exactly the functionality we need. It consists of a JAR-file, which contains the requests needed for lookups in the database. The user must provide the database and specify it in his own program. An instance of a WordnetDatabase is created and a synset can be looked up by using the method getSynset(String). The result will be an array of synsets containing the different senses of the word.

QTAG Tagging of words to find the correct part of speech is as mentioned a very huge task, which means we will be using an API called QTAG, which is described as “a probabilistic parts-of-speech tagger”[12], which means that it uses statistical methods. This means that it is does make mistakes, but is generally accurate.

76

Implementation

The tagger is initialized by creating an instance of a Tagger. A sentence can be tagged by using the provided method tag(String[]), which returns an array containing the suggested POS for each word in the sentence. The tags will generally be strings of 2-3 characters, describing the POS and whether it is singular or plural. We will only be using the POS.

B.1.5

Other functionality

The functionality described is concerned with computations of similarity. Some of this functionality uses methods from other classes, which have not been described yet. These classes are • Functions – Containing general functions used for the computations • Preprocessing – Containing functions used for preprocessing the texts • Vars – Containing different variables • Settings – Containing settings, which is saved to a file and loaded when the application is started The class Functions contains implementations of LCS and the levenshtein distance as well as the adapted Lesk algorithm. For a description of these algorithms, read chapter 5. Preprocessing contains methods for removing stop words, splitting a sentence into words and stripping non-alphanumeric characters from a string. The method removeMostUsedWords(String[]) removes the most used words from a list of words. stripSentence(String) uses a regular expression to remove invalid characters: [¬a-zA-Z0-9 .!?] splitSentence(String) splits a sentence when a dot, exclamation mark or question mark appears. Because abbreviations can appear, the method checks the resulting array for very short strings, which could indicate that an abbreviation has been split. If two short strings appear immediately after each other, they are concatenated. The class Vars contains different variables used in the tool. It contains an instance of the WordNetDatabase and an instance of Tagger.

B.2 View

77

Figure B.5: UML diagram showing the view. Settings contains variables with the settings that can be changed in the GUI. These settings includes word sense disambiguation settings, similarity tool, similarity method and location of external files such as the wordnet database.

B.2

View

The GUI consists of six classes: • MainFrame – The main window • Results – Shows results of similarity between texts and performance comparison • SimilarityMap – Shows result of similarity between texts in a folder • FolderTextResult – Shows the result of similarity between a text and the contents of a folder • SettingsDialog – Dialog used for specifying path settings • AboutDialog – About dialog Each window is build using a GUI editor for Eclipse called Jigloo GUI Builder. This means that the most of the implementation of the view is made using “drag and drop”. Only a few methods have been implemented to display more advanced parts of the GUI. These methods are discussed below.

78

Implementation

Figure B.6: UML diagram showing the control.

Results window Within the Results window are implemented two private classes, PerformanceCanvas and HeatMapCanvas, both extending Canvas. As the names suggests, they are used to draw the performance chart and the heatmap. They both contains the method paint(Graphics), which implements the drawing.

PerformanceCanvas contains a list of timings and labels. It is set using drawGraph(ArrayList max then update count table for each word; max ← overall end end end select word sense with highest count from each table; end end

105

106

Pseudocode

Appendix

E Source code

The source code has not been included in the report, since it contains approximately 6000 lines of code, which would consume more than 100 pages. The source code along with javadoc and a compiled version of the tool is available on the internet address www.damn.dk/similarity/. The application is licensed under the GNU General Public License, version 3.

108

Source code

Bibliography

[1] Satanjeev Banerjee and Ted Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pages 136–145, 2002. [2] Alexander Budanitsky and Graeme Hirst. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics, 2001. [3] Rudi Cilibrasi and Paul M. B. Vitanyi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, page 370, 2007. [4] Thomas H. Cormen, Charles E. Leierson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press, 2001. [5] Thanh Ngoc Dao. An improvement on capturing similarity between strings [online]. July 2005. Available from: http://www.codeproject.com/KB/ recipes/improvestringsimilarity.aspx [cited 2008-06-03]. [6] Thanh Ngoc Dao and Troy Simpson. Measuring similarity between sentences [online]. October 2005. Available from: http://www.codeproject. com/KB/string/semanticsimilaritywordnet.aspx [cited 2008-05-28]. [7] Google. Google search features - spell checker [online]. 2008. Available from: http://www.google.com/help/features.html#spell [cited 2008-06-23]. [8] Fakhreddine O. Karray and Clarence De Silva. Soft Computing and Intelligent Systems Design: Theory, Tools and Applications. Addison Wesley, 2004.

110

BIBLIOGRAPHY

[9] Princetone University Cognitive Science Laboratory. Morphy(7wn) manual page [online]. Available from: http://wordnet.princeton.edu/man/ morphy.7WN [cited 2008-06-02]. [10] Michael E. Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. ACM Special Interest Group for Design of Communication Proceedings of the 5th annual international conference on Systems documentation, pages 24–26, 1986. [11] Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, pages 707–710, 1966. [12] Oliver Mason. Qtag [online]. Available from: http://www.english.bham. ac.uk/staff/omason/software/qtag.html [cited 2008-06-10]. [13] George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 1995. [14] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligens Research, pages 95–130, 1999. [15] Brett Spell. Java api for wordnet searching (jaws) [online]. June 2007. Available from: http://engr.smu.edu/cse/dbgroup/sw/jaws.htm [cited 2008-06-10]. [16] Wikipedia the free encyclopedia. Matching [online]. 2008. Available from: http://en.wikipedia.org/wiki/Matching [cited 2008-06-02]. [17] Wikipedia the free encyclopedia. Natural selection [online]. 2008. Available from: http://en.wikipedia.org/wiki/Natural_selection [cited 200806-25]. [18] Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138, 1994.