Measuring the Similarity and Relatedness of Concepts : a MICAI 2013 Tutorial Ted Pedersen, Ph.D. University of Minnesota Department of Computer Science, Duluth http://www.d.umn.edu/~tpederse
[email protected]
What (I hope) you will learn! ●
●
● ●
●
The distinction between semantic similarity and relatedness (and why both are useful) How to measure using information from ontologies, definitions, and corpora How to use freely available software How to conduct experiments using freely available reference standards Some applications where these measures are used or could be useful
November 25, 2013
MICAI-2013 Tutorial
2
Orientation ●
●
We focus on methods that measure similarity and relatedness using information found in an ontology, which may be possibly augmented with statistics from corpora or other resources We will not discuss purely distributional methods –
November 25, 2013
Very interesting and useful, and deserve their own separate tutorial
MICAI-2013 Tutorial
3
Just a few distributional methods ●
Latent Semantic Analysis –
●
SenseClusters –
●
http://senseclusters.sourceforge.net
Clustering by Committee –
●
http://lsa.colorado.edu
http://demo.patrickpantel.com
Disco –
November 25, 2013
http://www.linguatools.de/disco/disco_en.html MICAI-2013 Tutorial
4
Outline ●
Measures of Similarity and Relatedness –
●
Using Open Source Software –
●
75 minutes + 10 minutes of questions 45 minutes + 10 minutes of questions
Similarity and Relatedness in the Wild –
November 25, 2013
60 minutes + 10 minutes of questions
MICAI-2013 Tutorial
5
Measures of Semantic Similarity and Relatedness (without tears)
November 25, 2013
MICAI-2013 Tutorial
6
What are we measuring? ●
Concept pairs (word senses) –
●
Assign a numeric value that quantifies how similar or related two concepts are
Not words –
Cold may be temperature or illness ●
–
This tutorial assumes senses assigned ●
November 25, 2013
Word Sense Disambiguation But, can also use these measures for WSD!
MICAI-2013 Tutorial
7
Why? ●
●
Being able to organize concepts by their similarity or relatedness to each other is a fundamental operation in the human mind, and in many problems in Natural Language Processing and Artificial Intelligence If we know a lot about X, and if we know Y is similar to X, then a lot of what we know about X may apply to Y –
November 25, 2013
Use X to explain or categorize Y MICAI-2013 Tutorial
8
What is lefse?
November 25, 2013
MICAI-2013 Tutorial
9
Well, it's like a tortilla, except made with potatoes.
November 25, 2013
MICAI-2013 Tutorial
10
Lefse is a traditional soft, Norwegian flatbread. Lefse is made out of flour, and milk or cream (or sometimes lard), and cooked on a griddle. Traditional lefse does not include potato, but it is commonly added to make a thicker dough that is easier to work with. Special tools are available for lefse baking, including long wooden turning sticks and special rolling pins with deep grooves.
November 25, 2013
MICAI-2013 Tutorial
11
Lefse
November 25, 2013
MICAI-2013 Tutorial
12
Similar or Related? ●
Similarity based on is-a relations
November 25, 2013
–
How much is X like Y?
–
Share ancestor in is-a hierarchy
MICAI-2013 Tutorial
13
Is-a hierarchy
November 25, 2013
MICAI-2013 Tutorial
14
Similar or Related? ●
Similarity based on is-a relations –
Share ancestor in is-a hierarchy ● ●
–
A miter saw and a sander are similar ●
November 25, 2013
LCS : least common subsumer Closer / deeper the ancestor the more similar both are kinds-of power tools (LCS)
MICAI-2013 Tutorial
15
Least Common Subsumer (LCS)
November 25, 2013
MICAI-2013 Tutorial
16
Saws!
November 25, 2013
MICAI-2013 Tutorial
17
Sanders!
November 25, 2013
MICAI-2013 Tutorial
18
Similar or Related? ●
Relatedness more general –
How much is X related to Y?
–
Many ways to be related ●
●
Hammer and nail are related but they really aren't similar –
●
is-a, part-of, treats, affects, symptom-of, ...
(use hammer to drive nails)
All similar concepts are related, but not all related concepts are similar
November 25, 2013
MICAI-2013 Tutorial
19
“Standard” Measures of Similarity ●
Path Based –
●
●
Rada et al., 1989 (path)
Path + Depth –
Wu & Palmer, 1994 (wup)
–
Leacock & Chodorow, 1998 (lch)
Path + Information Content –
Resnik, 1995 (res)
–
Jiang & Conrath, 1997 (jcn)
–
Lin, 1998 (lin)
November 25, 2013
MICAI-2013 Tutorial
20
Path Based Measures ●
●
●
Distance between concepts (nodes) in tree intuitively appealing Spatial orientation, good for networks or maps but not is-a hierarchies –
Reasonable approximation sometimes
–
Assumes all paths have same “weight”
–
But, more specific (deeper) paths tend to travel less semantic distance
Shortest path a good start, needs corrections
November 25, 2013
MICAI-2013 Tutorial
21
Shortest is-a Path 1 ●
path(a,b) =
-----------------------------shortest is-a path(a,b)
November 25, 2013
MICAI-2013 Tutorial
22
We count nodes... ●
●
Maximum = 1 –
self similarity
–
path(miter saw, miter saw) = 1
Minimum = 1 / (longest path in isa tree) –
path(miter saw, claw hammer) = 1/7
–
path(table saw, flat tip screwdriver) = 1/7
–
etc...
November 25, 2013
MICAI-2013 Tutorial
23
path(miter saw, sander) = .25
November 25, 2013
MICAI-2013 Tutorial
24
miter saw & sander
November 25, 2013
MICAI-2013 Tutorial
25
path (hammer, power tool) = .25
November 25, 2013
MICAI-2013 Tutorial
26
hammers and power tools
November 25, 2013
MICAI-2013 Tutorial
27
? ●
●
Are hammer and power tool similar to the same degree as are mitre saw and sander? The path measure reports “yes, they are.”
November 25, 2013
MICAI-2013 Tutorial
28
Path + Depth ●
Path only doesn't account for specificity
●
Deeper concepts more specific
●
Paths between deeper concepts travel less semantic distance
November 25, 2013
MICAI-2013 Tutorial
29
Wu and Palmer, 1994 2 * depth (LCS (a,b)) ●
wup(a,b) = ---------------------------depth (a) + depth (b)
●
depth(x) = shortest is-a path(root,x)
November 25, 2013
MICAI-2013 Tutorial
30
wup(miter saw, sander) = (2*2)/(4+3) = .57
November 25, 2013
MICAI-2013 Tutorial
31
wup (hammer, power tool) = (2*1)/(2+3) = .4
November 25, 2013
MICAI-2013 Tutorial
32
? ●
●
●
Wu and Palmer reports that sander and miter saw (.57) are more similar than are power tool and hammer (.4) Path reports that sander and miter saw (.25) are equally similar as are power tool and hammer (.25) Note that measures are scaled differently and so should compare relative rankings between measures (and not exact scores)
November 25, 2013
MICAI-2013 Tutorial
33
Information Content ●
● ●
ic(concept) = -log p(concept) [Resnik 1995] –
Need to count concepts
–
Term frequency +Inherited frequency
–
p(concept) = tf + if / N
Depth shows specificity but not frequency Low frequency concepts often much more specific than high frequency ones
November 25, 2013
MICAI-2013 Tutorial
34
Information Content term frequency (tf)
November 25, 2013
MICAI-2013 Tutorial
35
Information Content inherited frequency (if)
November 25, 2013
MICAI-2013 Tutorial
36
Information Content (IC = -log (f/N) final count (f = tf + if, N = 365,820)
November 25, 2013
MICAI-2013 Tutorial
37
Information Content (IC = -log (f/N) final count (f = tf + if, N = 365,820)
November 25, 2013
MICAI-2013 Tutorial
38
Lin, 1998 2 * IC (LCS (a,b)) ●
lin(a,b) = -------------------------IC (a) + IC (b)
●
Look familiar?
November 25, 2013
MICAI-2013 Tutorial
39
Wu & Palmer, 1994 ●
●
2 * depth (LCS (a,b)) wup(a,b) = -------------------------depth (a) + depth (b) wup and lin are identical except that lin uses info content instead of depth – Info content provides a measure of depth (based on specificity)
November 25, 2013
MICAI-2013 Tutorial
40
lin (miter saw, sander) = 2 * 2.26 / (3.60 + 3.66) = 0.62
November 25, 2013
MICAI-2013 Tutorial
41
lin (hammer, power tool) = 2 * 0.71 / (2.26+2.81) = 0.28
November 25, 2013
MICAI-2013 Tutorial
42
? ●
●
●
Lin : miter saw and sander (.62) more similar than hammer and power tool (.28) Wu and Palmer : miter saw and sander (.57) more similar than hammer and power tool (.4) Path miter saw and sander (.25) equally similar to hammer and power tool (.25)
November 25, 2013
MICAI-2013 Tutorial
43
What about concepts not connected via is-a relations? ●
Connected via other relations? –
●
●
Part-of, treatment-of, causes, etc.
Not connected at all? –
In different sections (axes) of an ontology
–
In different ontologies entirely
Relatedness! –
Use definition information
–
No is-a relations so can't be similarity
November 25, 2013
MICAI-2013 Tutorial
44
Measures of relatedness ●
Path based –
●
Hirst & St-Onge, 1998 (hso)
Definition based –
Lesk, 1986
–
Adapted lesk (lesk) ●
●
Banerjee & Pedersen, 2003
Definition + corpus –
Gloss Vector (vector, vector_pairs) ●
November 25, 2013
Patwardhan & Pedersen, 2006 MICAI-2013 Tutorial
45
Path based relatedness ● ●
Ontologies include relations other than is-a These can be used to find shortest paths between concepts –
However, a path made up of different kinds of relations can lead to big semantic jumps
–
A hammer is used to drive nails which are made of iron which comes from mines in Minnesota ●
November 25, 2013
…. so hammer and Minnesota are related ?? MICAI-2013 Tutorial
46
Measuring relatedness with definitions ●
● ●
Related concepts defined using many of the same terms But, definitions are short, inconsistent Concepts don't need to be connected via relations or paths to measure them –
Lesk, 1986
–
Adapted Lesk, Banerjee & Pedersen, 2003
November 25, 2013
MICAI-2013 Tutorial
47
Two separate ontologies...
November 25, 2013
MICAI-2013 Tutorial
48
Could join them together … ?
November 25, 2013
MICAI-2013 Tutorial
49
Each concept has definition
November 25, 2013
MICAI-2013 Tutorial
50
Each concept has definition
November 25, 2013
MICAI-2013 Tutorial
51
Each concept has definition
November 25, 2013
MICAI-2013 Tutorial
52
Overlaps ●
Claw hammer and carpenter –
Related by working with wood ● ●
●
Can't see this in structure of is-a hierarchies Claw hammer and iron worker just as similar
Ball peen hammer and claw hammer –
Reflects structure of is-a hierarchies
–
If you start with text like this maybe you can build is-a hierarchies automatically! ●
November 25, 2013
Another tutorial... MICAI-2013 Tutorial
53
Lesk and Adapted Lesk ●
Lesk, 1986 : measure overlaps in definitions to assign senses to words –
●
The more overlaps between two senses (concepts), the more related
Banerjee & Pedersen, 2003, Adapted Lesk –
Augment definition of each concept with definitions of related concepts ●
– November 25, 2013
Build a super gloss
Increase chance of finding overlaps MICAI-2013 Tutorial
54
The problem with definitions ... ●
Definitions contain variations of terminology that make it impossible to find exact overlaps
●
spatula : an instrument for spreading material
●
spreader : a hand tool for smoothing compounds
●
No matches??! How can we see that “hand tool” and “instrument” are similar, as are “spreading material” and “smoothing compound” ?
November 25, 2013
MICAI-2013 Tutorial
55
Gloss Vector Measure of Semantic Relatedness ●
Rely on co-occurrences of terms –
Terms that occur within some given number of terms of each other other
●
Allows for a fuzzier notion of matching
●
Exploits second order co-occurrences –
November 25, 2013
Friend of a friend relation
MICAI-2013 Tutorial
56
Gloss Vector Measure of Semantic Relatedness ●
Friend of a friend relation –
Suppose hand tool and instrument don't occur in text with each other. But, suppose that “repair” occurs with each.
–
Hand tool and instrument are second order co-occurrences via “repair”
November 25, 2013
MICAI-2013 Tutorial
57
Gloss Vector Measure of Semantic Relatedness ●
●
●
●
Replace words or terms in definitions with vector of co-occurrences observed in corpus Defined concept now represented by an averaged vector of co-occurrences Measure relatedness of concepts via cosine between their respective vectors Patwardhan and Pedersen, 2006 –
November 25, 2013
Inspired by Schutze, 1998 MICAI-2013 Tutorial
58
Experimental Results ●
Vector > Lesk > Info Content > Depth > Path –
●
●
Clear trend across various studies
Big differences in intrinsic evaluations (Vector > Lesk >> Info Content > Depth > Path) –
Banerjee and Pedersen, 2003 (IJCAI)
–
Pedersen, et al. 2007 (JBI)
Smaller differences in extrinsic evaluations –
November 25, 2013
Human raters mix up similarity & relatedness? MICAI-2013 Tutorial
59
Questions?
November 25, 2013
MICAI-2013 Tutorial
60
References ●
●
●
S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805810, Acapulco, August 2003. (lesk) J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics, pages 1933, Taiwan, 1997. (jcn) C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 265-283. MIT Press, 1998. (lch)
November 25, 2013
MICAI-2013 Tutorial
61
References ●
●
●
M.E. Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pages 24-26. ACM Press, 1986. D. Lin. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August 1998. (lin). S. Patwardhan and T. Pedersen. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, pages 1-8, Trento, Italy, April 2006. (vector, vector_pairs)
November 25, 2013
MICAI-2013 Tutorial
62
References ●
●
●
R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 19(1):17-30, 1989. (path) P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453, Montreal, August 1995. (res) H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97-123, 1998.
November 25, 2013
MICAI-2013 Tutorial
63
Using Open Source Software
November 25, 2013
MICAI-2013 Tutorial
64
Using Open Source Software ●
Packages providing the “standard” measures
●
Implementations of specific measures
●
Overview of WordNet::Similarity usage
November 25, 2013
MICAI-2013 Tutorial
65
Software that provides some of the “standard” measures
November 25, 2013
MICAI-2013 Tutorial
66
WordNet::Similarity ●
Similarity and Relatedness for WordNet –
http://wordnet.princeton.edu
●
Written in Perl (starting in 2002)
●
Offers command line, web interface, and API –
●
http://wn-similarity.sourceforge.net
We'll come back to this for some examples
November 25, 2013
MICAI-2013 Tutorial
67
ws4j ●
Java Re-implementation of WordNet::Similarity –
●
●
https://code.google.com/p/ws4j/
Includes path, depth, info content, hso, and lesk measures Online demo –
November 25, 2013
http://ws4jdemo.appspot.com/
MICAI-2013 Tutorial
68
NLTK ●
Natural Language Toolkit –
●
Includes path, depth, and information content measures
Written in Python –
General purpose NLP toolkit ●
–
November 25, 2013
Parsers, part of speech taggers, and more
http://nltk.org/
MICAI-2013 Tutorial
69
DKPro Similarity ●
●
●
Semantic similarity using vector space models like LSA and ESA, and also WordNet –
Implemented using UIMA
–
https://code.google.com/p/dkpro-similarity-asl/
Part of the much larger DKPro project, which provides UIMA wrappers for many existing tools and models Supports measuring similarity of short texts and concept pairs
November 25, 2013
MICAI-2013 Tutorial
70
Semilar ●
Semantic similarity using WordNet and LSA –
●
● ●
http://semanticsimilarity.org
Supports measuring similarity of short texts and concept pairs Provides many pre-built models using LSA Includes a web service and API in addition to downloadable libraries
November 25, 2013
MICAI-2013 Tutorial
71
UMLS::Similarity ●
Ports WordNet::Similarity to the UMLS –
Unified Medical Language System from NLM, a data warehouse of medical sources ●
– ●
Freely available, license required
http://umls-similarity.sourceforge.net
Perl and mySQL
November 25, 2013
MICAI-2013 Tutorial
72
ProteInOn ●
Computes Semantic Similarity for the Gene Ontology (GO) using path and information content measures –
●
http://geneontology.org/
Protein Interactions and Ontology –
November 25, 2013
http://lasige.di.fc.ul.pt/webtools/proteinon/
MICAI-2013 Tutorial
73
Measure Specific Software
November 25, 2013
MICAI-2013 Tutorial
74
UKB ●
Graph based similarity and relatedness measures, using WordNet –
●
http://ixa2.si.ehu.es/ukb/
Applies Personalized Page Rank to semantic similarity and relatedness measures, as well as to word sense disambiguation
November 25, 2013
MICAI-2013 Tutorial
75
WMFVEC ●
High dimensional vector approach using definitions from WordNet and Wiktionary –
●
http://www.cs.columbia.edu/~weiwei/code.h tml#wmfvec
Supports similarity measurements of short texts and concept pairs
November 25, 2013
MICAI-2013 Tutorial
76
olesk ●
Shortest path in weighted semantic network –
●
http://olesk.com/#SemanticRelatedness
Supports measuring similarity of short texts and concept pairs
November 25, 2013
MICAI-2013 Tutorial
77
Illinois WNSim ●
WordNet-based Similarity Metric – – –
●
https://cogcomp.cs.illinois.edu/page/softwa re_view/Illinois%20WNSim Also provides Java version https://cogcomp.cs.illinois.edu/page/softw are_view/Illinois%20WNSim%20(Java )
Measures similarity of short texts, provides support for similarity of named entities
November 25, 2013
MICAI-2013 Tutorial
78
WordNet::Similarity Usage
November 25, 2013
MICAI-2013 Tutorial
79
WordNet::Similarity ●
Install WordNet (http://princeton.wordnet.edu) –
●
●
Make sure to set $WNHOME
Install WordNet::QueryData –
cpan
–
> install WordNet::QueryData
Install WordNet::Similarity –
cpan
–
> install WordNet::Similarity
November 25, 2013
MICAI-2013 Tutorial
80
WordNet::Similarity
November 25, 2013
MICAI-2013 Tutorial
81
Command line ●
similarity.pl --type WordNet::Similarity::path dog cat –
●
similarity.pl --type WordNet::Similarity::path dog#n cat#n –
●
dog#n#1 cat#n#1 0.2
similarity.pl --type WordNet::Similarity::path dog#n#2 cat#n#3 –
●
dog#n#1 cat#n#1 0.2
dog#n#2 cat#n#3 0.125
Similarity.pl –type WordNet::Similarity::path dog cat#n#3 –
November 25, 2013
dog#n#3 cat#n#3 0.142857142857143
MICAI-2013 Tutorial
82
Command line ●
similarity.pl --type WordNet::Similarity::path dog cat --allsenses
November 25, 2013
–
dog#n#1 cat#n#1 0.2
–
dog#n#3 cat#n#2 0.2
–
dog#n#1 cat#n#7 0.2
–
dog#n#7 cat#n#5 0.166666666666667
–
dog#n#4 cat#n#2 0.142857142857143
–
dog#n#3 cat#n#3 0.142857142857143
–
dog#v#1 cat#v#2 0.142857142857143
–
dog#n#4 cat#n#3 0.142857142857143
–
dog#n#6 cat#n#5 0.142857142857143
–
dog#v#1 cat#v#1 0.142857142857143
–
dog#n#2 cat#n#2 0.125
–
Etc...
MICAI-2013 Tutorial
83
WordNet senses ●
wn cat -over
●
Overview of noun cat
●
The noun cat has 8 senses (first 1 from tagged texts)
●
●
●
●
1. (18) cat, true cat -- (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) 2. guy, cat, hombre, bozo -- (an informal term for a youth or man; "a nice guy"; "the guy's only doing it for some doll") 3. cat -- (a spiteful woman gossip; "what a cat she is!") 4. kat, khat, qat, quat, cat, Arabian tea, African tea -- (the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant; "in Yemen kat is used daily by 85% of adults")
November 25, 2013
MICAI-2013 Tutorial
84
wn cat -over ●
●
●
●
5. cat-o'-nine-tails, cat -- (a whip with nine knotted cords; "British sailors feared the cat") 6. Caterpillar, cat -- (a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work) 7. big cat, cat -- (any of several large cats typically able to roar and living in the wild) 8. computerized tomography, computed tomography, CT, computerized axial tomography, computed axial tomography, CAT -- (a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis
November 25, 2013
MICAI-2013 Tutorial
85
wn cat -over ●
Overview of verb cat
●
The verb cat has 2 senses (no senses from tagged texts)
●
1. cat -- (beat with a cat-o'-nine-tails)
●
2. vomit, vomit up, purge, cast, sick, cat, be sick, disgorge, regorge, retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate, throw up -- (eject the contents of the stomach through the mouth; "After drinking too much, the students vomited"; "He purged continuously"; "The patient regurgitated the food we gave him last night")
November 25, 2013
MICAI-2013 Tutorial
86
Measure type (-type WordNet::Similarity::X) ●
Similarity –
Shortest Path ●
–
Depth ●
–
wup, lch
Information Content ●
November 25, 2013
path
res, lin, jcn
MICAI-2013 Tutorial
87
Measure type (-type WordNet::Similarity::X) ●
Relatedness –
Definition ●
–
Definition + Corpus ● ●
–
vector vector_pairs
Path finding ●
November 25, 2013
lesk
hso MICAI-2013 Tutorial
88
Measure type (-type WordNet::Similarity::X) ●
Random –
rand
–
Very useful for experimental comparisons ●
November 25, 2013
How do your results compare with a random measure of similarity or relatedness??
MICAI-2013 Tutorial
89
Similarity measures don't cross part of speech tags ●
similarity.pl --type WordNet::Similarity::path dog#n cat#v –
Warning (WordNet::Similarity::path::parseWps()) - dog#n and cat#v belong to different parts of speech.
–
dog#n#2 cat#v#1 -1000000
November 25, 2013
MICAI-2013 Tutorial
90
Relatedness measures cross POS boundaries ●
similarity.pl --type WordNet::Similarity::lesk dog#n#1 cat#v#1 –
●
dog#n#1 cat#v#1 20
similarity.pl --type WordNet::Similarity::vector dog#n#1 cat#v#1 –
November 25, 2013
dog#n#1 cat#v#1 0.146581307649965
MICAI-2013 Tutorial
91
API ●
use WordNet::Similarity::wup;
●
use WordNet::QueryData;
●
my $wn = WordNet::QueryData->new();
●
my $wup = WordNet::Similarity::wup->new($wn);
●
●
my $value = $wup->getRelatedness('dog#n#1', 'cat#n#1');
●
my ($error, $errorString) = $wup->getError();
●
die $errorString if $error;
●
print "dog (sense 1) cat (sense 1) = $value\n";
●
dog (sense 1) cat (sense 1) = 0.866666666666667
November 25, 2013
MICAI-2013 Tutorial
92
API ●
my $wn = WordNet::QueryData->new;
●
use WordNet::Similarity::PathFinder;
●
my $obj = WordNet::Similarity::PathFinder->new ($wn);
●
my $wps1 = 'winston_churchill#n#1';
●
my $wps2 = 'england#n#1';
●
my @paths = $obj->getShortestPath($wps1, $wps2, 'n', 'wps');
●
my ($length, $path) = @{shift @paths};
●
defined $path or die "No path between synsets";
●
print "shortest path between $wps1 and $wps2 is $length edges long\n";
●
print "@$path\n";
●
shortest path between winston_churchill#n#1 and england#n#1 is 14 edges long winston_churchill#n#1 writer#n#1 communicator#n#1 person#n#1 causal_agent#n#1 physical_entity#n#1 object#n#1 location#n#1 region#n#3 district#n#1 administrative_district#n#1 country#n#2 European_country#n#1 england#n#1
November 25, 2013
MICAI-2013 Tutorial
93
Web Interface
November 25, 2013
MICAI-2013 Tutorial
94
Web Interface
November 25, 2013
MICAI-2013 Tutorial
95
Web Interface
November 25, 2013
MICAI-2013 Tutorial
96
Web Interface
November 25, 2013
MICAI-2013 Tutorial
97
Web Interface
November 25, 2013
MICAI-2013 Tutorial
98
Web Interface
November 25, 2013
MICAI-2013 Tutorial
99
Web Interface ●
If you like the web interface, you can run your own version! –
similarity_server.pl
–
All necessary html and cgi files included
November 25, 2013
MICAI-2013 Tutorial
100
Other Utilities ●
Build new information content files – by default counts come from SemCor –
BNCFreq.pl
–
brownFreq.pl
–
treebankFreq.pl
–
rawtextFreq.pl
●
compounds.pl – list all WordNet compounds
●
wnDepths.pl – list all WordNet depths
November 25, 2013
MICAI-2013 Tutorial
101
Questions?
November 25, 2013
MICAI-2013 Tutorial
102
References ●
●
●
●
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA. (ukb) Daniel Bär, Torsten Zesch, and Iryna Gurevych. DKPro Similarity: An Open Source Framework for Text Similarity, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 121-126, August 2013, Sofia, Bulgaria. (pdf) (bib) (dkpro-similarity) Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. (nltk) Q. Do and D. Roth and M. Sammons and Y. Tu and V. Vydiswaran, Robust, Light-weight Approaches to compute Lexical Similarity. Computer Science Research and Technical Reports, University of Illinois (2009) (Illionois WNSim)
November 25, 2013
MICAI-2013 Tutorial
103
References ●
●
●
Weiwei Guo and Mona Diab. "Improving Lexical Semantics for Sentential Semantics: Modeling Selectional Preference and Similar Words in a Latent Variable Model". In Proceedings of NAACL, 2013, Atlanta, Georgia, USA. (wmfvec) Bridget McInnes, Ted Pedersen, and Serguei Pakhomov, UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity - Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, Nov 14-18, 2009, pp. 431-435, San Francisco, CA (umls-similarity) Ted Pedersen, Siddharth Patwardhan and Jason Michelizzi, WordNet::Similarity - Measuring the Relatedness of Concepts - Appears in the Proceedings of Fifth Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-04), pp. 38-41, May 3-5, 2004, Boston, MA. (wordnet-similarity)
November 25, 2013
MICAI-2013 Tutorial
104
References ●
●
Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D. (2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, Sofia, Bulgaria. (semilar) Reda Siblini and Leila Kosseim (2013). Using a Weighted Semantic Network for Lexical Semantic Relatedness. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2013), September, Hissar, Bulgaria. (olesk)
November 25, 2013
MICAI-2013 Tutorial
105
Similarity and Relatedness in the Wild : How do we know it's working?
November 25, 2013
MICAI-2013 Tutorial
106
Intrinsic Evaluation ● ●
●
Develop your own measure Score it using pairs for which human reference standard is available Compare correlation between your measure and established measures –
Spearman's rank correlation often used
–
rank.pl in Ngram Statistics Package ●
November 25, 2013
http://ngram.sourceforge.net
MICAI-2013 Tutorial
107
Intrinsic Evaluation ●
Replication proves to be very difficult!
●
Many factors, see ACL 2013 paper –
Offspring from Reproduction Problems: What Replication Failure Teaches Us (Fokkens, van Erp, Postma, Pedersen, Vossen, and Freire) - Appears in the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, pp. 1691-1701, Sofia, Bulgaria.
–
http://aclweb.org/anthology//P/P13/P13-1166.pdf
November 25, 2013
MICAI-2013 Tutorial
108
Reference Standards ●
●
Rubenstein and Goodenough, 1965 –
65 pairs
–
Assessed by 50 undergraduate students
–
http://www.d.umn.edu/~tpederse/Data/ruben stein-goodenough-1965.txt
Miller and Charles, 1991 –
30 pair subset of R&G
–
Re-assessed by 38 undergraduate students
–
http://www.d.umn.edu/~tpederse/Data/millercharles-1991.txt
November 25, 2013
MICAI-2013 Tutorial
109
Rubenstein and Goodenough pairs no similarity (0.0) – synonyms (4.0) ●
3.94 gem jewel
●
3.92 car automobile
●
3.92 automobile car
●
3.84 gem jewel
●
2.41 brother lad
●
1.66 lad brother
●
2.37 crane implement
●
1.68 crane implement
●
0.04 rooster voyage
●
0.08 rooster voyage
●
0.04 noon string
●
0.08 noon string
– November 25, 2013
Rubenstein & Goodenough
MICAI-2013 Tutorial
–
Miller & Charles 110
Reference Standards •
●
WordSim-353, 2002 –
153 pairs assessed by 13 subjects
–
200 pairs assessed by 16 subjects
–
Includes the Miller and Charles pairs (reassessed)
http://www.cs.technion.ac.il/~gabr/resources/ data/wordsim353/
November 25, 2013
MICAI-2013 Tutorial
111
Reference Standards ●
Yang and Powers, 2006
●
130 verb pairs –
Assessed by 2 academic staff and 4 graduate students
–
How related in meaning is the pair? ● ●
–
November 25, 2013
0 for not at all 4 for inseperably related
http://david.wardpowers.info/Research/AI/p apers/200601-GWC-130verbpairs.txt MICAI-2013 Tutorial
112
Reference standards ● ●
●
Mturk 771, 2012 771 word pairs scored for relatedness by Mechanical Turkers At least 20 judgements per pair –
1 for not related, 5 for highly related
–
50 ratings per Turker
–
http://www2.mta.ac.il/~gideon/mturk771.html
November 25, 2013
MICAI-2013 Tutorial
113
Reference Standards ●
MWE-300, 2012
November 25, 2013
–
300 pairs where 216 are multi-word expressions and 84 are word pairs
–
Assessed by 5 native speakers on scale of 0 to 1
–
http://adapt.seiee.sjtu.edu.cn/similarity/
MICAI-2013 Tutorial
114
Reference Standards ● ●
Rel-122, 2013 Relatedness scores for 122 noun pairs, created at University of Central Florida –
Each pair assessed by at least 20 undergraduate students
–
0 for completely unrelated, 4 for strongly related
–
http://www.cs.ucf.edu/~seansz/rel-122/
November 25, 2013
MICAI-2013 Tutorial
115
Reference standards ●
MayoSRS, 2007 –
101 pairs of medical concepts
–
Assessed by 13 medical coders and 3 physicians, all from Mayo Clinic ●
–
●
1 for not at all related, 4 for nearly synonymous
MiniMayoSRS – a highly reliable subset of 29 pairs
http://rxinformatics.umn.edu/SemanticRelate dnessResources.html
November 25, 2013
MICAI-2013 Tutorial
116
Reference standards UMNSRS, 2010
● ●
–
566 pairs of medical concepts assessed for similarity by 8 medical students / residents
–
587 pairs of medical concepts assessed for relatedness by 8 medical students / residents
Assessed on a continuous scale (0 – 1500) http://rxinformatics.umn.edu/SemanticRelated nessResources.html
November 25, 2013
MICAI-2013 Tutorial
117
Reference Standards ●
Lexical & Distributional Semantics Evaluation Benchmarks, maintained by Manaal Faruqui –
●
http://www.cs.cmu.edu/~mfaruqui/suite.html
ACL Wiki (various datasets for related tasks) http://aclweb.org/aclwiki/index.php?title=Simi larity_(State_of_the_art) http://aclweb.org/aclwiki/index.php?title=Kn owledge_collections_and_datasets_(English) SemEval (many related tasks with data) –
●
– November 25, 2013
http://aclweb.org/aclwiki/index.php?title=Se mEval_Portal MICAI-2013 Tutorial
118
Extrinsic Evaluation ●
Synonym Tests
●
Word Sense Disambiguation
●
Semantic Textual Similarity
●
Recognizing Textual Entailment
November 25, 2013
MICAI-2013 Tutorial
119
ESL Synonym Tests ●
Provide one target word in context
●
Select “closest” synonym from a list of 4
●
●
●
Used in previous versions of TOEFL and other standardized tests http://aclweb.org/aclwiki/index.php?title=ESL_Synonym_Questions_(State_ of_the_art)
50 question data set available from Peter Turney –
November 25, 2013
http://www.apperceptual.com/ MICAI-2013 Tutorial
120
ESL Synonym Tests ●
●
Stem: "A rusty nail is not as strong as a clean, new one." Choices: –
(a) corroded
–
(b) black
–
(c) dirty
–
(d) painted
November 25, 2013
MICAI-2013 Tutorial
121
ESL Synonym Tests ●
similarity.pl --type WordNet::Similarity::vector --file rusty.txt
●
rusty#a#1 dirty#a#1 0.175879883782967
●
rusty#a#1 painted#a#1 0.0844532311114079
●
rusty#a#1 black#a#2 0.0656019491836669
●
rusty#a#1 corroded#a#1 0.0357324093083641 –
November 25, 2013
:(
MICAI-2013 Tutorial
122
TOEFL Synonym Tests ●
Rusty and other words are adjectives
●
Must used relatedness measure lesk – vector – vector_pairs – hso Should do word sense disambiguation first –
●
November 25, 2013
MICAI-2013 Tutorial
123
Word Sense Disambiguation ●
The meanings of words that occur together in a context will likely be related –
If a word has multiple senses, it will most likely be used in the sense that is most related to the senses of it's neighbors
–
Relatedness seems to matter more than similarity, unless you have a list ●
November 25, 2013
I have a horse, a cat and a cow at my farm.
MICAI-2013 Tutorial
124
Word Sense Disambiguation ●
SenseRelate Hypothesis : Most words in text will have multiple possible senses and will often be used with the sense most related to those of surrounding words –
He either has a cold or the flu ●
November 25, 2013
Cold not likely to mean air temperature
MICAI-2013 Tutorial
125
SenseRelate ●
●
In coherent text words will be used in similar or related senses, and these will also be related to the overall topic or mood of a text First applied to WSD in 2002 –
Banerjee and Pedersen, 2002 (WordNet)
–
Patwardhan et al., 2003 (WordNet)
–
Pedersen and Kolhatkar 2009 (WordNet)
–
McInnes et al., 2011 (UMLS)
November 25, 2013
MICAI-2013 Tutorial
126
Implementations ●
WordNet::SenseRelate –
AllWords, TargetWord, WordToSet
–
http://senserelate.sourceforge.net ●
●
Includes command line, API, and web interface
UMLS::SenseRelate –
AllWords
–
http://search.cpan.org/dist/UMLS-SenseRelat e/
November 25, 2013
MICAI-2013 Tutorial
127
Web Interface
November 25, 2013
MICAI-2013 Tutorial
128
Web Interface
November 25, 2013
MICAI-2013 Tutorial
129
SenseRelate for WSD ●
●
Assign each word the sense which is most similar or related to one or more of its neighbors –
Pairwise
–
2 or more neighbors
Pairwise algorithm results in a trellis much like in HMMs –
November 25, 2013
More neighbors adds lots of information and a lot of computational complexity MICAI-2013 Tutorial
130
SenseRelate - pairwise
November 25, 2013
MICAI-2013 Tutorial
131
SenseRelate – 2 neighbors
November 25, 2013
MICAI-2013 Tutorial
132
General Observations on WSD Results ●
●
●
●
Nouns more accurate; verbs, adjectives, and adverbs less so Increasing the window size nearly always improves performance Jiang-Conrath measure often a high performer for nouns (e.g., Patwardhan et al. 2003) Vector and lesk have coverage advantage –
November 25, 2013
handle mixed pairs while others don't MICAI-2013 Tutorial
133
SenseRelate Sentiment Classification ●
●
The underlying sentiment of a text can be discovered by determining which emotion is most related to the words in that text. –
Related to happy? : love, food, success, ...
–
Similar to happy? : joyful, ecstatic, ...
–
Pairwise comparisons between emotion and senses of words in context
Same form as Naive Bayesian model –
November 25, 2013
WordNet::SenseRelate::WordToSet MICAI-2013 Tutorial
134
SenseRelate - WordToSet
November 25, 2013
MICAI-2013 Tutorial
135
Experimental Results ●
Sentiment classification results in 2011 i2b2 suicide notes challenge were disappointing (Pedersen, 2012) –
Suicide notes not very emotional!
–
In many cases reflect a decision made and focus on settling affairs
November 25, 2013
MICAI-2013 Tutorial
136
Semantic Textual Similarity (STS) ●
●
How similar (semantically) are 2 texts? –
The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.
–
American intelligence leading up to the war on Iraq will be criticized by a powerful US Congressional committee due to report soon, officials said today
http://www-nlp.stanford.edu/wiki/STS
November 25, 2013
MICAI-2013 Tutorial
137
Semantic Textual Similarity (STS) ●
Combined distributional and WordNet information to learn a model from training data –
●
UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures, Daniel Bär, Chris Biemann, Iryna Gurevych, and Torsten Zesch, Semeval 2012
LSA Boosted with WordNet –
November 25, 2013
UMBC EBIQUITY-CORE: Semantic Textual Similarity Sy stems Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Johnathan Weese, *Sem 2013
MICAI-2013 Tutorial
138
Recognizing Textual Entailment (RTE) ●
A text entails a hypothesis if a human reading the text would infer that the hypothesis is true –
Text : The Christian Science Monitor named a US journalist kidnapped in Iraq as freelancer Jill Carroll.
–
Hypothesis: Jill Carroll was abducted in Iraq.
–
Hypothesis: The Christian Science Monitor kidnapped a freelancer.
November 25, 2013
MICAI-2013 Tutorial
139
RTE methods and data ●
●
Long series of shared tasks –
2004 to present
–
http://aclweb.org/aclwiki/index.php?title=T extual_Entailment_Resource_Pool
Recognizing that T and H are similar is helpful, although does not really solve the problem –
November 25, 2013
Hybrid approaches (like with STS)
MICAI-2013 Tutorial
140
Applications ●
Semantic similarity and relatedness are important components of many NLP applications –
Crucial building blocks
–
Interesting to study in their own right
November 25, 2013
MICAI-2013 Tutorial
141
Thank you! If you have any suggestions for content that should be added to or changed in this tutorial, please let me know! Any other comments are welcome too.
[email protected] Questions? November 25, 2013
MICAI-2013 Tutorial
142
References ●
●
●
●
S. Banerjee and T. Pedersen. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pages 136—145, Mexico City, February 2002. (wsd result) D. Faria, C. Pesquita, F. M. Couto, and A. Falcão, ProteInOn: A Web Tool for Protein Semantic Similarity, Technical Report, Department of Informatics, University of Lisbon, 2007 (proteinon) L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan and G. Wolfman (2002). Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1), 116-131. (wordsim-353) B. McInnes, T. Pedersen, Y. Liu, G. Melton and S. Pakhomov. Knowledgebased Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, pages 895-904, Washington, DC, October 2011. (wsd result)
November 25, 2013
MICAI-2013 Tutorial
143
References ●
●
●
●
G. A. Miller and W. G. Charles (1991). Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, 6(1), 1-28. S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and G Melton, Semantic Similarity and Relatedness between Clinical Terms : An Experimental Study - Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, November 13-17, 2010, pp. 572 - 576, Washington, DC. (umnsrs) S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pages 241—257, Mexico City, February 2003. (wsd result) S. Patwardhan and T. Pedersen. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, pages 1-8, Trento, Italy, April 2006. (wsd result)
November 25, 2013
MICAI-2013 Tutorial
144
References ●
●
●
●
T. Pedersen and V. Kolhatkar. WordNet :: SenseRelate :: AllWords - a broad coverage word sense tagger that maximizes semantic relatedness. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2009 Conference, pages 17-20, Boulder, CO, June 2009. (wsd result) T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3) : 288-299, June 2007. (mayosrs) T. Pedersen. Rule-based and lightly supervised methods to predict emotions in suicide notes. Biomedical Informatics Insights, 2012:5 (Suppl. 1):185-193, January 2012. (sentiment result) H. Rubenstein and J. B. Goodenough (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10), 627-633.
November 25, 2013
MICAI-2013 Tutorial
145
References ●
●
●
●
●
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity, Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Semeval 2012 (sts shared task) S. Szumlanski, F. Gomez and V.K. Sims (2013). A New Set of Norms for Semantic Relatedness Measures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 890-895). Sofia, Bulgaria. (rel-122) P. D. Turney (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502. (toefl synonyms) W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy for text understanding. In Proceedings of SIGMOD'12, pages 481-492, 2012. (mwe-300) D. Yang and D.M. W. Powers (2006). Verb Similarity on the Taxonomy of WordNet. Proceedings of the Third International WordNet Conference (GWC-06) (pp. 121-128). Jeju Island, Korea.
November 25, 2013
MICAI-2013 Tutorial
146
The End!
November 25, 2013
MICAI-2013 Tutorial
147