Feb 2, 2010 - Atene, 31 marzo 2009. Marco Baroni, Alessandro Lenci (submitted). âDistributional Memory: A General Framework for Corpus-based.
Distributional Memory A Generalized Framework for Corpus-based Semantics Alessandro Lenci University of Pisa, Department of Linguistics IMS Stuttgart 2nd February 2010
Credits
Distributional Memory is a joint research with Marco Baroni (CIMeC, University of Trento) Main references Marco Baroni, Alessandro Lenci (2009) “One distributional memory, many semantic tasks” Procedings of the Workshop on Geometrical Models for Natural Language Semantics (GEMS), 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Atene, 31 marzo 2009. Marco Baroni, Alessandro Lenci (submitted) “Distributional Memory: A General Framework for Corpus-based Semantics”,Computational Linguistics
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
Corpus-based semantics
Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts
The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model
Corpus-based semantics
Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts
The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model
Corpus-based semantics
Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts
The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model
Unstructured DSMs
Unstructured DSMs represent distributional data in terms of unstructured co-occurrence relations between an element and a context contexts as documents (Landauer & Dumais 1997, Griffiths et al. 2007) contexts as lexical collocates within a certain distance from the target (Bullinaria & Levy 2007, Lund & Burgess 1996, Rapp 2003, Schutze ¨ 1997)
Unstructured DSMs do not use the linguistic structure of texts to compute co-occurrences The teacher eats a red apple ⇒ eat is a legitimate context for apple and red because they appear in the same window
Unstructured DSMs
Unstructured DSMs represent distributional data in terms of unstructured co-occurrence relations between an element and a context contexts as documents (Landauer & Dumais 1997, Griffiths et al. 2007) contexts as lexical collocates within a certain distance from the target (Bullinaria & Levy 2007, Lund & Burgess 1996, Rapp 2003, Schutze ¨ 1997)
Unstructured DSMs do not use the linguistic structure of texts to compute co-occurrences The teacher eats a red apple ⇒ eat is a legitimate context for apple and red because they appear in the same window
Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them
To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple
Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear
Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them
To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple
Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear
Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them
To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple
Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear
Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)
Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti
two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji
pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji
Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)
Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti
two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji
pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji
Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)
Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti
two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji
pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji
Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)
Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti
two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji
pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji
“One semantic task, one distributional model”
The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.
Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive
“One semantic task, one distributional model”
The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.
Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive
“One semantic task, one distributional model”
The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.
Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive
“One semantic task, one distributional model”
The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.
Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive
The current landscape of distributional semantics attributional similarity tasks synonym detection, categorization, etc. words like dog and puppy are attributionally similar in the sense that their meanings share a large number of attributes (they are animals, they bark, etc.) attributional similarity is typically addressed by DSMs based on word collocates as proxies to concept attributes (Bullinaria & Levy 2007, Grefenstette 1994, Lund & Burgess 1996, Pado´ & Lapata 2007, Schutze ¨ 1997)
relational similarity tasks analogy recognition, relation extraction, etc. relational similarity is the property shared by pairs of words (dog–animal and car–vehicle) linked by similar semantic relations (hypernymy) DSMs tackle relational similarity by representing pairs of words in the space of the patterns that connect them in the corpus (Turney 2006, Girju et al. 2006, Hearst 1992, Pantel & Pennacchiotti 2006)
others selectional preferences (Erk 2007), argument alternations (Merlo & Stevenson 2001, Joanis et al. 2008), commonsense knowledge extraction (Almuhareb 2006, Cimiano & Wenderoth 2007), etc.
The current landscape of distributional semantics attributional similarity tasks synonym detection, categorization, etc. words like dog and puppy are attributionally similar in the sense that their meanings share a large number of attributes (they are animals, they bark, etc.) attributional similarity is typically addressed by DSMs based on word collocates as proxies to concept attributes (Bullinaria & Levy 2007, Grefenstette 1994, Lund & Burgess 1996, Pado´ & Lapata 2007, Schutze ¨ 1997)
relational similarity tasks analogy recognition, relation extraction, etc. relational similarity is the property shared by pairs of words (dog–animal and car–vehicle) linked by similar semantic relations (hypernymy) DSMs tackle relational similarity by representing pairs of words in the space of the patterns that connect them in the corpus (Turney 2006, Girju et al. 2006, Hearst 1992, Pantel & Pennacchiotti 2006)
others selectional preferences (Erk 2007), argument alternations (Merlo & Stevenson 2001, Joanis et al. 2008), commonsense knowledge extraction (Almuhareb 2006, Cimiano & Wenderoth 2007), etc.
Distributional Memory (DM) towards a unified framework for corpus-based semantics
The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics
Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object
Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling
Distributional Memory (DM) towards a unified framework for corpus-based semantics
The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics
Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object
Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling
Distributional Memory (DM) towards a unified framework for corpus-based semantics
The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics
Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object
Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
Weighted distributional tuples
W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R
Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt
Weighted distributional tuples
W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R
Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt
Weighted distributional tuples
W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R
Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt
Weighted distributional tuples
W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R
Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt
Weighted tuple structure w1 marine marine marine marine marine marine sergeant sergeant sergeant
l own use own use own use own use own
w2 bomb bomb gun gun book book bomb bomb gun
σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4
w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher
l use own use own use own use own use
w2 gun book book bomb bomb gun gun book book
σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6
Constraints on TW W1 = W2 inverse link constraint: for any link l ∈ L, there is a k ∈ L s.t. for each tuple tw = hhwi , l, wj i, vt i ∈ TW , tw−1 = hhwj , k, wi i, vt i ∈ TW (k is the inverse link of l) hhmarine, use, bombi, vt i ⇒ hhbomb, use−1 , marinei, vt i
Weighted tuple structure w1 marine marine marine marine marine marine sergeant sergeant sergeant
l own use own use own use own use own
w2 bomb bomb gun gun book book bomb bomb gun
σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4
w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher
l use own use own use own use own use
w2 gun book book bomb bomb gun gun book book
σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6
Constraints on TW W1 = W2 inverse link constraint: for any link l ∈ L, there is a k ∈ L s.t. for each tuple tw = hhwi , l, wj i, vt i ∈ TW , tw−1 = hhwj , k, wi i, vt i ∈ TW (k is the inverse link of l) hhmarine, use, bombi, vt i ⇒ hhbomb, use−1 , marinei, vt i
Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)
An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1
i=1 i=2 i=3
j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0
j=1
j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7
j=1
j=2 k=3
3.2 8.0 48.4
3.3 10.1 53.6
Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)
An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1
i=1 i=2 i=3
j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0
j=1
j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7
j=1
j=2 k=3
3.2 8.0 48.4
3.3 10.1 53.6
Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)
An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1
i=1 i=2 i=3
j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0
j=1
j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7
j=1
j=2 k=3
3.2 8.0 48.4
3.3 10.1 53.6
Tensor fibers
A fiber is equivalent to rows and columns in higher order tensors A mode-n fiber is a fiber where only the n-th index has not been fixed x∗11 = (40.0, 16.7, 5.2) x2∗3 = (8.0, 10.1) x32∗ = (7.0, 4.7, 53.6)
j=1
i=1 i=2 i=3
j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0
j=1
mode-1 fiber mode-2 fiber mode-3 fiber
j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7
j=1
j=2 k=3
3.2 8.0 48.4
3.3 10.1 53.6
Tuple structures as third order labeled tensors
A labeled tensor X λ is a tensor s.t. for each of its indices there is a one-to-one mapping of the integers from 1 to I (the dimensionality of the index) to I distinct strings (index labels) i : λ = an index element labeled with the string λ
Tuple labeled tensor Every weighted tuple structure TW built from W1 , L and W2 is represented as a labeled third order tensor X λ with its 3 indices labeled by W1 , L and W2 , respectively, and s.t. for each weighted tuple t ∈ TW = hhw1 , l, w2 i, vt i there is a tensor entry (i : w1 , j : l, k : w2 )= vt
Tuple structures as third order labeled tensors
A labeled tensor X λ is a tensor s.t. for each of its indices there is a one-to-one mapping of the integers from 1 to I (the dimensionality of the index) to I distinct strings (index labels) i : λ = an index element labeled with the string λ
Tuple labeled tensor Every weighted tuple structure TW built from W1 , L and W2 is represented as a labeled third order tensor X λ with its 3 indices labeled by W1 , L and W2 , respectively, and s.t. for each weighted tuple t ∈ TW = hhw1 , l, w2 i, vt i there is a tensor entry (i : w1 , j : l, k : w2 )= vt
Tuple structures as third order labeled tensor
w1 marine marine marine marine marine marine sergeant sergeant sergeant
i=1:marine i=2:sergeant i=3:teacher
l own use own use own use own use own
w2 bomb bomb gun gun book book bomb bomb gun
σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4
j=1:own j=2:use k=1:bomb 40.0 82.1 16.7 69.5 5.2 7.0
w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher
l use own use own use own use own use
j=1:own j=2:use k=2:gun 85.3 44.8 73.4 51.9 9.3 4.7
w2 gun book book bomb bomb gun gun book book
σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6
j=1:own j=2:use k=3:book 3.2 3.3 8.0 10.1 48.4 53.6
Tuple structures as third order labeled tensor
w1 marine marine marine marine marine marine sergeant sergeant sergeant
i=1:marine i=2:sergeant i=3:teacher
l own use own use own use own use own
w2 bomb bomb gun gun book book bomb bomb gun
σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4
j=1:own j=2:use k=1:bomb 40.0 82.1 16.7 69.5 5.2 7.0
w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher
l use own use own use own use own use
j=1:own j=2:use k=2:gun 85.3 44.8 73.4 51.9 9.3 4.7
w2 gun book book bomb bomb gun gun book book
σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6
j=1:own j=2:use k=3:book 3.2 3.3 8.0 10.1 48.4 53.6
Tensor matricization Matricization rearranges a higher order tensor into a matrix (Kolda 2006, Kolda & Bader 2009) Mode-n matricization arranges the mode-n fibers to be the columns of the resulting Dn × Dj matrix Dn is the dimensionality of the n-th index Dj is the product of the dimensionalities of the other indices
Mode-n matricization Each tensor entry (i1 , i2 , ..., iN ) is mapped to a matrix entry (in , j), where j is computed as in equation: j =1+
N X k =1 k6=n
((ik − 1)
k−1 Y m=1 m6=n
Dm )
(1)
Tensor matricization Matricization rearranges a higher order tensor into a matrix (Kolda 2006, Kolda & Bader 2009) Mode-n matricization arranges the mode-n fibers to be the columns of the resulting Dn × Dj matrix Dn is the dimensionality of the n-th index Dj is the product of the dimensionalities of the other indices
Mode-n matricization Each tensor entry (i1 , i2 , ..., iN ) is mapped to a matrix entry (in , j), where j is computed as in equation: j =1+
N X k =1 k6=n
((ik − 1)
k−1 Y m=1 m6=n
Dm )
(1)
Tensor matricization mode-1 matricization
j=1
i=1 i=2 i=3
j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0
Amode−1 i=1 i=2 I=3
j=1 40.0 16.7 5.2
j=2 82.1 69.5 7.0
j=1
j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7
j=3 85.3 73.4 9.3
j=4 44.8 51.9 4.7
j=1
j=2 k=3
3.2 8.0 48.4
j=5 3.2 8.0 48.4
3.3 10.1 53.6
j=6 3.3 10.1 53.6
Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei
Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn
Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei
Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn
Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei
Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn
Labeled tensor matricization mode-1, mode-2 and mode-3 matrices obtained from a tuple labeled tensor X λ
Amode−1
1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6 1: 2 3: 4: 5: 6: 7: 8: 9: Bmode−2 hmarine, hsergeant, hteacher, hmarine, hsergeant, hteacher, hmarine, hsergeant, hteacher, bombi bombi bombi guni guni guni booki booki booki 1:own 40.0 16.7 5.2 85.3 73.4 9.3 3.2 8.0 48.4 2:use 82.1 69.5 7.0 44.8 51.9 4.7 3.3 10.1 53.6 Cmode−3 1:bomb 2:gun 3:book
1: 2: 3: 4: 5: 6: hmarine,ownihmarine,useihsergeant,ownihsergeant,useihteacher,ownihteacher,usei 40.0 82.1 16.7 69.5 5.2 7.0 85.3 44.8 73.4 51.9 9.3 4.7 3.2 3.3 8.0 10.1 48.4 53.6
The DM semantic spaces
The rows and columns of the 3 matrices resulting from n-mode matricization of a third order tensor are vectors in semantic spaces vector dimensions are the corresponding column viz. row elements
Given the constraints on the tuple structure TW for each column of the mode-1 matrix labeled by hl, w2 i, there is an identical column in the mode-3 matrix labeled by hw1 , k i k is the inverse link of l and w1 = w2
for any row w2 in the mode-3 matrix, there is an identical row w1 in the mode-1 matrix
The DM semantic spaces
The rows and columns of the 3 matrices resulting from n-mode matricization of a third order tensor are vectors in semantic spaces vector dimensions are the corresponding column viz. row elements
Given the constraints on the tuple structure TW for each column of the mode-1 matrix labeled by hl, w2 i, there is an identical column in the mode-3 matrix labeled by hw1 , k i k is the inverse link of l and w1 = w2
for any row w2 in the mode-3 matrix, there is an identical row w1 in the mode-1 matrix
The DM semantic spaces Given a weighted tuple structure TW , the matricization of the corresponding labeled third order tensor X λ generates 4 distinct semantic vector spaces word by link-word (W1×LW2 ) vectors labeled with words w1 dimensions labeled with tuples of type hl, w2 i word-word by link (W1 W2×L) vectors labeled with tuples of type hw1 , w2 i dimensions labeled with links l word-link by word (W1 L×W2 ) vectors labeled with tuples of type hw1 , li dimensions labeled with words w2 link by word-word (L×W1 W2 ) vectors labeled with links l dimensions labeled with tuples of type hw1 , w2 i
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
The DM models DM models correspond to different ways to construct the underlying weighted tuple structure DepDM unlexicalixed model links as dependency paths (Curran & Moens 2002, Grefenstette 1994, Pado´ & Lapata 2007, ¨ Rothenhausler & Schutze 2009) ¨
LexDM heavily lexicalized model links as lexicalized dependency paths and lexico-syntactic shallow patterns (Hearst 1992, Pantel & Pennacchiotti 2006, Turney 2006)
TypeDM mildly lexicalized model links as lexicalized dependency paths and lexico-syntactic shallow patterns, but with a different scoring function based on pattern type frequency (Baroni et al. 2010, Davidov & Rappoport 2008a, Davidov & Rappoport 2008b)
All models share the same corpus and W1 = W2 sets, while differ for the links (L) and the scoring function
The DM models
The DM corpus 2.830 billion tokens resulting from concatenating ukWac, about 1.915 billion tokens of Web-derived texts English Wikipedia, a mid-2009 dump of about 820 million tokens British National Corpus, about 95 million tokens
the corpus was tokenized, POS-tagged and lemmatized with the TreeTagger, and dependency-parsed with the MaltParser (Nivre et al. 2007)
The label sets W1 = W2 30,693 lemmas (20,410 nouns, 5,026 verbs and 5,257 adjectives) the top 20,000 most frequent nouns and top 5,000 most frequent verbs and adjectives, augmented with lemmas in various standard test sets, such as the TOEFL and SAT lists.
The DM models
The DM corpus 2.830 billion tokens resulting from concatenating ukWac, about 1.915 billion tokens of Web-derived texts English Wikipedia, a mid-2009 dump of about 820 million tokens British National Corpus, about 95 million tokens
the corpus was tokenized, POS-tagged and lemmatized with the TreeTagger, and dependency-parsed with the MaltParser (Nivre et al. 2007)
The label sets W1 = W2 30,693 lemmas (20,410 nouns, 5,026 verbs and 5,257 adjectives) the top 20,000 most frequent nouns and top 5,000 most frequent verbs and adjectives, augmented with lemmas in various standard test sets, such as the TOEFL and SAT lists.
DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri
The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log
Oijk Eijk
The DepDM tensor contains about 110M non-zero tuples
(2)
DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri
The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log
Oijk Eijk
The DepDM tensor contains about 110M non-zero tuples
(2)
DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri
The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log
Oijk Eijk
The DepDM tensor contains about 110M non-zero tuples
(2)
LexDM
LLexDM contains 3,352,148 direct and inverse complex links, each with the structure pattern+suffix the suffix is formed by two substrings separated by a +, each respectively encoding various features of w1 and w2 their POS and morphological features (number for N, number and tense for V) the presence of an article (further specified with its definiteness value) and of adjectives for N the presence of adverbs for A, and the presence of adverbs, modals and auxiliaries for V, together with their diathesis (for passive only) If the adjective (adverb) modifying w1 or w2 belongs to a list of 10 (250) high frequency adjectives (adverbs), the suffix string contains the adjective (adverb) itself, otherwise only its POS
The tall soldier has already shot hsoldier, sbj intr+n-the-j+vn-aux-already, shooti
LexDM The patterns in the LexDM links include LDepDM : The man shot → hman, sbj intr+n-the+vn, shooti verb: 52 high frequency verbs are lexicalized: The soldier used a gun → hsoldier, use+n-the+n-a, guni is: The soldier is tall → htall, is+j+n-the, soldieri preposition-link noun-preposition: the arrival of a number of soldiers → hsoldier, of-number-of+ns+n-the, arrivali attribute noun: “(the) attribute noun of (a|the) NOUN is ADJ” (Almuhareb & Poesio 2004) and “(a|the) ADJ attribute noun of NOUN” (Veale & Hao 288): the colour of strawberries is red → hred, colour+j+ns, strawberryi as adj as: “as ADJ as (a|the) NOUN” (Veale & Hao 288) as sharp as a knife → hsharp, as adj as+j+n-a, knifei such as: “NOUN such as NOUN” and “such NOUN as NOUN” (Hearst 1992): animals such as cats → hanimal, such as+ns+ns, cati
The scoring function σ is LMI, and the the DepDM tensor contains about 355M non-zero tuples
LexDM The patterns in the LexDM links include LDepDM : The man shot → hman, sbj intr+n-the+vn, shooti verb: 52 high frequency verbs are lexicalized: The soldier used a gun → hsoldier, use+n-the+n-a, guni is: The soldier is tall → htall, is+j+n-the, soldieri preposition-link noun-preposition: the arrival of a number of soldiers → hsoldier, of-number-of+ns+n-the, arrivali attribute noun: “(the) attribute noun of (a|the) NOUN is ADJ” (Almuhareb & Poesio 2004) and “(a|the) ADJ attribute noun of NOUN” (Veale & Hao 288): the colour of strawberries is red → hred, colour+j+ns, strawberryi as adj as: “as ADJ as (a|the) NOUN” (Veale & Hao 288) as sharp as a knife → hsharp, as adj as+j+n-a, knifei such as: “NOUN such as NOUN” and “such NOUN as NOUN” (Hearst 1992): animals such as cats → hanimal, such as+ns+ns, cati
The scoring function σ is LMI, and the the DepDM tensor contains about 355M non-zero tuples
TypeDM
LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1
The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples
TypeDM
LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1
The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples
TypeDM
LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1
The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
Semantic experiments with DM For each space, DM has been tested on semantic experiments modeled by applying (some combination of) a small number of geometric operations vector length and normalization v u i=n uX ||v|| = t v2 i
(3)
i=1
similarity as vector cosine Pi=n cos(x, y) =
i=1 xi yi ||x||||y||
(4)
vector sum (centroid) two or more normalized vectors are summed by adding their values on each dimension
projection onto a subspace a vector with i dimensions is projected onto a subspace with k < i dimensions through multiplication by a square diagonal matrix with 1s in the diagonal cells corresponding to the k dimensions, and 0s elsewhere
Semantic experiments with DM For each space, DM has been tested on semantic experiments modeled by applying (some combination of) a small number of geometric operations vector length and normalization v u i=n uX ||v|| = t v2 i
(3)
i=1
similarity as vector cosine Pi=n cos(x, y) =
i=1 xi yi ||x||||y||
(4)
vector sum (centroid) two or more normalized vectors are summed by adding their values on each dimension
projection onto a subspace a vector with i dimensions is projected onto a subspace with k < i dimensions through multiplication by a square diagonal matrix with 1s in the diagonal cells corresponding to the k dimensions, and 0s elsewhere
Semantic experiments with the DM spaces preliminary observations
The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space
Our current aim is to prove that each space derived through tensor matricization is semantically interesting
Semantic experiments with the DM spaces preliminary observations
The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space
Our current aim is to prove that each space derived through tensor matricization is semantically interesting
Semantic experiments with the DM spaces preliminary observations
The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space
Our current aim is to prove that each space derived through tensor matricization is semantically interesting
Semantic experiments with the DM spaces preliminary observations
The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space
Our current aim is to prove that each space derived through tensor matricization is semantically interesting
Semantic experiments with the DM spaces preliminary observations
No feature selection/reweighting, dimensionality reduction or task-specific optimization have been used in the experiments the same underlying tuple tensor is used in all the experiments the experiment results should be regarded as a sort of “baseline” performance to be enhanced by task-specific parameter tuning
DM performance is compared to the results available in the literature and to our implementation of state-of-the-art DSMs alternative models trained on the same DM corpus (with the same linguistic pre-processing)
Semantic experiments with the DM spaces preliminary observations
No feature selection/reweighting, dimensionality reduction or task-specific optimization have been used in the experiments the same underlying tuple tensor is used in all the experiments the experiment results should be regarded as a sort of “baseline” performance to be enhanced by task-specific parameter tuning
DM performance is compared to the results available in the literature and to our implementation of state-of-the-art DSMs alternative models trained on the same DM corpus (with the same linguistic pre-processing)
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
The word by link-word (W1×LW2 ) space vectors labeled with words w1 (rows of mode-1 matrix) dimensions labelled with binary tuples of type hl, w2 i (columns of mode-1 matrix) 1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6
The space dimensions represent attributes of words The semantic tasks addressed with the W1×LW2 space involve measuring the attributional similarity among words 1 2 3 4
similarity judgments synonym detection noun categorization selectional preferences
The word by link-word (W1×LW2 ) space vectors labeled with words w1 (rows of mode-1 matrix) dimensions labelled with binary tuples of type hl, w2 i (columns of mode-1 matrix) 1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6
The space dimensions represent attributes of words The semantic tasks addressed with the W1×LW2 space involve measuring the attributional similarity among words 1 2 3 4
similarity judgments synonym detection noun categorization selectional preferences
Alternative models for the W1×LW2 space
Win – unstructured DSM that relies on the target-context linear proximity (Bullinaria & Levy 2007, Lund & Burgess 1996, Schutze ¨ 1997) based on co-occurrences of the same 30,000 W1 (W2 ) used for DM, within a window of maximally 5 content words counts converted to LMI weights (negative LMI values are raised to 0) the Win matrix has about 110 million non-zero entries
DV – a structured DSM, but dependency paths are not part of the attributes (implementation of the Dependency Vectors approach of Pado´ & Lapata 2007) DV is obtained from the same co-occurrence data of DepDM counts converted to LMI weights(negative LMI values are raised to 0) the DV matrix contains about 38 million non-zero values
Alternative models for the W1×LW2 space
Win – unstructured DSM that relies on the target-context linear proximity (Bullinaria & Levy 2007, Lund & Burgess 1996, Schutze ¨ 1997) based on co-occurrences of the same 30,000 W1 (W2 ) used for DM, within a window of maximally 5 content words counts converted to LMI weights (negative LMI values are raised to 0) the Win matrix has about 110 million non-zero entries
DV – a structured DSM, but dependency paths are not part of the attributes (implementation of the Dependency Vectors approach of Pado´ & Lapata 2007) DV is obtained from the same co-occurrence data of DepDM counts converted to LMI weights(negative LMI values are raised to 0) the DV matrix contains about 38 million non-zero values
Similarity judgments Data set Rubenstein and Goodenough (1965) (R&G) 65 noun pairs rated by 51 subjects on a 0-4 similarity scale car food cord
automobile fruit smile
3.9 2.7 0.0
Correlation between noun distances (cosines) in the W1×LW2 space and R&G ratings evaluated with Pearson’s r (Pado´ and Lapata 2007) model DoubleCheck1 TypeDM SVD-092
1 Chen
r 85 82 80
model Win DV-073 DepDM
r 65 62 57
model DV LexDM cosDV-073
r 57 53 47
ˇ et al. (2006); 2 Herdagdelen et al. (2009); 3 Pado & Lapata (2007)
Similarity judgments Data set Rubenstein and Goodenough (1965) (R&G) 65 noun pairs rated by 51 subjects on a 0-4 similarity scale car food cord
automobile fruit smile
3.9 2.7 0.0
Correlation between noun distances (cosines) in the W1×LW2 space and R&G ratings evaluated with Pearson’s r (Pado´ and Lapata 2007) model DoubleCheck1 TypeDM SVD-092
1 Chen
r 85 82 80
model Win DV-073 DepDM
r 65 62 57
model DV LexDM cosDV-073
r 57 53 47
ˇ et al. (2006); 2 Herdagdelen et al. (2009); 3 Pado & Lapata (2007)
Synonym detection Data set TOEFL (Landauer & Dumais 1997) 80 multiple-choice questions target levied
candidates imposed, believed, requested, correlated
DM picks the candidate with the highest cosine to the target item as its guess of the right synonym model LSA-031 GLSA2 PPMIC3 CWO4 PMI-IR-035 BagPack6 1 Rapp
accuracy 92.50 86.25 85.00 82.55 81.25 80.00
model DV TypeDM PairClass7 DepDM LexDM PMI-IR-018
accuracy 76.87 76.87 76.25 75.01 74.37 73.75
model DV-079 Win Human10 LSA-9710 Random
accuracy 73.00 69.37 64.50 64.38 25.00
(2003); 2 Matveeva et al. (2005); 3 Bullinaria & Levy (2007); 4 Ruiz Casado et al. ˇ (2005); & Clarke (2003); 6 Herdagdelen & Baroni (2009); 7 Turney (2008); 8 Turney 9 10 (2001); Pado & Lapata (2007); Landauer & Dumais (1997) 5 Terra
Synonym detection Data set TOEFL (Landauer & Dumais 1997) 80 multiple-choice questions target levied
candidates imposed, believed, requested, correlated
DM picks the candidate with the highest cosine to the target item as its guess of the right synonym model LSA-031 GLSA2 PPMIC3 CWO4 PMI-IR-035 BagPack6 1 Rapp
accuracy 92.50 86.25 85.00 82.55 81.25 80.00
model DV TypeDM PairClass7 DepDM LexDM PMI-IR-018
accuracy 76.87 76.87 76.25 75.01 74.37 73.75
model DV-079 Win Human10 LSA-9710 Random
accuracy 73.00 69.37 64.50 64.38 25.00
(2003); 2 Matveeva et al. (2005); 3 Bullinaria & Levy (2007); 4 Ruiz Casado et al. ˇ (2005); & Clarke (2003); 6 Herdagdelen & Baroni (2009); 7 Turney (2008); 8 Turney 9 10 (2001); Pado & Lapata (2007); Landauer & Dumais (1997) 5 Terra
Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)
Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing
Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)
cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =
1 n
Pk
r =1
max(nri ) i
Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)
Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing
Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)
cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =
1 n
Pk
r =1
max(nri ) i
Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)
Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing
Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)
cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =
1 n
Pk
r =1
max(nri ) i
Noun categorization Almuhareb & Poesio (AP) Data set
402 noun concepts from WordNet, balanced in terms of frequency and ambiguity Concepts must be clustered into 21 classes, corresponding to the 21 unique WordNet beginners (13–21 nouns per class) VEHICLES :
helicopter, motorcycle. . . ethics, incitement, . . .
MOTIVATION :
model
Almuhareb & Poesio (AP) purity model 1
DepPath TypeDM AttrValue-062 Win VSM3
1 Rothenhausler ¨
79 76 71 71 70
DV DepDM LexDM Random
purity 65 62 59 5
& Schutze (2009); 2 Almuhareb (2006); ¨ et al. (2009)
3 Herdagdelen ˇ
Noun categorization Almuhareb & Poesio (AP) Data set
402 noun concepts from WordNet, balanced in terms of frequency and ambiguity Concepts must be clustered into 21 classes, corresponding to the 21 unique WordNet beginners (13–21 nouns per class) VEHICLES :
helicopter, motorcycle. . . ethics, incitement, . . .
MOTIVATION :
model
Almuhareb & Poesio (AP) purity model 1
DepPath TypeDM AttrValue-062 Win VSM3
1 Rothenhausler ¨
79 76 71 71 70
DV DepDM LexDM Random
purity 65 62 59 5
& Schutze (2009); 2 Almuhareb (2006); ¨ et al. (2009)
3 Herdagdelen ˇ
Noun categorization Battig Data set
83 concepts from the expanded Battig and Montague norms of Van Overschelde et al. (2004) (cf. Baroni et al. 2010) Nouns are highly prototypical instances of 10 common concrete categories (up to 10 concepts per class) LAND MAMMALS : TOOLS :
dog, elephant. . . screwdriver, hammer, . . .
model Win TypeDM Strudel1 DepDM DV
Battig purity model 96 94 91 90 84
1 Baroni
DV-101 LexDM SVD-101 AttrValue1 Random
et al. (2010)
purity 79 78 71 45 12
Noun categorization Battig Data set
83 concepts from the expanded Battig and Montague norms of Van Overschelde et al. (2004) (cf. Baroni et al. 2010) Nouns are highly prototypical instances of 10 common concrete categories (up to 10 concepts per class) LAND MAMMALS : TOOLS :
dog, elephant. . . screwdriver, hammer, . . .
model Win TypeDM Strudel1 DepDM DV
Battig purity model 96 94 91 90 84
1 Baroni
DV-101 LexDM SVD-101 AttrValue1 Random
et al. (2010)
purity 79 78 71 45 12
Noun categorization ESSLLI 2008 Data set
44 concrete nouns grouped into hierarchically organized classes (ESSLLI 2008 shared task) 6 lower classes: BIRDS, LAND ANIMALS, FRUIT, GREENS, TOOLS, VEHICLES
3 middle classes: ANIMALS, VEGETABLES, ARTIFACTS 2 top classes: LIVING BEINGS, OBJECTS model TypeDM Katrenko1 DepDM DV LexDM Peirsman1 Win Shaoul1 Random
6-way purity 84 91 75 75 75 82 75 41 29
ESSLLI 2008 3-way purity 98 100 93 93 87 84 86 52 45
1 ESSLLI
2-way purity 100 80 100 100 100 86 59 55 54
2008 shared task
avg purity 94.0 90.3 89.4 89.3 87.3 84.0 73.3 49.3 42.7
Selectional preferences
The W1×LW2 space can be used to work with more abstract notions, such as that of a typical filler of a verb argument slot The selectional preferences of a predicate can not be reduced simply to the set of its attested arguments in a corpus we must account for the possibility of generalizations to unseen arguments kill the aardvark – OK, since aardvark is a living entity kill the serendipity – BAD, since serendipity is not a living entity
Data set human plausibility judgments (on a 7-point scale) of noun-verb pairs from McRae et al. (1997) (100 pairs, 36 raters) and Pado´ (2007) (211 pairs, ∼20 raters per pair): shoot deer obj 6.4 shoot deer subj 1.0
Selectional preferences
The W1×LW2 space can be used to work with more abstract notions, such as that of a typical filler of a verb argument slot The selectional preferences of a predicate can not be reduced simply to the set of its attested arguments in a corpus we must account for the possibility of generalizations to unseen arguments kill the aardvark – OK, since aardvark is a living entity kill the serendipity – BAD, since serendipity is not a living entity
Data set human plausibility judgments (on a 7-point scale) of noun-verb pairs from McRae et al. (1997) (100 pairs, 36 raters) and Pado´ (2007) (211 pairs, ∼20 raters per pair): shoot deer obj 6.4 shoot deer subj 1.0
Selectional preferences in the W1×LW2 space 1
Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj
measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2
Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v
3
Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors
The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix
Selectional preferences in the W1×LW2 space 1
Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj
measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2
Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v
3
Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors
The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix
Selectional preferences in the W1×LW2 space 1
Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj
measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2
Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v
3
Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors
The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix
Selectional preferences in the W1×LW2 space 1
Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj
measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2
Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v
3
Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors
The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix
Selectional preferences results
Performance measured with Spearman ρ correlation coefficient between the average human ratings and the model predictions (Pado´ et al. 2007)
model Pado´ 1 DepDM LexDM TypeDM ParCos1 Resnik1
McRae coverage 56 97 97 97 91 94
1 Pado ´
ρ 41 32 29 28 21 3
model BagPack2 TypeDM Pado´ 1 ParCos1 DepDM LexDM Resnik1
Pado´ coverage 100 100 97 98 100 100 98
ˇ et al. (2007); 2 Herdagdelen & Baroni (2009)
ρ 60 51 51 48 35 34 24
Selectional preferences results
Performance measured with Spearman ρ correlation coefficient between the average human ratings and the model predictions (Pado´ et al. 2007)
model Pado´ 1 DepDM LexDM TypeDM ParCos1 Resnik1
McRae coverage 56 97 97 97 91 94
1 Pado ´
ρ 41 32 29 28 21 3
model BagPack2 TypeDM Pado´ 1 ParCos1 DepDM LexDM Resnik1
Pado´ coverage 100 100 97 98 100 100 98
ˇ et al. (2007); 2 Herdagdelen & Baroni (2009)
ρ 60 51 51 48 35 34 24
Some conclusions on the W1×LW2 space
DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs
The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks
Some conclusions on the W1×LW2 space
DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs
The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks
Some conclusions on the W1×LW2 space
DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs
The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks
Some conclusions on the W1×LW2 space
DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs
The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni
1:own 40.0 16.7 5.2
2:use 82.1 69.5 7.0
The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2
recognizing analogies relation classification
. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4
qualia extraction predicting characteristic properties of concepts
The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni
1:own 40.0 16.7 5.2
2:use 82.1 69.5 7.0
The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2
recognizing analogies relation classification
. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4
qualia extraction predicting characteristic properties of concepts
The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni
1:own 40.0 16.7 5.2
2:use 82.1 69.5 7.0
The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2
recognizing analogies relation classification
. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4
qualia extraction predicting characteristic properties of concepts
Smoothing W1 W2×L with the W1×LW2 space For the analogy and the relation classification tasks (where the target pairs are known in advance), target pairs vectors are smoothed with new pairs containing their attributional neighbors one of the words of each target pair is combined in turn with the top 20 nearest W1×LW2 neighbors of the other word, obtaining a total of 41 pairs (including the original) the centroid of the W1 W2×L vectors of these pairs is then taken to represent a target pair the smoothed hautomobile, wheeli vector is an average of the hautomobile, wheeli, hcar, wheeli, hautomobile, circlei, etc., vectors
nearest neighbours are searched in the W1×LW2 matrix compressed to 5,000 dimensions via Random Indexing (with the parameters suggested by Sahlgren 2005)
DM smoothing is similar to the method proposed by Turney (2006) In DM the attributional and relational spaces are derived both from the same tensor
Smoothing W1 W2×L with the W1×LW2 space For the analogy and the relation classification tasks (where the target pairs are known in advance), target pairs vectors are smoothed with new pairs containing their attributional neighbors one of the words of each target pair is combined in turn with the top 20 nearest W1×LW2 neighbors of the other word, obtaining a total of 41 pairs (including the original) the centroid of the W1 W2×L vectors of these pairs is then taken to represent a target pair the smoothed hautomobile, wheeli vector is an average of the hautomobile, wheeli, hcar, wheeli, hautomobile, circlei, etc., vectors
nearest neighbours are searched in the W1×LW2 matrix compressed to 5,000 dimensions via Random Indexing (with the parameters suggested by Sahlgren 2005)
DM smoothing is similar to the method proposed by Turney (2006) In DM the attributional and relational spaces are derived both from the same tensor
Alternative models for the W1 W2×L space
LRA – reimplementation of Latent Relational Analysis (Turney 2006, “baseline LRA system”) using the DM corpus for a given set of target pairs, count all the patterns that connect them, in either order, in the corpus patterns are sequences of 1-to-3 words occurring between the targets, with all, none or any subset of the elements replaced by wildcards (with the, with *, * the, * *) only the top most 4,000 most frequent patterns are preserved
a target-pair-by-pattern matrix is constructed (with 8,000 dimensions, to account for directionality) values in the matrix are log- and entropy-transformed according to Turney’s formula
SVD is applied, reducing the columns to the top 300 latent dimensions target pairs are smoothed with the same DM method the neighbors for target pair expansion are taken from the best attributional DM model (TypeDM)
Recognizing analogies
Data set 374 SAT multiple-choice questions (Turney 2006) each question includes 1 target pair (stem) e 5 answer pairs the task is to choose the pair most analogous to the stem mason teacher carpenter soldier photograph book
stone chalk wood gun camera word
Recognizing analogies
Data set 374 SAT multiple-choice questions (Turney 2006) each question includes 1 target pair (stem) e 5 answer pairs the task is to choose the pair most analogous to the stem mason teacher carpenter soldier photograph book
stone chalk wood gun camera word
Recognizing SAT analogies results
The answer-pair with the highest cosine in W1 W2×L space to the target pair is selected as the right analogy model
accuracy
Human1 LRA-062 PERT3 PairClass4 VSM1
1 Turney
57.0 56.1 53.3 52.1 47.1
model BackPack5 k-means6 TypeDM LSA7 LRA
accuracy 44.1 44.0 42.4 42.0 37.7
model PMI-IR-062 DepDM LexDM Random
accuracy 35.0 31.4 29.3 20.0
& Littman (2005); 2 Turney (2006a); 3 Turney (2006b); 4 Turney (2008); & Baroni (2009); 6 Bicic¸i & Yuret (2006); 7 Mangalath et al. (2004)
5 Herdagdelen ˇ
Recognizing SAT analogies results
The answer-pair with the highest cosine in W1 W2×L space to the target pair is selected as the right analogy model
accuracy
Human1 LRA-062 PERT3 PairClass4 VSM1
1 Turney
57.0 56.1 53.3 52.1 47.1
model BackPack5 k-means6 TypeDM LSA7 LRA
accuracy 44.1 44.0 42.4 42.0 37.7
model PMI-IR-062 DepDM LexDM Random
accuracy 35.0 31.4 29.3 20.0
& Littman (2005); 2 Turney (2006a); 3 Turney (2006b); 4 Turney (2008); & Baroni (2009); 6 Bicic¸i & Yuret (2006); 7 Mangalath et al. (2004)
5 Herdagdelen ˇ
Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised
Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R
when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid
Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised
Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R
when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid
Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised
Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R
when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid
Relation classification SEMEVAL 2007 Data set
7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER
Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation
The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)
The contexts of the target word pairs (provided with the test set) are not used by the DM models
Relation classification SEMEVAL 2007 Data set
7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER
Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation
The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)
The contexts of the target word pairs (provided with the test set) are not used by the DM models
Relation classification SEMEVAL 2007 Data set
7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER
Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation
The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)
The contexts of the target word pairs (provided with the test set) are not used by the DM models
Relation classification SEMEVAL 2007 Data set
7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER
Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation
The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)
The contexts of the target word pairs (provided with the test set) are not used by the DM models
Relation classification SEMEVAL 2007 Data set
Baseline models Majority always guesses the majority class in the test set AllTrue always assigns an item to the target class ProbMatch randomly guesses classes matching their distribution in the test data
model TypeDM UCD-FC1 AllTrue ILK1 UCB1 LexDM LRA
acc 70.2 66.0 48.5 63.5 65.4 65.4 62.0
1 SEMEVAL
prec 71.7 66.1 48.5 60.5 62.7 64.7 62.7
recall 62.5 66.7 100.0 69.5 63.0 61.3 59.3
SEMEVAL 2007 F model 66.4 DepDM 64.8 UMELB-B1 64.8 UTH1 63.8 ProbMatch 62.7 UC3M1 62.5 Majority 60.2
acc 61.8 62.7 58.8 51.7 49.9 57.0
prec 61.0 61.5 56.1 48.5 48.2 81.3
recall 57.3 55.7 57.1 48.5 40.3 42.9
all measures are macro-averaged Task 4 (Models in group A: WordNet = NO & Query = NO)
F 58.9 57.8 55.9 48.5 43.1 30.8
Relation classification Nastase & Szpakowicz (NS) Data set
600 modifier-noun classified by Nastase & Szpakowicz (2003) into 30 relations CAUSE (cloud-storm), PURPOSE (album-picture), LOCATION - AT (pain-chest), LOCATION - FROM (visitor-country), etc.
model
Nastase & Szpakowicz (NS) global acc prec recall
LRA-061 VSM-AV2 VSM-WMTS1 LRA TypeDM LexDM DepDM AllTrue ProbMatch Majority
39.8 27.8 24.7 22.8 15.4 12.1 8.7 NA 4.7 8.2
41.0 27.9 24.0 20.3 19.5 7.5 11.6 3.3 3.3 0.3
35.9 26.8 20.9 21.1 20.2 14.1 14.5 100 3.3 3.3
F 36.6 26.5 20.3 18.8 13.7 8.1 8.1 6.4 3.3 0.5
all measures are macro-averaged except accuracy 1 Turney (2006a); 2 Turney & Littman (2005)
Relation classification Nastase & Szpakowicz (NS) Data set
600 modifier-noun classified by Nastase & Szpakowicz (2003) into 30 relations CAUSE (cloud-storm), PURPOSE (album-picture), LOCATION - AT (pain-chest), LOCATION - FROM (visitor-country), etc.
model
Nastase & Szpakowicz (NS) global acc prec recall
LRA-061 VSM-AV2 VSM-WMTS1 LRA TypeDM LexDM DepDM AllTrue ProbMatch Majority
39.8 27.8 24.7 22.8 15.4 12.1 8.7 NA 4.7 8.2
41.0 27.9 24.0 20.3 19.5 7.5 11.6 3.3 3.3 0.3
35.9 26.8 20.9 21.1 20.2 14.1 14.5 100 3.3 3.3
F 36.6 26.5 20.3 18.8 13.7 8.1 8.1 6.4 3.3 0.5
all measures are macro-averaged except accuracy 1 Turney (2006a); 2 Turney & Littman (2005)
Relation classification ´ Seaghdha ´ O & Copestake (OC) Data set
´ Seaghdha ´ 1,443 noun-noun compounds classified by O & Copestake (2009) into 6 relations BE (celebrity-winner), HAVE (door-latch), IN (air-disaster), ACTOR (school-inspector), INSTRUMENT (freight-train), ABOUT (bank-panic)
model
´ Seaghdha ´ O & Copestake (OC) global acc prec recall 1
OC-Comb OC-Rel1 TypeDM LexDM AllTrue LRA DepDM ProbMatch Majority
63.1 52.1 32.1 29.7 NA 28.2 27.6 17.1 21.3
NA NA 33.8 29.9 16.7 27.6 28.2 16.7 3.6
NA NA 33.5 28.9 100 27.4 28.2 16.7 16.7
F 61.6 49.9 31.4 28.7 28.5 27.2 27.0 16.7 5.9
all measures are macro-averaged except accuracy 1O ´ Seaghdha ´ & Copestake (2009)
Relation classification ´ Seaghdha ´ O & Copestake (OC) Data set
´ Seaghdha ´ 1,443 noun-noun compounds classified by O & Copestake (2009) into 6 relations BE (celebrity-winner), HAVE (door-latch), IN (air-disaster), ACTOR (school-inspector), INSTRUMENT (freight-train), ABOUT (bank-panic)
model
´ Seaghdha ´ O & Copestake (OC) global acc prec recall 1
OC-Comb OC-Rel1 TypeDM LexDM AllTrue LRA DepDM ProbMatch Majority
63.1 52.1 32.1 29.7 NA 28.2 27.6 17.1 21.3
NA NA 33.8 29.9 16.7 27.6 28.2 16.7 3.6
NA NA 33.5 28.9 100 27.4 28.2 16.7 16.7
F 61.6 49.9 31.4 28.7 28.5 27.2 27.0 16.7 5.9
all measures are macro-averaged except accuracy 1O ´ Seaghdha ´ & Copestake (2009)
The W1 W2×L space ad interim summary
TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique
While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent
The W1 W2×L space ad interim summary
TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique
While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent
The W1 W2×L space ad interim summary
TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique
While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent
Pattern-based relation extraction
Pattern-based approaches to extract semantic relations pick a set of lexico-syntactic patterns that should capture the relation of interest and harvest the word pairs they connect in text cf. Hearst (1992) for the hyponymy relation
In DM, the same approach can be pursued by exploiting the information already available in the W1 W2×L space promising links are selected as DM-equivalent of patterns relation instances are identified by measuring the length of word pair vectors in the W1 W2×L subspace defined by the selected links
Pattern-based relation extraction
Pattern-based approaches to extract semantic relations pick a set of lexico-syntactic patterns that should capture the relation of interest and harvest the word pairs they connect in text cf. Hearst (1992) for the hyponymy relation
In DM, the same approach can be pursued by exploiting the information already available in the W1 W2×L space promising links are selected as DM-equivalent of patterns relation instances are identified by measuring the length of word pair vectors in the W1 W2×L subspace defined by the selected links
Qualia extraction
Data set 1,487 noun-quale pairs corresponding to the qualia structures (Pustejovsky 1995) for 30 concrete (door) and abstract (imagination) nouns (Cimiano & Wenderoth 2007) each noun-quale pair was rated by 3 subjects and instantiates one of the four qualia roles defined by Pustejovsky (1995) Formal door-barrier Constitutive food-fat Agentive letter-write Telic novel-entertain
Extracting qualia in the W1 W2×L space
1
Selecting the pattern for qualia extraction the patterns proposed by Cimiano & Wenderoth (2007) are approximated by manually selecting links that are already in the DM tensors
FORMAL n as-form-of q, q as-form-of n n as-kind-of q, n as-sort-of q, n be q q such as n AGENTIVE n as-result-of q, q obj n
CONSTITUTIVE q as-member-of n, q as-part-of n, n with q n with-lot-of q, n with-majority-of q n with-number-of q, n with-sort-of q n with-variety-of q TELIC n for-use-as q, n for-use-in q, n sbj tr q n sbj intr q
Extracting qualia in the W1 W2×L space
2
Creating qualia subspaces of W1 W2×L for each role r , the W1 W2×L vectors containing a target noun n are projected onto the subspace determined by the link set associated to role r the lengths of vectors hn, qi are measured in this subspaces, with q a potential quale for n
3
Ranking potential qualia roles length in the subspace associated to the qualia role r is used to rank all hn, qi pairs relevant to r e.g., the length of hbook, readi in the subspace defined by the Telic links is the DM measure of fitness of read as Telic role of book
Extracting qualia in the W1 W2×L space
2
Creating qualia subspaces of W1 W2×L for each role r , the W1 W2×L vectors containing a target noun n are projected onto the subspace determined by the link set associated to role r the lengths of vectors hn, qi are measured in this subspaces, with q a potential quale for n
3
Ranking potential qualia roles length in the subspace associated to the qualia role r is used to rank all hn, qi pairs relevant to r e.g., the length of hbook, readi in the subspace defined by the Telic links is the DM measure of fitness of read as Telic role of book
Qualia extraction results
For each noun, the ranked list precision is computed at 11 equally spaced recall levels from 0% to 100%, separately for each role the precision, recall and F values at the recall level that results in the highest F score are averaged across the roles, and then across target nouns
model TypeDM P1 WebP1 LexDM
precision
recall
F
26.2 NA NA 19.9
22.7 NA NA 23.6
18.4 17.1 16.7 16.2
1 Cimiano
model WebJac1 DepDM Verb-PMI1 Base1
& Wenderoth (2007)
precision
recall
F
NA 17.8 NA NA
NA 16.9 NA NA
15.2 12.8 10.7 7.6
Qualia extraction results
For each noun, the ranked list precision is computed at 11 equally spaced recall levels from 0% to 100%, separately for each role the precision, recall and F values at the recall level that results in the highest F score are averaged across the roles, and then across target nouns
model TypeDM P1 WebP1 LexDM
precision
recall
F
26.2 NA NA 19.9
22.7 NA NA 23.6
18.4 17.1 16.7 16.2
1 Cimiano
model WebJac1 DepDM Verb-PMI1 Base1
& Wenderoth (2007)
precision
recall
F
NA 17.8 NA NA
NA 16.9 NA NA
15.2 12.8 10.7 7.6
Describing concept properties
Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.
Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)
Describing concept properties
Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.
Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)
Describing concept properties
Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.
Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)
Predicting characteristic properties with DM
The W1 W2×L space is used to predict the chracteristic properties of noun concepts all the hn, w2 i pairs that have the target nominal concept n as first element are ranked by length in the W1 W2×L space the longest hn, w2 i vectors in this space should correspond to salient properties of the target concept, since we expect a concept to often co-occur in texts with its important properties properties with different POS are normalized by dividing the length of the vector representing a pair by the length of the longest vector in the harvested concept-property set that has the same POS pair hcar, drivei, hcar, parki and hcar, enginei can be found among the longest W1 W2×L vectors with car as first item
Predicting characteristic properties Data set gold standard lists of 10 properties for each of 44 concrete noun concepts (cf. ESSLLI 2008 unconstrained property generation challenge) properties most frequently produced by subjects in the elicitation experiment of McRae et al. 2005 Algorithms must generate lists of 10 properties per concept performance is measured by the cross-concept average proportions of properties in the generated lists that are also in the corresponding gold standard lists model
overlap
s.d.
model
s.d.
14.1 8.8 4.1 1.8
10.3 9.9 6.1 3.9
23.9 19.5 16.1 14.5
1 Baroni
et al. (2010); 2 ESSLLI 2008 shared task
Strudel TypeDM DepDM LexDM
11.3 12.4 12.6 12.1
1
overlap
1
DV-10 AttrValue1 SVD-101 Shaoul2
Predicting characteristic properties Data set gold standard lists of 10 properties for each of 44 concrete noun concepts (cf. ESSLLI 2008 unconstrained property generation challenge) properties most frequently produced by subjects in the elicitation experiment of McRae et al. 2005 Algorithms must generate lists of 10 properties per concept performance is measured by the cross-concept average proportions of properties in the generated lists that are also in the corresponding gold standard lists model
overlap
s.d.
model
s.d.
14.1 8.8 4.1 1.8
10.3 9.9 6.1 3.9
23.9 19.5 16.1 14.5
1 Baroni
et al. (2010); 2 ESSLLI 2008 shared task
Strudel TypeDM DepDM LexDM
11.3 12.4 12.6 12.1
1
overlap
1
DV-10 AttrValue1 SVD-101 Shaoul2
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
The word-link by word (W1 L×W2 ) space vectors labeled with binary tuples of type hw1 , li (columns of mode-3 matrix) dimensions labelled with words w2 (rows of mode-3 matrix) 1:hmarine,owni 2:hmarine,usei 3:hsergeant,owni
1:bomb 40.0 82.1 16.7
2:gun 85.3 44.8 73.4
3:book 3.2 3.3 8.0
The W1 L×W2 vectors also represent syntactic slots of verb frames the vector labeled with the tuple hread, sbj−1 i represents the subject slot of the verb read in terms of the distribution of its noun fillers, labeling the dimensions of the space
The W1×LW2 space is used to classify verbs participating in different argument alternations
The word-link by word (W1 L×W2 ) space vectors labeled with binary tuples of type hw1 , li (columns of mode-3 matrix) dimensions labelled with words w2 (rows of mode-3 matrix) 1:hmarine,owni 2:hmarine,usei 3:hsergeant,owni
1:bomb 40.0 82.1 16.7
2:gun 85.3 44.8 73.4
3:book 3.2 3.3 8.0
The W1 L×W2 vectors also represent syntactic slots of verb frames the vector labeled with the tuple hread, sbj−1 i represents the subject slot of the verb read in terms of the distribution of its noun fillers, labeling the dimensions of the space
The W1×LW2 space is used to classify verbs participating in different argument alternations
Argument alternations
Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share
The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)
Argument alternations
Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share
The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)
Argument alternations
Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share
The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)
Transitivity alternations in the W1 L×W2 space Causative/inchoative alternation with alternating verbs, the object argument (John broke the vase) can also be realized as an intransitive subject (The vase broke)
data set 402 verbs extracted from Levin Classes (Levin 1993) 232 alternating causatives/inchoatives (break) 170 non-alternating transitives (mince)
Merlo & Stevenson (2001) classification task discriminate 3 classes of verbs, each characterized by a different transitivity alternation
data set 58 verbs from Merlo & Stevenson (2001) 19 unergative verbs undergoing the “induced action alternation” (race) 19 unaccusative verbs that undergo the “causative/inchoative alternation” (break) 20 object-drop verbs participating in the “unexpressed object alternation” (play)
Transitivity alternations in the W1 L×W2 space Causative/inchoative alternation with alternating verbs, the object argument (John broke the vase) can also be realized as an intransitive subject (The vase broke)
data set 402 verbs extracted from Levin Classes (Levin 1993) 232 alternating causatives/inchoatives (break) 170 non-alternating transitives (mince)
Merlo & Stevenson (2001) classification task discriminate 3 classes of verbs, each characterized by a different transitivity alternation
data set 58 verbs from Merlo & Stevenson (2001) 19 unergative verbs undergoing the “induced action alternation” (race) 19 unaccusative verbs that undergo the “causative/inchoative alternation” (break) 20 object-drop verbs participating in the “unexpressed object alternation” (play)
Transitivity alternations in the W1 L×W2 space
The similarities between the W1 L×W2 vectors of the transitive subject, intransitive subject, and direct object slots of a verb are used to classify the verbs the W1 L×W2 slot vectors hv , li whose links are sbj intr, sbj tr and obj are extracted for each verb v in a data set for LexDM, we sum the vectors with links beginning with one of these three patterns
a 3-dimensional vector with the cosines between the three slot vectors is built for each v these second order vectors encode the profile of similarity across the slots of a verb
verb classification is performed using the nearest centroid method on the 3-dimensional vectors, with leave-one-out cross-validation
Causative/inchoative alternation results
Binary classification of the C/I data set (with non-alternating verbs as negative examples) Causative/Inchoative model acc prec recall AllTrue 57.7 57.7 100 LexDM 69.9 76.0 69.9 TypeDM 69.1 75.7 68.5 DepDM 65.7 72.8 64.6 ProbMatch 51.2 57.7 57.7
F 73.2 72.8 71.9 68.4 57.7
Merlo & Stevenson (2001) classification task results
3-way classification of the MS data set model
Merlo & Stevenson (MS) acc prec recall
NoPass1 AllFeatures1 NoTrans1 NoCaus1 NoVBN1 TypeDM NoAnim1 LexDM DepDM AllTrue ProbMatch Majority
71.2 69.5 64.0 62.7 61.0 61.5 61.0 56.4 54.7 NA 33.3 33.9
NA NA NA NA NA 60.7 NA 55.3 52.9 33.3 33.3 11.3
NA NA NA NA NA 61.7 NA 56.7 55.0 100 33.3 33.3
F 71.2 69.1 63.8 62.6 61.0 60.8 59.9 55.8 53.2 50.0 33.3 16.9
all measures are macro-averaged except accuracy 1 Merlo & Stevenson (2001)
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
The link by word-word (L×W1 W2 ) space vectors labeled with links l (rows of mode-2 matrix) dimensions labelled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) 1:own 2:use
1:hmarine,bombi2:hsergeant,bombi3:hteacher,guni 40.0 16.7 5.2 82.1 69.5 7.0
Links are represented in terms of the word pairs they connect The L×W1 W2 space supports tasks involving the semantics of links characterizing prepositions (Baldwin et al. 2009) or measuring the relative similarity of different kinds of V-N relations, etc.
The L×W1 W2 vectors are currently used for the automatic selection of links for the W1 W2×L task of qualia extraction
The link by word-word (L×W1 W2 ) space vectors labeled with links l (rows of mode-2 matrix) dimensions labelled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) 1:own 2:use
1:hmarine,bombi2:hsergeant,bombi3:hteacher,guni 40.0 16.7 5.2 82.1 69.5 7.0
Links are represented in terms of the word pairs they connect The L×W1 W2 space supports tasks involving the semantics of links characterizing prepositions (Baldwin et al. 2009) or measuring the relative similarity of different kinds of V-N relations, etc.
The L×W1 W2 vectors are currently used for the automatic selection of links for the W1 W2×L task of qualia extraction
Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs
the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)
the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out
Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs
the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)
the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out
Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs
the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)
the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out
Automatic link selection for qualia extraction
Links selected in all folds of the leave-one-out procedure to extract links typical of each qualia role
FORMAL n is q, q is n, q become n, n coord q, q coord n, q have n, n in q, n provide q, q such as n AGENTIVE q after n, q alongside n, q as n, q before n, q besides n, q during n, q in n, q obj n, q out n, q over n, q since n, q unlike n
CONSTITUTIVE n have q, n use q, n with q, n without q TELIC q behind n, q by n, q like n, q obj n, n sbj intr q, q through n, q via n
Qualia extraction results with automatically selected links
Qualia extraction using DM subspaces defined by the automatic extracted links
model TypeDM* TypeDM P1 WebP1 LexDM WebJac1
precision 24.2 26.2 NA NA 19.9 NA
recall 26.7 22.7 NA NA 23.6 NA
F 19.1 18.4 17.1 16.7 16.2 15.2
model DepDM* LexDM* DepDM Verb-PMI1 Base1
precision 18.4 22.6 17.8 NA NA
green DM models trained with manually selected links 1 Cimiano & Wenderoth (2007)
recall 27.0 18.1 16.9 NA NA
F 15.1 14.8 12.8 10.7 7.6
Outline 1
Background and motivation
2
The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization
3
Implementing DM
4
Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space
5
Summary and conclusions
Conclusions two requirements for a general framework for distributional semantics
common representation for distributional semantics representing corpus-derived data to capture aspects of meaning that have so far been modeled with different, prima facie incompatible data structures
versatility in modeling semantic tasks using the common representation to address a large battery of semantic experiments, achieving a performance at least comparable to that of state-of-art, task-specific DSMs
Conclusions DM approach to distributional representation
DM models distributional data as a structure of weighted tuples that is formalized as a labeled third order tensor a generalization with respect to the common approach of many corpus-based semantic models that still couch distributional information information directly into binary structures the third order tensor formalization allows DM to fully exploit the potential of corpus-derived tuples
Semantic spaces are generated from the same underlying third order tensor, by the standard linear algebraic operation of tensor matricization
Conclusions DM approach to distributional representation
DM models distributional data as a structure of weighted tuples that is formalized as a labeled third order tensor a generalization with respect to the common approach of many corpus-based semantic models that still couch distributional information information directly into binary structures the third order tensor formalization allows DM to fully exploit the potential of corpus-derived tuples
Semantic spaces are generated from the same underlying third order tensor, by the standard linear algebraic operation of tensor matricization
Conclusions DM approach to distributional representation
DM address a large battery of semantic experiments with good performance In nearly all test sets the best implementation of DM (TypeDM) is at least as good as state-of-the-art algorithms other models outperforming TypeDM by a large margin have been trained on much larger corpora, rely on special knowledge resources, or on sophisticated machine learning algorithms TypeDM consistently outperforms alternative models reimplemented to be fully comparable to DM (Win, DV, LRA)
No task-specific optimization was performed
Conclusions DM approach to distributional representation
DM address a large battery of semantic experiments with good performance In nearly all test sets the best implementation of DM (TypeDM) is at least as good as state-of-the-art algorithms other models outperforming TypeDM by a large margin have been trained on much larger corpora, rely on special knowledge resources, or on sophisticated machine learning algorithms TypeDM consistently outperforms alternative models reimplemented to be fully comparable to DM (Win, DV, LRA)
No task-specific optimization was performed
DM as a model for meaning
Consistently with what is commonly assumed in cognitive science and formal linguistics, DM clearly distinguishes between: acquisition phase corpus-based tuple extraction and weighting
declarative structure the common underlying distributional memory
procedural problem-solving components the procedures to perform different semantic tasks
DM as a model for meaning
The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)
The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation
DM as a model for meaning
The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)
The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation
DM as a model for meaning
The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)
The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation
References Almuhareb, A. 2006. Attributes in Lexical Acquisition. Phd thesis, University of Essex. Almuhareb, A. and M. Poesio. 2004. Attribute-based and value-based clustering: An evaluation. In Proceedings of EMNLP, pages 158–165. Baldwin, T., V. Kordoni, and A. Villavicencio. 2009. Prepositions in applications: A survey and introduction to the special issue Computational Linguistics, 35(2):119–149. Baroni, M., E. Barbu, B. Murphy, and M. Poesio. 2010. Strudel: A distributional semantic model based on properties and types. Cognitive Science In press. Baroni, M., S. Evert, and A. Lenci, eds. 2008. Bridging the Gap between Semantic Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics Baroni, M. and A. Lenci. 2008. Concepts and properties in word spaces. Italian Journal of Linguistics, 20(1):55–88 Bicic¸i, E. and D. Yuret. 2006. Clustering word pairs to answer analogy questions. In Proceedings of the Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks, pages 277–284 Bullinaria, J. and J. Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39:510–526 Chen, H., M.-S. Lin, and Y.-C. Wei. 2006. Novel association measures using web search with double checking. In Proceedings of COLING-ACL, pages 1009–1016 Cimiano, P. and J. Wenderoth. 2007. Automatic acquisition of ranked qualia structures from the web. In Proceedings of ACL, pages 888–895 Curran, J. and M. Moens. 2002. Improvements in automatic thesaurus extraction. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, pages 59–66 Erk, K. 2007. A simple, similarity-based model for selectional preferences. In Proceedings of ACL, pages 216–223
References ´ 2008. A structured vector space model for word meaning in context. Erk, K. and S. Pado. In Proceedings of EMNLP, pages 897–906 Evert, S. 2005. The Statistics of Word Cooccurrences. Dissertation, Stuttgart University Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge Garrard, P., M. L. Ralph, J. Hodges, and K. Patterson. 2001. Prototypicality, distinctiveness, and intercorrelation: Analyses of the semantic attributes of living and nonliving concepts. Cognitive Neuropsychology, 18(2):25–174 Girju, R., A. Badulescu, and D. Moldovan. 2006. Automatic discovery of part-whole relations. Computational Linguistics, 32(1):83–135 Girju, R., P. Nakov, V. Nastase, S. Szpakowicz, P. Turney, and D. Yuret. 2007. SemEval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of SemEval 2007, pages 13–18 Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer, Boston Griffiths, T., M. Steyvers, and J. Tenenbaum. 2007. Topics in semantic representation. Psychological Review, 114:211–244 Harris, Z. 1954. Distributional structure. Word, 10(2-3):1456–1162 Hearst, M. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING, pages 539–545 ˇ Herdagdelen, A. and M. Baroni. 2009. BagPack: A general framework to represent semantic relations. In Proceedings of the EACL GEMS Workshop, pages 33–40 ˇ Herdagdelen, A., K. Erk, and M. Baroni. 2009. Measuring semantic relatedness with vector space models and random walks. In Proceedings of TextGraphs-4, pages 50–53 Joanis, E., S. Stevenson, and D. James. 2008. A general feature space for automatic verb classification. Natural Language Engineering, 14(3):337–367
References Karypis, G. 2003. CLUTO: A clustering toolkit. Technical Report 02-017, University of Minnesota Department of Computer Science Kolda, T. 2006. Multilinear operators for higher-order decompositions. Technical Report 2081, SANDIA Kolda, T. and B. Bader. 2009. Tensor decompositions and applications. SIAM Review, 51(3):455–500 Landauer, T. and S. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211–240 Lenci, A. 2008. Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(1):1–31 Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, IL Levin, B. and M. Rappaport-Hovav. 2005. Argument Realization. Cambridge University Press, Cambridge Lin, D. 1998. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304 Liu, H. and P. Singh. 2004. ConceptNet: A practical commonsense reasoning toolkit. BT Technology Journal, pages 211–226 Lund, K.and C. Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, 28:203–208 Matveeva, I., G.-A. Levow, A. Farahat, and C. Royer. 2005. Generalized latent semantic analysis for term representation. In Proceedings of RANLP, pages 60–68
References McRae, K., G. Cree, M. Seidenberg, and C. McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4):547–559 McRae, K., M. Spivey-Knowlton, and M. Tanenhaus. 1998. Modeling the influence of thematic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312 Merlo, P. and S. Stevenson. 2001. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, 27(3):373–408 Miller, G. and W. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6:1–28 Murphy, Gregory. 2002. The Big Book of Concepts. MIT Press, Cambridge, MA Nastase, V. and S. Szpakowicz. 2003. Exploring noun-modifier semantic relations. In Proceedings of the Fifth International Workshop on Computational Semantics, pages 285–301, Tilburg, The Netherlands ´ Seaghdha, ´ O D. and A. Copestake. 2009. Using lexical and relational similarity to classify semantic relations. In Proceedings of EACL, pages 621–629, Athens, Greece ´ S. and M. Lapata. 2007. Dependency-based construction of semantic space Pado, models. Computational Linguistics, 33(2):161–199 ´ U. 2007. The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Pado, Model of Sentence Processing. Dissertation, Saarland University, Saarbrucken ¨ ´ U., S. Pado, ´ and K. Erk. 2007. Flexible, corpus-based modelling of human Pado, plausibility judgements. In Proceedings of EMNLP, pages 400–409
References Pantel, P. and M. Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of COLING-ACL, pages 113–120 Pustejovsky, J. 1995. The Generative Lexicon. MIT Press, Cambridge, MA Quesada, J., P. Mangalath, and W. Kintsch. 2004. Analogy-making as predication using relational information and LSA vectors. In Proceedings of CogSci, page 1623 Rapp, R. 2003. Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the 9th MT Summit, pages 315–322 Rogers, T. and J. McClelland. 2004. Semantic Cognition: A Parallel Distributed Processing Approach. MIT Press, Cambridge, MA ¨ Rothenhausler, K. and H. Schutze. 2009. Unsupervised classification with dependency ¨ based word spaces. In Proceedings of the EACL GEMS Workshop, pages 17–24 Rubenstein, H. and J. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633 Ruiz-Casado, M., E. Alfonseca, and P. Castells. 2005. Using context-window overlapping in synonym discovery and ontology extension. In Proceedings of RANLP Sahlgren, M. 2005. An introduction to random indexing. http://www.sics.se/∼mange/papers/RI intro.pdf Schutze, H. 1997. Ambiguity Resolution in Natural Language Learning. CSLI, Stanford, CA ¨ Terra, E. and C. Clarke. 2003. Frequency estimates for statistical word similarity measures. In Proceedings of HLT-NAACL, pages 244–251 Turney, P. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of ECML, pages 491–502 Turney, P. 2006a. Expressing implicit semantic relations without supervision. In Proceedings of COLING-ACL, pages 313–320
References
Turney, P. 2006b. Similarity of semantic relations. Computational Linguistics, 32(3):379–416 Turney, P. 2007. Empirical evaluation of four tensor decomposition algorithms. Technical Report ERB-1152, NRC Turney, P. 2008. A uniform approach to analogies, synonyms, antonyms and associations. In Proceedings of COLING, pages 905–912 Turney, P. and M. Littman. 2005. Corpus-based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278 Van Overschelde, J., K. Rawson, and J. Dunlosky. 2004. Category norms: An updated and expanded version of the Battig and Montague (1969) norms. Journal of Memory and Language, 50:289–335 Veale, T. and Y. Hao. 2008. Acquiring naturalistic concept descriptions from the web. In Proceedings of LREC, pages 1121–1124 Vinson, D. and G. Vigliocco. 2008. Semantic feature production norms for a large set of objects and events. Behavior Research Methods, 40(1):183–190 Zhao, Y. and G. Karypis. 2003. Criterion functions for document clustering: Experiments and analysis. Technical Report 01-40, University of Minnesota Department of Computer Science