Distributional Memory A Generalized Framework ... - Semantic Scholar

3 downloads 0 Views 1MB Size Report
Feb 2, 2010 - Atene, 31 marzo 2009. Marco Baroni, Alessandro Lenci (submitted). “Distributional Memory: A General Framework for Corpus-based.
Distributional Memory A Generalized Framework for Corpus-based Semantics Alessandro Lenci University of Pisa, Department of Linguistics IMS Stuttgart 2nd February 2010

Credits

Distributional Memory is a joint research with Marco Baroni (CIMeC, University of Trento) Main references Marco Baroni, Alessandro Lenci (2009) “One distributional memory, many semantic tasks” Procedings of the Workshop on Geometrical Models for Natural Language Semantics (GEMS), 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Atene, 31 marzo 2009. Marco Baroni, Alessandro Lenci (submitted) “Distributional Memory: A General Framework for Corpus-based Semantics”,Computational Linguistics

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

Corpus-based semantics

Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts

The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model

Corpus-based semantics

Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts

The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model

Corpus-based semantics

Distributional Semantic Models (DSMs) aim at characterizing the meaning of linguistic expressions in terms of their distributional properties DSMs all rely on some version of the distributional hypothesis (Harris 1954, Miller & Charles 1991) the degree of semantic similarity between two words (or other linguistic units) can be modeled as a function of the degree of overlap among their linguistic contexts

The format of distributional representations greatly varies depending on the specific aspects of meaning they are designed to model

Unstructured DSMs

Unstructured DSMs represent distributional data in terms of unstructured co-occurrence relations between an element and a context contexts as documents (Landauer & Dumais 1997, Griffiths et al. 2007) contexts as lexical collocates within a certain distance from the target (Bullinaria & Levy 2007, Lund & Burgess 1996, Rapp 2003, Schutze ¨ 1997)

Unstructured DSMs do not use the linguistic structure of texts to compute co-occurrences The teacher eats a red apple ⇒ eat is a legitimate context for apple and red because they appear in the same window

Unstructured DSMs

Unstructured DSMs represent distributional data in terms of unstructured co-occurrence relations between an element and a context contexts as documents (Landauer & Dumais 1997, Griffiths et al. 2007) contexts as lexical collocates within a certain distance from the target (Bullinaria & Levy 2007, Lund & Burgess 1996, Rapp 2003, Schutze ¨ 1997)

Unstructured DSMs do not use the linguistic structure of texts to compute co-occurrences The teacher eats a red apple ⇒ eat is a legitimate context for apple and red because they appear in the same window

Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them

To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple

Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear

Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them

To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple

Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear

Structured DSMs In structured DSMs, co-occurrence statistics are collected in the form of corpus-derived triples (Almuhareb & Poesio 2004, Curran & Moens 2002, Erk & Pado´ 2008, Grefenstette 1994, Lin 1998, Pado´ ¨ & Lapata 2007, Rothenhausler & Schutze 2009,Turney 2006) ¨ word pairs and the syntactic relation or lexico-syntactic pattern that links them

To qualify as context of a target item, a word must be linked to it by some (interesting) lexico-syntactic relation (this can be used to distinguish the type of co-occurrence) The teacher eats a red apple ⇒ eat is not a legitimate context for red the object relation connecting eat and apple is treated as a different type of co-occurrence from the modifier relation linking red and apple

Structured DSMs seem to have a slight edge on unstructured ¨ models (Pado´ & Lapata 2007, Rothenhausler & Schutze 2009), but ¨ the picture is not totally clear

Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)

Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti

two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji

pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji

Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)

Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti

two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji

pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji

Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)

Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti

two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji

pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji

Binary models of distributional data Both structured and unstructured DSMs represent distributional data in terms of 2-way structures matrices M|B|×|T | , with B the set of basis elements representing the contexts used to compare the distributional similarity of the target elements (Pado´ & Lapata 2007)

Structured DSMs also map the corpus derived ternary data directly onto a 2-way matrix the dependency information in the tuple can be dropped (Pado´ & Lapata 2007) hmarine, sbj, shooti ⇒ hmarine, shooti

two words can be concatenated, treating the links as basis elements (Turney 2006) hmarine, sbj, shooti ⇒ hmarine-shoot, sbji

pairs formed by the link and one word are concatenated and treated as attributes of target words (Almuhareb & Poesio 2004, Curran & Moens ¨ 2002, Grefenstette 1994, Lin 1998, Rothenhausler & Schutze 2009) ¨ hmarine, sbj, shooti ⇒ hmarine, shoot-sbji

“One semantic task, one distributional model”

The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.

Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive

“One semantic task, one distributional model”

The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.

Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive

“One semantic task, one distributional model”

The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.

Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive

“One semantic task, one distributional model”

The choice to represent co-occurrence statistics directly as matrices produces prima facie incompatible semantic spaces We lose sight of the fact that different semantic spaces actually rely on the same kind of underlying distributional information This results in the development of ad hoc models geared towards specific aspects of meanings taxonomic similarity, relation identification, selectional preferences, etc.

Excellent empirical results but. . . not what humans do (human semantic memory is general-purpose) computationally inefficient, resources rarely reusable, prone to overfitting, not adaptive

The current landscape of distributional semantics attributional similarity tasks synonym detection, categorization, etc. words like dog and puppy are attributionally similar in the sense that their meanings share a large number of attributes (they are animals, they bark, etc.) attributional similarity is typically addressed by DSMs based on word collocates as proxies to concept attributes (Bullinaria & Levy 2007, Grefenstette 1994, Lund & Burgess 1996, Pado´ & Lapata 2007, Schutze ¨ 1997)

relational similarity tasks analogy recognition, relation extraction, etc. relational similarity is the property shared by pairs of words (dog–animal and car–vehicle) linked by similar semantic relations (hypernymy) DSMs tackle relational similarity by representing pairs of words in the space of the patterns that connect them in the corpus (Turney 2006, Girju et al. 2006, Hearst 1992, Pantel & Pennacchiotti 2006)

others selectional preferences (Erk 2007), argument alternations (Merlo & Stevenson 2001, Joanis et al. 2008), commonsense knowledge extraction (Almuhareb 2006, Cimiano & Wenderoth 2007), etc.

The current landscape of distributional semantics attributional similarity tasks synonym detection, categorization, etc. words like dog and puppy are attributionally similar in the sense that their meanings share a large number of attributes (they are animals, they bark, etc.) attributional similarity is typically addressed by DSMs based on word collocates as proxies to concept attributes (Bullinaria & Levy 2007, Grefenstette 1994, Lund & Burgess 1996, Pado´ & Lapata 2007, Schutze ¨ 1997)

relational similarity tasks analogy recognition, relation extraction, etc. relational similarity is the property shared by pairs of words (dog–animal and car–vehicle) linked by similar semantic relations (hypernymy) DSMs tackle relational similarity by representing pairs of words in the space of the patterns that connect them in the corpus (Turney 2006, Girju et al. 2006, Hearst 1992, Pantel & Pennacchiotti 2006)

others selectional preferences (Erk 2007), argument alternations (Merlo & Stevenson 2001, Joanis et al. 2008), commonsense knowledge extraction (Almuhareb 2006, Cimiano & Wenderoth 2007), etc.

Distributional Memory (DM) towards a unified framework for corpus-based semantics

The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics

Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object

Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling

Distributional Memory (DM) towards a unified framework for corpus-based semantics

The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics

Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object

Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling

Distributional Memory (DM) towards a unified framework for corpus-based semantics

The core geometrical structure of DM is a 3-way object, namely a third order tensor like structured DSMs, DM represents distributional facts as word-link-word tuples differently from current approaches, tuples are formalized as a ternary structure, and can become the backbone of a unified model for distributional semantics

Different semantic spaces are generated “on demand” through tensor matricization, projecting the third order tensor onto 2-way matrices all these different semantic spaces are now alternative views of the same underlying distributional object

Apparently unrelated semantic tasks can be addressed in terms of the same distributional memory, harvested only once from the corpus distributional data can be turned into a general purpose resource for semantic modeling

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

Weighted distributional tuples

W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R

Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt

Weighted distributional tuples

W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R

Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt

Weighted distributional tuples

W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R

Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt

Weighted distributional tuples

W1 , W2 sets of strings representing content words L a set of strings representing syntagmatic co-occurrence links between words T a set of corpus-derived tuples t = hw1 , l, w2 i, such that T ⊆ W1 × L × W2 , and w1 co-occurs with w2 and l represents the type of this co-occurrence relation vt a tuple weight, assigned by a scoring function σ : W1 × L × W2 → R

Weighted tuple structure A set TW of weighted distributional tuples tw = ht, vt i for all t ∈ T and σ(t) = vt

Weighted tuple structure w1 marine marine marine marine marine marine sergeant sergeant sergeant

l own use own use own use own use own

w2 bomb bomb gun gun book book bomb bomb gun

σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4

w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher

l use own use own use own use own use

w2 gun book book bomb bomb gun gun book book

σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6

Constraints on TW W1 = W2 inverse link constraint: for any link l ∈ L, there is a k ∈ L s.t. for each tuple tw = hhwi , l, wj i, vt i ∈ TW , tw−1 = hhwj , k, wi i, vt i ∈ TW (k is the inverse link of l) hhmarine, use, bombi, vt i ⇒ hhbomb, use−1 , marinei, vt i

Weighted tuple structure w1 marine marine marine marine marine marine sergeant sergeant sergeant

l own use own use own use own use own

w2 bomb bomb gun gun book book bomb bomb gun

σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4

w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher

l use own use own use own use own use

w2 gun book book bomb bomb gun gun book book

σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6

Constraints on TW W1 = W2 inverse link constraint: for any link l ∈ L, there is a k ∈ L s.t. for each tuple tw = hhwi , l, wj i, vt i ∈ TW , tw−1 = hhwj , k, wi i, vt i ∈ TW (k is the inverse link of l) hhmarine, use, bombi, vt i ⇒ hhbomb, use−1 , marinei, vt i

Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)

An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1

i=1 i=2 i=3

j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0

j=1

j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7

j=1

j=2 k=3

3.2 8.0 48.4

3.3 10.1 53.6

Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)

An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1

i=1 i=2 i=3

j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0

j=1

j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7

j=1

j=2 k=3

3.2 8.0 48.4

3.3 10.1 53.6

Third order tensors A tensor X is a multi-way array (Kolda & Bader 2009, Turney 2007) the order (or n-way) of a tensor is the number of indices needed to identify its elements tensors are a generalization of vectors (first order tensors) and matrices (second order tensors)

An array with 3 indices is a third order (or 3-way) tensor xijk = the element (i, j, k ) of a third order tensor X the dimensionality of a third order tensor is the product of the dimensionalities of its indices I × J × K an index has dimensionality I if it ranges over the integers from 1 to I j=1

i=1 i=2 i=3

j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0

j=1

j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7

j=1

j=2 k=3

3.2 8.0 48.4

3.3 10.1 53.6

Tensor fibers

A fiber is equivalent to rows and columns in higher order tensors A mode-n fiber is a fiber where only the n-th index has not been fixed x∗11 = (40.0, 16.7, 5.2) x2∗3 = (8.0, 10.1) x32∗ = (7.0, 4.7, 53.6)

j=1

i=1 i=2 i=3

j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0

j=1

mode-1 fiber mode-2 fiber mode-3 fiber

j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7

j=1

j=2 k=3

3.2 8.0 48.4

3.3 10.1 53.6

Tuple structures as third order labeled tensors

A labeled tensor X λ is a tensor s.t. for each of its indices there is a one-to-one mapping of the integers from 1 to I (the dimensionality of the index) to I distinct strings (index labels) i : λ = an index element labeled with the string λ

Tuple labeled tensor Every weighted tuple structure TW built from W1 , L and W2 is represented as a labeled third order tensor X λ with its 3 indices labeled by W1 , L and W2 , respectively, and s.t. for each weighted tuple t ∈ TW = hhw1 , l, w2 i, vt i there is a tensor entry (i : w1 , j : l, k : w2 )= vt

Tuple structures as third order labeled tensors

A labeled tensor X λ is a tensor s.t. for each of its indices there is a one-to-one mapping of the integers from 1 to I (the dimensionality of the index) to I distinct strings (index labels) i : λ = an index element labeled with the string λ

Tuple labeled tensor Every weighted tuple structure TW built from W1 , L and W2 is represented as a labeled third order tensor X λ with its 3 indices labeled by W1 , L and W2 , respectively, and s.t. for each weighted tuple t ∈ TW = hhw1 , l, w2 i, vt i there is a tensor entry (i : w1 , j : l, k : w2 )= vt

Tuple structures as third order labeled tensor

w1 marine marine marine marine marine marine sergeant sergeant sergeant

i=1:marine i=2:sergeant i=3:teacher

l own use own use own use own use own

w2 bomb bomb gun gun book book bomb bomb gun

σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4

j=1:own j=2:use k=1:bomb 40.0 82.1 16.7 69.5 5.2 7.0

w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher

l use own use own use own use own use

j=1:own j=2:use k=2:gun 85.3 44.8 73.4 51.9 9.3 4.7

w2 gun book book bomb bomb gun gun book book

σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6

j=1:own j=2:use k=3:book 3.2 3.3 8.0 10.1 48.4 53.6

Tuple structures as third order labeled tensor

w1 marine marine marine marine marine marine sergeant sergeant sergeant

i=1:marine i=2:sergeant i=3:teacher

l own use own use own use own use own

w2 bomb bomb gun gun book book bomb bomb gun

σ 40.0 82.1 85.3 44.8 3.2 3.3 16.7 69.5 73.4

j=1:own j=2:use k=1:bomb 40.0 82.1 16.7 69.5 5.2 7.0

w1 sergeant sergeant sergeant teacher teacher teacher teacher teacher teacher

l use own use own use own use own use

j=1:own j=2:use k=2:gun 85.3 44.8 73.4 51.9 9.3 4.7

w2 gun book book bomb bomb gun gun book book

σ 51.9 8.0 10.1 5.2 7.0 9.3 4.7 48.4 53.6

j=1:own j=2:use k=3:book 3.2 3.3 8.0 10.1 48.4 53.6

Tensor matricization Matricization rearranges a higher order tensor into a matrix (Kolda 2006, Kolda & Bader 2009) Mode-n matricization arranges the mode-n fibers to be the columns of the resulting Dn × Dj matrix Dn is the dimensionality of the n-th index Dj is the product of the dimensionalities of the other indices

Mode-n matricization Each tensor entry (i1 , i2 , ..., iN ) is mapped to a matrix entry (in , j), where j is computed as in equation: j =1+

N X k =1 k6=n

((ik − 1)

k−1 Y m=1 m6=n

Dm )

(1)

Tensor matricization Matricization rearranges a higher order tensor into a matrix (Kolda 2006, Kolda & Bader 2009) Mode-n matricization arranges the mode-n fibers to be the columns of the resulting Dn × Dj matrix Dn is the dimensionality of the n-th index Dj is the product of the dimensionalities of the other indices

Mode-n matricization Each tensor entry (i1 , i2 , ..., iN ) is mapped to a matrix entry (in , j), where j is computed as in equation: j =1+

N X k =1 k6=n

((ik − 1)

k−1 Y m=1 m6=n

Dm )

(1)

Tensor matricization mode-1 matricization

j=1

i=1 i=2 i=3

j=2 k=1 40.0 82.1 16.7 69.5 5.2 7.0

Amode−1 i=1 i=2 I=3

j=1 40.0 16.7 5.2

j=2 82.1 69.5 7.0

j=1

j=2 k=2 85.3 44.8 73.4 51.9 9.3 4.7

j=3 85.3 73.4 9.3

j=4 44.8 51.9 4.7

j=1

j=2 k=3

3.2 8.0 48.4

j=5 3.2 8.0 48.4

3.3 10.1 53.6

j=6 3.3 10.1 53.6

Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei

Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn

Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei

Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn

Labeled tensor matricization In DM, mode-n matricization is applied to labeled tensors and its outcome are labeled matrices row labels labels of the n-th index in the tensor column labels labels of the mode-n tensor fibers Each mode-n fiber of a tensor X λ is labeled with the binary tuple whose elements are the labels of the corresponding fixed index elements x∗11 = (40, 16.7, 5.2):hown, bombi x2∗1 = (16.7, 69.5):hsergeant, bombi x32∗ = (7.0, 4.7, 53.6):hteacher, usei

Labeled mode-n matricization Given a labeled third order tensor X λ , labeled mode-n matricization maps each entry (i1 : λ1 , i2 : λ2 , i3 : λ3 ) to the labeled entry (in : λn , j : λj ) such that j is obtained according to equation (1), and λj is the binary tuple obtained from the triple hλ1 , λ2 , λ3 i by removing λn

Labeled tensor matricization mode-1, mode-2 and mode-3 matrices obtained from a tuple labeled tensor X λ

Amode−1

1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6 1: 2 3: 4: 5: 6: 7: 8: 9: Bmode−2 hmarine, hsergeant, hteacher, hmarine, hsergeant, hteacher, hmarine, hsergeant, hteacher, bombi bombi bombi guni guni guni booki booki booki 1:own 40.0 16.7 5.2 85.3 73.4 9.3 3.2 8.0 48.4 2:use 82.1 69.5 7.0 44.8 51.9 4.7 3.3 10.1 53.6 Cmode−3 1:bomb 2:gun 3:book

1: 2: 3: 4: 5: 6: hmarine,ownihmarine,useihsergeant,ownihsergeant,useihteacher,ownihteacher,usei 40.0 82.1 16.7 69.5 5.2 7.0 85.3 44.8 73.4 51.9 9.3 4.7 3.2 3.3 8.0 10.1 48.4 53.6

The DM semantic spaces

The rows and columns of the 3 matrices resulting from n-mode matricization of a third order tensor are vectors in semantic spaces vector dimensions are the corresponding column viz. row elements

Given the constraints on the tuple structure TW for each column of the mode-1 matrix labeled by hl, w2 i, there is an identical column in the mode-3 matrix labeled by hw1 , k i k is the inverse link of l and w1 = w2

for any row w2 in the mode-3 matrix, there is an identical row w1 in the mode-1 matrix

The DM semantic spaces

The rows and columns of the 3 matrices resulting from n-mode matricization of a third order tensor are vectors in semantic spaces vector dimensions are the corresponding column viz. row elements

Given the constraints on the tuple structure TW for each column of the mode-1 matrix labeled by hl, w2 i, there is an identical column in the mode-3 matrix labeled by hw1 , k i k is the inverse link of l and w1 = w2

for any row w2 in the mode-3 matrix, there is an identical row w1 in the mode-1 matrix

The DM semantic spaces Given a weighted tuple structure TW , the matricization of the corresponding labeled third order tensor X λ generates 4 distinct semantic vector spaces word by link-word (W1×LW2 ) vectors labeled with words w1 dimensions labeled with tuples of type hl, w2 i word-word by link (W1 W2×L) vectors labeled with tuples of type hw1 , w2 i dimensions labeled with links l word-link by word (W1 L×W2 ) vectors labeled with tuples of type hw1 , li dimensions labeled with words w2 link by word-word (L×W1 W2 ) vectors labeled with links l dimensions labeled with tuples of type hw1 , w2 i

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

The DM models DM models correspond to different ways to construct the underlying weighted tuple structure DepDM unlexicalixed model links as dependency paths (Curran & Moens 2002, Grefenstette 1994, Pado´ & Lapata 2007, ¨ Rothenhausler & Schutze 2009) ¨

LexDM heavily lexicalized model links as lexicalized dependency paths and lexico-syntactic shallow patterns (Hearst 1992, Pantel & Pennacchiotti 2006, Turney 2006)

TypeDM mildly lexicalized model links as lexicalized dependency paths and lexico-syntactic shallow patterns, but with a different scoring function based on pattern type frequency (Baroni et al. 2010, Davidov & Rappoport 2008a, Davidov & Rappoport 2008b)

All models share the same corpus and W1 = W2 sets, while differ for the links (L) and the scoring function

The DM models

The DM corpus 2.830 billion tokens resulting from concatenating ukWac, about 1.915 billion tokens of Web-derived texts English Wikipedia, a mid-2009 dump of about 820 million tokens British National Corpus, about 95 million tokens

the corpus was tokenized, POS-tagged and lemmatized with the TreeTagger, and dependency-parsed with the MaltParser (Nivre et al. 2007)

The label sets W1 = W2 30,693 lemmas (20,410 nouns, 5,026 verbs and 5,257 adjectives) the top 20,000 most frequent nouns and top 5,000 most frequent verbs and adjectives, augmented with lemmas in various standard test sets, such as the TOEFL and SAT lists.

The DM models

The DM corpus 2.830 billion tokens resulting from concatenating ukWac, about 1.915 billion tokens of Web-derived texts English Wikipedia, a mid-2009 dump of about 820 million tokens British National Corpus, about 95 million tokens

the corpus was tokenized, POS-tagged and lemmatized with the TreeTagger, and dependency-parsed with the MaltParser (Nivre et al. 2007)

The label sets W1 = W2 30,693 lemmas (20,410 nouns, 5,026 verbs and 5,257 adjectives) the top 20,000 most frequent nouns and top 5,000 most frequent verbs and adjectives, augmented with lemmas in various standard test sets, such as the TOEFL and SAT lists.

DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri

The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log

Oijk Eijk

The DepDM tensor contains about 110M non-zero tuples

(2)

DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri

The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log

Oijk Eijk

The DepDM tensor contains about 110M non-zero tuples

(2)

DepDM LDepDM contains 796 direct and inverse links formed by N-V, N-N and A-N dependencies: sbj intr: The teacher is singing → hteacher, sbj intr, singi sbj tr: The soldier is reading a book → hsoldier, sbj tr, readi iobj: The soldier gave the woman a book → hwoman, iobj, givei nmod: good teacher → hgood, nmod, teacheri coord: teachers and soldiers → hteacher, coord, soldieri prd: The soldier became sergeant → hsergeant, prd, becomei verb: The soldier is reading a book → hsoldier, verb, booki preposition: I saw a soldier with the gun → hgun, with, soldieri

The scoring function σ is the Local Mutual Information (LMI) (Evert 2005) computed on the word-link-word co-occurrence counts (negative LMI values are raised to 0) LMI = Oijk log

Oijk Eijk

The DepDM tensor contains about 110M non-zero tuples

(2)

LexDM

LLexDM contains 3,352,148 direct and inverse complex links, each with the structure pattern+suffix the suffix is formed by two substrings separated by a +, each respectively encoding various features of w1 and w2 their POS and morphological features (number for N, number and tense for V) the presence of an article (further specified with its definiteness value) and of adjectives for N the presence of adverbs for A, and the presence of adverbs, modals and auxiliaries for V, together with their diathesis (for passive only) If the adjective (adverb) modifying w1 or w2 belongs to a list of 10 (250) high frequency adjectives (adverbs), the suffix string contains the adjective (adverb) itself, otherwise only its POS

The tall soldier has already shot hsoldier, sbj intr+n-the-j+vn-aux-already, shooti

LexDM The patterns in the LexDM links include LDepDM : The man shot → hman, sbj intr+n-the+vn, shooti verb: 52 high frequency verbs are lexicalized: The soldier used a gun → hsoldier, use+n-the+n-a, guni is: The soldier is tall → htall, is+j+n-the, soldieri preposition-link noun-preposition: the arrival of a number of soldiers → hsoldier, of-number-of+ns+n-the, arrivali attribute noun: “(the) attribute noun of (a|the) NOUN is ADJ” (Almuhareb & Poesio 2004) and “(a|the) ADJ attribute noun of NOUN” (Veale & Hao 288): the colour of strawberries is red → hred, colour+j+ns, strawberryi as adj as: “as ADJ as (a|the) NOUN” (Veale & Hao 288) as sharp as a knife → hsharp, as adj as+j+n-a, knifei such as: “NOUN such as NOUN” and “such NOUN as NOUN” (Hearst 1992): animals such as cats → hanimal, such as+ns+ns, cati

The scoring function σ is LMI, and the the DepDM tensor contains about 355M non-zero tuples

LexDM The patterns in the LexDM links include LDepDM : The man shot → hman, sbj intr+n-the+vn, shooti verb: 52 high frequency verbs are lexicalized: The soldier used a gun → hsoldier, use+n-the+n-a, guni is: The soldier is tall → htall, is+j+n-the, soldieri preposition-link noun-preposition: the arrival of a number of soldiers → hsoldier, of-number-of+ns+n-the, arrivali attribute noun: “(the) attribute noun of (a|the) NOUN is ADJ” (Almuhareb & Poesio 2004) and “(a|the) ADJ attribute noun of NOUN” (Veale & Hao 288): the colour of strawberries is red → hred, colour+j+ns, strawberryi as adj as: “as ADJ as (a|the) NOUN” (Veale & Hao 288) as sharp as a knife → hsharp, as adj as+j+n-a, knifei such as: “NOUN such as NOUN” and “such NOUN as NOUN” (Hearst 1992): animals such as cats → hanimal, such as+ns+ns, cati

The scoring function σ is LMI, and the the DepDM tensor contains about 355M non-zero tuples

TypeDM

LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1

The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples

TypeDM

LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1

The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples

TypeDM

LTypeDM contains 25,336 direct and inverse links that correspond to the patterns in the LexDM links The LexDM pattern suffixes are used to count the number of distinct surface realizations of each pattern the two LexDM links of−1 +n-a+n-the and of−1 +ns-j+n-the are counted as two occurrences of the same TypeDM link of−1

The scoring function σ computes LMI on the number of distinct suffix types displayed by a link The TypeDM tensor contains about 130M non-zero tuples

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

Semantic experiments with DM For each space, DM has been tested on semantic experiments modeled by applying (some combination of) a small number of geometric operations vector length and normalization v u i=n uX ||v|| = t v2 i

(3)

i=1

similarity as vector cosine Pi=n cos(x, y) =

i=1 xi yi ||x||||y||

(4)

vector sum (centroid) two or more normalized vectors are summed by adding their values on each dimension

projection onto a subspace a vector with i dimensions is projected onto a subspace with k < i dimensions through multiplication by a square diagonal matrix with 1s in the diagonal cells corresponding to the k dimensions, and 0s elsewhere

Semantic experiments with DM For each space, DM has been tested on semantic experiments modeled by applying (some combination of) a small number of geometric operations vector length and normalization v u i=n uX ||v|| = t v2 i

(3)

i=1

similarity as vector cosine Pi=n cos(x, y) =

i=1 xi yi ||x||||y||

(4)

vector sum (centroid) two or more normalized vectors are summed by adding their values on each dimension

projection onto a subspace a vector with i dimensions is projected onto a subspace with k < i dimensions through multiplication by a square diagonal matrix with 1s in the diagonal cells corresponding to the k dimensions, and 0s elsewhere

Semantic experiments with the DM spaces preliminary observations

The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space

Our current aim is to prove that each space derived through tensor matricization is semantically interesting

Semantic experiments with the DM spaces preliminary observations

The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space

Our current aim is to prove that each space derived through tensor matricization is semantically interesting

Semantic experiments with the DM spaces preliminary observations

The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space

Our current aim is to prove that each space derived through tensor matricization is semantically interesting

Semantic experiments with the DM spaces preliminary observations

The experiments correspond to key semantic tasks in computational linguistics and/or cognitive science, typically addressed by distinct DSMs so far To support the view of DM as a generalized model we have maximized the variety of aspects of meanings covered by the experiments The choice of the DM semantic space to tackle a particular task is essentially based on the “naturalness” with which the task can be modeled in that space many alternatives are conceivable, both with respect to the space selection, and to the type of operations performed on the space

Our current aim is to prove that each space derived through tensor matricization is semantically interesting

Semantic experiments with the DM spaces preliminary observations

No feature selection/reweighting, dimensionality reduction or task-specific optimization have been used in the experiments the same underlying tuple tensor is used in all the experiments the experiment results should be regarded as a sort of “baseline” performance to be enhanced by task-specific parameter tuning

DM performance is compared to the results available in the literature and to our implementation of state-of-the-art DSMs alternative models trained on the same DM corpus (with the same linguistic pre-processing)

Semantic experiments with the DM spaces preliminary observations

No feature selection/reweighting, dimensionality reduction or task-specific optimization have been used in the experiments the same underlying tuple tensor is used in all the experiments the experiment results should be regarded as a sort of “baseline” performance to be enhanced by task-specific parameter tuning

DM performance is compared to the results available in the literature and to our implementation of state-of-the-art DSMs alternative models trained on the same DM corpus (with the same linguistic pre-processing)

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

The word by link-word (W1×LW2 ) space vectors labeled with words w1 (rows of mode-1 matrix) dimensions labelled with binary tuples of type hl, w2 i (columns of mode-1 matrix) 1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6

The space dimensions represent attributes of words The semantic tasks addressed with the W1×LW2 space involve measuring the attributional similarity among words 1 2 3 4

similarity judgments synonym detection noun categorization selectional preferences

The word by link-word (W1×LW2 ) space vectors labeled with words w1 (rows of mode-1 matrix) dimensions labelled with binary tuples of type hl, w2 i (columns of mode-1 matrix) 1: 2: 3: 4: 5: 6: hown,bombihuse,bombihown,gunihuse,gunihown,bookihuse,booki 1:marine 40.0 82.1 85.3 44.8 3.2 3.3 2:sergeant 16.7 69.5 73.4 51.9 8.0 10.1 3:teacher 5.2 7.0 9.3 4.7 48.4 53.6

The space dimensions represent attributes of words The semantic tasks addressed with the W1×LW2 space involve measuring the attributional similarity among words 1 2 3 4

similarity judgments synonym detection noun categorization selectional preferences

Alternative models for the W1×LW2 space

Win – unstructured DSM that relies on the target-context linear proximity (Bullinaria & Levy 2007, Lund & Burgess 1996, Schutze ¨ 1997) based on co-occurrences of the same 30,000 W1 (W2 ) used for DM, within a window of maximally 5 content words counts converted to LMI weights (negative LMI values are raised to 0) the Win matrix has about 110 million non-zero entries

DV – a structured DSM, but dependency paths are not part of the attributes (implementation of the Dependency Vectors approach of Pado´ & Lapata 2007) DV is obtained from the same co-occurrence data of DepDM counts converted to LMI weights(negative LMI values are raised to 0) the DV matrix contains about 38 million non-zero values

Alternative models for the W1×LW2 space

Win – unstructured DSM that relies on the target-context linear proximity (Bullinaria & Levy 2007, Lund & Burgess 1996, Schutze ¨ 1997) based on co-occurrences of the same 30,000 W1 (W2 ) used for DM, within a window of maximally 5 content words counts converted to LMI weights (negative LMI values are raised to 0) the Win matrix has about 110 million non-zero entries

DV – a structured DSM, but dependency paths are not part of the attributes (implementation of the Dependency Vectors approach of Pado´ & Lapata 2007) DV is obtained from the same co-occurrence data of DepDM counts converted to LMI weights(negative LMI values are raised to 0) the DV matrix contains about 38 million non-zero values

Similarity judgments Data set Rubenstein and Goodenough (1965) (R&G) 65 noun pairs rated by 51 subjects on a 0-4 similarity scale car food cord

automobile fruit smile

3.9 2.7 0.0

Correlation between noun distances (cosines) in the W1×LW2 space and R&G ratings evaluated with Pearson’s r (Pado´ and Lapata 2007) model DoubleCheck1 TypeDM SVD-092

1 Chen

r 85 82 80

model Win DV-073 DepDM

r 65 62 57

model DV LexDM cosDV-073

r 57 53 47

ˇ et al. (2006); 2 Herdagdelen et al. (2009); 3 Pado & Lapata (2007)

Similarity judgments Data set Rubenstein and Goodenough (1965) (R&G) 65 noun pairs rated by 51 subjects on a 0-4 similarity scale car food cord

automobile fruit smile

3.9 2.7 0.0

Correlation between noun distances (cosines) in the W1×LW2 space and R&G ratings evaluated with Pearson’s r (Pado´ and Lapata 2007) model DoubleCheck1 TypeDM SVD-092

1 Chen

r 85 82 80

model Win DV-073 DepDM

r 65 62 57

model DV LexDM cosDV-073

r 57 53 47

ˇ et al. (2006); 2 Herdagdelen et al. (2009); 3 Pado & Lapata (2007)

Synonym detection Data set TOEFL (Landauer & Dumais 1997) 80 multiple-choice questions target levied

candidates imposed, believed, requested, correlated

DM picks the candidate with the highest cosine to the target item as its guess of the right synonym model LSA-031 GLSA2 PPMIC3 CWO4 PMI-IR-035 BagPack6 1 Rapp

accuracy 92.50 86.25 85.00 82.55 81.25 80.00

model DV TypeDM PairClass7 DepDM LexDM PMI-IR-018

accuracy 76.87 76.87 76.25 75.01 74.37 73.75

model DV-079 Win Human10 LSA-9710 Random

accuracy 73.00 69.37 64.50 64.38 25.00

(2003); 2 Matveeva et al. (2005); 3 Bullinaria & Levy (2007); 4 Ruiz Casado et al. ˇ (2005); & Clarke (2003); 6 Herdagdelen & Baroni (2009); 7 Turney (2008); 8 Turney 9 10 (2001); Pado & Lapata (2007); Landauer & Dumais (1997) 5 Terra

Synonym detection Data set TOEFL (Landauer & Dumais 1997) 80 multiple-choice questions target levied

candidates imposed, believed, requested, correlated

DM picks the candidate with the highest cosine to the target item as its guess of the right synonym model LSA-031 GLSA2 PPMIC3 CWO4 PMI-IR-035 BagPack6 1 Rapp

accuracy 92.50 86.25 85.00 82.55 81.25 80.00

model DV TypeDM PairClass7 DepDM LexDM PMI-IR-018

accuracy 76.87 76.87 76.25 75.01 74.37 73.75

model DV-079 Win Human10 LSA-9710 Random

accuracy 73.00 69.37 64.50 64.38 25.00

(2003); 2 Matveeva et al. (2005); 3 Bullinaria & Levy (2007); 4 Ruiz Casado et al. ˇ (2005); & Clarke (2003); 6 Herdagdelen & Baroni (2009); 7 Turney (2008); 8 Turney 9 10 (2001); Pado & Lapata (2007); Landauer & Dumais (1997) 5 Terra

Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)

Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing

Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)

cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =

1 n

Pk

r =1

max(nri ) i

Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)

Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing

Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)

cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =

1 n

Pk

r =1

max(nri ) i

Noun categorization Categorization tasks are a crucial probe into the semantic organization of the lexicon e.g. to investigate the human ability to arrange concepts hierarchically into taxonomies (Murphy 2002)

Corpus-based semantics is interested in investigating whether distributional (attributional) similarity could be used to group words into semantically coherent categories e.g. for semantic typing

Categorization as an unsupervised clustering task nouns are clustered with CLUTO (Karypis 2003), using their similarity matrix based on pairwise cosines repeated bisections algorithm with global optimization method (parameters with default values in CLUTO)

cluster quality is evaluated by percentage purity (Zhao & Karypis 2001) Purity =

1 n

Pk

r =1

max(nri ) i

Noun categorization Almuhareb & Poesio (AP) Data set

402 noun concepts from WordNet, balanced in terms of frequency and ambiguity Concepts must be clustered into 21 classes, corresponding to the 21 unique WordNet beginners (13–21 nouns per class) VEHICLES :

helicopter, motorcycle. . . ethics, incitement, . . .

MOTIVATION :

model

Almuhareb & Poesio (AP) purity model 1

DepPath TypeDM AttrValue-062 Win VSM3

1 Rothenhausler ¨

79 76 71 71 70

DV DepDM LexDM Random

purity 65 62 59 5

& Schutze (2009); 2 Almuhareb (2006); ¨ et al. (2009)

3 Herdagdelen ˇ

Noun categorization Almuhareb & Poesio (AP) Data set

402 noun concepts from WordNet, balanced in terms of frequency and ambiguity Concepts must be clustered into 21 classes, corresponding to the 21 unique WordNet beginners (13–21 nouns per class) VEHICLES :

helicopter, motorcycle. . . ethics, incitement, . . .

MOTIVATION :

model

Almuhareb & Poesio (AP) purity model 1

DepPath TypeDM AttrValue-062 Win VSM3

1 Rothenhausler ¨

79 76 71 71 70

DV DepDM LexDM Random

purity 65 62 59 5

& Schutze (2009); 2 Almuhareb (2006); ¨ et al. (2009)

3 Herdagdelen ˇ

Noun categorization Battig Data set

83 concepts from the expanded Battig and Montague norms of Van Overschelde et al. (2004) (cf. Baroni et al. 2010) Nouns are highly prototypical instances of 10 common concrete categories (up to 10 concepts per class) LAND MAMMALS : TOOLS :

dog, elephant. . . screwdriver, hammer, . . .

model Win TypeDM Strudel1 DepDM DV

Battig purity model 96 94 91 90 84

1 Baroni

DV-101 LexDM SVD-101 AttrValue1 Random

et al. (2010)

purity 79 78 71 45 12

Noun categorization Battig Data set

83 concepts from the expanded Battig and Montague norms of Van Overschelde et al. (2004) (cf. Baroni et al. 2010) Nouns are highly prototypical instances of 10 common concrete categories (up to 10 concepts per class) LAND MAMMALS : TOOLS :

dog, elephant. . . screwdriver, hammer, . . .

model Win TypeDM Strudel1 DepDM DV

Battig purity model 96 94 91 90 84

1 Baroni

DV-101 LexDM SVD-101 AttrValue1 Random

et al. (2010)

purity 79 78 71 45 12

Noun categorization ESSLLI 2008 Data set

44 concrete nouns grouped into hierarchically organized classes (ESSLLI 2008 shared task) 6 lower classes: BIRDS, LAND ANIMALS, FRUIT, GREENS, TOOLS, VEHICLES

3 middle classes: ANIMALS, VEGETABLES, ARTIFACTS 2 top classes: LIVING BEINGS, OBJECTS model TypeDM Katrenko1 DepDM DV LexDM Peirsman1 Win Shaoul1 Random

6-way purity 84 91 75 75 75 82 75 41 29

ESSLLI 2008 3-way purity 98 100 93 93 87 84 86 52 45

1 ESSLLI

2-way purity 100 80 100 100 100 86 59 55 54

2008 shared task

avg purity 94.0 90.3 89.4 89.3 87.3 84.0 73.3 49.3 42.7

Selectional preferences

The W1×LW2 space can be used to work with more abstract notions, such as that of a typical filler of a verb argument slot The selectional preferences of a predicate can not be reduced simply to the set of its attested arguments in a corpus we must account for the possibility of generalizations to unseen arguments kill the aardvark – OK, since aardvark is a living entity kill the serendipity – BAD, since serendipity is not a living entity

Data set human plausibility judgments (on a 7-point scale) of noun-verb pairs from McRae et al. (1997) (100 pairs, 36 raters) and Pado´ (2007) (211 pairs, ∼20 raters per pair): shoot deer obj 6.4 shoot deer subj 1.0

Selectional preferences

The W1×LW2 space can be used to work with more abstract notions, such as that of a typical filler of a verb argument slot The selectional preferences of a predicate can not be reduced simply to the set of its attested arguments in a corpus we must account for the possibility of generalizations to unseen arguments kill the aardvark – OK, since aardvark is a living entity kill the serendipity – BAD, since serendipity is not a living entity

Data set human plausibility judgments (on a 7-point scale) of noun-verb pairs from McRae et al. (1997) (100 pairs, 36 raters) and Pado´ (2007) (211 pairs, ∼20 raters per pair): shoot deer obj 6.4 shoot deer subj 1.0

Selectional preferences in the W1×LW2 space 1

Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj

measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2

Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v

3

Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors

The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix

Selectional preferences in the W1×LW2 space 1

Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj

measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2

Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v

3

Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors

The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix

Selectional preferences in the W1×LW2 space 1

Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj

measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2

Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v

3

Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors

The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix

Selectional preferences in the W1×LW2 space 1

Select a set of prototypical subj (obj) nouns of the verb v project the W1×LW2 vectors onto the subspace defined by the dimensions labeled with hlsbj , vi (hlobj , vi) lsbj is any link containing either the string sbj intr or the string sbj tr lobj is any link containing the string obj

measure the length of the noun vectors in this subspace pick the top n longest ones as prototypical subj (obj) of v (n = 20) 2

Build prototype subj (obj) argument vectors for v the vectors (in the full W1×LW2 space) of the picked nouns are normalized and summed the result is a centroid vector representing an abstract “subj (obj) prototype” for v

3

Measure the plausibility of an arbitrary noun n as the subj (obj) of v plausibility is modeled with the distance between the n vector and the subj (obj) prototype vectors

The DM approach is directly inspired to the model by Erk et al. 2007 In DM all the steps are carried out in the same W1×LW2 matrix

Selectional preferences results

Performance measured with Spearman ρ correlation coefficient between the average human ratings and the model predictions (Pado´ et al. 2007)

model Pado´ 1 DepDM LexDM TypeDM ParCos1 Resnik1

McRae coverage 56 97 97 97 91 94

1 Pado ´

ρ 41 32 29 28 21 3

model BagPack2 TypeDM Pado´ 1 ParCos1 DepDM LexDM Resnik1

Pado´ coverage 100 100 97 98 100 100 98

ˇ et al. (2007); 2 Herdagdelen & Baroni (2009)

ρ 60 51 51 48 35 34 24

Selectional preferences results

Performance measured with Spearman ρ correlation coefficient between the average human ratings and the model predictions (Pado´ et al. 2007)

model Pado´ 1 DepDM LexDM TypeDM ParCos1 Resnik1

McRae coverage 56 97 97 97 91 94

1 Pado ´

ρ 41 32 29 28 21 3

model BagPack2 TypeDM Pado´ 1 ParCos1 DepDM LexDM Resnik1

Pado´ coverage 100 100 97 98 100 100 98

ˇ et al. (2007); 2 Herdagdelen & Baroni (2009)

ρ 60 51 51 48 35 34 24

Some conclusions on the W1×LW2 space

DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs

The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks

Some conclusions on the W1×LW2 space

DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs

The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks

Some conclusions on the W1×LW2 space

DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs

The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks

Some conclusions on the W1×LW2 space

DM models perform very well in tasks involving attributional similarity The performance of unstructured DSMs (including Win) is equally very high The best DM model (TypeDM) also achieves brilliant results in capturing selectional preferences this task is not directly addressable by unstructured DSMs

The real advantage of structured DSMs (like DM) actually resides in their versatility in addressing a much larger and various range of semantic tasks

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni

1:own 40.0 16.7 5.2

2:use 82.1 69.5 7.0

The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2

recognizing analogies relation classification

. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4

qualia extraction predicting characteristic properties of concepts

The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni

1:own 40.0 16.7 5.2

2:use 82.1 69.5 7.0

The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2

recognizing analogies relation classification

. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4

qualia extraction predicting characteristic properties of concepts

The word-word by link (W1 W2×L) space vectors labeled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) dimensions labelled with links l (rows of mode-2 matrix) 1:hmarine,bombi 2:hsergeant,bombi 3:hteacher,guni

1:own 40.0 16.7 5.2

2:use 82.1 69.5 7.0

The space dimensions represent links as attributes of word pairs The W1 W2×L space can be used to solve semantic tasks based on relational similarity . . . 1 2

recognizing analogies relation classification

. . . but also problems not traditionally defined in terms of a word-pair-by-link matrix 3 4

qualia extraction predicting characteristic properties of concepts

Smoothing W1 W2×L with the W1×LW2 space For the analogy and the relation classification tasks (where the target pairs are known in advance), target pairs vectors are smoothed with new pairs containing their attributional neighbors one of the words of each target pair is combined in turn with the top 20 nearest W1×LW2 neighbors of the other word, obtaining a total of 41 pairs (including the original) the centroid of the W1 W2×L vectors of these pairs is then taken to represent a target pair the smoothed hautomobile, wheeli vector is an average of the hautomobile, wheeli, hcar, wheeli, hautomobile, circlei, etc., vectors

nearest neighbours are searched in the W1×LW2 matrix compressed to 5,000 dimensions via Random Indexing (with the parameters suggested by Sahlgren 2005)

DM smoothing is similar to the method proposed by Turney (2006) In DM the attributional and relational spaces are derived both from the same tensor

Smoothing W1 W2×L with the W1×LW2 space For the analogy and the relation classification tasks (where the target pairs are known in advance), target pairs vectors are smoothed with new pairs containing their attributional neighbors one of the words of each target pair is combined in turn with the top 20 nearest W1×LW2 neighbors of the other word, obtaining a total of 41 pairs (including the original) the centroid of the W1 W2×L vectors of these pairs is then taken to represent a target pair the smoothed hautomobile, wheeli vector is an average of the hautomobile, wheeli, hcar, wheeli, hautomobile, circlei, etc., vectors

nearest neighbours are searched in the W1×LW2 matrix compressed to 5,000 dimensions via Random Indexing (with the parameters suggested by Sahlgren 2005)

DM smoothing is similar to the method proposed by Turney (2006) In DM the attributional and relational spaces are derived both from the same tensor

Alternative models for the W1 W2×L space

LRA – reimplementation of Latent Relational Analysis (Turney 2006, “baseline LRA system”) using the DM corpus for a given set of target pairs, count all the patterns that connect them, in either order, in the corpus patterns are sequences of 1-to-3 words occurring between the targets, with all, none or any subset of the elements replaced by wildcards (with the, with *, * the, * *) only the top most 4,000 most frequent patterns are preserved

a target-pair-by-pattern matrix is constructed (with 8,000 dimensions, to account for directionality) values in the matrix are log- and entropy-transformed according to Turney’s formula

SVD is applied, reducing the columns to the top 300 latent dimensions target pairs are smoothed with the same DM method the neighbors for target pair expansion are taken from the best attributional DM model (TypeDM)

Recognizing analogies

Data set 374 SAT multiple-choice questions (Turney 2006) each question includes 1 target pair (stem) e 5 answer pairs the task is to choose the pair most analogous to the stem mason teacher carpenter soldier photograph book

stone chalk wood gun camera word

Recognizing analogies

Data set 374 SAT multiple-choice questions (Turney 2006) each question includes 1 target pair (stem) e 5 answer pairs the task is to choose the pair most analogous to the stem mason teacher carpenter soldier photograph book

stone chalk wood gun camera word

Recognizing SAT analogies results

The answer-pair with the highest cosine in W1 W2×L space to the target pair is selected as the right analogy model

accuracy

Human1 LRA-062 PERT3 PairClass4 VSM1

1 Turney

57.0 56.1 53.3 52.1 47.1

model BackPack5 k-means6 TypeDM LSA7 LRA

accuracy 44.1 44.0 42.4 42.0 37.7

model PMI-IR-062 DepDM LexDM Random

accuracy 35.0 31.4 29.3 20.0

& Littman (2005); 2 Turney (2006a); 3 Turney (2006b); 4 Turney (2008); & Baroni (2009); 6 Bicic¸i & Yuret (2006); 7 Mangalath et al. (2004)

5 Herdagdelen ˇ

Recognizing SAT analogies results

The answer-pair with the highest cosine in W1 W2×L space to the target pair is selected as the right analogy model

accuracy

Human1 LRA-062 PERT3 PairClass4 VSM1

1 Turney

57.0 56.1 53.3 52.1 47.1

model BackPack5 k-means6 TypeDM LSA7 LRA

accuracy 44.1 44.0 42.4 42.0 37.7

model PMI-IR-062 DepDM LexDM Random

accuracy 35.0 31.4 29.3 20.0

& Littman (2005); 2 Turney (2006a); 3 Turney (2006b); 4 Turney (2008); & Baroni (2009); 6 Bicic¸i & Yuret (2006); 7 Mangalath et al. (2004)

5 Herdagdelen ˇ

Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised

Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R

when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid

Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised

Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R

when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid

Classifying relations with DM Relation classification requires grouping pairs of words into classes that instantiate the same relations (e.g. CAUSE, PART OF, etc.) the common approach to this task is supervised

Nearest centroid method when both positive and negative examples are available, given a relation type R a positive centroid is created by summing the W1 W2×L vectors of the positive example pairs of R, and a negative centroid is created by summing the W1 W2×L vectors of the negative example pairs of R the cosine of a test pair with the centroids decides whether it instantiates the R relation based on whether it is closer to the positive or negative centroid of R

when there are no negative examples, a centroid for each class is created test items are classified depending on their nearest centroid

Relation classification SEMEVAL 2007 Data set

7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER

Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation

The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)

The contexts of the target word pairs (provided with the test set) are not used by the DM models

Relation classification SEMEVAL 2007 Data set

7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER

Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation

The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)

The contexts of the target word pairs (provided with the test set) are not used by the DM models

Relation classification SEMEVAL 2007 Data set

7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER

Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation

The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)

The contexts of the target word pairs (provided with the test set) are not used by the DM models

Relation classification SEMEVAL 2007 Data set

7 relation types between nominals from SemEval-2007 Task 04 (Girju et al. 2007): CAUSE - EFFECT , INSTRUMENT- AGENCY , PRODUCT- PRODUCER , ORIGIN - ENTITY , THEME - TOOL, PART- WHOLE , CONTENT- CONTAINER

Instances consist of Web snippets, containing word pairs connected by a certain pattern e.g.,“* causes *” for the CAUSE - EFFECT relation

The retrieved snippets were manually classified by the SEMEVAL organizers as positive (cycling-happiness) or negative (costumer-satisfaction) instances of a relation (CAUSE - EFFECT) for each relation, 140 training examples and about 80 test cases (ca. 50% positive)

The contexts of the target word pairs (provided with the test set) are not used by the DM models

Relation classification SEMEVAL 2007 Data set

Baseline models Majority always guesses the majority class in the test set AllTrue always assigns an item to the target class ProbMatch randomly guesses classes matching their distribution in the test data

model TypeDM UCD-FC1 AllTrue ILK1 UCB1 LexDM LRA

acc 70.2 66.0 48.5 63.5 65.4 65.4 62.0

1 SEMEVAL

prec 71.7 66.1 48.5 60.5 62.7 64.7 62.7

recall 62.5 66.7 100.0 69.5 63.0 61.3 59.3

SEMEVAL 2007 F model 66.4 DepDM 64.8 UMELB-B1 64.8 UTH1 63.8 ProbMatch 62.7 UC3M1 62.5 Majority 60.2

acc 61.8 62.7 58.8 51.7 49.9 57.0

prec 61.0 61.5 56.1 48.5 48.2 81.3

recall 57.3 55.7 57.1 48.5 40.3 42.9

all measures are macro-averaged Task 4 (Models in group A: WordNet = NO & Query = NO)

F 58.9 57.8 55.9 48.5 43.1 30.8

Relation classification Nastase & Szpakowicz (NS) Data set

600 modifier-noun classified by Nastase & Szpakowicz (2003) into 30 relations CAUSE (cloud-storm), PURPOSE (album-picture), LOCATION - AT (pain-chest), LOCATION - FROM (visitor-country), etc.

model

Nastase & Szpakowicz (NS) global acc prec recall

LRA-061 VSM-AV2 VSM-WMTS1 LRA TypeDM LexDM DepDM AllTrue ProbMatch Majority

39.8 27.8 24.7 22.8 15.4 12.1 8.7 NA 4.7 8.2

41.0 27.9 24.0 20.3 19.5 7.5 11.6 3.3 3.3 0.3

35.9 26.8 20.9 21.1 20.2 14.1 14.5 100 3.3 3.3

F 36.6 26.5 20.3 18.8 13.7 8.1 8.1 6.4 3.3 0.5

all measures are macro-averaged except accuracy 1 Turney (2006a); 2 Turney & Littman (2005)

Relation classification Nastase & Szpakowicz (NS) Data set

600 modifier-noun classified by Nastase & Szpakowicz (2003) into 30 relations CAUSE (cloud-storm), PURPOSE (album-picture), LOCATION - AT (pain-chest), LOCATION - FROM (visitor-country), etc.

model

Nastase & Szpakowicz (NS) global acc prec recall

LRA-061 VSM-AV2 VSM-WMTS1 LRA TypeDM LexDM DepDM AllTrue ProbMatch Majority

39.8 27.8 24.7 22.8 15.4 12.1 8.7 NA 4.7 8.2

41.0 27.9 24.0 20.3 19.5 7.5 11.6 3.3 3.3 0.3

35.9 26.8 20.9 21.1 20.2 14.1 14.5 100 3.3 3.3

F 36.6 26.5 20.3 18.8 13.7 8.1 8.1 6.4 3.3 0.5

all measures are macro-averaged except accuracy 1 Turney (2006a); 2 Turney & Littman (2005)

Relation classification ´ Seaghdha ´ O & Copestake (OC) Data set

´ Seaghdha ´ 1,443 noun-noun compounds classified by O & Copestake (2009) into 6 relations BE (celebrity-winner), HAVE (door-latch), IN (air-disaster), ACTOR (school-inspector), INSTRUMENT (freight-train), ABOUT (bank-panic)

model

´ Seaghdha ´ O & Copestake (OC) global acc prec recall 1

OC-Comb OC-Rel1 TypeDM LexDM AllTrue LRA DepDM ProbMatch Majority

63.1 52.1 32.1 29.7 NA 28.2 27.6 17.1 21.3

NA NA 33.8 29.9 16.7 27.6 28.2 16.7 3.6

NA NA 33.5 28.9 100 27.4 28.2 16.7 16.7

F 61.6 49.9 31.4 28.7 28.5 27.2 27.0 16.7 5.9

all measures are macro-averaged except accuracy 1O ´ Seaghdha ´ & Copestake (2009)

Relation classification ´ Seaghdha ´ O & Copestake (OC) Data set

´ Seaghdha ´ 1,443 noun-noun compounds classified by O & Copestake (2009) into 6 relations BE (celebrity-winner), HAVE (door-latch), IN (air-disaster), ACTOR (school-inspector), INSTRUMENT (freight-train), ABOUT (bank-panic)

model

´ Seaghdha ´ O & Copestake (OC) global acc prec recall 1

OC-Comb OC-Rel1 TypeDM LexDM AllTrue LRA DepDM ProbMatch Majority

63.1 52.1 32.1 29.7 NA 28.2 27.6 17.1 21.3

NA NA 33.8 29.9 16.7 27.6 28.2 16.7 3.6

NA NA 33.5 28.9 100 27.4 28.2 16.7 16.7

F 61.6 49.9 31.4 28.7 28.5 27.2 27.0 16.7 5.9

all measures are macro-averaged except accuracy 1O ´ Seaghdha ´ & Copestake (2009)

The W1 W2×L space ad interim summary

TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique

While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent

The W1 W2×L space ad interim summary

TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique

While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent

The W1 W2×L space ad interim summary

TypeDM achieves competitive results in semantic tasks involving relational similarity TypeDM generally outperforms our LRA implementation the large advantage of Turney’s LRA might be due to its gigantic training corpus (ca. 50 billions), and/or to the more sophisticated smoothing technique

While LRA is trained separately for each test set, the structure of the W1 W2×L space is completely task-independent

Pattern-based relation extraction

Pattern-based approaches to extract semantic relations pick a set of lexico-syntactic patterns that should capture the relation of interest and harvest the word pairs they connect in text cf. Hearst (1992) for the hyponymy relation

In DM, the same approach can be pursued by exploiting the information already available in the W1 W2×L space promising links are selected as DM-equivalent of patterns relation instances are identified by measuring the length of word pair vectors in the W1 W2×L subspace defined by the selected links

Pattern-based relation extraction

Pattern-based approaches to extract semantic relations pick a set of lexico-syntactic patterns that should capture the relation of interest and harvest the word pairs they connect in text cf. Hearst (1992) for the hyponymy relation

In DM, the same approach can be pursued by exploiting the information already available in the W1 W2×L space promising links are selected as DM-equivalent of patterns relation instances are identified by measuring the length of word pair vectors in the W1 W2×L subspace defined by the selected links

Qualia extraction

Data set 1,487 noun-quale pairs corresponding to the qualia structures (Pustejovsky 1995) for 30 concrete (door) and abstract (imagination) nouns (Cimiano & Wenderoth 2007) each noun-quale pair was rated by 3 subjects and instantiates one of the four qualia roles defined by Pustejovsky (1995) Formal door-barrier Constitutive food-fat Agentive letter-write Telic novel-entertain

Extracting qualia in the W1 W2×L space

1

Selecting the pattern for qualia extraction the patterns proposed by Cimiano & Wenderoth (2007) are approximated by manually selecting links that are already in the DM tensors

FORMAL n as-form-of q, q as-form-of n n as-kind-of q, n as-sort-of q, n be q q such as n AGENTIVE n as-result-of q, q obj n

CONSTITUTIVE q as-member-of n, q as-part-of n, n with q n with-lot-of q, n with-majority-of q n with-number-of q, n with-sort-of q n with-variety-of q TELIC n for-use-as q, n for-use-in q, n sbj tr q n sbj intr q

Extracting qualia in the W1 W2×L space

2

Creating qualia subspaces of W1 W2×L for each role r , the W1 W2×L vectors containing a target noun n are projected onto the subspace determined by the link set associated to role r the lengths of vectors hn, qi are measured in this subspaces, with q a potential quale for n

3

Ranking potential qualia roles length in the subspace associated to the qualia role r is used to rank all hn, qi pairs relevant to r e.g., the length of hbook, readi in the subspace defined by the Telic links is the DM measure of fitness of read as Telic role of book

Extracting qualia in the W1 W2×L space

2

Creating qualia subspaces of W1 W2×L for each role r , the W1 W2×L vectors containing a target noun n are projected onto the subspace determined by the link set associated to role r the lengths of vectors hn, qi are measured in this subspaces, with q a potential quale for n

3

Ranking potential qualia roles length in the subspace associated to the qualia role r is used to rank all hn, qi pairs relevant to r e.g., the length of hbook, readi in the subspace defined by the Telic links is the DM measure of fitness of read as Telic role of book

Qualia extraction results

For each noun, the ranked list precision is computed at 11 equally spaced recall levels from 0% to 100%, separately for each role the precision, recall and F values at the recall level that results in the highest F score are averaged across the roles, and then across target nouns

model TypeDM P1 WebP1 LexDM

precision

recall

F

26.2 NA NA 19.9

22.7 NA NA 23.6

18.4 17.1 16.7 16.2

1 Cimiano

model WebJac1 DepDM Verb-PMI1 Base1

& Wenderoth (2007)

precision

recall

F

NA 17.8 NA NA

NA 16.9 NA NA

15.2 12.8 10.7 7.6

Qualia extraction results

For each noun, the ranked list precision is computed at 11 equally spaced recall levels from 0% to 100%, separately for each role the precision, recall and F values at the recall level that results in the highest F score are averaged across the roles, and then across target nouns

model TypeDM P1 WebP1 LexDM

precision

recall

F

26.2 NA NA 19.9

22.7 NA NA 23.6

18.4 17.1 16.7 16.2

1 Cimiano

model WebJac1 DepDM Verb-PMI1 Base1

& Wenderoth (2007)

precision

recall

F

NA 17.8 NA NA

NA 16.9 NA NA

15.2 12.8 10.7 7.6

Describing concept properties

Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.

Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)

Describing concept properties

Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.

Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)

Describing concept properties

Corpus-based semantic methods have been applied to generate commonsense concept descriptions in terms of intuitively salient properties (Almuhareb 2006, Baroni & Lenci 2008, Baroni et al. 2010) a dog is a mammal, it barks, it has a tail, etc.

Semantic feature norms (property lists collected from subjects in elicitation tasks) are widely used in cognitive science as surrogates of mental features (Garrard et al. 2001, McRae et al. 2005, Vinson & Vigliocco 2008) Large-scale collections of property-based concept descriptions are also carried out in AI, where they are important for commonsense reasoning cf. Open Mind Common Sense (Liu & Singh 2004)

Predicting characteristic properties with DM

The W1 W2×L space is used to predict the chracteristic properties of noun concepts all the hn, w2 i pairs that have the target nominal concept n as first element are ranked by length in the W1 W2×L space the longest hn, w2 i vectors in this space should correspond to salient properties of the target concept, since we expect a concept to often co-occur in texts with its important properties properties with different POS are normalized by dividing the length of the vector representing a pair by the length of the longest vector in the harvested concept-property set that has the same POS pair hcar, drivei, hcar, parki and hcar, enginei can be found among the longest W1 W2×L vectors with car as first item

Predicting characteristic properties Data set gold standard lists of 10 properties for each of 44 concrete noun concepts (cf. ESSLLI 2008 unconstrained property generation challenge) properties most frequently produced by subjects in the elicitation experiment of McRae et al. 2005 Algorithms must generate lists of 10 properties per concept performance is measured by the cross-concept average proportions of properties in the generated lists that are also in the corresponding gold standard lists model

overlap

s.d.

model

s.d.

14.1 8.8 4.1 1.8

10.3 9.9 6.1 3.9

23.9 19.5 16.1 14.5

1 Baroni

et al. (2010); 2 ESSLLI 2008 shared task

Strudel TypeDM DepDM LexDM

11.3 12.4 12.6 12.1

1

overlap

1

DV-10 AttrValue1 SVD-101 Shaoul2

Predicting characteristic properties Data set gold standard lists of 10 properties for each of 44 concrete noun concepts (cf. ESSLLI 2008 unconstrained property generation challenge) properties most frequently produced by subjects in the elicitation experiment of McRae et al. 2005 Algorithms must generate lists of 10 properties per concept performance is measured by the cross-concept average proportions of properties in the generated lists that are also in the corresponding gold standard lists model

overlap

s.d.

model

s.d.

14.1 8.8 4.1 1.8

10.3 9.9 6.1 3.9

23.9 19.5 16.1 14.5

1 Baroni

et al. (2010); 2 ESSLLI 2008 shared task

Strudel TypeDM DepDM LexDM

11.3 12.4 12.6 12.1

1

overlap

1

DV-10 AttrValue1 SVD-101 Shaoul2

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

The word-link by word (W1 L×W2 ) space vectors labeled with binary tuples of type hw1 , li (columns of mode-3 matrix) dimensions labelled with words w2 (rows of mode-3 matrix) 1:hmarine,owni 2:hmarine,usei 3:hsergeant,owni

1:bomb 40.0 82.1 16.7

2:gun 85.3 44.8 73.4

3:book 3.2 3.3 8.0

The W1 L×W2 vectors also represent syntactic slots of verb frames the vector labeled with the tuple hread, sbj−1 i represents the subject slot of the verb read in terms of the distribution of its noun fillers, labeling the dimensions of the space

The W1×LW2 space is used to classify verbs participating in different argument alternations

The word-link by word (W1 L×W2 ) space vectors labeled with binary tuples of type hw1 , li (columns of mode-3 matrix) dimensions labelled with words w2 (rows of mode-3 matrix) 1:hmarine,owni 2:hmarine,usei 3:hsergeant,owni

1:bomb 40.0 82.1 16.7

2:gun 85.3 44.8 73.4

3:book 3.2 3.3 8.0

The W1 L×W2 vectors also represent syntactic slots of verb frames the vector labeled with the tuple hread, sbj−1 i represents the subject slot of the verb read in terms of the distribution of its noun fillers, labeling the dimensions of the space

The W1×LW2 space is used to classify verbs participating in different argument alternations

Argument alternations

Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share

The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)

Argument alternations

Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share

The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)

Argument alternations

Alternations involve the expression of the same semantic argument in two different syntactic slots (Levin & Rappaport-Hovav 2005) Measures of “slot overlap” have been used by Joanis et al. (2008) as features to classify verbs on the basis of their argument alternations the set of nouns that appear in two alternating slots should overlap to a certain degree the cosine between the vectors of different syntactic slots of the same verb measures the amount of fillers they share

The W1 L×W2 space is used to carry out the automatic classification of verbs that participate in different types of transitivity alternations in transitivity alternations, verbs allow both for a transitive NP V NP variant and for an intransitive NP V (PP) variant (Levin 1993)

Transitivity alternations in the W1 L×W2 space Causative/inchoative alternation with alternating verbs, the object argument (John broke the vase) can also be realized as an intransitive subject (The vase broke)

data set 402 verbs extracted from Levin Classes (Levin 1993) 232 alternating causatives/inchoatives (break) 170 non-alternating transitives (mince)

Merlo & Stevenson (2001) classification task discriminate 3 classes of verbs, each characterized by a different transitivity alternation

data set 58 verbs from Merlo & Stevenson (2001) 19 unergative verbs undergoing the “induced action alternation” (race) 19 unaccusative verbs that undergo the “causative/inchoative alternation” (break) 20 object-drop verbs participating in the “unexpressed object alternation” (play)

Transitivity alternations in the W1 L×W2 space Causative/inchoative alternation with alternating verbs, the object argument (John broke the vase) can also be realized as an intransitive subject (The vase broke)

data set 402 verbs extracted from Levin Classes (Levin 1993) 232 alternating causatives/inchoatives (break) 170 non-alternating transitives (mince)

Merlo & Stevenson (2001) classification task discriminate 3 classes of verbs, each characterized by a different transitivity alternation

data set 58 verbs from Merlo & Stevenson (2001) 19 unergative verbs undergoing the “induced action alternation” (race) 19 unaccusative verbs that undergo the “causative/inchoative alternation” (break) 20 object-drop verbs participating in the “unexpressed object alternation” (play)

Transitivity alternations in the W1 L×W2 space

The similarities between the W1 L×W2 vectors of the transitive subject, intransitive subject, and direct object slots of a verb are used to classify the verbs the W1 L×W2 slot vectors hv , li whose links are sbj intr, sbj tr and obj are extracted for each verb v in a data set for LexDM, we sum the vectors with links beginning with one of these three patterns

a 3-dimensional vector with the cosines between the three slot vectors is built for each v these second order vectors encode the profile of similarity across the slots of a verb

verb classification is performed using the nearest centroid method on the 3-dimensional vectors, with leave-one-out cross-validation

Causative/inchoative alternation results

Binary classification of the C/I data set (with non-alternating verbs as negative examples) Causative/Inchoative model acc prec recall AllTrue 57.7 57.7 100 LexDM 69.9 76.0 69.9 TypeDM 69.1 75.7 68.5 DepDM 65.7 72.8 64.6 ProbMatch 51.2 57.7 57.7

F 73.2 72.8 71.9 68.4 57.7

Merlo & Stevenson (2001) classification task results

3-way classification of the MS data set model

Merlo & Stevenson (MS) acc prec recall

NoPass1 AllFeatures1 NoTrans1 NoCaus1 NoVBN1 TypeDM NoAnim1 LexDM DepDM AllTrue ProbMatch Majority

71.2 69.5 64.0 62.7 61.0 61.5 61.0 56.4 54.7 NA 33.3 33.9

NA NA NA NA NA 60.7 NA 55.3 52.9 33.3 33.3 11.3

NA NA NA NA NA 61.7 NA 56.7 55.0 100 33.3 33.3

F 71.2 69.1 63.8 62.6 61.0 60.8 59.9 55.8 53.2 50.0 33.3 16.9

all measures are macro-averaged except accuracy 1 Merlo & Stevenson (2001)

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

The link by word-word (L×W1 W2 ) space vectors labeled with links l (rows of mode-2 matrix) dimensions labelled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) 1:own 2:use

1:hmarine,bombi2:hsergeant,bombi3:hteacher,guni 40.0 16.7 5.2 82.1 69.5 7.0

Links are represented in terms of the word pairs they connect The L×W1 W2 space supports tasks involving the semantics of links characterizing prepositions (Baldwin et al. 2009) or measuring the relative similarity of different kinds of V-N relations, etc.

The L×W1 W2 vectors are currently used for the automatic selection of links for the W1 W2×L task of qualia extraction

The link by word-word (L×W1 W2 ) space vectors labeled with links l (rows of mode-2 matrix) dimensions labelled with word pair tuples hw1 , w2 i(columns of mode-2 matrix) 1:own 2:use

1:hmarine,bombi2:hsergeant,bombi3:hteacher,guni 40.0 16.7 5.2 82.1 69.5 7.0

Links are represented in terms of the word pairs they connect The L×W1 W2 space supports tasks involving the semantics of links characterizing prepositions (Baldwin et al. 2009) or measuring the relative similarity of different kinds of V-N relations, etc.

The L×W1 W2 vectors are currently used for the automatic selection of links for the W1 W2×L task of qualia extraction

Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs

the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)

the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out

Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs

the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)

the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out

Automatic link selection for qualia extraction For each of the 30 noun concepts in the Cimiano & Wenderoth (2007) data set, the noun-quale pairs pertaining to the remaining 29 concepts are used as training examples, to select a set of 20 links for Qualia extraction for each role r, two L×W1 W2 subspaces are constructed a positive subspace, with the example pairs hn, qr i as unique non-zero dimensions a negative subspace, with non-zero dimensions corresponding to all hw1 , w2 i pairs such that w1 is one of the training nominal concepts, and w2 is not a qualia qr in the example pairs

the length of each link is measured in both subspaces e.g., the length of the obj link is measured in a subspace characterized by hn, qtelic i example pairs (positive subspace), and the length of obj in a subspace characterized by hn, w2 i pairs that are not Telic examples (negative subspace)

the pointwise mutual information (PMI) is computed on these lengths to find the links that are most typical of the positive subspace corresponding to each qualia role links with less than 10 non-zero dimensions in the positive subspace are filtered out

Automatic link selection for qualia extraction

Links selected in all folds of the leave-one-out procedure to extract links typical of each qualia role

FORMAL n is q, q is n, q become n, n coord q, q coord n, q have n, n in q, n provide q, q such as n AGENTIVE q after n, q alongside n, q as n, q before n, q besides n, q during n, q in n, q obj n, q out n, q over n, q since n, q unlike n

CONSTITUTIVE n have q, n use q, n with q, n without q TELIC q behind n, q by n, q like n, q obj n, n sbj intr q, q through n, q via n

Qualia extraction results with automatically selected links

Qualia extraction using DM subspaces defined by the automatic extracted links

model TypeDM* TypeDM P1 WebP1 LexDM WebJac1

precision 24.2 26.2 NA NA 19.9 NA

recall 26.7 22.7 NA NA 23.6 NA

F 19.1 18.4 17.1 16.7 16.2 15.2

model DepDM* LexDM* DepDM Verb-PMI1 Base1

precision 18.4 22.6 17.8 NA NA

green DM models trained with manually selected links 1 Cimiano & Wenderoth (2007)

recall 27.0 18.1 16.9 NA NA

F 15.1 14.8 12.8 10.7 7.6

Outline 1

Background and motivation

2

The Distributional Memory framework weighted tuple structures labeled tensors labeled matricization

3

Implementing DM

4

Semantic experiments with the DM spaces The W1×LW2 space The W1 W2×L space The W1 L×W2 space The L×W1 W2 space

5

Summary and conclusions

Conclusions two requirements for a general framework for distributional semantics

common representation for distributional semantics representing corpus-derived data to capture aspects of meaning that have so far been modeled with different, prima facie incompatible data structures

versatility in modeling semantic tasks using the common representation to address a large battery of semantic experiments, achieving a performance at least comparable to that of state-of-art, task-specific DSMs

Conclusions DM approach to distributional representation

DM models distributional data as a structure of weighted tuples that is formalized as a labeled third order tensor a generalization with respect to the common approach of many corpus-based semantic models that still couch distributional information information directly into binary structures the third order tensor formalization allows DM to fully exploit the potential of corpus-derived tuples

Semantic spaces are generated from the same underlying third order tensor, by the standard linear algebraic operation of tensor matricization

Conclusions DM approach to distributional representation

DM models distributional data as a structure of weighted tuples that is formalized as a labeled third order tensor a generalization with respect to the common approach of many corpus-based semantic models that still couch distributional information information directly into binary structures the third order tensor formalization allows DM to fully exploit the potential of corpus-derived tuples

Semantic spaces are generated from the same underlying third order tensor, by the standard linear algebraic operation of tensor matricization

Conclusions DM approach to distributional representation

DM address a large battery of semantic experiments with good performance In nearly all test sets the best implementation of DM (TypeDM) is at least as good as state-of-the-art algorithms other models outperforming TypeDM by a large margin have been trained on much larger corpora, rely on special knowledge resources, or on sophisticated machine learning algorithms TypeDM consistently outperforms alternative models reimplemented to be fully comparable to DM (Win, DV, LRA)

No task-specific optimization was performed

Conclusions DM approach to distributional representation

DM address a large battery of semantic experiments with good performance In nearly all test sets the best implementation of DM (TypeDM) is at least as good as state-of-the-art algorithms other models outperforming TypeDM by a large margin have been trained on much larger corpora, rely on special knowledge resources, or on sophisticated machine learning algorithms TypeDM consistently outperforms alternative models reimplemented to be fully comparable to DM (Win, DV, LRA)

No task-specific optimization was performed

DM as a model for meaning

Consistently with what is commonly assumed in cognitive science and formal linguistics, DM clearly distinguishes between: acquisition phase corpus-based tuple extraction and weighting

declarative structure the common underlying distributional memory

procedural problem-solving components the procedures to perform different semantic tasks

DM as a model for meaning

The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)

The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation

DM as a model for meaning

The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)

The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation

DM as a model for meaning

The third order tensor formalization of corpus-based tuples allows distributional information to be represented in a similar way to other types of knowledge In linguistics, cognitive science, and AI, semantic and conceptual knowledge is represented in terms of structures built around typed relations between elements, such as synsets, concepts, properties, etc. lexical networks like WordNet (Fellbaum 1998), commonsense resources like ConceptNet (Liu & Singh 2004), cognitive models of semantic memory (Rogers & McClelland 2004)

The tensor representation of distributional data promises to lay new bridges across existing approaches to semantic representation

References Almuhareb, A. 2006. Attributes in Lexical Acquisition. Phd thesis, University of Essex. Almuhareb, A. and M. Poesio. 2004. Attribute-based and value-based clustering: An evaluation. In Proceedings of EMNLP, pages 158–165. Baldwin, T., V. Kordoni, and A. Villavicencio. 2009. Prepositions in applications: A survey and introduction to the special issue Computational Linguistics, 35(2):119–149. Baroni, M., E. Barbu, B. Murphy, and M. Poesio. 2010. Strudel: A distributional semantic model based on properties and types. Cognitive Science In press. Baroni, M., S. Evert, and A. Lenci, eds. 2008. Bridging the Gap between Semantic Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics Baroni, M. and A. Lenci. 2008. Concepts and properties in word spaces. Italian Journal of Linguistics, 20(1):55–88 Bicic¸i, E. and D. Yuret. 2006. Clustering word pairs to answer analogy questions. In Proceedings of the Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks, pages 277–284 Bullinaria, J. and J. Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39:510–526 Chen, H., M.-S. Lin, and Y.-C. Wei. 2006. Novel association measures using web search with double checking. In Proceedings of COLING-ACL, pages 1009–1016 Cimiano, P. and J. Wenderoth. 2007. Automatic acquisition of ranked qualia structures from the web. In Proceedings of ACL, pages 888–895 Curran, J. and M. Moens. 2002. Improvements in automatic thesaurus extraction. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, pages 59–66 Erk, K. 2007. A simple, similarity-based model for selectional preferences. In Proceedings of ACL, pages 216–223

References ´ 2008. A structured vector space model for word meaning in context. Erk, K. and S. Pado. In Proceedings of EMNLP, pages 897–906 Evert, S. 2005. The Statistics of Word Cooccurrences. Dissertation, Stuttgart University Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge Garrard, P., M. L. Ralph, J. Hodges, and K. Patterson. 2001. Prototypicality, distinctiveness, and intercorrelation: Analyses of the semantic attributes of living and nonliving concepts. Cognitive Neuropsychology, 18(2):25–174 Girju, R., A. Badulescu, and D. Moldovan. 2006. Automatic discovery of part-whole relations. Computational Linguistics, 32(1):83–135 Girju, R., P. Nakov, V. Nastase, S. Szpakowicz, P. Turney, and D. Yuret. 2007. SemEval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of SemEval 2007, pages 13–18 Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer, Boston Griffiths, T., M. Steyvers, and J. Tenenbaum. 2007. Topics in semantic representation. Psychological Review, 114:211–244 Harris, Z. 1954. Distributional structure. Word, 10(2-3):1456–1162 Hearst, M. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING, pages 539–545 ˇ Herdagdelen, A. and M. Baroni. 2009. BagPack: A general framework to represent semantic relations. In Proceedings of the EACL GEMS Workshop, pages 33–40 ˇ Herdagdelen, A., K. Erk, and M. Baroni. 2009. Measuring semantic relatedness with vector space models and random walks. In Proceedings of TextGraphs-4, pages 50–53 Joanis, E., S. Stevenson, and D. James. 2008. A general feature space for automatic verb classification. Natural Language Engineering, 14(3):337–367

References Karypis, G. 2003. CLUTO: A clustering toolkit. Technical Report 02-017, University of Minnesota Department of Computer Science Kolda, T. 2006. Multilinear operators for higher-order decompositions. Technical Report 2081, SANDIA Kolda, T. and B. Bader. 2009. Tensor decompositions and applications. SIAM Review, 51(3):455–500 Landauer, T. and S. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211–240 Lenci, A. 2008. Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics, 20(1):1–31 Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, IL Levin, B. and M. Rappaport-Hovav. 2005. Argument Realization. Cambridge University Press, Cambridge Lin, D. 1998. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304 Liu, H. and P. Singh. 2004. ConceptNet: A practical commonsense reasoning toolkit. BT Technology Journal, pages 211–226 Lund, K.and C. Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, 28:203–208 Matveeva, I., G.-A. Levow, A. Farahat, and C. Royer. 2005. Generalized latent semantic analysis for term representation. In Proceedings of RANLP, pages 60–68

References McRae, K., G. Cree, M. Seidenberg, and C. McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4):547–559 McRae, K., M. Spivey-Knowlton, and M. Tanenhaus. 1998. Modeling the influence of thematic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312 Merlo, P. and S. Stevenson. 2001. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, 27(3):373–408 Miller, G. and W. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6:1–28 Murphy, Gregory. 2002. The Big Book of Concepts. MIT Press, Cambridge, MA Nastase, V. and S. Szpakowicz. 2003. Exploring noun-modifier semantic relations. In Proceedings of the Fifth International Workshop on Computational Semantics, pages 285–301, Tilburg, The Netherlands ´ Seaghdha, ´ O D. and A. Copestake. 2009. Using lexical and relational similarity to classify semantic relations. In Proceedings of EACL, pages 621–629, Athens, Greece ´ S. and M. Lapata. 2007. Dependency-based construction of semantic space Pado, models. Computational Linguistics, 33(2):161–199 ´ U. 2007. The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Pado, Model of Sentence Processing. Dissertation, Saarland University, Saarbrucken ¨ ´ U., S. Pado, ´ and K. Erk. 2007. Flexible, corpus-based modelling of human Pado, plausibility judgements. In Proceedings of EMNLP, pages 400–409

References Pantel, P. and M. Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of COLING-ACL, pages 113–120 Pustejovsky, J. 1995. The Generative Lexicon. MIT Press, Cambridge, MA Quesada, J., P. Mangalath, and W. Kintsch. 2004. Analogy-making as predication using relational information and LSA vectors. In Proceedings of CogSci, page 1623 Rapp, R. 2003. Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the 9th MT Summit, pages 315–322 Rogers, T. and J. McClelland. 2004. Semantic Cognition: A Parallel Distributed Processing Approach. MIT Press, Cambridge, MA ¨ Rothenhausler, K. and H. Schutze. 2009. Unsupervised classification with dependency ¨ based word spaces. In Proceedings of the EACL GEMS Workshop, pages 17–24 Rubenstein, H. and J. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633 Ruiz-Casado, M., E. Alfonseca, and P. Castells. 2005. Using context-window overlapping in synonym discovery and ontology extension. In Proceedings of RANLP Sahlgren, M. 2005. An introduction to random indexing. http://www.sics.se/∼mange/papers/RI intro.pdf Schutze, H. 1997. Ambiguity Resolution in Natural Language Learning. CSLI, Stanford, CA ¨ Terra, E. and C. Clarke. 2003. Frequency estimates for statistical word similarity measures. In Proceedings of HLT-NAACL, pages 244–251 Turney, P. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of ECML, pages 491–502 Turney, P. 2006a. Expressing implicit semantic relations without supervision. In Proceedings of COLING-ACL, pages 313–320

References

Turney, P. 2006b. Similarity of semantic relations. Computational Linguistics, 32(3):379–416 Turney, P. 2007. Empirical evaluation of four tensor decomposition algorithms. Technical Report ERB-1152, NRC Turney, P. 2008. A uniform approach to analogies, synonyms, antonyms and associations. In Proceedings of COLING, pages 905–912 Turney, P. and M. Littman. 2005. Corpus-based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278 Van Overschelde, J., K. Rawson, and J. Dunlosky. 2004. Category norms: An updated and expanded version of the Battig and Montague (1969) norms. Journal of Memory and Language, 50:289–335 Veale, T. and Y. Hao. 2008. Acquiring naturalistic concept descriptions from the web. In Proceedings of LREC, pages 1121–1124 Vinson, D. and G. Vigliocco. 2008. Semantic feature production norms for a large set of objects and events. Behavior Research Methods, 40(1):183–190 Zhao, Y. and G. Karypis. 2003. Criterion functions for document clustering: Experiments and analysis. Technical Report 01-40, University of Minnesota Department of Computer Science