Convolutional 2D Knowledge Graph Embeddings

Convolutional 2D Knowledge Graph Embeddings Mehdi Ali 13.07.18

Contents • Introduction • ConvE • Experiments • Conclusion

1

Introduction – Knowledge Graph & Link Prediction

studied_in

live s_in

try

in

n s _i e v li

un _c o

try

ate d_

f

un

works_

_in

_o

co

_a

studied

ie nd

lives_in_c

in_

has

at ed _

at

loc

loc

CEO_Of

fr

has_a

t works_a

works_at

2

Introduction – Link Prediction

• KG has millions of facts • Link predictor should [1]: • Scale to number of parameters • Scale to computational costs • Compute expressive features

3

Existing Approaches Table 1: Scoring functions r (es , eo ) from neural link predictors in the literature, their relation-dependent parameters and space complexity; ne and nr respectively denote the number of entities and relation types, i.e. ne = |E| and nr = |R|. Model SE (Bordes et al. 2014) TransE (Bordes et al. 2013a) DistMult (Yang et al. 2015) ComplEx (Trouillon et al. 2016) ConvE

Scoring Function r (es , eo ) WrL es WrR eo p kes + rr eo kp hes , rr , eo i hes , rr , eo i f (vec(f ([es ; rr ] ⇤ !))W)eo

Relation Parameters WrL , WrR 2 Rk⇥k r r 2 Rk r r 2 Rk r r 2 Ck 0 r r 2 Rk

Space Complexity O(ne k + nr k 2 ) O(ne k + nr k) O(ne k + nr k) O(ne k + nr k) O(ne k + nr k 0 )

[1] the scoring component, the two entity embeddings es and eo entropy loss: are scored by a function r . The score of a triple (s, r, o) is 1 X • Current approaches focus on shallow and fast models L(p, t) = (ti · log(pi ) + (1 ti ) · log(1 pi )), (2) defined as (s, r, o) = r (es , eo ) 2 R. N i •In Table Less expressive features 1 we summarise the scoring function of several where t is the label with dimension R1x1 for 1-1 scorlink from the literature. The vectorsisesto and • prediction The onlymodels way to increase expensiveness increase embedding sizevector à Problem: Embedding or R1xN for 1-N scoring (see the next section for 1-N eo denote the subject and object embedding, whereof esentities , eo 2 anding parameters is proportional to number relations scoring); the elements of vector t are ones for relationships Ck in ComplEx and es , eo 2 Rk in all other models, and P • Simple model like DistMult (embedding size of 200) needs 33GB for its parameters for Freebase that exists and memory zero otherwise. hx, y, zi = i xi yi zi denotes the tri-linear dot product; ⇤ We use rectified linear units as the non-linearity f for denotes the convolution operator; f denotes a non-linear faster training (Krizhevsky, Sutskever, and Hinton 2012), and function. batch normalisation after each layer to stabilise, regularise and increase rate of convergence (Ioffe and Szegedy 2015).

4

Convolution Operator

• Parameter efficient [1] • Fast to compute due to optimised GPUs [1] • Due to widespread usage, robust methodologies have been proposed [1]

[2]

5

1D vs 2D Convolution

!

!

!

!

!

!

"

"

"

"

"

"

[2] !

!

!

!

!

!

"

"

" "

"

" [1]

Depending on kernel width, less interaction between features

More interactions between features 6


7

ity; ne and nr respectively denote the number of entities and relation types, i.e.

ConvE (1)

Model rdes et al. 2014) (Bordes et al. 2013a) • Multi-layer Architecture for lt (Yang et al. 2015) link prediction Ex (Trouillon et al. 2016) • Convolution layer

Scoring Function r (es , eo ) WrL es WrR eo p kes + rr eo kp hes , rr , eo i hes , rr , eo i f (vec(f ([es ; rr ] ⇤ !))W)eo

Relatio WrL ,

r

• Projection layer

ng component, the two entity embeddings es and eo • Inner product layer d by a function r . The score of a triple (s, r, o) is s (s, r, o) = r (es , eo ) 2 R. le 1 we summarise the scoring function of several iction models from the literature. The vectors es and

entropy loss:

1 X L(p, t) = ( [1]N i

where t is the label ve 8

ConvE (2) – System Architecture Figure 1: In the ConvE model, the entity and relation embeddings are first reshaped and concatenated (steps 1, 2); the resulting matrix is then used as input to a convolutional layer (step 3); the resulting feature map tensor is vectorised and projected into a k-dimensional space (step 4) and matched with all candidate object embeddings (step 5).

0.9 0.2 0.1 0.6 0.2 0.3 0.0 0.7 0.1 0.4 0.4 0.4

If instead of 1-N scoring, we use 1-(0.1N) scoring – that is, scoring against 10% of the entities – we can compute a forward-backward pass 25% faster. However, we converge roughly 230% slower on the training set. Thus 1-N scoring has an additional effect which is akin to batch normalisation (Ioffe and Szegedy 2015) – we trade some computa-

[1]

relations: a large number of test triples can be obtained simply by inverting triples in the training set. For example, the test set frequently contains triples such as (s, hyponym, o) while the training set contains its inverse (o, hypernym, s). To create a dataset without this property, Toutanova and Chen (2015) introduced FB15k-237 – a subset of FB15k

9

ConvE (3) • More features due to multi-layer architecture • Convolution filters and additional matrix parameters independent of entities and relations • Score each (s,r) pair against all other entities (1:N scoring) • Existing models usually score triples (s,r,o)

• Binary cross-entropy loss • Adam optimiser

10


11

Experiments(1) Table 3: Link prediction results for WN18 and FB15k WN18

FB15k

MR

MRR

@10

Hits @3

DistMult (Yang et al. 2015) ComplEx (Trouillon et al. 2016) Gaifman (Niepert 2016) ANALOGY (Liu, Wu, and Yang 2017) R-GCN (Schlichtkrull et al. 2017)

902 – 352 – –

.822 .941 – .942 .814

.936 .947 .939 .947 .964

.914 .936 – .944 .929

.728 .936 .761 .939 .697

97 – 75 – –

.654 .692 – .725 .696

.824 .840 .842 .854 .842

.733 .759 – .785 .760

.546 .599 .692 .646 .601

ConvE Inverse Model

504 740

.942 .963

.955 .964

.947 .964

.935 .953

64 2501

.745 .660

.873 .660

.801 .659

.670 .658

@1

MR

MRR

@10

Hits @3

@1

Table 1: [1] and FB15k-237 Table 4: Link prediction results forResults WN18RR WN18RR

FB15k-237 Hits

DistMult (Yang et al. 2015)

Hits

MR

MRR

@10

@3

@1

MR

MRR

@10

@3

@1

5110

.43

.49

.44

.39

254

.241

.419

.263

.155

12

WN18

FB15k

MR

MRR

@10

Hits @3

@1

MR

MRR

@10

Hits @3

@1


902 – 352 – –

.822 .941 – .942 .814

.936 .947 .939 .947 .964

.914 .936 – .944 .929

.728 .936 .761 .939 .697

97 – 75 – –

.654 .692 – .725 .696

.824 .840 .842 .854 .842

.733 .759 – .785 .760

.546 .599 .692 .646 .601

ConvE Inverse Model

504 740

.942 .963

.955 .964

.947 .964

.935 .953

64 2501

.745 .660

.873 .660

.801 .659

.670 .658

Experiments(2) – Removed Inverse Relations

Table 4: Link prediction results for WN18RR and FB15k-237

(feline, hyponym, cat)

WN18RR

FB15k-237 Hits

(cat, hypernym, feline)

Hits

MR

MRR

@10

@3

@1

MR

MRR

@10

@3

@1

DistMult (Yang et al. 2015) ComplEx (Trouillon et al. 2016) R-GCN (Schlichtkrull et al. 2017)

5110 5261 –

.43 .44 –

.49 .51 –

.44 .46 –

.39 .41 –

254 339 –

.241 .247 .248

.419 .428 .417

.263 .275 .258

.155 .158 .153

ConvE Inverse Model

5277 13526

.46 .35

.48 .35

.43 .35

.39 .35

246 7030

.316 .010

.491 .014

.350 .011

.239 .007

Tableet al. 2:Inverse the previous best model, R-GCN (Schlichtkrull 2017), which achieves 0.417 Hits@10 with more than 8M parameters. Overall, ConvE is more than 17x parameter efficient than R-GCNs, and 8x more parameter efficient than DistMult. For the entirety of Freebase, the size of these models would be more than 82GB for R-GCNs, 21GB for DistMult, compared to 5.2GB for ConvE. Analysis Ablation Study Table 7 shows the results from our ablation study where we evaluate different parameter initialisation (n = 2) to

relations removed [1] business people) and successful modelling of such a high

indegree nodes requires capturing all these differences. Our hypothesis is that deeper models, that is, models that learn multiple layers of features, like ConvE, have an advantage over shallow models, like DistMult, to capture all these constraints. However, deeper models are more difficult to optimise, so we hypothesise that for datasets with low average relationspecific indegree (like WN18RR and WN18), a shallow model like DistMult might suffice for accurately representing the structure of the network. To test our two hypotheses, we take two datasets with low (low-WN18) and high (high-FB15k) relation-specific in13 degree and reverse them into high (high-WN18) and low (low-

Experiments(3) – Removed Inverse Relations Table 2: Parameter scaling of DistMult vs ConvE.

Model

Param. count

Emb. size

Hits @10 @3

MRR

DistMult DistMult DistMult

1.89M 0.95M 0.23M

128 64 16

.23 .22 .16

.41 .39 .31

.25 .25 .17

.15 .14 .09

ConvE ConvE ConvE ConvE ConvE

5.05M 1.89M 0.95M 0.46M 0.23M

200 96 54 28 14

.32 .32 .30 .28 .26

.49 .49 .46 .43 .40

.35 .35 .33 .30 .28

.23 .23 .22 .20 .19

@1

Table 3: Results on FB15k-237 [1]

experimented with replacing it with fully connected layers and 1D convolution; however, these modifications consistently reduced the predictive accuracy of the model. We also

To gauge the severity ple, rule-based model We call this model the verse relationships auto two relation pairs r1 , r implies (o, r2 , s), or vi We assume that inve among the training, va expect the number of i the size of the training Thus, we detect inverse co-occurs with the pres at least 0.99 (fv + f the validation and test dataset. Relations matc the inverse of each oth At test time, we chec outside the test set: if permutation of the 14 top k

Experiments(4) – Removed Inverse Relations Table 5: Link prediction results for YAGO3-10 and Countries YAGO3-10

Cou Hits

AU

MR

MRR

@10

@3

@1

S1

DistMult (Yang et al. 2015) ComplEx (Trouillon et al. 2016)

5926 6351

.34 .36

.54 .55

.38 .40

.24 .26

1.00±0.00 0.97±0.02

0.7 0.5

ConvE Inverse Model

2792 59448

.52 .01

.66 .02

.56 .02

.45 .01

1.00±0.00 –

0.9

Table 4: Removed inverse relations [1]

Table 6: Mean PageRank ⇥10 3 of nodes in the test set vs reduction in error in terms of AUC-PR or Hits@10 of ConvE wrt. DistMult.

error reduction of ConvE comp with r = 0.83. This gives addit that are deeper have an advanta with high (recursive) indegree. 15

Table 3: Link prediction results for WN18 and FB15k WN18

Observation (1)

FB15k

MR

MRR

@10

Hits @3

@1

MR

MRR

@10

Hits @3

@1


902 – 352 – –

.822 .941 – .942 .814

.936 .947 .939 .947 .964

.914 .936 – .944 .929

.728 .936 .761 .939 .697

97 – 75 – –

.654 .692 – .725 .696

.824 .840 .842 .854 .842

.733 .759 – .785 .760

.546 .599 .692 .646 .601

ConvE Inverse Model

504 740

.942 .963

.955 .964

.947 .964

.935 .953

64 2501

.745 .660

.873 .660

.801 .659

.670 .658

• Good performance on YAGO3-10 and FB15K-237 compared to WN18RR • YAGO3-10 and FB15K-237 contain nodes with very high relation-specific indegree Table 4: Link prediction results for WN18RR and FB15k-237 WN18RR

FB15k-237 Hits

Hits

MR

MRR

@10

@3

@1

MR

MRR

@10

@3

@1

DistMult (Yang et al. 2015) ComplEx (Trouillon et al. 2016) R-GCN (Schlichtkrull et al. 2017)

5110 5261 –

.43 .44 –

.49 .51 –

.44 .46 –

.39 .41 –

254 339 –

.241 .247 .248

.419 .428 .417

.263 .275 .258

.155 .158 .153

ConvE Inverse Model

5277 13526

.46 .35

.48 .35

.43 .35

.39 .35

246 7030

.316 .010

.491 .014

.350 .011

.239 .007

Table 5: Link prediction results for YAGO3-10 and Countries the previous best model, R-GCN (Schlichtkrull et al. 2017), business people) and successful modelling of such a high which achieves 0.417 Hits@10 with more than 8M parameindegree nodes requires capturingYAGO3-10 all these differences. Our Countries ters. hypothesis is that deeper models, that is, models that learn AUC-PR multiple layers of features, like ConvE, haveHits an advantage Overall, ConvE is more than 17x parameter efficient than over shallow models, like all these R-GCNs, and 8x more parameter efficient than DistMult. For MRDistMult, MRR to capture @10 @3 @1conS1 S2 S3 straints. the entirety of Freebase, the size of these models would be DistMult (Yang et al. 2015) 5926 .34 .54 .38 .24 1.00±0.00 0.72±0.12 0.52±0.0 more than 82GB for R-GCNs, 21GB for DistMult, compared However, optimise, ComplEx (Trouillon et al. deeper 2016) models 6351 are more .36 difficult .55 to.40 .26 so 0.97±0.02 0.57±0.10 0.43±0.0 to 5.2GB for ConvE. we hypothesise that for datasets with low average relationConvE 2792 .52 and .66 .56a shallow .45 1.00±0.00 0.99±0.01 0.86 ±0.0 specific indegree (like WN18RR WN18), [1] Inverse Modelmodel like DistMult might 59448suffice .01for accurately .02 .02 .01 – – – representing Analysis the structure of the network. Ablation Study To test our two hypotheses, we take two datasets with 16 6: Mean low PageRank ⇥10 3and of nodes in the test setrelation-specific vs Table 7 shows the results from our ablation studyTable where (low-WN18) high (high-FB15k) inerror reduction of ConvE compared to DistMult is str

Observation (2)

• Hypothesis: To model nodes with very high indegree deeper models are needed • Models like ConvE have an advantage compared to shallow models like DistMult

• Hypothesis: Deeper models are harder to optimise • For datasets with low relation-specific indegree shallow models like DistMult are sufficient [1]

17

Analysis of Indegree • To test hypothesis dataset with low/high average relation-specific was reversed into dataset with into high/low indegree • Both hypothesis held: ConvE performed better on high node indegree and DistMult better on low-indegree dataset

Model

Low-FB 15k

High-WN18

ConvE

0.586

0.952

DistMult

0.728

0.938

Table 5: Results Hits@10 [1]

18

Conclusion

• ConvE, link prediction model based on a CNN • Expressive Features due to multi-layer architecture • Uses less parameters • Fast through 1:N scoring • Hypothesis: ConvE can model better datasets, containing nodes with high indegree [1] 19

Thank You Questions?

20

Literature & Resources [1]: Dettmers, Tim, et al. "Convolutional 2d knowledge graph embeddings." arXiv preprint arXiv:1707.01476 (2017).

[2]: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

21