Automatic Keyphrase Extraction via Topic

11 downloads 0 Views 629KB Size Report
Oct 9, 2010 - CS&T, THU. Automatic Keyphrase Extraction via Topic Decomposition ... Coverage An appropriate set of keyphrases should also have.
Automatic Keyphrase Extraction via Topic Decomposition Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun Presenter: Wenyi Huang Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for Information Science and Technology Tsinghua University

Oct 9, 2010

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Introduction

What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)

Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Motivation

What about topic ? Relevance Good keyphrases should be relevant to the major topics of the given document. Coverage An appropriate set of keyphrases should also have a good coverage of a document’s major topics.

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Motivation

What about topic ? Relevance Good keyphrases should be relevant to the major topics of the given document. Coverage An appropriate set of keyphrases should also have a good coverage of a document’s major topics.

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Building Topic Interpreters Method Latent Dirichlet Allocation (LDA) Datasets Wikipedia snapshot at March 2008

Figure: An example of probabilistic topic model

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Topic-Decomposed PageRank

Figure: Topical PageRank for Keyphrase Extraction. (TPR)

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Calculate Ranking Scores by TPR

Rz (wi ) = λ

X j:wj →wi

e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )

(1)

pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Calculate Ranking Scores by TPR

Rz (wi ) = λ

X j:wj →wi

e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )

(1)

pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Calculate Ranking Scores by TPR

Rz (wi ) = λ

X j:wj →wi

e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )

(1)

pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Extract Keyphrases Using Ranking Scores

Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =

(2)

z=1 Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Extract Keyphrases Using Ranking Scores

Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =

(2)

z=1 Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Extract Keyphrases Using Ranking Scores

Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =

(2)

z=1 Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Examples Arafat Says U.S. Threatening to Kill PLO Officials

(a) Topic on “Terrorism”

(b) Topic on “Israel”

(c) Topic on “U.S.”

(d) TPR Result

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Experiments 1

Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)

2

Evaluation Metrics precision, recall, F-measure p=

ccorrect , cextract

r=

ccorrect , cstandard

f=

2pr , p+r

(3)

binary preference measure (Bpref) Bpref =

1 X |n ranked higher than r| 1− . R M

(4)

r∈R

mean reciprocal rank (MRR) MRR =

1 X 1 , |D| rankd

(5)

d∈D

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Influences of Parameters - The Number of Topics K

K 50 100 500 1000 1500

Pre. 0.268 0.276 0.284 0.282 0.282

Rec. 0.330 0.340 0.350 0.348 0.348

F. 0.296 0.304 0.313 0.312 0.311

Bpref 0.204 0.208 0.215 0.214 0.214

MRR 0.632 0.632 0.648 0.638 0.631

Table: Influence of the number of topics K when the number of keyphrases M = 10 on NEWS.

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Influences of Parameters - Damping Factor λ

0.3 0.28 0.26

F−measure

0.24 0.22 λ=0.1 λ=0.3 λ=0.5 λ=0.7 λ=0.9

0.2 0.18 0.16 0.14 0.12 0.1

1

2

3

4

5

6

7

8 9 10 11 12 13 14 15 16 17 18 19 20 Keyphrase Number

Figure: F-measure of TPR with λ = 0.1, 0.3, 0.5, 0.7 and 0.9 when M ranges from 1 to 20 on NEWS. Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Different Preference Values

Pref pr(w|z) pr(z|w) prod

Pre. 0.256 0.282 0.259

Rec. 0.316 0.348 0.320

F. 0.283 0.312 0.286

Bpref 0.192 0.214 0.193

MRR 0.584 0.638 0.587

Table: Influence of three preference value settings when the number of keyphrases M = 10 on NEWS.

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Comparing with Baseline Methods Method TFIDF PageRank LDA TPR

Pre. 0.239 0.242 0.259 0.282

Rec. 0.295 0.299 0.320 0.348

F. 0.264 0.267 0.286 0.312

Bpref 0.179 0.184 0.194 0.214

MRR 0.576 0.564 0.518 0.638

Table: Comparing results on NEWS when the number of keyphrases M = 10.

Method TFIDF PageRank LDA TPR

Pre. 0.333 0.330 0.332 0.354

Rec. 0.173 0.171 0.172 0.183

F. 0.227 0.225 0.227 0.242

Bpref 0.255 0.263 0.254 0.274

MRR 0.565 0.575 0.548 0.583

Table: Comparing results on RESEARCH when the number of keyphrases M = 5. Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Comparing with Baseline Methods

0.5

0.3 TFIDF

0.45 TFIDF PageRank LDA TPR

0.4

PageRank LDA TPR 0.2

0.3

Recall

Recall

0.35

0.25

0.25

0.15

0.2

0.1

0.15 0.1

0.05 0.05 0

0.2

0.25

0.3

0.35 Precision

0.4

0.45

0.5

Figure: Precision-recall results on NEWS, M ranges from 1 to 20.

Wenyi Huang Dept. CS&T, THU

0

0.3

0.32

0.34

0.36 Precision

0.38

0.4

0.42

Figure: Precision-recall results on RESEARCH, M ranges from 1 to 10.

Automatic Keyphrase Extraction via Topic Decomposition

Conclusion

TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Conclusion

TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Conclusion

TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition

Thank You ! QUESTIONS ? My Homepage http://nlp.csai.tsinghua.edu.cn/˜hwy/

Wenyi Huang Dept. CS&T, THU

Automatic Keyphrase Extraction via Topic Decomposition