Oct 9, 2010 - CS&T, THU. Automatic Keyphrase Extraction via Topic Decomposition ... Coverage An appropriate set of keyphrases should also have.
Automatic Keyphrase Extraction via Topic Decomposition Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun Presenter: Wenyi Huang Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for Information Science and Technology Tsinghua University
Oct 9, 2010
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Introduction
What is keyphrase extraction? Method Supervised Learning algorithms for keyphrase extraction (Turney, 2000)
Unsupervised TFIDF TextRank: Bringing order into texts (Rada Mihalcea and Paul Tarau. 2004)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Motivation
What about topic ? Relevance Good keyphrases should be relevant to the major topics of the given document. Coverage An appropriate set of keyphrases should also have a good coverage of a document’s major topics.
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Motivation
What about topic ? Relevance Good keyphrases should be relevant to the major topics of the given document. Coverage An appropriate set of keyphrases should also have a good coverage of a document’s major topics.
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Building Topic Interpreters Method Latent Dirichlet Allocation (LDA) Datasets Wikipedia snapshot at March 2008
Figure: An example of probabilistic topic model
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Topic-Decomposed PageRank
Figure: Topical PageRank for Keyphrase Extraction. (TPR)
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Calculate Ranking Scores by TPR
Rz (wi ) = λ
X j:wj →wi
e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )
(1)
pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Calculate Ranking Scores by TPR
Rz (wi ) = λ
X j:wj →wi
e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )
(1)
pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Calculate Ranking Scores by TPR
Rz (wi ) = λ
X j:wj →wi
e(wj , wi ) Rz (wj ) + (1 − λ)pz (wi ). O(wj )
(1)
pz (w) = pr(w|z), probability of word w given topic z. pz (w) = pr(z|w), probability of topic z given word w. pz (w) = pr(w|z) × pr(z|w), product of hub and authority. Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Extract Keyphrases Using Ranking Scores
Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =
(2)
z=1 Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Extract Keyphrases Using Ranking Scores
Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =
(2)
z=1 Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Extract Keyphrases Using Ranking Scores
Candidate Phrases noun phrases (Hulth, 2003) (adjective)*(noun)+ Doc topic distribution pr(z|d) for each topic z. Phrase Score K X Rz (p) × pr(z|d). R(p) =
(2)
z=1 Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Examples Arafat Says U.S. Threatening to Kill PLO Officials
(a) Topic on “Terrorism”
(b) Topic on “Israel”
(c) Topic on “U.S.”
(d) TPR Result
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Experiments 1
Datasets NEWS: 308 news articles in DUC2001 RESEARCH: 2,000 abstracts of research articles (Hulth, 2003)
2
Evaluation Metrics precision, recall, F-measure p=
ccorrect , cextract
r=
ccorrect , cstandard
f=
2pr , p+r
(3)
binary preference measure (Bpref) Bpref =
1 X |n ranked higher than r| 1− . R M
(4)
r∈R
mean reciprocal rank (MRR) MRR =
1 X 1 , |D| rankd
(5)
d∈D
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Influences of Parameters - The Number of Topics K
K 50 100 500 1000 1500
Pre. 0.268 0.276 0.284 0.282 0.282
Rec. 0.330 0.340 0.350 0.348 0.348
F. 0.296 0.304 0.313 0.312 0.311
Bpref 0.204 0.208 0.215 0.214 0.214
MRR 0.632 0.632 0.648 0.638 0.631
Table: Influence of the number of topics K when the number of keyphrases M = 10 on NEWS.
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Influences of Parameters - Damping Factor λ
0.3 0.28 0.26
F−measure
0.24 0.22 λ=0.1 λ=0.3 λ=0.5 λ=0.7 λ=0.9
0.2 0.18 0.16 0.14 0.12 0.1
1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 Keyphrase Number
Figure: F-measure of TPR with λ = 0.1, 0.3, 0.5, 0.7 and 0.9 when M ranges from 1 to 20 on NEWS. Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Different Preference Values
Pref pr(w|z) pr(z|w) prod
Pre. 0.256 0.282 0.259
Rec. 0.316 0.348 0.320
F. 0.283 0.312 0.286
Bpref 0.192 0.214 0.193
MRR 0.584 0.638 0.587
Table: Influence of three preference value settings when the number of keyphrases M = 10 on NEWS.
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Comparing with Baseline Methods Method TFIDF PageRank LDA TPR
Pre. 0.239 0.242 0.259 0.282
Rec. 0.295 0.299 0.320 0.348
F. 0.264 0.267 0.286 0.312
Bpref 0.179 0.184 0.194 0.214
MRR 0.576 0.564 0.518 0.638
Table: Comparing results on NEWS when the number of keyphrases M = 10.
Method TFIDF PageRank LDA TPR
Pre. 0.333 0.330 0.332 0.354
Rec. 0.173 0.171 0.172 0.183
F. 0.227 0.225 0.227 0.242
Bpref 0.255 0.263 0.254 0.274
MRR 0.565 0.575 0.548 0.583
Table: Comparing results on RESEARCH when the number of keyphrases M = 5. Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Comparing with Baseline Methods
0.5
0.3 TFIDF
0.45 TFIDF PageRank LDA TPR
0.4
PageRank LDA TPR 0.2
0.3
Recall
Recall
0.35
0.25
0.25
0.15
0.2
0.1
0.15 0.1
0.05 0.05 0
0.2
0.25
0.3
0.35 Precision
0.4
0.45
0.5
Figure: Precision-recall results on NEWS, M ranges from 1 to 20.
Wenyi Huang Dept. CS&T, THU
0
0.3
0.32
0.34
0.36 Precision
0.38
0.4
0.42
Figure: Precision-recall results on RESEARCH, M ranges from 1 to 10.
Automatic Keyphrase Extraction via Topic Decomposition
Conclusion
TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Conclusion
TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Conclusion
TPR outperform all baselines on both datasets TPR enjoys advantages of both LDA and TFIDF/PageRank methods Bpref and MRR serve as supplemental metrics for evaluation
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition
Thank You ! QUESTIONS ? My Homepage http://nlp.csai.tsinghua.edu.cn/˜hwy/
Wenyi Huang Dept. CS&T, THU
Automatic Keyphrase Extraction via Topic Decomposition