An innovative biological named entity recognition ...

5 downloads 92 Views 682KB Size Report
The whole web was searched through Yahoo! Search BOSS service [3]. ... Using Support Vector Machine and Web Evidence. Qiong Wu, Michael .... CONTACT.
An Innovative Biological Named Entity Recognition System Using Support Vector Machine and Web Evidence Qiong Wu, Michael Gribskov Department of Biological Sciences, Purdue University, West Lafayette, IN 47906 Abstract: The number of electronic biological publications is growing rapidly, motivating the development of effective automated information retrieval systems. Identifying named entities in literature is an essential part of many projects. We propose an SVM-based biological named entity recognition system that utilizes a small set of corpus-specific contextual features and supportive web evidence. Both the features and supportive evidence are obtained with little human intervention. It has been observed that gene/protein names frequently co-occur with a restricted set of local contextual words, and therefore the significance of the co-occurrence with those contextual words indicates the likelihood that a candidate word is a gene/protein name. We collected 35,384 sentences from 724 full-text articles that have at least one of 1153 pre-identified gene/protein names, and selected 43 contextual features from meaningful words that most frequently co-occur with these gene/protein names at sentence level. For each selected contextual term, supportive web evidence, i.e., the ratio of the number of web PDF documents that contain both the candidate and contextual terms compared to the number containing only the candidate term, is identified. An SVM model is then trained on each set of such ratio vectors. During the prediction stage, we also consider the TF-IDF value, which limits the search for gene names to the most meaningful terms in the texts, to further refine the SVM-predicted positives. Our system’s performance is comparable to ABNER on unseen texts and achieves an F1-score of 0.496, while requiring far fewer features and allowing simple adaptation to any corpus.  

INTRODUCTION

0.8

1.0

0.6 0.4 0.0

1.0

0

10

20

30

40

Rank

ROC on Unseen Abstract after Feature Reduction

0.0

0.2

0.4

0.6

0.8

1.0

0.6 0.4 0.2

TOP 13: AUC=0.5482 TOP 19: AUC=0.5807 TOP 25: AUC=0.7539 All: AUC=0.8129

0.0

True Positive Rate

0.8

1.0

(d)

0.8 0.6 0.4

TOP13: AUC=0.9490 TOP19: AUC=0.9916 TOP25: AUC=0.9796 ALL: AUC=0.9844

0.0

1.Contextual feature extraction 35,384 sentences were collected from 724 full-text articles, each of which has at least one of 1153 preidentified gene/protein names. Contextual features with various word stems were hand-selected from the pool of meaningful words that occur most frequently in the 5-word windows centered at the gene/protein names. 2. Training set preparation 1000 AGI names (e.g. AG1G01080), 1000 gene aliases (e.g. COP1) and 1000 enzymes (e.g. dehydratase) in Arabidopsis were randomly selected from TAIR [1] database, all of which are considered as known positive cases and are assumed to share similar contextual patterns with genes/proteins in plant UBQ system. Random words from articles in a different knowledge domain other than plant genomics served as negative cases, whose contextual patterns should be distinct from the ones of genes/proteins. 3. Search for web evidence Instead of human annotated values, supportive web evidence [2] was automatically retrieved to characterize each feature. The whole web was searched through Yahoo! Search BOSS service [3]. Each term in the dataset was queried with and without a context, and the feature value is defined as:

0.8

ROC on Mixed-Type after Feature Reduction

0.2

Predicted gene/protein

Unseen abstract/full text

0.2

F-score

0.6

1.0

(c)

True Positive Rate

Support Vector Machine

0.4

DF

Ratio vectors

0.2

False Positive Rate

TF I

Web Docs

e ur at ion Fe lect se

Web search by BOSS: gene + context gene alone

Feature F-score

0.8 0.6

0.0

The prototype of the proposed system was implemented for a collection of literature in plant uniqutin/proteasome (UBQ) system. Two key steps in the learning process involves different training data: extraction of contextual features from the corpus uses known UBQ genes/proteins, and characterization of feature patterns relies on another set of genes/proteins in public plant genomics database.

Known plant genes/proteins

0.4 0.0

METHODS

Fig 1. System Flowchart

(b)

agi: AUC=1 alias: AUC=0.9867 enzyme: AUC=0.9985 mixed type: AUC=0.9844 abstract: AUC=0.8129

0.2

True Positive Rate

Identifying key biological concepts (e.g., gene, protein, and chemical names) in full-text literature is a crucial part of many projects. Previous work in recognizing biological named entities rely heavily on expert curation and complicated rule-based systems, and lack of deep-annotated corpus is a major obstacle to applying such system to literature in a new biological area. Moreover, the existing NER systems were trained on specific corpus, therefore may not be easily transferred to other biological corpora if the representative features differ significantly. To address the above issues, we developed an innovative SVM-based biological named entity recognition system that uses corpus-specific contextual features and web-retrieved feature values, which can serve as an excellent starting point of information extraction and knowledge organization in emerging fields in genomics.

Context extraction using known UBQ gene/proteins

ROC with All Features

(a)

0.0

False Positive Rate

0.2

0.4

0.6

0.8

1.0

False Positive Rate

Fig 2. (a) Model performance for different types of names/texts. 5-fold CV accuracies for agi, alias, enzyme and mixed types are 99.88%, 93.81%, 99.56% and 95.3% respectively. (b) Plot of feature f-scores. Features are ranked by their ability to distinguish two classes in order to find out the most informative ones and reduce redundancy caused by multiple features sharing the same word stem. Three potential cutoff ranks are identified: 13, 19 and 25. (c) (d) Effects of feature reduction on identifying gene/protein mentions in mixed-type and UBQ abstracts.  

ABNER 42.40% (511/1206) 69.20% Recall (511/738) 0.526 F1 measure Precision

SVM Best Full+TFIDF Full+TFIDF 32.47% 37.24% 37.14% (664/2045) (510/1373) (585/1575) 89.97% 69.11% 79.27% (664/738) (510/738) (585/738) 0.477 0.4832 0.5058 SVM Full

RESULTS 4. Model training and feature selection A binary SVM [4] classifier is trained separately on each type of the three biological entities mentioned above and all types mixed. F-score is used to rank the distinguishability of features.

5. Model assessment SVM classifier trained on mixed types was tested with 50 abstracts of UBQ system. Positives predicted by the model were further filtered by TFIDF ranks. Model performance was compared with ABNER [5] on the same texts.

CONCLUSIONS This work proposes a new and simpler algorithmic approach to identifying biological named entities, whose performance is comparable to the state-of-art NER package. With objectively-defined features and automatically web-retrieved feature values, our model provides a promising alternative to handling lack of training cases and annotated texts in many biological fields, and can serve as an excellent starting point for more sophisticated learning scheme or manual curation. CONTACT Please contact [email protected] or [email protected] with questions .

Extracted contextual features The set of features includes 43 most-occurring contextual words in the close neighborhood of genes/proteins, and can be represented by 12 keywords: “gene”, “protein”, “mutant”, “transcript”, “function”, “bind”, “interact”, “express”, “over-expression”, “activate”, “promote” and “regulate”. Performance on single types Three types of positive cases (AGI, gene alias, and enzyme) represent three levels of difficulty in recognizing biological name mentions, with AGI being the most formulated and easiest to identify and gene alias the hardest. Classifier for AGIs achieves the highest accuracy and AUC, followed by classifiers for enzyme and mixed types, and the prediction accuracy for gene alias is the lowest. The above 90% accuracy in all types indicates the potential of the proposed model. Performance on real abstracts AGI, gene alias, and enzyme are commonly seen in plant genomic articles, and the feature patterns observed in the mixture of the three are expected to mimic the patterns of target names in real literature. Compared with the benchmark package, our model features superior recall rate at the expense of reduced precision. Incorporation of TFIDF measure improved the precision by 14.7%, and the recall rate was still comparable to that of ABNER. The best performing model achieved an F1 score similar to ABNER, and a much higher recall rate.

REFERENCES 1. Huala, E., et al. (2001) The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant, Nucleic Acids Research, 29, 102-105. 2. Brewster, C., et al. (2009) Issues in learning an ontology from text, BMC Bioinformatics, 10, S1. 3. BOSS (Build your Own Search Service): http://developer.yahoo.com/search/boss/ 4. Chang, C.-C. and Lin, C.-J. (2011) LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol., 2, 1-27 5. Settles, B. (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text, Bioinformatics, 21, 3191-3192.