T k i d. d t i. Task-independent issues. Ambiguities. The protein name ARF may
denote either a small GTP-binding p g protein or a tumor suppressor gene ...
Special Topics in Computer Science
NLP in a Nutshell NLP in a Nutshell CS492B Spring Semester 2009
Speaker : Hodong p g Lee Professor : Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology
Named Entity Recognition, Jong C. Park and Jung‐jae Kim, Chapter 6 of the book “Text Mining for Biology and Biomedicine”, Artech House, 2006.
TEXT MINING APPLICATIONS: TERMINOLOGY AND NER Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
2
Introduction What is Named Entity Recognition (NER) ? h i d i R i i ( R) ? The task of recognizing entity‐denoting expressions in t l l d t i N d E titi (NE ) natural language documents, i.e., Named Entities (NEs) Genes, proteins, cells, and diseases
Cf. Automatic Term Recognition (ATR) The goal is to associate a given term with a concept in a well–defined semantic framework The goal of NER is to relate each named entity of h l f l h d f importance to an individual in the real world Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
3
Introduction: Difficulties T ki d Task‐independent issues d t i Ambiguities The protein name ARF may denote either a small GTP‐binding p g protein or a tumor suppressor gene product
Coinages The protein name p53 does not describe the function of the protein, only the weight of the protein
Aliases The gene, officially designed as SELL or selectin g g L, is currently known to have as many as 15 aliases
Task dependent issues Task‐dependent issues NER for Information Extraction (IE) systems is more complicated than that for Text Retrieval (TR) systems Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
4
Introduction: NER for IE A Anaphora resolution h l ti Arfophilin is an ADP ribosylation factor (Ar) binding protein of unknown function. It is identical to the Rab11 p binding protein eferin/Rab11‐FIP3, and we show it binds both Arf 5 and Rab11. (PMID:12857874) In the extracted tuple (It, binds, Arf (It binds Arf 5 and Rab11) 5 and Rab11) ‘It’ It should be replaced with Arfophilin
Species information identification for proteins Plant DNA polymerases and E. coli DNA polymerase I, but not animal DNA polymerases or avian reverse b t not animal DNA pol merases or a ian re erse transcriptase, are strongly stimulated by ethidium bromide (EdBr) … (PMID:6821157) Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
5
Biomedical Biomedical Named Entities Named Entities Wh t i N What is a Named Entity (NE) ? d E tit (NE) ? A phrase or a combination of phrases that refers to a specific object or a group of objects
Examples Newspapers: persons, organizations, places, and artifacts Court opinions: courts and parties Biomedical literature: genes, proteins, cells, drugs, chemicals, and diseases Table 1. Example NEs of Biomedical Objects 1.
Genes: Tp53, agaR
2 2.
Proteins: p53 ‘galactosidase Proteins: p53, galactosidase, alpha (GLA) alpha (GLA)’
3.
Cells: CD4+‐cells, Human malignant mesothelioma (HMMME)
4.
Drugs: Cyclospoine, herbimycin
5.
Chemicals: 5’‐(N‐ethylcarboxamido)adenosine (NECA)
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
6
Biomedical Named Entities: Resources Table 2. Biomedical Databases and Resources 1.
Genes: Human Genome Nomenclature (http://www.gene.ucl.ac.kr/nomenclature/), GenBank (http://www ncbi nlm nih gov/Genebank/) (http://www.ncbi.nlm.nih.gov/Genebank/)
2.
Proteins: UniProt (http://www.expasy.org/sprot/), IPI(http://www.ensembl.org/IPI/)
3.
Cells: Cell database of Riken Bioresource Center (http://www.brc.riken.jp/inf/en/)
4.
Drugs: MedMaster (http://www.ashp.org/), USP DI (http://www.usp.org/)
5.
Chemicals: UMLS Metathesaurus (http://www.nlm.nih.gov/research/umls)
6.
Diseases: NCBI Genes and Diseases (http://www.ncbi.nlm.nih.gov/disease/), Disease Database (http://www.diseasesdatabase.com/)
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
7
Biomedical Named Entities: Gene and Protein Names Bi Biomedical databases di l d b Include special characters Uppercase, comma, hyphen, slash, bracket, digit
Use descriptive terms that suggest the characteristics of proteins Table 3. Example Protein Names with Descriptive Terms Semantic Type of Descriptive Term
Example Protein Name
Protein function
Growth hormone
Localization
p Nuclear protein
Species origin
HIV‐1 envelope glycoprotein
Physical property
Salivary acidic protein
Similarity to other proteins
Rho‐like Protein
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
8
Biomedical Named Entities: Gene and Protein Names Bi Biomedical literature di l li Includes gene/protein names in various linguistic forms Abbreviations, plurals, compounds, anaphoric expression Prepositional phrases, relative phrases, and even expressions across sentences Table 4. Example Gene and Protein Names in Various Linguistic Forms Linguistic Forms Li i ti F
Example Gene and Protein Names E l G d P t i N
Abbreviation
GLA (as in Table 6.2)
Plural
p38 MAPKs, ERK1/2
Compound
Rpg1p/Tif32p
Coordination
91 and 84 kDa proteins
Cascade
Kappa 3 binding factor (such that kappa 3 is a gene name)
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
9
Biomedical Named Entities: Gene and Protein Names T bl Example Gene and Protein Names in Various Linguistic Forms Table 5. E l G d P t i N i V i Li i ti F Linguistic Forms
Example Gene and Protein Names
Anaphoric expression
it this enzyme A protein that doest not bind RNA directly but inhibits the activity of eIF4f
Acronym
phospholipase D (PLD) C‐Jun N‐terminal kinase (JNK)
Apposition
PD98059, specific MEK1/2 inhibitor U0126 (known as the ERKs inhibitor)
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
10
Gene and Protein Name Recognition: Issues Open issues Ambiguous Names g Synonyms Variations Names of Newly Discovered Genes and Proteins Varying Range of Target Names
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
11
Gene and Protein Name Recognition: Ambiguous Names Ambiguous gene and protein names bi d i Different genes and proteins Common English words Simple pattern matching for gene names shows extremely low precision 2% for full texts and 7% for abstracts low precision, 2% for full texts and 7% for abstracts The largest source of errors is the gene names that share their form with common English words
Different classes of biomedical entities Myc‐c can be a gene name as well as a protein name, as in myc gene and myc‐c protein d t i CD4 can be a protein name, as well as a cell name, as in CD4 protein and CD4+‐cells 4p 4 Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
12
Gene/Protein Name Recognition: Ambiguous Names What gene/protein names refer to is h / i f i dependent on: Publish time P21 formerly denoted a macromolecule associated with a cascade of signals from receptors at cell surfaces to the nucleus, but currently it denotes a different protein that inhibits the cell cycle
Species The yeast homologue of the human gene PMS1 is called PMS2, whereas yeast PMS1 corresponds to human PMS2. Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
13
Gene/Protein Name Recognition: Synonyms M li Many aliases HUGO Nomenclature includes more than 23,000 aliases among more than 21,000 human genes Release 47.0 of Swiss‐Prot contains more than 26,000 synonyms of protein names among approximately 180,000 entries
Some gene/protein names denote the same gene/protein that is homologous in different species Drosophila and mouse genetics agree that armadillo from fruit flies and beta‐catenin from mice and basically the same. Table 6. Example synonyms of Gene and Protein Names 1.
caspase‐3 or CASP3 or apoptosis‐related cysteine protease or CPP32
2.
p21 or WAF1 or CIP1 or SDI1 or CAP20
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
14
Gene/Protein Name Recognition: Variations Variations i i Character‐level variations Word‐level variations Word‐order variations Syntactic variations Variations with abbreviations
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
15
Gene/Protein Name Recognition: Variations Table 7. Example Variations of Gene and Protein Names (1) Character‐level variations
(a) (b) (c) (d) (e)
(2) Word‐level variations
(a) Rnase P protein or Rnase P (b) Interleukin‐1 beta precursor or INTERLEUKIN 1‐beta PROTEIN or INTERLEUKIN 1 beta (c) transcription intermediary factor‐2 or transcriptional intermediate factor 2 (d) the Ras guanine nucleotide exchange factor Sos or the Ras guanine nucleotide releasing protein Sos
(3) Word‐order variations
(a) Collagen type XIII alpha 1 or Alpha 1 type XIII collagen (b) integrin alpha 4 or alpha4 integrin
Jong C. Park, CS Dept., KAIST
D(2) or D2 SYT4 or SYT IV CGA or IG alpha S receptor kinase or S receptor kinase S‐receptor kinase Thioredoxin h‐type 1 or Thioredoxin h (THL1)
CS492B: Spring 2009
16
Gene/Protein Name Recognition: New Gene/Protein Names A great number of new genes and proteins A t b f d t i make it hard to register them on time We need to develop rules and models that recognize novel gene and protein names with their common characteristics Note that the guidelines for gene and protein nomenclatures cannot distinguish gene and protein names from other terms and common words Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
17
Gene/Protein Name Recognition: Varying Range of Target Names Li g i ti f Linguistic forms to handle is dependent on the systems t h dl i d d t th t Indefinite phrases For systems extracting protein‐protein interactions gp p Not for systems constructing an index of gene and protein names from biomedical documents
The names for protein families for systems For systems extracting general biomedical interactions Not for systems extracting only names of individual proteins
Adjectives modifying the noun phrases as genes and proteins j y g p g p eukaryotic in eukaryotic RhoA‐binding kinases
A substring of a noun phrase as a gene and protein name RhoA in eukaryotic RhoA‐binding kinases
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
18
Approaches to NER: An Overview Di i Dictionary‐based approaches b d h Find names of the well‐known nomenclatures
R l b d h Rule‐based approaches Construct rules and patterns to match them with NEs
M hi l Machine learning approaches i h Employ machine learning techniques to develop statistical models for gene and protein name recognition
Hybrid approaches Merge two or more of the above approaches to complement one another
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
19
Approaches to NER: Dictionary‐based Approach Wh di ti Why dictionary‐based approach ? b d h ? Fast development Low cost
Terminological resources Gene/protein databases G / i d b Biological ontology
Limitations False positive recognition Ambiguous names
False negative recognition Synonyms, variations, and lack of a unified resources for newly y y , , y published names Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
20
Approaches to NER: Rule‐based Approach Wh l b d Why rule‐based approach ? h ? A broader range of variations than dictionary‐based approaches
Rules and patterns Surface clues on character strings e.g., Ras R guanine nucleotide exchange factor Sos i l tid h f t S
Morphological clues e.g., Hrp54, Laer\mt, p53, ligase
Syntactic clues e.g., calmodulin N‐methyltransferase Table 8. Example Rules for EMPathIE and PASTA enzyme Æ enzyme_modifier enzyme. enzyme Æ y character, ‘‐’, enzyme , , y _head.
Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
21
Approaches to NER: Machine Learning Approach Wh Why machine learning ? hi l i ? Naming conventions are different from research domains Creating hand‐made rules and patterns is labor‐intensive
Statistical models Lexical and character features Probability of a sequence of words (n‐grams)
Limitation Lack of training corpus, i.e., data sparseness problem g p , , p p Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
22
Approaches to NER: Hybrid Approach Wh h b id Why hybrid approach ? h ? Machine‐based approaches suffer from the data sparseness problem p p Three kinds of approaches have their own features
Combining strategies Task‐purpose combining Dictionary‐based method for classifying gene/protein names Di ti b d th d f l if i / t i Rule‐based method for identifying boundary of name phrase Machine learning method for disambiguation of relevant names
Sequential combining Pre‐processing (dictionary), filtering (rule), and selection (machine g) p learning) steps Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
23
Approaches to NER: Other Issues Cl Class identification id ifi i NEs are often ambiguous with respect to their class i f ti information
N Named entity grounding d tit g di g NEs grounded into entry IDs of biomedical resources can be used as indices or references
Acronym disambiguation A di bi ti Acronyms often contain multiple meanings Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
24
Biomedical NER approaches v.s. General‐purpose NER approaches The approaches to biomedical NER have evolved h h bi di l R h l d similar to those in general domains Contrasts Hybrid approaches in general domain but not in biomedical domain show the state‐of‐art performance The approaches to biomedical NER focus on gene and protein names, while approaches to general NER d t i hil h t l NER deal with language‐independent issues as well Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
25
Summary C Current research focus of biomedical NER hf f bi di l R Gene and protein names
Characteristics of biomedical named entities Ambiguous names, synonyms, and variations
Biomedical NER methods Dictionary‐based approaches Rule‐based approaches Machine learning approaches Hybrid approaches Jong C. Park, CS Dept., KAIST
CS492B: Spring 2009
26