A registry of disease-causing mutations in protein kinase domains

3 downloads 5381 Views 385KB Size Report
KinMutBase: A Registry of Disease-Causing. Mutations in Protein Kinase Domains. Csaba Ortutay,1 Jouni Väliaho,1 Kaj Stenberg,2 and Mauno Vihinen1,3n.
HUMAN MUTATION 25:435^442 (2005)

DATABASES

KinMutBase: A Registry of Disease-Causing Mutations in Protein Kinase Domains Csaba Ortutay,1 Jouni Va¨liaho,1 Kaj Stenberg,2 and Mauno Vihinen1,3n 1 Institute of Medical Technology, University of Tampere, Tampere, Finland; 2Department of Biosciences, Division of Biochemistry, University of Helsinki, Helsinki, Finland; 3Research Unit, Tampere University Hospital, Tampere, Finland

Communicated by A. Jaimie Cuticchia A large number of disease-causing mutations have been identified from several protein kinases. KinMutBase is a comprehensive knowledge base for human disease-related mutations in protein kinase domains (http:// bioinf.uta.fi/KinMutBase/). The latest version contains 582 different mutations for 1,790 cases in 1,322 families. KinMutBase entries are described on the DNA, mRNA, and protein level. Numbers for affected patients and families are also provided. KinMutBase has extensive amount of links and cross-references to literature, other databases, and information sources. There are numerous interactive pages about sequences, structures, mutation statistics, and diseases. Detailed statistical study was done on frequencies of different types of mutations both on the DNA and protein level in serine/threonine kinase (PSK) and tyrosine kinase (PTK). Three-dimensional structures indicate clustering of disease-related mutations mainly to conserved subdomains, and substrate and coligand binding amino acids, although mutations appear throughout the sequences. CpG containing codons, especially for arginine, constitute the majority of mutational hotspots. There are certain clear differences in mutation patterns and types between PSKs and PTKs. Hum Mutat 25:435–442, 2005. r 2005 Wiley-Liss, Inc. KEY WORDS:

database; protein tyrosine kinase; protein serine-threonine kinase; kinase domain; families

DATABASES:

http://bioinf.uta.fi/KinMutBase/ (KinMutBase)

INTRODUCTION Recent analyses suggest that the human genome encodes some 430 catalytically active protein kinases [Hanks, 2003]. The vast majority of them are serine/threonine and tyrosine kinases but histidine and arginine kinases also exist. These enzymes catalyze protein phosphorylation: the transfer of a phosphate group to serine, threonine, tyrosine, histidine, or acidic residues of the target proteins, which is one of the most frequent posttranslational modifications of proteins. Kinase activity is modulated by interactions with activator or inhibitor proteins, cofactors, ligands, auto- and transphosphorylation, and phosphatases. Many kinases act in crucial steps, and are therefore key regulators of biological control and signaling cascades. Phosphatases are antagonists of kinases, since they can remove the phosphate group from the target molecule. Together, phosphatases and kinases are switches of cellular signaling and networking [Bhaduri and Sowdhamini, 2003]. Since defects in kinases frequently cause diseases, mutation data is valuable for researchers from several fields. Both activating and inactivating mutations have been detected. Activating mutations turn the enzymes constitutively on, leading to constant activation of signaling cascades, which often leads to tumors. In contrast, inactivation of kinases makes the cells unable to respond to certain signals. Depending on the type and position of mutations, a single protein kinase can be linked to more than one disease. For example, mutations of the KIT protooncogene can cause various r2005 WILEY-LISS, INC.

types of mastocytosis and other hematological diseases [Boissan et al., 2000] and tumors [de Silva and Reid, 2003]. The importance of protein kinases, and the increasing amount of mutation data related to them, raised need for an integrated database of mutations and related information in kinase domains. Such a database should join the available knowledge about mutations, diseases, sequences, and structures. The database should also provide links to the literature and other databases. KinMutBase (http://bioinf.uta.fi/KinMutBase/) was developed to fulfill these needs. The database has been completely reengineered and it has grown substantially since its previous release [Stenberg et al., 2000], which contained 170 mutations in nine kinases. ORGANIZATION AND CONTENTS OF KinMutBase KinMutBase entries describe mutations on the genomic, RNA, and protein level, and contain information on occurrences of mutations such as the number of patients, number of unrelated families, and the number of patients homozygous for a mutation. Received 28 September 2004; accepted revised manuscript 11 December 2004. n Correspondence to: Mauno Vihinen, Institute of Medical Technology, FI-33014 University of Tampere, Finland. E-mail: mauno.vihinen@uta.¢ DOI 10.1002/humu.20166 Published online in Wiley InterScience (www.interscience.wiley.com).

436

ORTUTAY ET AL.

MUTbase software [Riikonen and Vihinen, 1999], developed to build and maintain locus-specific mutation databases, was modified to maintain data simultaneously for multiple kinases and to accept mutations only in the kinase domain. New mutation data are inserted to the database on the Internet by using a Perlcgi submission form. The data are handled by Perl scripts, which are capable of interpreting the most common mutation types such as point mutations, insertions and deletions. Other cases are coded manually. The submission form provides several validation and data integrity services containing the main features of European Bioinformatics Institute (EBI) Mutation Checker (www.ebi. ac.uk/cgi-bin/mutations/check.cgi). There is also a PubMed (www.ncbi.nlm.nih.gov/entrez/query.fcgi) reference agent on the submission form, which fetches reference information from PubMed and inserts it into the database. The primary storage format of the data is Extensible Markup Language (XML), database entries are also converted to a flat file format for export. Each entry has two unique key values: ID and accession number. The ID codes follow conventions used in the PIN naming system for immunodeficiency mutation databases [Vihinen et al., 1999]. The entries are linked with other databases such as The European Molecular Biology Laboratory (EMBL) (www.embl.org), SwissProt (www.expasy.org/sprot/), PubMed, and locus-specific databases via their accession numbers.

The database has a specific page for each disease related kinase. The page contains basic information on related genes and diseases such as disease names, reference DNA, cDNA and protein sequences, exon/intron structure, and kinase characteristics. KinMutBase FEATURES KinMutBase combines genetic, mutation, literature, and structural information into a knowledge base that allows users from many fields to search and analyze kinase domain–related data. A user-friendly web interface was developed to enhance quick navigation. In addition to mutations, KinMutBase provides several other types of information. The first group of data describes the disease-related protein kinase genes. DNA, cDNA, and protein reference sequences are available and links to other integrated databases such as LocusLink (www.ncbi.nlm.nih.gov/ LocusLink) [Pruitt and Maglott, 2001], UniGene (www.ncbi.nlm. nih.gov/UniGene) [Pontius et al., 2003], and GeneLynx (www. genelynx.org) [Lenhard et al., 2001] are provided. SNP information can be found via dbSNP (www.ncbi.nlm.gov/projects/SNP) [Kitts and Sherry, 2003]. Since most of the kinase genes have homologs in mouse, Drosophila melanogaster, and yeast, there are links to the Mouse Genome Informatics Database (www.informatics.jax.org) [Bult et al., 2004], Flybase (http://flybase.bio.indiana.edu) [FlyBase Consortium, 2003], the Saccharomyces Genome Database

FIGURE 1. Colored character visualization of alignment of PTKs.The ¢gure was created with MultiDisp (http://bioinf.uta.¢/cgi-bin/ MultiDisp.cgi).The character height is proportional to the frequency of a residue in a position. Number of independent missense mutations is indicated by gray boxes under the residues.The darker the box, the more mutations there are. (White boxes show no mutation, black boxes show at least 15 independent mutations.) The boxes indicate the number of mutations in a certain position in all the studied kinases.

KinMutBase

(www.yeastgenome.org) [Cherry et al., 1997], and to the mouse models in The Jackson Laboratory (www.jax.org). Usually there are one or two mouse models for a particular gene, but mast/stem cell growth factor receptor (KIT) has as many as 47 models. Structural information for the kinases is presented on separate pages. The three-dimensional structure has been solved for six kinases and theoretical models calculated by the Center for Biological Sequence Analysis (CBS) CPHmodels 2.0 homology modeling server (CPHmodels) server (www.cbs.dtu.dk/services/ CPHmodels/) are available for all the 33 kinases. One of the noticeable features of KinMutBase is sequence alignments. Separate alignments are presented for tyrosine kinases (PTKs) and serine/threonine kinases (PSKs), together with mutations. The alignments are visualized in various formats and have links directly to the mutation entries in the database. These pages can be used in many ways, e.g., to search for frequently mutated positions. The alignments are further visualized with MultiDisp (http://bioinf.uta.fi/cgi-bin/MultiDisp.cgi) (P. Riikonen

TABLE 1A.

Kinase ACVRL1 AMHR2 BMPRIA BMPR2 CDK4 CHEK2 PAK3 PHKG2 RHOK RPS6KA3 RPS6KA3a STKI1 TGFBR2 Total Total (%)

Missense

Nonsense

7/21/0 7/9/2 3/3/0 8/21/0 15/18/0 6/6/0 1/19/0 8/8/0 4/4/0 18/18/0 9/12/0 28/37/0 10/10/0 124/186/2 34/38/33.3

1/1/0

437

and M. Vihinen, unpublished results) (Fig. 1). These graphs can display common features of sequences in the alignments, not just identities but common properties, such as hydrophobicity and polarity of residues. Tree representations of the alignments are also available. Diseases are listed on a separate page along with links to the OMIM database (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db= OMIM). Mutations in most of the genes are described only in one disease, but there are examples in which distinct mutations of a gene are diagnosed in various diseases. For example, mutations of serine/threonine-protein kinase 11 (STK11) are described in melanomas [Rowan et al., 1999], pancreatic cancer [Su et al., 1999], colon cancer [Dong et al., 1998], testicular carcinoma [Ylikorkala et al., 1999], hepatocellular carcinoma [Kim et al., 2004], and adenocarcinoma [Kuragaki et al., 2003]. Statistics of mutation types in KinMutBase are shown in Table 1. The largest number of cases has been detected in Bruton tyrosine kinase (BTK, 364 cases in 317 independent families), whereas the largest

MutationTypes in Serine/Threonine Kinasesn Deletion

Insertion

Splice site

1/1/0 11/11/4 2/2/0 5/10/0

5/21/0 7/18/0

1/1/0 2/2/0

1/1/0 1/4/0 2/2/0

155/173/0

12/12/0 2/2/0 8/8/0

4/4/0 16/23/0

3/5/0

39/69/0 10.7/14.1/0

195/225/4 53.4/46/66.7

6/8/0 1.6/1.6/0

2/2/0

1/1/0 0.2/0.2/0

Total

Total (%)

9/23/0 19/21/6 10/26/0 22/51/0 15/18/0 162/180/0 2/23/0 12/12/0 4/4/0 34/34/0 11/14/0 55/73/0 10/10/0 365/489/6

2.5/4.7/0 5.2/4.3/100 2.7/5.3/0 6/10.4/0 4.1/3.7/0 44.4/36.8/0 0.5/4.7/0 3.3/2.5/0 1.1/0.8/0 9.3/6.9/0 3/2.9/0 15.1/14.9/0 2.7/2/0 100/100/100

n

Numbers refer to unrelated families, a¡ected individuals, and homozygous cases, respectively. Carboxy terminal domain.

a

TABLE 1B.

Kinase BTK FGFR1 FGFR2 FGFR3 FLT3 FLT4 GUCY2D INSR IRAK4 JAK3 JAK3a KIT LTK MERTK MET NTRK1 RET ROR2 TEK ZAP70 Total Total (%) n

MutationTypes ofTyrosine Kinases n

Missense

Nonsense

Deletion

Insertion

Splice site

Other

Total

Total (%)

170/195/0 3/5/0 8/8/0 53/57/0 138/138/0 7/40/0 24/84/4 26/37/0

51/58/0 1/5/0

35/36/0 1/3/0

20/25/0

40/49/0 1/1/0

1/1/0

14/14/0

5/5/0

317/364/0 6/14/0 8/8/0 53/57/0 157/157/0 7/40/0 27/92/5 29/40/0 7/7/3 8/9/3 4/4/1 88/94/2 11/11/0 2/2/1 35/55/0 45/65/19 130/165/2 9/19/19 4/81/0 10/17/10 957/1301/65

10.4/28/0 0.6/1.1/0 0.8/0.6/0 5.5/4.4/0 16.4/12.1/0 0.7/3.1/0 2.8/7.1/7.7 3/3.1/0 0.7/0.5/4.6 0.8/0.7/4.6 0.4/0.3/1.5 9.2/7.3/3.1 1.1/0.8/0 0.2/0.2/1.5 3.6/4.2/0 4.7/5/29.2 13.6/12.7/3.1 0.9/1.5/29.2 0.4/6.2/0 1/1.3/15.4

4/4/1 1/1/0 75/79/2 11/11/0 35/55/0 15/25/2 127/162/2 1/1/1 4/81/0 6/12/5 708/995/17 74/76.5/26.2

1/1/0 1/1/0 4/4/2 1/1/0 1/1/0

1/1/1 2/2/0 3/3/1 1/1/0 1/1/0 13/15/0

1/1/0

1/1/1

2/2/0 3/3/0 8/18/18

17/27/7

11/11/10

2/2/2 91/106/12 9.5/8.1/18.5

36/41/10 3.7/3.2/15.4

74/95/20 7.7/7.3/30.8

1/6/0 2/3/2/ 1/1/1

2/3/3 46/57/6 4.8/4.4/9.2

Numbers refer to unrelated families, a¡ected individuals, and homozygous cases, respectively. a Pseudokinase domain.

2/7/0 0.2/0.5/0

100/100/100

438

ORTUTAY ET AL. TABLE 2.

Frequencies of Disease-Causing Nucleotide Substitutions in Coding Sequencesn Serine/threonine kinases

Tyrosine kinases

From

To

A¡ected residues

Unrelated families

A¡ected residues

Unrelated families

A A A C C C G G G T T T Transitions Transversions

C G T A G T A C T A C G

3 (2.2%) 13 (9.6%) 5 (3.7%) 7 (5.2%) 1 (0.7%) 34 (25.2%) 28 (20.7%) 7 (5.2%) 12 (8.9%) 6 (4.4%) 11 (8.2%) 8 (5.9%) 86 (63.7%) 49 (36.3%)

3 (1.9%) 14 (8.6%) 5 (3.1%) 7 (4.3%) 1 (0.6%) 55 (34.0%) 30 (18.5%) 7 (4.3%) 12 (7.4%) 6 (3.7%) 13 (8.0%) 9 (5.6%) 112 (69.1%) 50 (30.9%)

18 (5.6%) 31 (9.7%) 11 (3.4%) 24 (7.5%) 10 (3.1%) 52 (16.2%) 61 (18.9%) 25 (7.8%) 31 (9.6%) 11 (3.4%) 35 (10.9%) 13 (4.0%) 179 (55.6%) 143 (44.4%)

20 (2.5%) 61 (7.6%) 67 (8.3%) 46 (5.7%) 12 (1.5%) 114 (14.1%) 133 (16.5%) 71 (8.8%) 130 (16.1%) 17 (2.1%) 88 (10.9%) 47 (5.8%) 396 (49.1%) 410 (50.9%)

n

Transitions are set in bold.

FIGURE 2. Distribution of point mutations in the coding sequences by codon position. Left, number of a¡ected residues; right, number of independent mutation events in serine/threonine kinases, in tyrosine kinases, and in both.

number of homozygous cases appears in high affinity nerve growth factor receptor (NTRK1, 19 homologous cases out of the total of 65). DATA ANALYSIS The KinMutBase contains 582 different mutations in 20 PTK domains and in 13 PSK domains. The database refers to 1,790 cases from 1,322 families (Table 1). Mutations appear both in conserved hallmark residues of the kinases as well as in nonhomologous sites. Mutations from 489 affected individuals in 365 families are found from PSKs; mutations from 1,301 affected individuals in 957 families are found from PTKs. Differences in the numbers for kinase types may be an artifact arising from incidence of diseases and research interest on the kinases. Deletions are more common in PSKs than in PTKs, in which missense mutations total to about 75%. The bias in PSKs is due to a single kinase, CHEK2, in which more than 90% of cases are deletions. CHEK2 cases count more than 40% of PSK mutations. The frequencies of nucleotide substitutions (Table 2) show that the majority of the substitutions are transitions (58% of the affected residues and 52% of the unrelated events). The distributions of mutations in PSKs and PTKs are very similar. Mutations in CpG dinucleotides are the most common mutation

events also in immunodeficiencies [Ollila et al., 1996]. Accordingly, C to T and G to A mutations are clearly overrepresented, totaling over 30% of point mutations. The distribution of mutations by codon position shows that mutations at the third position account for only 10% of the disease-causing point mutations (Fig. 2). The first and second codon positions are affected with about the same frequency. The result is in line with the known outcome of mutations in different positions: In the third position only 57 out of 183 possible single nucleotide substitutions lead to amino acid substitution or nonsense mutation, while the numbers for the first codon are 175 out of 183 and for the second codon position the numbers are 183 out of 183. There are 463 mutations in coding sequences: 76 are deletions (16.4%), 27 are insertions (5.8%), and the rest are nucleotide substitutions (360; 77.6%). The majority of substitutions are single nucleotide alterations; one of the rare exceptions is the mutation K00595, in which the gene of retinal guanylyl cyclase 1 (GUCY2D) a hexanucleotide ‘‘gcgcac’’ is replaced by ‘‘ctgcat.’’ This cone-rod dystrophy-causing mutation [Perrault et al., 1998] has been described in six patients from one family. In the current database release, 72.6% of the mutations are unique, detected only from members of a single family. There are some very frequently mutated residues, in which mutations are

KinMutBase TABLE 3. Correlation Between Amino Acid Substitution Frequencies

439

in PTKs (Upper Half) and PSKs (Lower Half) n

Original a

Mutant (%)

A

A C D E F G H I K L M N P Q R S T V W X Y Correlation Mutant (%) A C D E F G H I K L M N P Q R S T V W X Y

0 ^ 0.47 0.47 ^ 0 ^ ^ ^ ^ ^ ^ 0.31 ^ ^ 0 0.31 1.26 ^ ^ ^ 0.83 Aa 0 ^ 0.71 0.71 ^ 0 ^ ^ ^ ^ ^ ^ 0 ^ ^ 0 0 0.71 ^ ^ ^

C

a

D

^ 0.14 0 ^ ^ 0 ^ 1.92 0.22 ^ 0 0.82 ^ 4.81 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0.96 ^ ^ ^ ^ 0.16 ^ 0.05 ^ ^ ^ ^ 7.42 0.11 ^ 0.11 ^ 0.43 11.4 0.78 0.32 Ca D ^ 0 0 ^ ^ 0 ^ 0 0 ^ 0 2.53 ^ 0.63 ^ ^ ^ ^ ^ ^ ^ ^ ^ 3.16 ^ ^ ^ ^ 0.48 ^ 0 ^ ^ ^ ^ 0.63 0 ^ 0.25 ^ 0.5 1.28

a

a

E

F

G

0.16 ^ 2.45 0 ^ 0.33 ^ ^ 2.62 ^ ^ ^ ^ 0.16 ^ ^ ^ 0 ^ 0.49 ^ 0.42 E 0.81 ^ 0 0 ^ 0 ^ ^ 2.43 ^ ^ ^ ^ 0.81 ^ ^ ^ 0 ^ 3.81 ^

^ 0 ^ ^ 0 ^ ^ 0 ^ 0.19 ^ ^ ^ ^ ^ 0.87 ^ 0 ^ ^ 0 1 Fa ^ 0 ^ ^ 0 ^ ^ 0 ^ 0.48 ^ ^ ^ ^ ^ 1.9 ^ 0 ^ ^ 0

0 0 0.98 1.47 ^ 0.16c ^ ^ ^ ^ ^ ^ ^ ^ 1.14 0.33 ^ 0.81 0.16 0 ^ 0.77 Ga 0 0 2.35 0.78 ^ 0 ^ ^ ^ ^ ^ ^ ^ ^ 1.57 1.57 ^ 0.78 0 0 ^

H

I

b

K

L

M

Nb

Pa

^ ^ 0.06 ^ ^ ^ 0 ^ ^ 0 ^ 0 0.06 0.06 0.26 ^ ^ ^ ^ ^ 0.19 0.4 H ^ ^ 0 ^ ^ ^ 0 ^ ^ 0.33 ^ 0 0 0.66 0 ^ ^ ^ ^ ^ 0.6

^ ^ ^ ^ 0 ^ ^ 0 0 0 0 0.12 ^ ^ 0 0 0.35 0.12 ^ ^ ^ 0.15 Ib ^ ^ ^ ^ 0.63 ^ ^ 0 0.63 0 0 1.26 ^ ^ 0 0 0 0 ^ ^ ^

^ ^ ^ 3.8 ^ ^ ^ 0 0.14c ^ 0.7 0.7 ^ 0.28 0.56 ^ 0.14 ^ ^ 0.28 ^ 0.62 K ^ ^ ^ 0.68 ^ ^ ^ 0 0 ^ 0 0.68 ^ 0 0.68 ^ 0 ^ ^ 0.66 ^

^ ^ ^ ^ 1.35 ^ 0 0 ^ 0 0 ^ 2.71 0 0.54 0.27 ^ 0.27 0 0.54 ^ 0.68 L ^ ^ ^ ^ 1.25 ^ 0 0 ^ 0 0 ^ 2.5 0 3.75 0 ^ 0 0 1.26 ^

^ ^ ^ ^ ^ ^ ^ 0.46 0.18 0.37 0 ^ ^ ^ 0.18 ^ 4.65 0.27 ^ ^ ^ 0.42 M ^ ^ ^ ^ ^ ^ ^ 0.33 0 0 0 ^ ^ ^ 0.65 ^ 0.33 0 ^ ^ ^

^ ^ 0 ^ ^ ^ 0.22 0 1.87 ^ ^ 0 ^ ^ ^ 0.07 0.07 ^ ^ ^ 0 0.26 Nb ^ ^ 0.69 ^ ^ ^ 0 0 0.34 ^ ^ 0 ^ ^ ^ 0.69 0 ^ ^ ^ 0.35

0.12 ^ ^ ^ ^ ^ 0 ^ ^ 0.95 ^ ^ 0 0 0 0.71 0.36 ^ ^ ^ ^ 0.97 Pa 0 ^ ^ ^ ^ ^ 0 ^ ^ 1.08 ^ ^ 0 0 0 0.54 0.54 ^ ^ ^ ^

Qa

Ra

^ ^ ^ 2.69 ^ ^ 0 ^ ^ ^ ^ 1.05 0 1.34 ^ 0 0 0.3 0 0.15 ^ 0 ^ ^ 0.09 0.9 0 4.03 0 0 ^ 0.3 ^ 0 ^ ^ ^ 3.58 1.44 3.58 ^ ^ 1 0.78 a Q Ra ^ ^ ^ 9.42 ^ ^ 0 ^ ^ ^ ^ 0 0 2.02 ^ 0 0 0.67 0 0 ^ 0 ^ ^ 0 1.35 0 2.02 0 0 ^ 0 ^ 0 ^ ^ ^ 5.38 1.98 12.51 ^ ^

S

T

V

0.9 0.15 ^ ^ 0.15 0 ^ 0 ^ 0.15 ^ 0 0.6 ^ 0.6 0 0 ^ 0 0.15 0.3 0.53 S 0.61 0 ^ ^ 0.61 0 ^ 0.61 ^ 0 ^ 0 0.61 ^ 0 0 0 ^ 0 0 0

0 ^ ^ ^ ^ ^ ^ 0.19 0 ^ 0.09 0 0.19 ^ 0 0 0 ^ ^ ^ ^ 0.65 T 0.53 ^ ^ ^ ^ ^ ^ 0.53 0.53 ^ 1.06 0 0.53 ^ 0 0 0 ^ ^ ^ ^

0 ^ 0 0.17 ^ 0.17 ^ 0.5 ^ 1.67 1.84 ^ ^ ^ ^ ^ ^ 0 ^ ^ ^ 0.33 V 1.57 ^ 0.79 0.79 0.79 0.79 ^ 0 ^ 0 1.57 ^ ^ ^ ^ ^ ^ 0 ^ ^ ^

Wa Y ^ 0.04 ^ ^ ^ 0 ^ ^ ^ 0.15 ^ ^ ^ ^ 0.15 0.08 ^ ^ 0 0.42 ^ 0.82 Wa ^ 0.3 ^ ^ ^ 0 ^ ^ ^ 0 ^ ^ ^ ^ 0 0 ^ ^ 0 0.72 ^

^ 0.66 2.27 ^ 0.19 ^ 0.19 ^ ^ ^ ^ 0 ^ ^ ^ 0.09 ^ ^ ^ 0.76 0 0.33 Y ^ 0.83 0 ^ 0 ^ 0 ^ ^ ^ ^ 0 ^ ^ ^ 0 ^ ^ ^ 0.88 0

n Original amino acids are in columns and replaced ones in rows (X denotes stop codon).Values are substitution frequencies in percent corrected by amino acid frequencies.The correlation coe⁄cients between the PTKs and PSKs are in the middle row. Dashes indicate substitutions which are impossible with only one nucleotide substitution in the codon. a Amino acids substitution patterns having correlation coe⁄cient higher than 0.75 (good correlation). b Substitution patterns having a correlation coe⁄cient lower than 0.26 (poor correlation). c Splice site mutations.

reported from many unrelated families. The well-known 1100delC mutation (K00362) of the CHK2 checkpoint homolog (CHEK2) has been described in at least 154 families, and the D835 mutations (K00567-571, K00573-74, and K00577) of fms-related tyrosine kinase 3 (FLT3) has been described in 136 families. A total of 13% of the affected residues have more than one mutation. In residue K650 of fibroblast growth factor receptor 3 (FGFR3), six distinct amino acid substitutions (K00109–110, K00356–358, and K00497) in 29 unrelated families have been reported. Amino acid substitutions were analyzed for all the possible combinations. There are some differences between PTKs and PSKs (Table 3). The strongest similarity is that phenylalanine is always changed to leucine or serine (correlation coefficient 1). Also, other mutations are made possible by replacing bases in codons for phenylalanine. Otherwise, the lowest correlation is

detected in mutation patterns of asparagine and isoleucine (correlation coefficients 0.26 and 0.15, respectively). Only very few mutations appear in codons for cysteine, histidine, isoleucine, and serine. On the other hand, just a few mutations are alterations to alanine or isoleucine. By far the largest frequencies are for mutations from arginine to stop codon, arginine to cysteine, aspartate to tyrosine, and aspartate to threonine. In all these instances a charged residue is replaced. When looking at the mutation frequencies between groups of amino acid types it is evident that mutations in polar and charged residues are clearly overrepresented. In PSKs, nonsense mutations are frequent for polar and charged amino acids (Table 4). The distribution of mutations within the domains is visualized in Figure 3 to investigate mutational hotspots. The numbers are for unrelated families, representing the number of unrelated

440

ORTUTAY ET AL. TABLE 4. Comparison of Amino Acid Substitution Frequencies in PTKs

Mutant Tyrosine kinases Hydrophobic Polar Small Charged Aromatic Aliphatic Stopcodon Serine/Threonine kinases Hydrophobic Polar Small Charged Aromatic Aliphatic Stopcodon

and PSKs n

Hydrophobic

Polar

Small

Charged

Aromatic

Aliphatic

7.56 1.51 4.15 1.39 2.11 6.71 1.07

14.54 19.81 3.85 13.67 21.36 8.85 5.8

3.64 5.13 3.13 5.13 0.61 5.1 0.15

14.54 17.94 3.69 11.8 21.14 8.77 4.35

1.24 2.81 1.11 2.74 0.57 0.41 1.18

7.97 1.77 4.29 1.65 1.35 7.91 0.54

5.34 8.34 6.76 7.08 3.17 4.88 2.23

15.43 12.16 5.37 5.5 9.66 2.79 18.96

3.8 6.14 3.33 6.14 0.61 4.42 0

15.43 11.13 4.68 4.46 9.31 2.79 16.98

1.94 0.66 1.9 0 0.6 0.81 1.6

6.04 8.64 5.4 7.38 2.67 5.87 1.26

n Columns represent the original amino acids, and rows the mutant forms. Values are for observed substitution frequencies corrected by amino acid frequencies. Grouping of amino acids is according to Livingstone and Barton [1993].

FIGURE 3. Distribution of KinMutBase mutations according to multiple sequence alignment in tyrosine kinases (top) and serine/ threonine kinases (bottom).The number of missense mutations is displayed on the upper part of bar charts and the number of nonsense mutations is displayed on the lower part. Locations of kinase-speci¢c motifs and subdomains are presented as black bars.

mutational events. A large proportion of the mutations affect the conserved kinase motifs of Hanks [Hanks, 2003], many of them are involved in substrate and coligand binding. The bias is also evident from the visualization of the multiple sequence alignment (Fig. 1). Many of the mutations are in subdomains VIB and VIII,

which are responsible for substrate recognition [Johnson et al., 1998; Taylor et al., 1995]. There are also certain hotspots outside the conserved subdomains and motifs, for example R24C mutation (K00113) of cyclin-dependent kinase 4 (CDK4) and L67P (K00055) of serine/threonine kinase 11 (STK 11), is reported

KinMutBase

441

FIGURE 4. Location of frequently mutated residues in PTKs indicated in the insulin receptor kinase domain structure (PDB code: 1GAG).The ¢gure was created with PyMOL (www.pymol.org) [DeLano, 2002]. A: Cartoon representation of the structure with substrate (gray stick model) and Mg 2+ (yellow balls). B: Residues a¡ected frequently by missense mutations (blue), nonsense mutations (red) or both (magenta). C: Location of kinase-speci¢c motifs (blue backbone) and tyrosine kinase-speci¢c subdomains (red). (Residues with overlapping de¢nitions are red.) D: Residues a¡ected by mutations causing constitutive activation (red).

from at least 14 unrelated families. A total of 21 of the frequently mutated sites are arginines, which are coded by CpG dinucleotidecontaining codons. The most frequently mutated residues were investigated in three-dimensional protein structures. Structure has been determined for several PTK kinase domains. The kinase domain structure of insulin receptor kinase (PDB ID:1GAG) [Hubbard, 1997] was used to visualize the localization of mutations in PTKs (Fig. 4). For the disease-related PSKs, no structure has been determined. The locations of missense and nonsense mutations were compared to locations of conserved motifs. Numerous missense mutations are located around the substrate-binding pocket near the substrate, Mg2+, and ATP binding sites. The sequence and structural studies reveal that disease-causing mutations are widely distributed within the domain, indicating that kinase is vulnerable for alterations in many locations. Interestingly, the activating mutations cluster to the activation loop and ATP/ligand binding region (Fig. 4D). The growth of the KinMutBase allows some statistical analysis of mutations. Differences in mutation types between PSKs and PTKs may be an artifact arising from different numbers of cases. However, amino acid substitutions display such differences that these differences are likely to remain, even in the future when more mutations and kinases are included. Most of the kinases lead to several diseases when mutated, and the disease phenotype depends on the mutation position and type. A total of 27 out of the 33 kinases are linked to cancers and tumors, further pinpointing the essence of signal pathway control, which is lost either by inactivation or activation when the kinases are mutated. REFERENCES Bhaduri A, Sowdhamini R. 2003. A genome-wide survey of human tyrosine phosphatases. Protein Eng 16:881–888. Boissan M, Feger F, Guillosson JJ, Arock M. 2000. c-Kit and c-kit mutations in mastocytosis and other hematological diseases. J Leukoc Biol 67:135–148. Bult CJ, Blake JA, Richardson JE, Kadin JA, Eppig JT, the Mouse Genome Database Group. 2004. The Mouse Genome Database (MGD): integrating biology with the genome. Nucleic Acids Res 32:D476–D481. [Database issue] Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D. 1997. Genetic and physical maps of Saccharomyces cerevisiae. Nature 387:67–73.

de Silva CM, Reid R. 2003. Gastrointestinal stromal tumors (GIST): C-kit mutations, CD117 expression, differential diagnosis and targeted cancer therapy with Imatinib. Pathol Oncol Res 9:13–19. DeLano WL. 2002. The PyMOL molecular graphics system. San Carlos, CA: DeLano Scientific. DeLano WL. 2002. The PyMOL user’s manual. San Carlos, CA: DeLano Scientific. Dong SM, Kim KM, Kim SY, Shin MS, Na EY, Lee SH, Park WS, Yoo NJ, Jang JJ, Yoon CY, Kim JW, Kim SY, Yang YM, Kim SH, Kim CS, Lee JY. 1998. Frequent somatic mutations in serine/ threonine kinase 11/Peutz-Jeghers syndrome gene in left-sided colon cancer. Cancer Res 58:3787–3790. FlyBase Consortium. 2003. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res 31:172–175. Hanks SK. 2003. Genomic analysis of the eukaryotic protein kinase superfamily: a perspective. Genome Biol 4:111. Hubbard SR. 1997. Crystal structure of the activated insulin receptor tyrosine kinase in complex with peptide substrate and ATP analog. EMBO J 16:5572–5581. Johnson LN, Lowe ED, Noble ME, Owen DJ. 1998. The Eleventh Datta Lecture. The structural basis for substrate recognition and control by protein kinases. FEBS Lett 430:1–11. Kim CJ, Cho YG, Park JY, Kim TY, Lee JH, Kim HS, Lee JW, Song YH, Nam SW, Lee SH, Yoo NJ, Lee JY, Park WS. 2004. Genetic analysis of the LKB1/STK11 gene in hepatocellular carcinomas. Eur J Cancer 40:136–141. Kitts A, Sherry S. 2003. The single nucleotide polymorphism database (dbSNP) of nucleotide sequence variation. The NCBI handbook. Bethesda, MD: National Library of Medicine, NCBI. p 5–1–30. (online: /www.ncbi.nlm.nih.gov/books/bv.fcgi?rid= handbook.chapter.1143S, date accessed: 1 September 2004) Kuragaki C, Enomoto T, Ueno Y, Sun H, Fujita M, Nakashima R, Ueda Y, Wada H, Murata Y, Toki T, Konishi I, Fujii S. 2003. Mutations in the STK11 gene characterize minimal deviation adenocarcinoma of the uterine cervix. Lab Invest 83: 35–45. Lenhard B, Hayes WS, Wasserman WW. 2001. GeneLynx: a gene-centric portal to the human genome. Genome Res 11:2151–2157.

442

ORTUTAY ET AL.

Livingstone CD, Barton GJ. 1993. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 9:745–756. Ollila J, Lappalainen I, Vihinen M. 1996. Sequence specificity in CpG mutation hotspots. FEBS Lett 396:119–122. Perrault I, Rozet JM, Gerber S, Kelsell RE, Souied E, Cabot A, Hunt DM, Munnich A, Kaplan J. 1998. A retGC-1 mutation in autosomal dominant cone-rod dystrophy. Am J Hum Genet 63:651–654. Pontius JU, Wagner L, Schuler GD. 2003. UniGene: a unified view of the transcriptome. The NCBI handbook. Bethesda, MD: National Library of Medicine, NCBI. p 21–1–12. (online: /www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.857S, date accessed: 1 September 2004) Pruitt KD, Maglott DR. 2001. RefSeq and LocusLink: NCBI genecentered resources. Nucleic Acids Res 29:137–140. Riikonen P, Vihinen M. 1999. MUTbase: maintenance and analysis of distributed mutation databases. Bioinformatics 15:852–859. Rowan A, Bataille V, MacKie R, Healy E, Bicknell D, Bodmer W, Tomlinson I. 1999. Somatic mutations in the Peutz-Jeghers (LKB1/STKII) gene in sporadic malignant melanomas. J Invest Dermatol 112:509–511.

Stenberg KA, Riikonen PT, Vihinen M. 2000. KinMutBase, a database of human disease-causing protein kinase mutations. Nucleic Acids Res 28:369–371. Su GH, Hruban RH, Bansal RK, Bova GS, Tang DJ, Shekher MC, Westerman AM, Entius MM, Goggins M, Yeo CJ, Kern SE. 1999. Germline and somatic mutations of the STK11/LKB1 Peutz-Jeghers gene in pancreatic and biliary cancers. Am J Pathol 154:1835–1840. Taylor SS, Radzio-Andzelm E, Hunter T. 1995. How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein-tyrosine kinase. FASEB J 9:1255–1266. Vihinen M, Lehva¨slaiho H, Cotton RD. 1999. Immunodeficiency mutation databases. In: Ochs HD, Smith CIE, Puck M, editors. Primary immunodeficiency diseases. A molecular and genetic approach. Oxford: Oxford University Press. p 443–447. Ylikorkala A, Avizienyte E, Tomlinson IP, Tiainen M, Roth S, Loukola A, Hemminki A, Johansson M, Sistonen P, Markie D, Neale K, Phillips R, Zauber P, Twama T, Sampson J, Jarvinen H, Ma¨kela¨ TP, Aaltonen LA. 1999. Mutations and impaired function of LKB1 in familial and non-familial Peutz-Jeghers syndrome and a sporadic testicular cancer. Hum Mol Genet 8:45–51.