Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Comparative Genome Annotation for Mapping, Prediction and Discovery of Genes Claudia Kappen & J. Michael Salbaum Center for Human Molecular Genetics, Munroe-Meyer Institute, and Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198-5455.
[email protected] [email protected] Abstract We have used comparative genome analyses to produce annotated maps for large genomic loci. The first example is a locus on mouse chromosome 9 that is syntenic to human chromosome 15. This effort relied on draft sequences from the human genome project, our own draft sequences from mouse genomic DNA, and, more recently, from the mouse genome project. Our strategy used reiterative searches of DNA, Protein, STS and EST databases, as well as Genome Maps. In this fashion, we were able to assemble sequence contigs over a large region that comprises 14 genes. We will present our framework for data interpretation and demonstrate that unfinished sequences can be used to assemble maps of complex genomic loci with good accuracy. By focusing on one model locus, we will discuss limitations and advantages of this approach and provide criteria for the implementation of automated genome annotation strategies.
1. Introduction With the sequencing of multiple complex genomes, such as the human genome, enormous amounts of information have become available. The tasks of completion, quality control, assembly and interpretation of this information may turn out to take longer than the time it took to generate the raw data. Thus, the available drafts of the human and the soon-to-be completed mouse genomes continue to be annotated on various levels. Even before completely finished versions are produced, however, limited regions of genomes have been studied and annotated at greater detail [1;2]. To date, the quality and accuracy of annotation for small regions far exceeds whole genome assemblies and annotation strategies. We here report our experience with genome annotation on the basis of draft sequences, as applied to a small region of human chromosome 15, and the progress of annotation of
this region over the past two years. Using this example as a model, we assess feasibility and strategy, technical limitations and possibilities for automation, quality and remaining problems, as well as the outlook for progress of genome annotation. Our interest in the specific region of human chromosome 15/mouse chromosome 9 stems from our discovery that it harbors three closely related genes of the immunoglobulin superfamily, Neogenin, putative neural cell adhesion molecule PUNC, and NOPE (Neighbor of PUNC/E11) [3]. This suggested that this genomic region might harbor a gene cluster. With this guiding hypothesis, we used genome annotation to investigate the location and order of the three genes and the possibility that other related genes might exist in this region. A second motivation came from the fact that the region was known to be located within a large interval associated with the complex genetic disease Bardet-Biedl syndrome [4]. Bardet-Biedl syndrome is characterized by retinal degeneration, hypogonadism, kidney abnormalities, digit and tooth abnormalities, obesity and diabetes (OMIM #209900). Identifying genes that map in the disease interval might identify the cause of this disease or at least provide broader information for the underlying causes of one or more of its pathologies.
2. Establishment of syntenic regions on mouse chromosome 9 / human chromosome 15 We started our investigations by compiling known information on gene location on chromosomes 9 (mouse) and 15 (human) to better characterize the extent of synteny. Sources for data were OMIM [5], the chromosome 15 map (through the Sanger Centre) at the Human Genetics Division of the University of Southampton School of Medicine [6], Gene Map'99 [7], MGI [8], and the Whitehead Institute Maps [9], as well as human mouse homology maps available from NCBI,
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
1
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Figure 1: Gene map for a syntenic region on mouse chromosome 9/human chrosmosome 15 Microsatellite framework markers are given as coordinates for human chromosome 15, for mouse chromosome 9, the MGD map location; intervals do not reflect physical scale. The three starting genes of interest, NEOGENIN, NOPE and PUNC are colored in red (where experimental mapping information was available), the BBS4 critical region is also indicated in red. Turquoise labels map positions taken from the NCBI database, light green from the UK database, and yellow from the literature. Tan labels unmapped human genes for which mouse counterparts (purple) were mapped. Lines establish direct correspondence of orthologous genes. It is interesting to note that the mouse locus is in opposite direction from the centromere compared to human, and that the mouse locus is discontinuous relative to the human target region, indicating large-scale chromosomal rearrangements during the divergence of human and mouse species.
MGD, MGI (mammalian homology), and the Davis Human/Mouse Map (now available through NCBI). The summary is shown schematically in Figure 1. Characteristic for syntenic regions, many genes are found in both human and mouse, and generally, in similar order. Interestingly, it appears that left to right in schematic represents the direction centromer – telomere on human chromosome 15, but the opposite direction on mouse chromosome 9. For the mouse, mapping information was more scarce than for human, and therefore such a map must be considered tentative. However, it clearly establishes the overlap of the critical region for BardetBiedl Syndrome with the chromosomal location of the three immunoglobulin superfamily genes Neogenin, Nope and Punc. This provided support for our hypothesis that either of these three or other genes in the critical interval could be candidate disease genes. On this basis, we constructed a genome annotation map for the region
surrounding the three Ig-superfamily genes. 3. Generation of Gene Maps DNA sequences typically run in stretches of 500-800 characters with decreasing quality towards the end. If a given stretch of DNA were sequenced only once, the genome would consist of high and low quality areas intermingled. This is typically overcome by repeat sequencing, after which overlapping sequences are assembled into larger stretches of contiguous sequence, or for short, “contigs”. Yet, matching the correct overlapping sequences depends on the length of overlap and sequence quality, as any undetermined nucleotide with the character designation N increases ambiguity of alignments. Theoretically, then it becomes possible that two different sequences, at an overlap, may be equally similar to a query sequence, thus producing two links when in reality
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
2
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Figure 2: Genome annotation: Data levels and components Individual BACs are marked by different colors. BACs were aligned according to overlap established by the presence of a common marker (arrows). Arrow colors represent: black: sequence tagged sites, STS, nucleotide sequence markers; burgundy: nucleotide sequence marker that is part of an open reading frame; light blue: microsatellite marker previously mapped to this interval. Only some of these could be mapped to the BACs (see test for explanation).
only one link can be correct because DNA is a linear molecule. In alignments of DNA sequences from databases that cover the whole genome, therefore, high cut-off levels need to be chosen to restrict contig formation to strictly linear assemblies. This leads to gaps between contigs that cannot be closed unless other strategies are employed. One computational strategy is to allow alignments with lower scores for selected ends, but this does not improve the quality of the underlying data. Quality improvement of DNA sequence data could only be achieved by repeat sequencing so many times that all ambiguities can be eliminated. This, on the other hand, is considered infeasible for large genomes such as human. Experimental strategies that increase data quality are (a) mate-pair selection and (b) marker mapping. a) Mate-pair selection relies on the knowledge that DNA fragments prior to sequencing are cloned into circular vectors, and the inserts are sequenced from both ends (see Figure 2). The information which two sequences belong to the same clone insert greatly reduces complexity, because, by definition, those two sequences can only reside on one contig [10]. Thus, the number of possible combinations for any one DNA sequence is significantly restricted. Secondly, pairing mates may enable gap closure between two previously unlinked contigs across ambiguous quality sequence or even in the absence of sequence information about the ambiguous region. b) Marker mapping was employed long before sequencing [11], by taking into account the existence of multiple but distinct stretches of dinucleotide repeats. With such “microsatellite” markers, it became possible to derive a map of genomes, on the basis of anonymous landmarks for each chromosome. These "sequence tagged sites" (STS), of course, should
then be present on individual clones, and in this way, entire clone sets have been mapped and ordered relative to each other, producing clone contigs (Figure 2). The anonymous microsatellite markers thus serve as a framework for ordering and annotating gene location and sequence data [12]. However, the assembly of different types of data, DNA sequence, mate-pairing data and framework markers on whole genome scale is computationally complex enough that different algorithms and strategies produce maps that may be quite different from each other (compare NCBI [13] and ENSEMBL [14] and GoldenPath [15], for example, or the two initial draft maps published by the public [16] and commercial [17] sequencing consortia for the human genome). While there are disagreements over the best strategies for improvement, most annotation specialists agree that the two most challenging tasks revolve around interpretation of data at multiple levels and around formalizing points of judgement, where human judgement is still superior. We have employed “annotation by hand” to map and discover genes in a selected region of human chromosome 15. 4. Establishment of a BAC contig for the region of interest on human chromosome 15. We established a map of BAC contigs by using the htgs division of GenBank with information up to December 2000, and Gene Map 99 for chromosome 15. In order to obtain information on the content of each BAC, its sequence was submitted to GenScan analysis [9;18], which returned probabilities for coding potential (open reading frames, ORF), as well as predictions of polypeptides/proteins encoded on a BAC. Even though
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
3
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
the DNA sequence information for each BAC is discontinuous and may be in either orientation, stretches of open reading frames can be used as unique markers. This allowed us to ask whether such signatures were also present on other BACs suspected in that region. By reiterating the process of GenScan analysis, and comparison of ORFs, we were able to establish overlaps between contigs. The process is schematically depicted in Figure 4, and made use of existing database information (blue lines), protein predictions (green lines), and constructed sequences (purple lines). Difficulties encountered included partial coding sequences on one BAC corresponding to a larger ORF on another, and the presence of repetitive elements. Generally, repetitive elements with coding potential represent endogenous retroviruses or remnants thereof, and therefore protein predictions like “similar to DNApolymerase…” or “similar retrovirus…” are indicative of repetitive elements. Nevertheless, each such prediction was independently evaluated by searching protein sequence databases for perfect matches. Imperfect matches to the known cloned Polymerases would support the classification as repetitive element or endogenous retrovirus or fragment thereof. Such elements can be misleading because of their frequent occurrence in the genome and therefore, we generally did not consider them useful markers. Secondly, we used sequence tagged sites (STS) from Gene Map 99 for human chromosome 15 as markers. All 336 STSs between microsatellite markers D15S159 and D15S114 were screened against the non-redundant (nr) and high throughput genome sequencing (htgs) divisions of GenBank. For matches of individual STS to BACs, we evaluated those hits with two criteria: (i) location of the already identified BACs. Third, we used BLAST [19] to search with the sequence of one BAC for overlap with
BACs on chromosome 15, and (ii) potential overlap with other BACs in GenBank. Again, the presence of repetitive elements can misidentify a BAC on the basis of this shared sequence. We therefore only considered search hits on BACs that also shared other sequence similarities with the query. From over 137 BACs (the complete list known by December 2000) in the target interval, we were able to construct two contigs that contained one gene (NEOGENIN), and the two other genes of interest (PUNC and NOPE), respectively. Considerable time was spent on attempts to link these two BAC contigs. The presence of extended repeat sequences proved to be the major obstacle. The second problem was that searches only identified other BACs that were not located on chromosome 15 (but on chromosomes 1 [5x], 3 [2x], 4 [2x], 6 [2x], 8 [2x], 9, 10, 11, 18 [1x each]). By performing several quality control assessments on those BACs (GenScan predictions, presence of cloned genes, previous mapping information) we were able to exclude the possibility that these matching BACs could have erroneously be classified to the wrong chromosome. Out of a total of 137 BAC matches evaluated, 18 (13%) were from nonchromosome 15 locations. Finally, because BACs are composites and therefore searches yield only overall composite scores, we disassembled eachquery BAC into its components and used those individually to scan for matches. These searches have a convenient intrinsic quality control: the piece must find its parent BAC at 100% identity. Only in one case did this yield a BAC that further extended the contig. This strategy thus yielded some advance but may be limited in incremental usefulness. In summary, using coding potential and protein predictions as markers, we were able to construct two BAC contigs (Figure 3).
Figure 3: Establishment of BAC contigs for a small segment of human chromosome 15 BACs are outlined as colored bars. Gaps indicate regions/markers not covered by the BAC. Markers are given in black (STS), dark red (EST), and blue (microsatellite). In only three cases was a microsatellite marker recovered on a BAC (see text for explanation), and the availability of EST markers considerably increased marker density and mapping resolution. Overall, of 336 markers evaluated, 58 were useful in construction of this BAC contig map.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
4
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Table 1. Gene Predictions for the 09/15 region. Kappen&Salbaum, 2000
NCBI build 30
UCSC
ENSEMBLE
RIS
RIS HP1
RIS
PDCD7 CLPX
PDCD7 CLPX HP2 (=DOP) CILP HP3 PITSLRE
PDCD7 CLPX FLJ20509 (=DOP) CILP
RIS TX1 TX2 PDCD7 CLPX Q9NXO3 (=DOP) CILP
CILP DOP PUNC NOPE BNNO BIND/HSPC121 GLOB SLC24A1 IRLB
PUNC HP4 (=NOPE) DPP8 (=BNNO) HSPC121/BIN HP5 (=GLOB) SLC24A1 IRLB
NEOGENIN HCN4
NEOGENIN HCN4
DDM36 (=NOPE) DPP8 BIND HP (=GLOB) SLC24A1
NEOGENIN HCN4 SDFR1
PUNC TX2 (=NOPE) DPP8 BIND TX3 (=GLOB) SLC24A1 IRLB DENN NEOGENIN HCN4 TX4 (not SDFR1)
Legend: Gene names, predicted hypothetical proteins (HP) or predicted transcripts (TX) are listed. Discordance is evident for the location of DOP, for the presence and identity of additional coding regions / transcripts, and two genes are not featured in the UCSC version. Interestingly, the controversial regions are those for which only draft sequence is currently available. Overall, the transcriptional orientations and the majority of annotations were concordant between all databases.
5. Establishment of DNA sequence contigs The above strategy was extremely useful for mapping the order and orientation of pieces in a given BAC. For this, all BACs with overlaps (a total of 11, see Figure 3) were decomposed into their pieces (on average 20/BAC) and all pieces were compared with each other to identify overlaps. These were used to construct sequence contigs. While not all pieces could be joined, and about 30 could not be joined at all, we were able to order and orient pieces within a BAC and generally reduce complexity by a factor of at least 2-3 fold. Where DNA sequence pieces encoded open reading frames, the quality of assembly was validated by repeated GenScan analysis of sequence contigs. 6. Genes in the assembled region: discovery, order and orientation. From the BAC coding potential predictions and the assembled sequence contigs, several open reading frames were predicted in addition to previously cloned genes (Table 1). Genes SLC24A1, NEOGENIN and PUNC had been previously cloned and mapped to the region of interest by either physical or molecular methods. In addition, genes RIS and HCN4 were known but had no map annotation at the time of our initial analysis. This suggested that the remaining
uncharacterized ORFs may represent novel genes. The assembly of BAC contigs had established the location of each gene relative to neighbors, and the assembly of the DNA sequence contigs allowed us to determine the putative orientation of transcription for each gene. To confirm this orientation, GenScan analyses and prediction of splice donor and acceptor sites were used. For previously unknown genes, we then undertook more detailed analyses to define exon-intron structure and the complete cDNA sequence. Here, our analyses were greatly aided by comparison to mouse cDNA sequences (and genomic DNA sequences) that we had generated experimentally ([3], and Salbaum unpublished data). 7. Discovery of genes by genome annotation. Expressed Sequence Tags (ESTs) are short sequence reads derived from cDNA produced from m/RNA from the tissue of origin. This DNA therefore represents actively transcribed genes, the genes that are expressed in the tissue of origin. By now, many such ESTs have been produced from tissues, cell lines, cells or libraries. Even without detailed information on the source, (the original libraries were pools of tissue, so that specific information is lacking from what cell type a given EST may originate), ESTs are genuine tags for an expressed gene and can be used to assemble the coding cDNA/mRNA for a gene. As long as the whole cDNA is
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
5
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
covered by ESTs, overlapping ESTs can be assembled into a contig that ends at the common most outlying end for all ESTs. Reiterative searches of EST databases will be required, which may identify multiple matches (due to the size of EST databases which may contain up to 1,000 ESTs for one gene), but only matches of 100% identity represent the identical gene. Lower scores either identify (i) allelic variants, (ii) related members of a gene family or ESTs derived from (iii) incompletely or (iv) alternatively spliced RNA. When this is the case, reiterative searches will distinguish between possibilities (i;ii) and (iii;iv): in the first case, the sequence similarity will be high along the entire length of query and match, in the latter case, sequences will be 100% identical but interrupted by stretches of non-identity (which signify introns or untranslated regions of an RNA). This process is depicted in Figure 4 where red lines indicate the reiterative construction of a cDNA from ESTs, and green lines indicate computer-aided predictions of the proteins
encoded by the open reading frames (ORFs). In this fashion, we identified the complete coding sequences for three novel genes and termed them (according to location) BNNO (Between NEOGENIN and NOPE), GLOB (Gene left of BNNO) and DOP (Downstream of PUNC), respectively. Since then, BNNO has been independently experimentally verified (termed DPP8 [20]) in the human genome. In all cases, the final cDNA structure was validated against genomic DNA predictions. In addition, we used cDNA information for all ORFs to find out whether there might be exon matches on BAC pieces that we had previously been unable to map. This could be the case when one exon was present on a BAC/BAC piece and might have been missed by GenScan or misincorporated into the prediction of another ORF. The latter case would have been missed in the initial screen, but found upon more extensive evaluation and searches at the protein level..
Figure 4: Integrated Framework for Comparative Genome Anotation Existing databases are labeled green, experimental data from our own sequencing efforts in turquoise, computer-aided predictions are labeled red, the assembly process in orange, and results/newly generated information in yellow. Lines indicate information flow and re--terative searches at the level of nucleotide sequences (blue for human, turquoise for mouse input), cDNA sequences (black), EST sequences (red), and computer-aided predictions for DNA sequences and proteins (purple).
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
6
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
8. Conclusions and Outlook for future developments in genome annotation. Our combined analyses resulted in a composite map far more accurate than those provided by either public or commercial genome project. While employing manual annotation and human judgement, we also developed a process framework that should be amenable to automation (Figure 4). A very useful feature of our approach was to use information for human and mouse species comparatively, which is possible - at least for gene-encoding regions - due to the strong evolutionary conservation of gene sequence and structure between mammals. We have here described this framework and its application, which resulted in discovery of three previously unknown genes and a high quality annotation map from draft sequences. Since we submitted our annotation for publication in August (and updated prior to printing in December) 2000, two developments allow us now to assess accuracy and quality of the annotation, and identify factors that will improve quality in genome annotation: 1) Updated annotations of the human genome have been produced; as basis for this article, we chose the June 2002 freeze of goldenPath ([15]; this server came online August 2001 and was therefore unavailable to our initial analyses), the build 30 version of the NCBI database [13] and the (2002 August 28) update of ENSEMBL [14]. 2) The Trans-NIH Mouse Initiative selected part of the mouse region on chromosome 9 that corresponds to human chromosome 15 for priority sequencing in the Mouse Genome project with the quality level of “finished sequence” (Salbaum, unpublished data). The Mouse Genome Map is also in the process of being assembled, and we here refer to the July 12, 2002 freeze of ENSEMBL for the mouse. These two developments allow us to evaluate both experimental factors and computational strategies in genome annotation. (i) The map position of our region of interest has now been refined to between bases 60191726 – 61912400 for the contig encompassing Nope and PUNC, and bases 69109000 - 70101000 for the contig encompassing Neogenin and HCNY in the NCBI build 30 map. The assembled super contig NT_010265 (bases 58301338 – 63984708) contains finished as well as draft sequence between CLPX and PUNC and BNNO and GLOB, as well as beyond IRLB. (ii) The two separate contigs for the human locus also appear to be separated in the mouse genome (ENSEMBL). Experimental support for the close linkage of Neogenin and Glob in the mouse came from BAC 286 [3]. Since the mouse genome has been sequenced at much higher coverage by now (the region was essentially unsequenced two years ago), it appears that other BAC clones span a larger distance, and we therefore have to assume that BAC 286 had undergone a
rearrangement. This is an experimental possibility, and highlights the importance of quality of BAC libraries, as well as the benefit of using either multiple BACs (which were not available for the original experiments) or multiple mapping strategies. (iii) Both BAC contigs that we assembled overlapped with the critical gemomic region for the inherited disorder Bardet-Biedl Syndrome. Since our study, the causative gene for BBS4 has been cloned [21], and was found to encode a protein with similarity to 0-linked acetylglucosamine transferase. This gene is localized at about 68 Mb, between our two BAC contigs. While the motivation for our original study was to identify candidate genes within the BBS4 region, we were unable to close the gap between both BAC contigs and thus missed the gene causative for this disease / disorder. The corresponding orthologous gene in the mouse also maps between Neogenin and Irlb (ENSEMBL). (iv) When the first drafts of the Human Genome were produced by the public consortium [22] and a commercial firm [17], both versions differed in their prediction from our published map. Even today, the three major publicly accessible genome browsers differ in their gene assignments for the region (see Table 1). For example, the UCSC browser is missing PUNC entirely, and, possibly due to the close relationship of PUNC and NOPE, assumes ESTs for both under NOPE. Other explanations are that the prediction algorithms differ slightly, and that the region still contains areas where finished sequence is not yet available. An alternative possibility is that there may be rearrangements or polymorphisms in the region, making final mapping more complex. In this regard, it is helpful to evaluate the extent to which a prediction of a transcript is supported by EST data (which is not the case for TX1, for example). The inclusion of independently derived data will add important criteria in establishing validity for any prediction. Generally, however, the maps available two years later conform very well with the one we originally produced entirely on the basis of draft sequences, both with regard to gene content and transcriptional orientation. This provides strong support for the validity of our approach and its feasibility for annotation of other regions in the human genome or for genomes of other organisms. (v) Comparative genome analysis is particularly valid for coding regions / cDNAs, where mouse and human show strong conservation. Our sequence analyses revealed such low similarity in non-coding regions that mouse genomic sequence information (in the form of our own STS data) was not helpful for ordering and orientating the human locus. On the other hand, this low degree of similarity can be exploited to locate functionally significant (and presumably conserved) regions in non-coding DNA (see accompanying manuscript on this topic).
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
7
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
(vi) The discrepancies in the 09/15 region between a manual annotation and whole-genome annotation suggest that other genomic regions could also be annotated with greater accuracy by hand. Alternatively, the fidelity of automatic annotation should improve with focus on smaller regions, suggesting that the precision of annotation can be improved through inclusion and evaluation of data at different levels (integration of multiple databases) even in the absence of human judgement. Nevertheless, cases will remain where curation may be the only solution to discrepancies. (vii) For the experimentalist, the current status of genome annotation suggests that critical evaluation is required before accepting predictions, and sometimes even map positions. It can be expected that the focus on smaller regions, and inclusion of data at all possible levels will increase the resolution and precision of genome maps. For the computational scientist, the ongoing need for genome annotation presents the challenge to develop tools that integrate multiple databases. Of particular use would be re-iterative approaches that may be able to approximate manual annotation and human judgement.
[7] http://www.ncbi.nlm.nih.gov/genemap/map.cgi?CHR=15 [8] http://www.informatics.jax.org [9] http://www-genome.wi.mit.edu/cgi-bin/mouse/index [10] Dietrich, W.F., N.G. Copeland, D.J. Gilbert, J.C. Miller, N.A. Jenkins, and E.S. Lander (1995): Mapping the mouse genome: current status and future prospects. Pro.c Natl. Acad. Sci. USA 92, 10849-10853. [11] Jeffreys, A.J., V. Wilson, and S.L. Thein (1985): Hypervariable 'minisatellite' regions in human DNA. Nature 314, 67-73. [12] Cox, D.R., E.D. Green, E.S. Lander, D. Cohen, and R.M. Myers (1994): Assessing mapping progress in the Human Genome Project. Science 265, 2031-2. [13] http://www.ncbi.nlm.nih.gov/cgibin/Entrez/maps.cgi?org=hum&chr=15 [14] http://www.ensembl.org/Homo_sapiens/ [15] http://genome.ucsc.edu
9. Acknowledgements: This work was performed on two iMacs using the databases and Web-based servers mentioned in the text and MacMolly (SoftGene, Berlin, Germany) for off-line sequence comparisons and translation to protein. It was funded in part by grants NIH-RO1-HD37408 (to C.K.) and NIH-R21-DK59280 (to J.M.S.). We thank Saralyn Fisher and Melanie Schrack for help with the preparation of the manuscript. 10. References: [1] Loots, G.G. R.M. Locksley, C.M. Blankespoor, Z.E. Wang, W. Miller, E.M. Rubin, and K.A. Frazer (2000): Identification of a coordinate regulator of interleukins 4, 13, and 5 by crossspecies sequence comparisons. Science 288, 136-140. [2] Crabtree, J., T. Wiltshire, B. Brunk, S. Zhao, J. Schug, C.J. Stoeckert, Jr., and M. Bucan (2001): High-resolution BACbased map of the central portion of mouse chromosome 5. Genome Res. 11, 1746-1757. [3] Kappen, C., and J.M. Salbaum (2001): 09/15: Comparative genomics of a conserved chromosomal region associated with a complex human phenotype. Genomics 73, 171-178. [4] Bruford, E.A., R. Riise, P.W. Teague, K. Porter, K.L. Thomson, A.T. Moore, M. Jay, M. Warburg, A. Schinzel, N. Tommerup, K. Tornqvist, T. Rosenberg, M. Patton, D.C. Mansfield, and A.F. Wright (1997): Linkage mapping in 29 Bardet-Biedl syndrome families confirms loci in chromosomal regions 11q13, 15q22.3-q23, and 16q21. Genomics 41, 93-99. [5] http://www.ncbi.nlm.nih.gov/omim/
[6] http://ceder.genetics.sothon.ac.uk/pub/chrom15/map.html
[16] McPherson, J.D,. M. Marra, L. Hillier, R.H. Waterston, A. Chinwalla, J. Wallis, M. Sekhon, K. Wylie, E.R. Mardis, R.K. Wilson, R. Fulton, T.A. Kucaba, C. Wagner-McPherson, W.B. Barbazuk, S.G. Gregory, S.J. Humphray, L. French, R.S. Evans, G. Bethel, A. Whittaker, J,L. Holden, O.T. McCann, A. Dunham, C. Soderlund, C.E. Scott, D.R. Bentley, G. Schuler, H.C. Chen, W. Jang, E.D. Green, J.R. Idol, V.V. Maduro, K.T. Montgomery, E. Lee, A. Miller, S. Emerling, R. Kucherlapati, R. Gibbs, S. Scherer, J.H. Gorrell, E. Sodergren, K. ClercBlankenburg, P. Tabor, S. Naylor, D. Garcia, P.J. de Jong, J.J. Catanese, N. Nowak, K. Osoegawa, S. Qin, L. Rowen, A. Madan, M. Dors, L. Hood, B. Trask, C. Friedman, H. Massa, V.G. Cheung, I.R. Kirsch, T. Reid, R. Yonescu, J. Weissenbach, T. Bruls, R. Heilig, E. Branscomb, A. Olsen, N. Doggett, J.F. Cheng, T. Hawkins, R.M. Myers, J. Shang, L. Ramirez, J. Schmutz, O. Velasquez, K. Dixon, N.E. Stone, D.R. Cox, D. Haussler, W.J. Kent, T. Furey, S. Rogic, S. Kennedy, S. Jones, A. Rosenthal, G. Wen, M. Schilhabel, G. Gloeckner, G. Nyakatura, R. Siebert, B. Schlegelberger, J. Korenberg, X.N. Chen, A. Fujiyama, M. Hattori, A. Toyoda, T. Yada, H.S. Park, Y. Sakaki, N. Shimizu, S. Asakawa (2001): A physical map of the human genome. Nature 409, 934-41. [17] Venter, J.C., M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, H.O. Smith, M. Yandell, C.A. Evans, R.A. Holt, J.D. Gocayne, P. Amanatides, R.M. Ballew, D.H. Huson, J.R. Wortman, Q. Zhang, C.D. Kodira, X.H. Zheng, L. Chen, M. Skupski, G. Subramanian, P.D. Thomas, J. Zhang, G.L. G. Miklos, C. Nelson, S. Broder, A.G. Clark, J. Nadeau, V.A. McKusick, N. Zinder, A.J. Levine, R.J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
8
Proceedings of the 36th Hawaii International Conference on System Sciences - 2003
Evangelista, A.E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T.J. Heiman, M.E. Higgins, R.R. Ji, Z. Ke, K.A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G.V. Merkulov, N. Milshina, H.M. Moore, A.K. Naik, V.A. Narayan, B. Neelam, D. Nusskern, D.B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan (2001): The sequence of the human genome. Science 291, 1304-1051. [18] Burge, C.B., and S. Karlin (1998): Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-354. [19] Altschul, S.F., W.R. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990): Basic local alignment search tool. J. Mol. Biol. 215, 403-410. [20] Abbott, C.A., D.M. Yu, E. Woollatt, G.R. Sutherland, G.W. McCaughan, and M.D. Gorrell (2000): Cloning, expression and chromosomal localization of a novel human dipeptidyl peptidase (DPP) IV homolog, DPP8. Eur. J. Biochem. 267, 6140-6150. [21] Mykytyn, K., T. Braun, R. Carmi, N.B. Haider, C.C. Searby, M. Shastri, G. Beck, A.F. Wright, A. Iannaccone, K. Elbedour, R. Riise, A. Baldi, A. Raas-Rothschild, S.W. Gorman, D.M. Duhl, S.G. Jacobson, T. Casavant, E.M. Stone, and V.C. Sheffield (2001): Identification of the gene that, when
mutated, causes the human obesity syndrome BBS4. Nat Genet 28, 188-91. [22] Lander, E.S., L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J.P. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J.C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R.H. Waterston, R.K. Wilson, L.W. Hillier, J.D. McPherson, M.A. Marra, E.R. Mardis, L.A. Fulton, A.T. Chinwalla, K.H. Pepin, W.R. Gish, S.L. Chissoe, M.C. Wendl, K.D. Delehaunty, T.L. Miner, A. Delehaunty, J.B. Kramer, L.L. Cook, R.S. Fulton, D.L. Johnson, P.J. Minx, S.W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J.F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier (2001): Initial sequencing and analysis of the human genome. Nature 409, 860-921.
0-7695-1874-5/03 $17.00 (C) 2003 IEEE
9