Computational Challenges in Comparative Genomics ...

Computational Challenges in Comparative Genomics A Tutorial WITH

W EBB

B ERNARD M.E. M ORET C. M ILLER , PAVEL A. P EVZNER , AND DAVID S ANKOFF

1 Introduction Comparative approaches have long been a mainstay of biology and medicine. In part, this is due to necessity: many organisms and systems are difficult to procure or to maintain in the laboratory, while ethical concerns have prevented experimentation with humans and are reducing experimentation with higher mammals. Thus, in particular, much of what we know about humans has been learned through animal models. More importantly, comparative approaches embody an evolutionary approach to biology—and that, ever since Darwin, is what has enabled biologists to make sense out of the extremely complex systems they study. The great pioneer Theodore Dobzhansky famously wrote an essay entitled “Nothing makes sense in biology except in the light of evolution,” in which he argues that the large amount of data collected by field and bench biologists can only reveal their structure through an analysis based on evolution. Since evolutionary processes can only be understood through the comparison of various products of these processes, comparative approaches must form the foundation of any biological research method. The case is even more compelling today than when Dobzhansky wrote his essay: with the advent of high-throughput instruments for molecular biology and now for other aspects of biology and the life sciences, the amount of data collected has exploded and continues to grow at an exponential rate. The kind of meticulous craft used in the early study of, e.g., genomic sequences, simply cannot keep up with the rate of data collection, nor have experimental validation methods kept up with high-throughput instruments. We are thus faced with the necessity of using computational methods to make sense of the massive data accumulating in genomic, proteomic, metabolomic, morphological, physiological, neurological, clinical, and other databases. These computational methods, be they datamining, machine learning, or combinatorial optimization, all rely on basic models derived from our knowledge of a few well studied organisms, and thus all remain comparative, even when the comparison is not explicit—changing model parameters is tantamount in most cases to a quantifi1

cation of differences between the system under study and that used to derive the original model. Comparative genomics is faced with what is, for now, the most daunting of these avalanches of data: genomic data, in the form of sequence data, accumulates at an exponential rate, doubling approximately every year and a half. (That rate, incidentally, even exceeds Moore’s law, the observation made by Intel’s cofounder G.E. Moore that the density of transistors on a commodity semiconductor chip doubles every two years—so that hardware capabilities are, in effect falling farther and farther behind what may be needed to process the new data.) Wholegenome sequences are now routinely produced for bacteria and are getting increasingly easy to produce for vertebrate genomes, but a full understanding of even a single genome, that is, its structure, how its parts interact and are controlled, and the nature of the evolutionary processes that shape genomes, remains far away. In the case of the heavily studied human genome, for instance, we have some understanding of the coding genes (a bit under 2% of the entire genome by length) and a fair start on noncoding genes and other conserved elements (a bit under 3% of the entire genome), but remain nearly clueless about the other 95%. Yet getting to that apparently modest level has only been possible through comparative approaches. Comparative approaches are based on the identification of conserved patterns— the study of evolution is indeed just as much the study of conservation. In the case of comparative genomics, positive or negative selection helps conserve regions of the genome, while areas under neutral selection are free to vary and thus expected to diverge more rapidly. Thus the basis of comparative genomics is the identification and mapping of conserved regions. Since selection pressure for conservation is linked to function, the identification of regions conserved across a range of genomes leads naturally to a conjecture that these conserved regions play similar functional roles. (Such conjectures, at least for now, need to be verified experimentally.) The conservation of certain groups of genes forming similar pathways in several organisms leads us to conjecture that the pathways themselves may be conserved, in which case other related organisms should possess a similar group of genes; hence, if some, but not all, of these genes have been identified in a related organism, it is reasonable to conclude that the other genes are also present and to conduct a search targeted for these specific “missing” genes. This principle is widely used in the identification of genes in related species and can enrich both ends of a pairwise comparison. On the other hand, finding genes in one subgroup of organisms that appear to have no similarity to any genes in other related organisms can lead us to conjecture the occurrence of a lateral gene transfer (especially in bacteria) or gene duplication at some past time; finding these genes 2

in some unrelated group of organisms can strengthen the conjecture of a lateral gene transfer. Since lateral gene transfer is thought to play a major role in the acquisition of drug resistance or virulence in pathogenic bacteria, identifying the event and the group of genes thus transferred is of crucial importance to human health; and since transferring useful groups of genes through artificial means can lead to improved crops, the same tools are very important in genetic engineering. Finally, the sequencing of eukaryotic genomes led to the discovery that these genomes include numerous duplicated regions, some very large and some apparently made of nested duplications. These duplications represent both a serious obstacle for genome sequencing and, more importantly, a chance to witness evolution in action: because copies may escape selection pressure, it is thought that gene duplication is the key to the development of novel gene functions. Now, DNA sequences have been compared ever since the beginning of DNA sequencing and genetic maps have been compared since the beginning of the 20th century, yet neither constitutes what is generally viewed today as comparative genomics. It is the availability of whole-genome sequences that has given rise to this area of research, which is distinguished both by its potential (since it addresses entire genomes, it can in principle identify patterns not present at local scales and thus elucidate complex mechanisms that involve many areas of the genome) and by its scale and complexity. The latter mandate the use of computational methods; the former provides the impetus that has caused this area to grow enormously in the last few years. For those researchers working in comparative genomics, the much quoted wording “postgenomic era” simply means that, now that we can get our hands on complete genomes for a variety of organisms and even, for simpler genomes, a variety of individuals, the task of understanding genomes can finally begin. Comparative genomics can be used even when the whole-genome sequence is not known in detail for each organism under study: for instance, in human genetics, one can use the “generic” whole-genome sequence of the human along with dense SNP array data for specific individuals to study patterns of change within the human species and how these patterns relate to phenotypic traits, especially those linked to inherited disorders or predispositions to certain diseases. The primary goal of comparative genomics is thus to delimit regions of a genome and tag each region with a label such as “under positive selection,” ”exon,” “promoter region,” etc., and to do so by using comparisons with other genomes. This is similar to working with several very large magazines describing closely related actions, and written in the same alphabet, but in somewhat different languages, languages that are for the most part not understood. These magazines also contains a very large amount of advertising and other nonspecific content that can 3

vary quite a bit from one magazine to the other; we do not understand any of this content (the so-called “junk” DNA) and often have serious trouble telling it apart from the “text.” From some limited understanding of a few text passages in each magazine, we attempt to identify punctuations marks, words, sentences, and eventually larger motifs (these pages have to do with processing sugars, those with controlling replication, etc.), all by comparing the texts back and forth, building models using statistics and combinatorics, and running optimization and machine learning algorithms on the data. The models attempt to characterize the structure of words, the syntax of sentences (e.g., how a gene can be formed of exons and introns and surrounded by control elements), but also how these constructs change from one magazine to the other. In this tutorial, we focus on the computational challenges, that is, on the development of models and algorithms for the analysis of complete genomes. Biological results obtained in comparative genomics are discussed in the references.

2 Whole Genomes Whole-genome sequences are now thought of as almost routine and may indeed become so in the near future, but it is worth considering what such data really consists of. What most of us would envision, that is, for each chromosome, a linear (or circular, as the case may be) sequence of nucleotides, is known as a finished genome sequence. These are only available for a few eukaryotic organisms so far, such as human, mouse, etc., and even then still have some gaps. (Prokaryotic genomes being much simpler, they are available in much larger numbers and with fewer gaps.) For most organisms, what is actually available is a close approximation, in which chromosomal sequences have gaps, even some fairly large ones, as well as pieces that cannot be reliably placed along a chromosome; these are known as draft assemblies—most of the genome is sequenced, but some regions with very long repeats may not be resolved and other regions may not have had sufficient coverage for reliable assembly. For many organisms, the overall coverage is fairly low; the resulting sequences suffer from more numerous and larger gaps than in draft assemblies and also have a higher error probability at each position. The new high-throughput sequencing technologies give very high coverage, but very short reads (as low as 30–50 bases), leading to more severe problems with repeats than when using traditional Sanger sequencing. Yet the latest technologies can sequence most prokaryotic genomes de novo in a few days at most and for a very modest cost; much remains to do in order to extend this performance to eukaryotic genomes, but the “$1,000 genome” no longer appears 4

unrealistic. At the very least, we can expect an avalanche of new data, with thousands of new prokaryotic genomes and at least partial assemblies of hundreds of new eukaryotic ones within the next few years. So what do we want to know about these sequences? First, we want to partition the genome into regions under positive, neutral, and negative selection. Next, of course, we want to identify all genes, coding and noncoding; in eukaryotes, we want to identify all exons making up each gene. Next, we want to group these genes into gene families, that is, we want to identify homologs. Once that is done, we want to reconstruct (some parts of) the history of the gene families, identifying duplications and losses, and figuring out what parts are the result of lateral transfers and where the foreign DNA came from. As part of this process, which is almost invariably comparative, we also want to determine homologies and orthologies across the organisms under study. What happened to make these genomes different if, as all findings to date seem to indicate, all life on this planet can be traced back to a common origin? That is, how do genomes evolve? It is not the purpose of this tutorial to tackle this complex topic in depth (the bibliography suggests some good introductions to it), but we do need to have some basic notion of the evolutionary events that can affect the structure and composition of a genome in order to devise models and compute with them. A simplified list of such events includes:

Nucleotide-level changes, such as point mutations and short indels (insertions and deletions): these are reasonably well understood and, especially for coding sequences, have well supported, detailed models, but their effect is strictly local and not at a true genomic scale, except inasmuch as they may disable duplicate genes and eventually mutate them beyond recognition. However, point mutations account for most of the individual variations within a population—for instance, single-nucleotide polymorphisms (SNPs) are thought to form over 80% of the genetic diversity within the human species. Duplications: ranging from short repeats to very large segmental duplications to spectacular events such as a doubling of the entire genome (to create a tetraploid genome from a regular diploid one, for instance), these are the object of intense study today. Duplication of genes is viewed as the most common cause for the development of new function and thus, under positive selection, of divergence, since gene copies are freed from functional constraints. Significant differences in the size of a gene family can also occur among individuals, sometimes with clinical consequences, which has given rise to a study of the copy numbers of certain genes. 5

Losses: again ranging from short deletions to excision of large regions, these prevent duplications from enlarging the genome beyond all reason. They are particularly important in prokaryotic genomes, which are very trim, consisting mostly of coding and regulatory regions. Since losses can thus cancel duplications and mask earlier mutations, they make it impossible to reconstruct exactly the detailed evolutionary history of a genome and force us to use statistical estimates of the number of evolutionary events that have left no trace. Lateral gene transfers: common in prokaryotes and to a lesser extent in some eukaryotic groups, lateral (also called horizontal) gene transfer denotes the process by which an organism acquires “foreign” DNA and incorporates it into its own genome. It is seen as a key mechanism for pathogenic bacteria, one through which they may acquire drug resistance or virulence traits. Distinguishing such genes from those created through duplication and subsequent mutations is an important step in the analysis of a genome, as the two types obviously have very different histories. Genomic rearrangements: including both fission and fusion of chromosomes (for instance, the domestic horse has 32 chromosome pairs while its feral cousin, the Przewalski horse, has 33, but two of the pairs in the latter appear to make up one in the former) as well as more limited events such as transposition of a region to another location on the genome, genomic rearrangements affect evolution, but are also a common feature of cancerous tissue and have proved an equally fascinating study for oncologists, geneticists, evolutionary biologists, and computer scientists. Recombinations: through the mechanism of crossover, recombinations can exchange the tails of the two chromatids that form a chromosome pair and thus, across many generations, form along each chromosome a mosaic of regions inherited from different ancestors. This mosaic structure is of crucial importance in genetic studies and at the heart of the Haplotype Mapping project. (Hybridization of two species to form a new one can be viewed as an extreme form of recombination.) Through uneven crossover, recombinations can also cause duplications and various rearrangements. Thus some of these mechanisms play a role mostly in evolutionary terms (wholegenome duplications, hybridization, lateral gene transfer, some types of rearrangements), some mostly in genetics (SNPs), and some at all scales (most rearrangements, recombinations, duplications). While some may leave obvious traces that can be recognized even when looking at a single genome (such as 6

recent whole-genome duplications), most can only be identified by comparison with other genomes; even then, self-cancelling events (a deletion that removes a prior duplication, a point mutation that reverses a prior one, rearrangements that end up returning the genomic region to its starting location, etc.) will remain undetectable, although their existence and even their frequency may be inferred. Comparative genomics thus requires us to model these events in order to identify their occurrence among a group of genomes and to infer the evolutionary history of the group. For some of these events, the underlying mechanisms are known (crossover, various mechanisms for lateral gene transfer, several types of duplication, etc.); for others, the mechanism remains more obscure (segmental duplication, whole genome duplication); for most, the exact parameters are very difficult to infer from the data at hand. Thus current computational research includes the validation of different models and the optimization of parameters. The basic tool, as discussed earlier, is comparison with other whole genomes, preferably of closely related species. Such comparisons are not new, of course: well before the “age of DNA,” biologists compared genetic maps of various organisms, a technique pioneered by the fly geneticists from Morgan’s laboratory in the later 1920s and early 1930s. Sturtevant and Dobzhansky published a series of papers in the mid-1930s in which they discussed chromosomal rearrangements and even took a stab at ancestral reconstruction. What makes whole-genome comparisons particularly productive as well as challenging today is the availability of the entire genome sequence, not just a genetic map with a dozen markers per chromosome. However, pairwise comparison is most effective when the two species are closely related, such as a pair of mammals; when the two species are only distantly related, even regions under selection pressure have had time to diverge considerably while rearrangements, duplications, and losses are far more numerous, so that it becomes very difficult to identify homologous regions. In order to handle more distantly related genomes, one must move beyond pairwise comparisons to multiple genome analysis in a phylogenetic context, as the phylogeny provides evolutionary paths that relate the genomes through their ancestors and thus breaks what appear to be very large evolutionary distances into a series of more manageable steps. However, in such a move, many computational problems that were tractable in a pairwise context become NP-hard—exactly as in the case of sequence alignment.

7

3 Pairwise Genome Comparisons Perhaps the most successful early use of comparative genomics was in gene hunting and annotation. By comparing the mouse genome with the human genome, for instance, researchers were able to add well over a thousand genes from each species to the overall gene complement of the other. Much of this work was relatively simple, using BLAST to search for gene pieces from one species in the genome of the other: its success was due entirely to the fact that mammals appear to share a nearly identical complement of genes. Gene annotation was also very successful, as gene function (although not gene regulation) is well preserved across mammals. However, reconstructing the evolution of these mammalian genomes turned out to be very much harder. The scale was one problem: dealing with 3 billion base pairs and nearly 20’000 genes requires very efficient algorithms and large, “genome-scale,” computers (at the very least, with enough memory to hold two full genome sequences, on the order of 100GB). Yet many of the computational problems had no efficient solution; indeed, many of them had no solution at all. Thus were started many research programs targeted at relatively narrow questions in comparative genomics.

Given two whole genomes, how do we identify syntenic blocks, that is, sequence blocks that are reasonably conserved in the two genomes. These blocks should be as large as possible to simplify later computations. Many algorithms were already known for variations of this problem on text strings, but only those requiring perfect identity would scale to genomic sizes. The addition of a range of possible mutations and other changes made many existing algorithms inapplicable and caused others to run extremely slowly. Given a simplified version of the genomes as lists of syntenic blocks, how do we identify the specific rearrangement events that caused these lists (often with identical overall contents among mammals) to differ in individual content and ordering? This was an entirely new problem and today has solutions only for a few specific rearrangement operations. We also lack strong models for these rearrangements—for instance, we suspect that short regions are more easily moved or reversed than long ones, but lack the data to parameterize this notion. Given a more detailed version of the genomes as collections of genes sorted into gene families, how do we reconstruct the history of these families so as to be able to pair up corresponding genes and determine orthologies? This problem is tangled with that of rearrangements and also lacks a strong 8

model, as the same data could be explained by very different scenarios, including duplications, lateral gene transfers, hybridizations, etc. In the presence of recombinations, lateral gene transfers, or hybridizations (collectively known as reticulation events), evolutionary trees turn into evolutionary networks, with their own set of models and problems.

Given two whole genomes along with their genes, how do we identify and characterize large-scale segmental duplication events? The larger such duplications in humans are extremely complex, as they appear to consist of nested duplications, in which smaller duplications create blocks of repeats and a consecutive collection of these blocks is in turn duplicated. Given as much data as possible about two contemporary organisms, how do we reconstruct a candidate ancestral genome? In addition to solving all four previous problems, we need to restrict the space of possible solutions by adding plausible biological constraints—otherwise, especially for more distant organisms, competing optimal choices (according to any reasonable optimization criterion) will be far too numerous. This is a problem that is best posed in a multiple-genome setting rather than in pairwise setting, as the additional genomes will themselves restrict the space of optimal solutions. Given the sequence of the model organism, SNP statistics for the population, and dense SNP maps for a number of individuals, how do we reconstruct a series of crossover events that best explains the different SNP maps? This is but one of the many problems that arise when attempting to combine genomics and genetics—one of the key approaches to leveraging genomic research results to benefit human health. Given a mix of sequences and an established phylogeny of whole genomes, how do we place the sequences within the phylogeny? Some of the sequences may belong to existing species, some to new species in a know genus or family, and a few may prove to be the first representatives of entirely new families of organisms. Solving this problem is the key to a metagenomic analysis. As is typically the case in Computational Biology and Bioinformatics, these questions are receiving two types of answers. One type is practical, aimed at delivering reasonable, useful answers as soon as possible; methods are specialized for the data at hand, and refined as the need arises. For example, the Genome Browser at UC Santa Cruz provides precomputed genomic alignments in addition to the standard facilities for browsing annotations; in addition to the human 9

genome, genomes of 28 (as of fall 2009) vertebrates are included: five fishes, one frog, one lizard, and 21 mammals (including one monotreme and one marsupial). These are generated by first creating independent pairwise genomic alignments with the human genome (ignoring repeats), then by applying various refinement heuristics, and finally by placing them into a multiple alignment using a progressive multiple alignment strategy. The aligners used in pairwise alignment and in progressive multiple alignment are capable of limited handling of rearrangements (BlastZ for pairwise, multiz for the progressive version). The comparative genomics pipeline used to produce these alignments, developped in a collaboration between UC Santa Cruz (David Haussler) and Pennsylvania State U. (Webb Miller), is aimed at vertebrates and works best on mammals; its tools (BlastZ, multiz, etc.) do not offer foundational solutions to the problems raised above, but they provide high-quality output within a restricted scope. Similar approaches have been used in tackling other problems in comparative genomics, often using general-purpose computational frameworks such as Hidden Markov Models (HMMs). We do not describe these solutions in detail (if only because, as is typical of actual tools, they include a large variety of ad hoc measures as well as a long list of parameters), but list several in the bibliography. The other type of answers comes from foundational research in combinatorics, statistics, and algorithms. For instance, the problem of finding a minimum number of inversions to turn one permutation of oriented genes into another has been the object of intense study for over 20 years, even since David Sankoff first formalized it. It received a first solution in breakthrough work by the group of Pavel Pevzner, which established a mathematical framework and provided a polynomial-time (if inefficient) algorithm; it has since been ceaselessly refined by several groups, with that of Bernard Moret delivering a linear-time algorithm for the distance computation and recently a randomized n log n-time algorithm for finding a minimal sequence of operations (thus incidentally placing the overall problem on the same level as the trivial one of sorting a list of numbers in increasing order). Lineage sorting, which attempts to reconstruct a history of gene duplications and losses to explain incompatible gene-based phylogenetic trees, is another example of a problem that has been studied very formally and for which a broad range of solutions (both combinatorial and likelihood-based) has been developed. The problems solved in these approaches are the formal, abstract versions of the rather more messy problems arising in biology and typically address a single issue (inversion, but not all rearrangements; duplication and loss, but not hybridization and lateral gene transfer; etc.). Thus, while they provide a long-lasting foundation for future tools, they often do not immediately contribute to data analysis. 10

4 Multiple Genome Comparisons Once we move to multiple genomes, the setting changes and becomes even more explicitly based on evolution. While it is of course possible to compare a collection of genomes by carrying out all pairwise comparisons, or by comparing all of them to a fixed “reference” genome (the first pass in the UCSC genome aligner), the first approach is not only expensive (entailing a quadratic number of comparisons), but hard to interpret (the independent pairwise comparisons are likely to return incompatible results), while the “star-shaped” comparison of all genomes to a single reference suffers if the reference genome is not sufficiently close to every other genome in the comparison and also deliberately ignores the underlying evolutionary history of the group, focusing solely on the evolutionary path to the reference. In order to take advantage of the full evolutionary model, we must use a phylogenetic approach: place the modern genomes at the leaves of a (typically known) phylogenetic tree and carry out the comparisons along the edges of the tree. There are two major advantages to this approach. One is that the pairwise distance along an edge of the tree is typically less than that between two leaves of the tree, especially if the collection of genomes is a reasonable sampling of the clade to which these organisms belong. The other, more important, advantage, is that the phylogenetic relationships carry a significant amount of information and can be used to great effect in guiding the computations. However, there is one major disadvantage: every single comparison (along an edge of the tree) involves at least one internal node, that is, a putative ancestral genome. In classical phylogenetic reconstruction from sequence data, new sequences are often inferred at internal nodes (in all methods based on parsimony or on Bayesian statistics and in many of the methods based on maximum likelihood). The same situation obtains here, which partially explains the interest in the reconstruction of ancestral genomes. One should, however, carefully distinguish inference of data at internal nodes in the course of running an algorithm for tree reconstruction, in which case the data obeys some requirements of the algorithm, but perhaps not any particular requirement of biological plausibility, and so does not represent a bona fide attempt at reconstructing some distant ancestral organism; and ancestral reconstruction per se, in which the latter is explicitly sought. The first is just the storage of intermediate values in the course of optimizing the choice of tree topology, branch lengths, and other parameters; the second is a goal in itself. For the most part, researchers to date have focused on the first, not on the second. In comparative genomics, however, the interest has shifted to the second—which is much harder to formalize. Some positive results on ancestral

11

reconstruction in mammals (and more recently in plants) have appeared in the literature, but these are limited partial reconstructions for very closely related organisms; more general attempts to date have yielded negative results. However, rapid progress is being made, some of which is illustrated in the Comparative Genomics session of this PSB 2010. Anyone familiar with phylogenetic reconstruction knows that it is, of its own, a complex and computationally demanding problem, even when working with simple sequence data. (Not that sequence data is that simple, but the assumptions usually made in such analyses do make it simple, e.g., by assuming that every position evolves independently of all others.) What are we then to make of a problem where the data is a million times larger, the evolutionary events far more diverse and complex, and the results far more difficult to evaluate? (It is possible to look at a tree of several hundred species and check it thoroughly “by hand,” but the same cannot be done when the output is a multiple alignment with tens of thousands of blocks, to say nothing of an output that actually aligns the genomic sequences themselves.) As before, we can tackle the entire problem, by tailoring our approach to the specifics of the data at hand, but without attempting to provide fundamental solutions; or we can formalize subproblems and work on those. At least, in most cases, we do not need to build the tree along with the genomic alignment— the tree for model organisms is usually known and accepted; and if we study organisms not yet placed in such a tree, or where some doubts remain as to their proper place, we can construct gene-based phylogenies of our own to place these organisms. We cannot, however, trust edge lengths: the rate of evolution for the genes used in building the tree need not be closely related to the overall rate of evolution for the genome, nor for any of its components. This is why current attempts at using phylogenetic information in comparative genomics use the phylogenetic tree for scheduling computations (as in a progressive alignment) or as the backbone of a Hidden Markov model (HMM)—in both cases, only the topology of the tree is used. If, however, we can assume the presence of a molecular clock (or have some other means of assigning dates to internal nodes), then we can use the dates to good effect. In particular, regions of the genome under neutral selection display a fairly uniform drift (the result of randomly chosen evolutionary changes) that can be accurately measured and serve as a timing device, at least for short to medium time scales. We can use such a measuring stick to verify that our computational inferences are compatible with the established dates of divergence in the tree. We can also return to fundamentals and ask very basic questions, such as “Given three genomes (the simplest possible tree with a single internal node), how do we reconstruct an ancestral genome for the internal node?” Principles of par12

simony suggest that the ancestral genome be chosen so as to minimize the sum of its distances (in a genomic alignment) to the three given genomes. Finding some x such that, for the given a, b, and c, x minimizes the sum d (a; x) + d (b:x) + d (c; x) is, of course, the problem of finding a median. It is also the core problem in the reconstruction of ancestral genomes. And it is generally NP-hard, even for distances that can be computed in linear time. (This is an instance of the complexity explosion that frequently occurs when moving from two-way comparisons to three-way comparisons.) Another deceptively simple question, this time about gene families, asks, given three homologous gene families (one in each of three genomes), which genes in each family are unique to that family. (Even if all three families have exactly the same size, it does not automatically follow that every gene in each family was inherited from a common ancestor.) The problem here is to model the evolutionary process and properly assess the respective importance of sequence divergence and positioning on the genome; ironically, the simplest model designed requires the construction of an ancestral genome—indeed, a median, with all its attending complexities. Here again, however, much progress has been made in a few years and we now have tools that can compute near-optimal medians under inversions or so-called “double-cut-and-join (DCJ)” models of rearrangement for genomes of mammalian size (20’000 genes) in a few seconds, provided that the genomes not be too far diverged. In recent years, the notion of genomic signatures has gained popularity. Originally based on the spectrum of dinucleotides frequencies (nearly invariant across a prokaryotic genome), the term has since been used for gene-family-based comparisons, for genome-wide localization of transcription factors (signature tags), and for characteristic sets of rearrangements used in ancestral reconstruction. Signatures, viewed as simple summaries that capture patterns of evolution (conservation or divergence) in the genome, can provide insight into a very complicated process as well as form the basis of computational techniques. In a recent paper, Swenson and Moret argue that, in spite of the theoretical increase in complexity, the use of three rather than two genomes, by enabling the computation of genomic signatures, actually reduces the computational burden, in addition to its improving the accuracy of reconstruction.

13

5 Conclusion While comparative genomics has already proved its worth over and over, we are at the beginning of its flowering. To a large extent, the results obtained to date through comparative genomics have been the low-hanging fruit. As methods get refined and some of the foundational research in models and algorithms gets incorporated into bioinformatics tools, more difficult targets will be reached. Some of the modelling and computational research needed include, in no particular order,

fusing rearrangements and gene family evolution; better, faster, and finer identification of syntenic blocks; orthology assignments based on the full history of genomic evolution; fine-grained comparison of entire genomes (not just at the level of syntenic blocks), using both sequence-level and genome-level tools; merging population genetics and phylogenetics; modelling abnormal processes, such as microrearrangements in the genome of cancerous cells; reliable reconstruction of the main features of ancestral genomes. Moving theoretical results from the Math or CS research lab into the hands of end users in the Biology research lab remains a serious issue: turning a research prototype into a usable tool and keeping that tool up to date is not suited to the academic research model and CS researchers who attempt to develop a tool tend to underestimate the effort required to make such an endeavor successful. As a result, most current tools have in fact been built by the researchers most invested in their use, the biologists themselves; the tools are thus perfectly well adapted to their user community, but tend not to make use of the latest results from the Math and CS computational biology community.

14

6 Annotated Bibliography This section is organized into subsections, with general references (surveys, textbooks, and monographs), references on particular topics (rearrangements, lineage sorting, recombination, etc.), and a few references to applications other than genome annotation. This bibliography is not meant to be representative or balanced—the textbooks and surveys listed in the first part do that. Its intent is to provide various paths to exploration, some highly abstract (as in sorting signed permutations by inversions), some purely algorithmic, some touching more closely on evolutionary biology, etc. Comparative genomics today affects nearly all areas of biology and much of medicine, so that its “perimeter” is very long and varied, making for fascinating reading.

6.1 References of General Interest These are not about comparative genomics, but are mentioned in the text, or help set the stage, or enable further exploration. Brown, T.A. Genomes (3rd edition). Garland Press, 2006. Everything you ever wanted to know (and more) about genomes, their organization, functions, evolution, etc. Beautifully done, with a consistent level of presentation throughout, and almost completely self-contained—freshman biology is more than enough background. The second edition of this text is available in its entirety online at the NCBI web site. Dobzhansky, Th., “Nothing makes sense in biology except in the light of evolution,” The American Biology Teacher 35 (1973), 125–129. The famous paper by Dobzhansky. Moore, G.E., “Cramming more components onto integrated circuits,” 1965. This article was published in the now defunct trade magazine Electronics, but is widely available on the Web; it is at the origin of Moore’s law, although that term was coined several years later by Carver Mead of Caltech.

6.2 Texts and Surveys J.R. Brown, ed. Comparative Genomics: Basic and Applied Research. CRC Press/Taylor & Francis, 2007. A very nice collection of survey papers, half on methodology and half on applications across a wide swath in the life sciences. Koonin, E.V., and Galperin, M.Y. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Springer Verlag, 2002.

15

A nearly pure bioinformatics text with excellent coverage, but do not expect any algorithmic content. D. Sankoff and J.H. Nadeau, eds. Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment, and Evolution of Gene Families. Kluwer, 2000. While getting a bit older, this monograph (proceedings from a workshop) remains the only one targeted at computational methodologies and has served as a foundation for much subsequent work on genome rearrangements. Boffelli, D., Nobrega, M.A., and Rubin, E.M., “Comparative genomics at the vertebrate extremes,” Nature Reviews Genetics 5 (2004), 456–465. Contrasts the uses of comparing the human genome with other primate genomes and with fish genomes. Eisen, J.A., “Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes,” PLoS Biology 5, 3 (2007), e82. An excellent short essay on metagenomics conducted using shotgun sequencing technologies, with a good list of important papers, from one of the authors of the breakthrough paper that created the field. Hardison, R.C., “Comparative genomics,” PLoS Biol. 1, 2 (2003), e58. A wide-ranging survey emphasizing bioinformatics and applications. Koonin, E.V., Aravind, L., and Kondrashov, A., “The impact of comparative genomics on our understanding of evolution,” Cell 101, 6 (2000), 573–576. Among many survey papers that address comparative genomics as a tool to understand function, this one is aimed at evolution. O’Brien, S.J., et al. “The promise of comparative genomics in mammals,” Science 286, 5439 (1999), 458–481. An early survey that is nearly an advertisement for the power of comparative genomics, discussing advances already made on the basis of simple genetic maps and anticipating the utility of whole-genome sequences to come. Paterson, A.H., et al., “Comparative genomics of plant chromosomes,” Plant Cell 12 (2000), 1523–1540. An early survey for plant biologists, with an emphasis on polyploidy and hybridization. Rubin, G.M., et al., “Comparative genomics of the eukaryotes,” Science 287, 5461 (2000), 2204–2215. A relatively early paper comparing three model organisms (fly, worm, and yeast) and discussing, among other things, the evolution of gene families. Ureta-Vidal, A., Ettwiller, L., and Birney, E., “Comparative genomics: genome-wide analysis in metazoan eukaryotes,” Nature Reviews Genetics 4 (2003), 251–262.

16

A paper from the EnsEMBL group, the European counterpart to the UCSC Genome Browser, focusing on genomic alignment and on gene and regulatory region prediction.

6.3 What to Read For methodological papers (models, algorithms, and their computational validation), the best source is refereed annual conferences. In addition to PSB itself, they include, in alphabetical order: APBC: the Asia-Pacific Bioinformatics Conference, started in 2003. CPM: the Symposium on Combinatorial Pattern Matching, started in 1990 to showcase research on string algorithms (more theoretical). ISMB: the Symposium on Intelligent Systems in Molecular Biology, started in 1993 (more applied). RECOMB: the Conference on Research in Computational Molecular Biology, started in 1997. RECOMB-CG: the RECOMB Workshop on Comparative Genomics, started in 2003. WABI: the Workshop on Algorithms in Bioinformatics, started in 2001. The proceedings of CPM, RECOMB-CG, and WABI are published by Springer Verlag in the Lecture Notes in Computer Science/Lecture Notes in Bioinformatics (LNCS/LNBI) series. Those of ISMB are currently published as a supplement to the journal Bioinformatics; those of RECOMB and those of APBC have had a varied history, but those of APBC are currently published as a supplement to BMC Bioinformatics, while those of RECOMB have been published since 2008 by Springer Verlag in the LNCS/LNBI series. Finally, four journals regularly feature work of this type as well; they are Genome Research IEEE/ACM Transactions on Computational Biology and Bioinformatics Journal of Bioinformatics and Computational Biology Journal of Computational Biology In addition, relevant articles appear with some regularity in Bioinformatics, BMC Bioinformatics, and BMC Algorithms in Molecular Biology, and, naturally, other articles are scattered across Science, Nature, PNAS USA, PLoS journals, Mol. Biol. and Evol., Nucleic Acids Res., as well as more specialized journals.

17

6.4 References on Genome Rearrangements and Ancestral Reconstruction Bader, D.A., Moret, B.M.E., and Yan, M., “A linear-time algorithm for computing inversion distances between signed permutations with an experimental study,” J. Comput. Biol. 8, 5 (2001), 483–491. The linear-time distance computation for inversions. Bergeron, A., Heber, S., and Stoye, J., “Common intervals and sorting by reversals: A marriage of necessity,” in Proc. 1st European Conf. Comput. Biol. ECCB’02, in Bioinformatics 18 (2002), S54–S63. One of a line of papers from the first author that discusses her “common intervals,” the main theoretical tool today for the study of inversions. Chaisson, M.J., Raphael, B.J., and Pevzner, P.A., “Microinversions in mammalian evolution,” Proc. Nat’l Acad. Sci. USA 103, 52 (2006), 19824–19829. Proposes the use of microinversions as a phylogenetic character for mammals, estimating that hundreds of thousands of such can be found in mammalian genomes. Earnest-DeYoung, J.V., Lerat, E., and Moret, B.M.E., “Reversing gene erosion: Reconstructing ancestral bacterial genomes from gene-content and order data,” in Proc. 4th Workshop on Algorithms in Bioinformatics (WABI’04), in LNCS 3240, 1–13, Springer Verlag (2004). Shows why reconstructing ancestral genomes for a fairly divergent set of genomes (gamma proteobacteria) is not doable without new biological constraints. Hannenhalli, S., and Pevzner, P.A., “Transforming mice into men (polynomial algorithm for genomic distance problems),” in Proc. 36th IEEE Symp. Foundations of Comput. Sci. FOCS’95, IEEE Press (1995), 581–592. See paper below. Hannenhalli, S., and Pevzner, P.A., “Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals),” in Proc. 27th ACM Symp. Theory of Comput. STOC’95, ACM Press (1995), 178–189. These two papers are the breakthrough results on sorting by inversions (called reversals by many CS researchers). Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Miller, W., and Haussler, D., “The infinite sites model of genome evolution,” Proc. Nat’l Acad. Sci. USA 105, 38 (2008), 14254– 14261. Shows how to reconstruct a most parsimonious evolutionary history under the assumption that each base pair is affected by at most one event. Ma, J., Ratan, A., Raney, B.J., Suh, B.B., Zhang, L., Miller, W., and Haussler, D., “DUPCAR: Reconstructing contiguous ancestral regions with duplications,” J. Comput.

18

Biol. 15, 8 (2008), 1–21. A simple greedy heuristic based on preserving adjacencies with an extensive validation suite. Moret, B.M.E., and Warnow, T., “Advances in phylogeny reconstruction from gene order and content data,” in Molecular Evolution: Producing the Biochemical Data, Part B, E.A. Zimmer and E.H. Roalson, eds., Vol. 395 of Methods in Enzymology, Elsevier (2005), 673–700. An overview of the state of the art in algorithmic development for gene-order data, especially in the context of phylogenetic reconstruction. Murphy W.J., et al., “Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps,” Science 309, 5734 (2005), 613–617. An analysis of breakpoints and rearrangements in eight mammalian genomes, indicating a prevalence of segmental duplications. Peng, Q., Pevzner, P.A., and Tesler, G., “The fragile breakage versus random breakage models of chromosome evolution,” PLoS Comput. Biol. 2, 2 (2006). Are there locations along the genome that are more likely to “break” and thus form a boundary of a rearrangement? This basic question has proved remarkably difficult to answer over the years—this is one of the more recent assessments. Sankoff, S., Lefebvre, J.F., Tillier, E., Maler, A., and El-Mabrouk, N., “The distribution of inversion lengths in bacteria,” in Proc. 3rd RECOMB Workshop on Comparative Genomics RECOMB-CG’05, in LNCS 3388, 97–108, Springer Verlag (2005). A comprehensive attempt at understanding one of the key parameters in any model of genomic rearrangements. Sankoff, D., Zheng, C., Wall, P.K., dePamphilis, C., Leebens-Mack, J., and Albert, V., “Internal validation of ancestral gene order reconstruction in angiosperm phylogeny,” in Proc. 6th RECOMB Workshop on Comparative Genomics RECOMBCG’08, Paris (France), in LNCS 5267, Springer Verlag, 2008. The best work to date on ancestral reconstruction in plants. Sebat, J., et al., “Large-scale copy number polymorphism in the human genome,” Science 305, 5683 (2004), 525–528. The first publication after the completion of the human genome reporting the prevalence of gene copy number variation. Swenson, K.M., and Moret, B.M.E., “Inversion-based genomic signatures,” in Proc. 7th Asia-Pacific Bioinformatics Conf. APBC’09, in BMC Bioinformatics 10 (Suppl. 1):S7. Introduces these signatures and uses them to improve ancestral reconstruction. Swenson, K.M., Rajan, V., Lin, Y., and Moret, B.M.E., ”Sorting signed permutations by inversions in O(nlogn) time,” in Proc. 13th Int’l Conf. on Research in Comput.

19

Molecular Biol. RECOMB’09, in LNCS 5541, 386–399, Springer Verlag (2009). The fast algorithm for sorting by inversions. Tesler, G., “Efficient algorithms for multichromosomal genome rearrangements,” J. Computer and System Sci. 65, 3 (2002), 587–609. The first practical approach to handling multichromosomal rearrangements. Warren, R., and Sankoff, D., “Genome halving with general operations,” in Proc. 6th Asia-Pacific Bioinformatics Conf. APBC’08, Kyoto (Japan), in Advances in Bioinformatics and Computational Biology, vol. 6, 231–240, Imperial Press (2008). How to handle whole-genome duplications—which, in a reconstruction scenario, are reversed into whole-genome halving. Yin, P., and Hartemink, A.J., “Theoretical and practical advances in genome halving,” Bioinformatics 21, 7 (2005), 869–879. A somewhat different take on the same problem.

6.5 References on Models of Conservation, Duplication, etc. Dumas, L., et al., “Gene copy number variation spanning 60 million years of human and primate evolution,” Genome Res. 17 (2007), 1266–1277. Studying copy numbers in primates without looking at clinical aspects. Jiang, Z., et al., “Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution,” Nat. Genet. 39, 1361–1368 (2007). On the track of segmental duplications in humans, through comparisons with chimpanzees and rhesus macaques. Kent, W.J., Baertsch, R., Hinrichs, A., Miller., W., and Haussler, D., “Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes,” Proc. Nat’l Acad. Sci. USA 100, 20 (2003), 11484–11489. Modelling, testing the model, and analyzing results—a classic paper. Siepel A., et al., “Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes,” Genome Res. 15, 8 (2005), 1034–1050. How to detect conserved elements in a purely comparative manner.

6.6 References on Lineage Sorting, Lateral Gene Transfer, and Reticulate Evolution Arvestad, L., Berglund, A.-C., Lagergren, J., and Sennblad, B., “Gene tree reconstruction and orthology analysis based on an integrated model for duplications and

20

sequence evolution,” in Proc. 8th Conf. on Research in Comput. Mol. Biol. RECOMB’04, ACM Press (2004), 326–335. Integrating gene duplication and sequence mutations and indels in a Bayesian framework. Gusfield, D., Eddhu, S., and Langley, C., “Efficient reconstruction of phylogenetic networks with constrained recombination,” in Proc. 2nd IEEE Comput’l Systems Bioinformatics Conf. CSB’03, IEEE Press (2003), 363–374. Introduces galled trees, a natural constraint on phylogenetic networks that has proved very successful in terms of algorithm development. Jain, R., Rivera, M.C., and Lake, J.A., “Horizontal gene transfer among genomes: The complexity hypothesis,” Proc. Nat’l Acad. Sci. USA 96, 7 (1999), 3801–3806. A very early whole-genome approach to the problem using six complete prokaryotic genomes. Linder, C.R., and Rieseberg, L.H., “Reconstructing patterns of reticulate evolution in plants,” American J. Botany 91 (2004), 1700–1708. Offering reflections on the combined effects of recombination, gene duplication, and hybridization, as well as a model. Maddison, W.P., “Gene trees in species trees,” Sys. Biol. 46 (1997), 523–536. See paper below. Page, R.D.M., and Charleston, M.A., “From gene to organismal phylogeny: Reconciled trees and the gene tree species tree problem,” Mol. Phyl. and Evol. 7 (1997), 2310240. The two foundational papers in the area of lineage sorting. Ochman, H., Lawrence, J.G., and Groisman, E.A., “Lateral gene transfer and the nature of bacterial innovation,” Nature 405, 6784 (2000), 299–304. The classic paper on the topic. Than, C., Ruths, D., Innan, H., and Nakhleh, L., “Confounding factors in HGT detection: Statistical error, coalescent effects, and multiple solutions,” J. Comput. Biol. 14, 4 (2007), 517–535. The first framework unifying lineage sorting and lateral gene transfer.

6.7 Some Tools in Comparative Genomics Tools are to be distinguished from algorithms, in the sense that they are already in a form usable by researchers in biology. Blanchette, M., et al., “Aligning multiple genomic sequences with the threaded blockset aligner,” Genome Res. 14, 4 (2004), 708-715. The paper describing multiz, a progressive genomic alignment tool.

21

Chiaromonte, F., Yap, V.B., and Miller W., “Scoring pairwise genomic sequence alignments,” in Proc. 7th Pacific Symp. Biocomput. PSB’02, World Scientific Pub. (2002), 115–126. The original paper on BLASTZ. Miller, W., et al., “28-way vertebrate alignment and conservation track in the UCSC Genome Browser,” Genome Res. 17 (2007), 1797–1808. A fascinating article detailing the genomic multiple alignment mentioned in the text—a real tour de force! Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W., “Human-mouse alignments with BLASTZ,” Genome Res. 13, 1 (2003), 103–107. Shows how to use BLASTZ for complex genomes. Siepel, A., and Haussler, D., “Phylogenetic Hidden Markov Models,” In R. Nielsen, ed., Statistical Methods in Molecular Evolution, Springer, New York (2005), 325–351. The phylogenetic HMMs used at UCSC and elsewhere for exon hunting and other types of genomic annotation. Tesler, G., “GRIMM: genome rearrangements web server,” Bioinformatics 18, 3 (2002), 4920-493. An early implementation of the Hannenhalli-Pevzner results; later implementations such as GRAPPA (from Moret’s group) are faster; and MGR (also from Tesler) handles multichromosomal genomes.

6.8 A (Very) Few Applications of Comparative Genomics Basically every genome annotation is the product of comparative genomics, so we give instead applications in other areas. Bourque, G., Pevzner, P.A., and Tesler, G., “Reconstructing the genomic architecture of ancestral mammals: Lessons from human, mouse, and rat genomes,” Genome Res. 14 (2004), 507–516. One in a series of papers by Pevzner’s group and others, using increasing numbers of vertebrate genomes—this was one of the first such papers. Cui, X., Vinar, T., Brejova, B., Shasha, D., and Li, M., “Homology search for genes,” in Proc. 15th Intelligent Systems for Mol. Biol. ISMB’07, in Bioinformatics 23, 13 (2007), i97–i103. Gene hunting by statistical means, using comparative approaches. Osterman, A., and Overbeek, R., “Missing genes in metabolic pathways: a comparative genomics approach,” Current Opinion in Chemical Biol. 7, 2 (2003), 238–251.

22

A typical comparative gene-hunting approach based on comparing genomes and pathways. Sebat, J., et al., “Strong association of de novo copy number mutations with autism,” Science 316, 5823 (2007), 445–449. Copy number is now viewed as the second (after SNPs) major source of genetic variations among humans and its effects on health are the subject of intense research. Volik, S., et al., “Decoding the fine-scale structure of a breast cancer genome and transcriptome,” Genome Res. 16 (2006), 394–404. Multiple small rearrangements and duplications are common in many cancerous tissues; this paper details an analysis from multiple sources of data. Not comparative genomics per se, but illustrating genomic changes in pathology.

23