Multiple Sequence Alignment with Evolutionary ... - Springer Link

3 downloads 0 Views 687KB Size Report
Abstract. In this paper we provide a brief review of current work in the area of multiple sequence alignment. (MSA) for DNA and protein sequences using ...
Genetic Programming and Evolvable Machines, 5, 121–144, 2004  C 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Multiple Sequence Alignment with Evolutionary Computation CONRAD SHYU [email protected] LUKE SHENEMAN [email protected] JAMES A. FOSTER [email protected] Initiatives for Bioinformatics and Evolutionary Studies (IBEST), Department of Bioinformatics and Computational Biology, University of Idaho, Moscow, Idaho 83844-1010, USA Submitted February 28, 2003; Revised November 6, 2003

Abstract. In this paper we provide a brief review of current work in the area of multiple sequence alignment (MSA) for DNA and protein sequences using evolutionary computation (EC). We detail the strengths and weaknesses of EC techniques for MSA. In addition, we present two novel approaches for inferring MSA using genetic algorithms. Our first novel approach utilizes a GA to evolve an optimal guide tree in a progressive alignment algorithm and serves as an alternative to the more traditional heuristic techniques such as neighbor-joining. The second novel approach facilitates the optimization of a consensus sequence with a GA using a vertically scalable encoding scheme in which the number of iterations needed to find the optimal solution is approximately the same regardless the number of sequences being aligned. We compare both of our novel approaches to the popular progressive alignment program Clustal W. Experiments have confirmed that EC constitutes an attractive and promising alternative to traditional heuristic algorithms for MSA. Keywords: multiple sequence alignment, genetic algorithm, progressive alignments, DNA sequences

1.

Introduction

Living things diverge from common ancestors through changes in deoxyribonucleic acid (DNA) and millions of years of evolution [5]. DNA plays a fundamental role in the processes of life. DNA contains the template for the synthesis of proteins, which are crucial molecules for living systems. Moreover, DNA is essential to life because it functions as a medium to transmit information from one generation to another [10]. The most important regions in DNA are generally conserved to ensure survival. Sequence alignment is commonly used to detect and quantify similarities in DNA or protein sequences. Alignments of biological sequences generated by computational algorithms are routinely used as a basis for inference about sequences whose structures or functions are not well known. The most common approach is to find the best-scoring alignment between a pair of sequences, where the alignment score is a measure of the edit distance between the sequences in the context of a particular evolutionary model. An evolutionary model can be represented as a scoring system which penalizes substitutions and gaps [5, 7]. The best-scoring (optimal) alignment can be found through the use of dynamic programming (DP) algorithms such as the Smith-Waterman [28, 37] and Needleman-Wunsch algorithms [20]. However, the complexity of DP algorithms grows exponentially as the length and number of sequences increase. Specifically, multiple sequence alignments (MSA) with DP have been shown to be

122

SHYU, SHENEMAN AND FOSTER

NP-hard [36]. Several heuristic approaches, such as Clustal W [32–34] are frequently used to quickly approximate optimal alignments. In this paper, we briefly review the current work in sequence alignment with evolutionary computation (EC). In addition, we present two novel approaches that utilize EC to optimize multiple alignments. Our first new approach employs a steady-state GA [13, 39] to evolve guide trees, which is a fundamental component of progressive alignment algorithms [8]. The population in the GA consists of viable guide trees that are represented in an efficient, coalescing binary tree structure. This enables fast and meaningful crossover and mutation. Variability operators such as crossover and mutation are constructed such that the viability of an individual tree is never compromised. Fitness is objectively computed by performing the progressive alignment in the pairwise ordering specified by the guide trees in the population. The fitness of an individual tree is computed as the natural log of the alignment score of the final alignment produced by performing the progressive alignment in the order specified by that tree. In this way, the fitness of a guide tree is optimized only in respect to the most important result: the quality of the final multiple sequence alignment. The second MSA approach facilitates the optimization of a consensus sequence [6] with a GA with an encoding scheme that was designed such that the search complexity is independent of the number of sequences being aligned. The search complexity of this approach primarily depends on the length of the consensus sequence and the degree of similarity between sequences. The scheme encodes each possible matching nucleotide at a given column with binary masks. This compact representation greatly reduces the space requirement as well as the search complexity. The objective or evaluation function gives the sum-of-pairs (SP) [4] score to determine the fitness of each chromosome in the population. SP score has been widely used to detect and quantify similarities between sequences; however it does not provide any probabilistic or biological justifications [7]. To further improve the performance of GA, we have developed a sequence profiling formulation that reduces the complexity for calculating the SP scores.

2.

Sequence alignment

There are diverse motivations behind the alignment of biological sequences. Genetic sequences are inherited from common ancestors through millions of years of evolution. Therefore, it is of interest to trace evolutionary history of mutation and other evolutionary changes through sequencing [1, 5]. Alignment of biological sequences, in this context, is generally understood as a comparison based on the criteria of evolution. For example, the number of mutations, insertions, and deletions of residues necessary to transform one DNA sequence into another is a measure of phylogeny or evolutionary relatedness. On the other hand, a comparison may pinpoint regions of common origin, which may in turn coincide with regions of similar structure or function [10]. A pairwise sequence alignment is a technique of arranging two sequences, so that the residues in certain positions are deemed to have a common evolutionary origin. In other words, if the same residue occurs in both sequences at the same position then it may have been conserved during the course of evolution. If, however, two residues differ, then it is generally assumed that they may have been derived from a common ancestor. Homologous sequences, those related by common descent, might have different lengths, which is generally explained through insertions or deletions [27].

MULTIPLE SEQUENCE ALIGNMENT WITH EVOLUTIONARY COMPUTATION

123

Statistical approaches, such as hidden Markov models, have been commonly used to detect homologous sequences and subsequently infer the alignments [7, 22]. A hidden Markov model consists of a set of states connected by probabilistic transitions. Each transition indicates the probability of moving from one state to another. The transition structure consists of repeated element of match, insert, and silent delete states. The number of repeated elements is the length of the model. Each element of a match, insert and delete state models a position in the consensus sequence of the sequence family and describes sequence homology. Another commonly used approach is dynamic programming. Dynamic programming is a mathematically rigorous technique because it is guaranteed to find the optimal alignment [26]. MSA is simply an extension of pairwise sequence alignment. MSA is the process of aligning three or more sequences simultaneously to bring as many similar residues into register as possible [4, 25]. The resulting alignments are commonly interpreted in two contexts; (a) to find regions that define a conserved pattern or domain; and (b) to derive the possible phylogeny or evolutionary relationships among the sequences [12]. The presence of similar domains across multiple sequences implies a similar biochemical function or higher-level structure that may be used as the basis for further experimental investigation.

2.1.

Dynamic programming

DP is a commonly used recurrence method for solving sequential or multi-stage decision problems [11, 22]. The essence of DP is the principle of optimality. DP has long been used to solve varieties of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management [11]. It views a problem as a set of interdependent sub-problems and DP solves these sub-problems and uses the results to solve ever-larger sub-problems. The solution to a sub-problem is expressed as a function of solutions to one or more sub-problems at the preceding levels. DP expresses the problem in a recurrence formulation. To make optimal decisions for the next and all future states, DP only needs to know the current state and the state of its immediate predecessors. This is also known as the Markovian property [7]. For a process to be Markovian, future states must depend only on the present state and the past should not have any effect on the future. The term programming in the name actually refers to the mathematical rules that can be easily followed to solve a problem; it has nothing to do with writing a computer program. DP is known to be an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics [27] as it is the basis of sequence alignments for comparing DNA and protein sequences. The recurrence equation (Eq. (1)) is applied repeatedly to fill the matrix of F(i, j) values. This particular formulation gives the global alignment of two sequences. F(i, j) is the maximum of three previous values, namely F(i − 1, j − 1), F(i − 1, j), and F(i, j − 1). The value s(xi , y j ) is the score for aligning the characters xi and y j while d is the penalty for gap insertion. For pairwise sequence alignments, DP begins with the construction of an alignment matrix F(i, j) with the indexes (i, j) for the two sequences Sx and Sy . The matrix is first initialized with F(0, 0) = 0. The value of F(i, j) is the score of the best alignment from the first character x1 to the character xi of sequence Sx and the first character y1 to the character y j of Sy . There are three possible ways that xi and y j can be aligned; (a) xi can align with y j , which gives a match or mismatch; (b) xi is aligned with a gap; or (c) y j

124

SHYU, SHENEMAN AND FOSTER

is aligned to a gap. Since the matrix is built recursively, in order to calculate F(i, j), the previous states F(i − 1, j − 1), F(i − 1, j), and F(i, j − 1) must be known beforehand. The following equation shows the recurrence formulation of DP for sequence alignment.    F(i − 1, j − 1) + s(xi , y j ), F(i, j) = max F(i − 1, j) − d, (1)   F(i, j − 1) − d. Simultaneous alignment of three or more sequences with DP, however, poses a difficult algorithmic challenge [30]. Determining the optimal alignment of more than a handful of sequences has a prohibitive time complexity [36]. Because of this, various heuristic approaches have been developed, many of which are capable of producing good alignments in a relatively short period of time. The most commonly used heuristic technique is known as progressive multiple sequence alignment [8, 32–34]. 2.2.

Progressive alignment

Traditional progressive multiple sequence alignment algorithms involve at least a three-step process in which input sequences are first compared to one another using dynamic programming (DP) [8] to determine the edit distances between all possible pairs of sequences. The use of DP for computing pairwise distances guarantees an optimal result for the pairwise comparisons, but has time complexity of O(n 2 ) for comparing just two sequences [36]. For n input sequences, the numbers of pairwise distance measurements which must be taken are:   n Number of Pairwise Distances = (2) 2 Notably, to counter the obvious scalability issues of performing so many optimal pairwise alignments, systems such as Clustal W offer the option of using faster, less-accurate forms of pairwise distance measurements, but this ultimately results in the construction of less accurate guide trees, which can have a deleterious impact on the overall quality of the entire multiple sequence alignment. After all pairwise distances have been computed, the distances are used to construct a guide tree using techniques such as Neighbor-Joining (NJ) [24, 31]. The process of constructing a guide tree [8] based on pairwise distances is simple and reasonably scalable, but it is a subject to certain limitations (see Figure 1). NJ is a simplistic iterative clustering algorithm which is based on the approach of using pairwise edit distance information to decompose an initial star-shaped tree into a fully descriptive tree which represents, based on pairwise sequence distances, the phylogenetic relationships between all of the taxa on the tree [24, 31]. In such a tree, the most similar sequences are clustered together first, followed by the most similar sub-alignments, and so on. Eventually, an entire tree is built which represents the similarity relationships between all of the sequences. The tree built by neighbor joining (NJ) is subsequently used as the guide tree that ultimately describes an order of operations of aligning sequences and sub-alignments. The quality of the final alignment is typically quantified by a sum-of-pairs (SP) score.

MULTIPLE SEQUENCE ALIGNMENT WITH EVOLUTIONARY COMPUTATION

125

Figure 1. The traditional progressive alignment algorithm. (a) All possible pairs of sequences are optimally aligned using dynamic programming to determine their edit distance. Then, (b) edit distance information is used by a neighbor-joining algorithm to estimate and construct a guide tree. (c) Finally, the sequences are progressively aligned using the guide tree in order to produce an alignment.

2.3.

Clustal W

Clustal W is a popular progressive alignment system. Since progressive alignment is a heuristic algorithm, Clustal W is not guaranteed to find optimal alignments [8, 32–34]. Clustal W exploits the fact that homologous sequences are evolutionarily related. It builds up multiple alignments progressively with a series of pairwise alignments, moving from the leaves upward in a guide tree that estimates the phylogeny of the sequences [8]. Although Clustal W doesn’t always find optimal alignments, in most cases those alignments give a good starting point for further automatic or manual refinement. This type of alignment is generally useful for the study of identifying regions that are highly conserved. The alignment can be further improved through sequence weighting, position-specific gap penalties and choice of weight matrix [2]. The local maxima problem stems from the nature of the progressive alignment strategy. As the algorithm follows the guide tree and merges sequences together, the solution is never guaranteed to be globally optimal, as defined by some overall measure of alignment quality. Any misaligned regions made early in the alignment process cannot be corrected later as new information from other sequences is introduced. This problem is frequently a result of an incorrect branching order in the guide tree. One way to correct this is to use an iterative or stochastic sampling procedure such as bootstrapping [33]. The choice of alignment parameters is also problematic in Clustal W. If parameters are not chosen appropriately, alignments will not converge to a globally optimal solution. For closely related sequences, any reasonable scoring matrices should work fine because matches usually receive the most weights. Therefore, when matches dominate an alignment, almost any weight matrices will find a good solution. However, when aligning more divergent sequences, scores for gaps and mismatches become narrow and critical because they occur more frequently. Moreover, for highly conserved sequences, the range of gap penalties that will find the correct or best possible solution can be very broad. As more and more divergent sequences are added, however, the exact values for gap penalties become critical for success [31]. Our observations have confirmed that this is a common problem in most MSA algorithms. As the number of sequences in an alignment increase, the expected number of matches in each column also increases. For example, the probability of finding a matching nucleotide in

126

SHYU, SHENEMAN AND FOSTER

the column of ten sequences is much higher than that of three sequences. In general, it is difficult to justify why one scoring matrix is better than the others [7]. 2.4.

Sum-of-pairs (SP) scores and substitution matrices

Carrillo and Lipman [4] first introduced the sum-of-pairs (SP) score function, which defines the scores of a multiple alignment of N sequences as the sum of the scores of the N (N −1)/2 pairwise alignments. Although SP score function has been widely used to evaluate MSA, it doesn’t really provide any biological or probabilistic justification [7]. Each sequence is scored as if it is descended from the N − 1 other sequences, instead of a single ancestor. As a result, evolutionary events are often overestimated. The problem worsens as the number of sequences increase. A weighted SP score function [2] has been proposed to partially compensate this effect. Moreover, despite the simplicity of the SP score function, its sheer running time and space consumption makes it impractical even for modestly-sized sets of short sequences. It has been shown that the problem of computing MSA with optimal SP score is NP-hard [36]. Several fast approximations and divide-and-conquer approaches [30] have been proposed to overcome the computational complexity. In [2] and [6], the SP function, w(M), sums all the pairwise substitution scores in the columns for the sequence pairs p and q. Each column is evaluated with a scoring matrix. The substitution scoring function, s(m pj , m q j ), defines all possible alignments for nucleotides p j and q j . The function s(m pj , m q j ) gives the score of the alignment at column j for sequence p and q. The weight, α p,q , is intended to balance the overestimation problem in the SP score function [2, 6, 7]. The following equation (Eq. (3)) shows the mathematical formulation of the weighted SP score function. w(M) =

 1≤ p