A Clustering Approach to Generalized Tree

0 downloads 0 Views 112KB Size Report
This elegant method has been suggested by Taylor [15] and later by Wong et al. [16]. ... the application of our program to the study of a set of Alu repeats.
A Clustering Approach to Generalized Tree Alignment with Application to Alu Repeats Benno Schwikowski1;2 and Martin Vingron1 1

DKFZ, Abt. Theoretische Bioinformatik, INF 280, D-69120 Heidelberg, Germany 2 GMD-SCAI, D-53754 St. Augustin, Germany

Abstract. A formalization of the multiple sequence alignment problem that emphasizes the problem’s evolutionary aspect is the Generalized Tree Alignment Problem. Given a set of sequences, this formalization asks for a phylogenetic tree and ancestral sequences such that the implied amount of change necessary to explain the given data is minimal. The problem is computationally hard and we present a heuristic algorithm for it. Our procedure mimicks agglomerative clustering techniques as used for phylogenetic trees while at the same time aligning the sequences using the data structure of sequence graphs. The approach achieves good results in terms of the underlying scoring function. It produces biologically meaningful answers which in this paper we will demonstrate on a set of Alu repeats.

1 Introduction The comparison of biological sequences is one of the areas where molecular biology has profited from contributions by mathematics and computer science. Especially pairwise sequence alignment programs have become routine tools in the study of DNA- and protein sequences and in searching sequence databases. Programs to simultaneously align several sequences were much harder to develop and there still is no definite answer to this problem. In practice, many multiple alignment programs today use procedures which are successful in that they frequently reproduce the experimentalist’s intuition while a formalization of the problem is still lacking in these approaches. In this paper we focus on a formalization that models the evolutionary aspect of multiple sequence alignment. Assume that a phylogenetic tree for a set of sequences is given. Then one formalization of the problem is to ask for a set of sequences to assign to the inner nodes of this phylogenetic tree such that the sum of the alignment distances along all edges of the tree is minimized. This version of the problem is called Tree Alignment Problem and has been introduced by David Sankoff [11]. However, frequently the phylogenetic tree is not known and should be derived from the set of sequences, too. The resulting problem is called Generalized Tree Alignment Problem: Given a set of sequences find a tree such that with suitably chosen ancestral sequences at the inner nodes the sum of the alignment distances along all edges is minimal. The Generalized Tree Alignment Problem has been proven to be MAX SNP-hard. This means,

? E-mail: fschwikowski,[email protected] Work supported by DFG grant Vi–160/1

roughly speaking, that an arbitrarily good practical approximation algorithm could only be designed if the complexity classes P and NP were equal. In the current paper we want to present a clustering based, iterative approach to the problem. Before pointing out the specific advantages of the new procedure we shortly summarize the hierarchic alignment approach as can be found in many implementations. Generally, in this kind of approach the given sequences are initially clustered to obtain a hierarchic tree. Then they are aligned by iteratively applying the pairwise alignment algorithm. Given a set of un-aligned sequences and the hierarchic clustering, the sequences in two-element clusters are aligned to each other. These alignments are subsequently treated as single entities and their alignment will remain fixed throughout the remainder of the procedure. Doolittle and Feng [2] coined the phrase “once a gap always a gap” to signify this. Two groups, each of which has already been aligned, are then aligned with each other keeping the alignments fixed and using average values as scores. The hierarchic clustering on which this procedure is based is typically calculated by an agglomerative clustering algorithm [1, 14]. These procedures successively collapse more and more sequences into clusters. The decision which distinguishes different clustering algorithms is how to define the distance between a newly formed cluster and the other clusters or sequences. To be more precise, let a matrix of distances between objects be given. Select the smallest distance, say between objects i and j . Collapse the two objects into a cluster. To treat this cluster as a new object a distance between it and the other objects needs to be defined. Single linkage clustering at this point chooses for each object not in the cluster the minimal distance to either of the two objects in the newly formed cluster. Maximum linkage clustering at that point chooses the maximum. The newly formed cluster now becomes an object like any other one and the procedure continues. More successful clustering methods have used averaging in setting the distances to new clusters [14, 10]. In sequence alignment this averaging amounts to using average scores between the pre-aligned groups of sequences in the course of the computation. This elegant method has been suggested by Taylor [15] and later by Wong et al. [16]. In our current approach we want to further develop this idea in two respects. First, we are dealing with Generalized Tree Alignment and thus want to minimize a welldefined scoring function. Hence we strive at a data structure that allows to closely assess the effect of heuristic decisions on the final value of the scoring function. As a consequence we will use a structure different from pre-aligned sequence groups to represent the stages of the clustering. Our data structure is called sequence graph and serves to represent sets of aligned sequences and candidates for their common ancestors. The second major modification refers to when this assignment is made: The final choice of an ancestral sequence is effected not at the time of comparing a sequence graph to other sequences or sequence graphs but in a backtracking step after the topology of the tree has been chosen. In our presentation an emphasis will be placed on some of the practical aspects of the algorithm like size and structure of sequence graphs. Furthermore we will present the application of our program to the study of a set of Alu repeats. Another variant of the algorithm with a guaranteed error bound is based on a description of Generalized

Tree Alignment as a Steiner Tree Problem. It is described in detail elsewhere [13].

2 Sequence Graphs When posed for two individual sequences, the Generalized Tree Alignment problem asks for a most parsimonious explanation of their history since their divergence in evolution. The answer is a series of mutations and indels constituting an evolutionary pathway in the space of all sequences. For two sequences, the most parsimonious solutions are all shortest paths between them. Our heuristic uses the sequences on shortest paths between pairs to form the set of candidate ancestral sequences of the two sequences. Sequence graphs, introduced by Hein [4] to generate a tree alignment, efficiently support operations on shortest paths. Given two sequence graphs, all sequences on all shortest paths between the represented sequence sets can be efficiently calculated and be represented by a new sequence graph. We consider sequences a1 a2 : : : an over an alphabet . (jj = 4 for the nucleotides, jj = 20 for the amino acids). The set of all finite sequences is denoted by  . The symbol “?” denotes a letter that is neutral for sequence catenation, e.g. s? = s for any sequence s. We let 0 :=  [ f?g. Definition 1. A sequence graph is a directed, acyclic graph G = (V; E ) with labeled edges, E  V  V  0 . An edge e = (i; j; l) runs from i to j and is labeled with l 2 0 . Since G is acyclic, there exists a topological ordering on its vertices; w. l. o. g. we can assume that V = f1; : : : ; ng and i < j for all (i; j; l) 2 E . G has one node without incoming edges (its source) and one without outgoing edges (its sink). The edge labels of a path P = (e1 ; : : : ; ek ) in G generate a sequence s(P ) := l1 l2 : : : lk . When P runs from source to sink, the sequence s(P ) is said to be represented by G; S(G) denotes the set of all represented sequences. When the sequences in S(G) represent a cluster j we will also refer to S(G) as the image of j and denote it by Ij . Note that, in G, multiple edges between one node and another are possible. Any single sequence s can also be represented by a linear sequence graph. We will denote it by G(s). Definition 2. Let d(s; t) be the edit distance between two sequences s and t. We say that a sequence u is between s and t when d(s; t) = d(s; u) + d(u; t). Let d(Il ; Im ) be the minimal d(s; t) of any s 2 Il and t 2 Im and call two sequences s and t for which the minimum is assumed a closest pair. We say a sequence u is between Il and Im when it is between any such closest pair s and t. Let w :    7! IR0 be a metric mutation penalty function and g : IR ! IR>0 the gap penalty function. We assume that g is an affine linear function g(k ) = a + b  k with a; b 2 IR0 and w(x; y) < 2  b for any fx; y; z g  . Only to facilitate the exposition in the next paragraph we additionally demand that the mutation penalty function w :    ! IR obeys the strict triangle inequality w(x; z ) < w(x; y) + w(y; z ).

Shortest Paths between Sequence Graphs. Given two sequence sets I1 ; I2   that can be represented by sequence graphs G1 and G2 , we want to identify all shortest paths between I1 = S(G1 ) and I2 = S(G2 ). This involves finding all “closest pairs” (s1 ; s2 ) 2 I1  I2 with minimal d(s1 ; s2 ) = d(I1 ; I2 ). Kruskal and Sankoff [7] have generalized the dynamic programming approach for the alignment of two sequences [9] to the optimal alignment of two directed networks. Directed networks represent sequences with node labels instead of edge labels. Given two networks N1 and N2 representing sequence sets I1 and I2 , their algorithm calculates d(I1 ; I2 ) for linear gap penalty functions g(k ) = b  k . Hein [4] has rephrased this approach for sequence graphs, in combination with the reasoning of Gotoh [3] in order to handle affine linear gap penalty functions g(k ) = a + b  k and represent all sequences on any shortest path between I1 and I2 in a new sequence graph. For the purpose of exposition, we present the alignment algorithm for the simpler case of a linear gap penalty function g(k ) = b  k . Besides the non-linearity of sequence graphs, the presence of “?” as an edge label implies another slight difference to the classical dynamic approach. We   programming can occur. We extend the therefore consider extended alignments where columns ? ? mutation penalty function w to w0 : 0  0 7! IR0 by defining w0 (?; ?) := 0 and w0 (?; x) := w0 (?; x) := g(1) = b for all x 2  so that the score of an extended alignment is the sum of w0 (x; y ) over all columns xy and this score is identical to the score of the corresponding reduced alignment. Assuming that V(G1 ) = f1; : : : ; mg and V(G2 ) = f1; : : : ; ng, the following algorithm calculates d(G1 ; G2 ), i.e. the score of an optimal alignment between two optimally close sequences s1 2 S(G1 ) and s2 2 S(G2 ). It works analogously to the alignment of networks [7]. Algorithm A LIGN (G1 ,G2 )

for i := 1 to m do for j := 1 to n do





d(i; j ) := mine1 ;e2 d(i ; j ) + w (l1 ; l2 ); d(i ; j ) + w (l1 ; ?); d(i; j ) + w (?; l2 ) 0

0

0

0

0

0

0

end for end for output d(m; n);

The minimization in the inner loop is performed over all edges e1 = (i0 ; i; l1 ) 2 E(G1 ) and e2 = (j 0 ; j; l2 ) 2 E(G2 ). For the special case i = j = 1 we let d(i; j ) := 0. Algorithm A LIGN provides the basis for the representation of all shortest paths between I1 and I2 . In this computation each pair (e1 ; e2 ) 2 E1  E2 is considered exactly once in the minimization of A LIGN. Thus the run time is O (jE1 j  jE2 j). Since d(i; j ) can be stored in a two-dimensional array, space requirement is in O (jV1 j  jV2 j). Optimal Alignment Graph and Shortest Paths Graph. Once the matrix d is calculated by A LIGN, a graph representing all optimal alignments can be generated by

backtracking in O (jE1 j  jE2 j) time. The procedure is analogous to the two-sequence case. The vertices of the graph are a subset of f(i;j )j i 2 f1;: : :; mg; j 2 f1; : : :; ngg. Edges are labeled with alignment columns ll12 , l?1 , or ? l2 , corresponding to the first, second or third term in the minimization of ALIGN that leads to the minimum d(i; j ). The paths from source to sink correspond to the optimal alignments. We call this graph Optimal Alignment Graph of G1 and G2 and denote it by A (G1 ; G2 ). Figure 1 gives an example.

T T

C –

T A T –

A C A C

C A T C

Fig. 1. Optimal Alignment Graph for G1 = G(TCTA) and G2 right correspond to the optimal alignments of TCTA and TAC

A –

= G(TAC). The paths from left to

Based on the Optimal Alignment Graph the sequences on shortest paths can be determined. Any sequence s on a shortest path between S(G1 ) and S(G2 ) can be found by (a) choosing an optimal alignment of s1 2 S(G1 ) with s2 2 S(G2 ) with minimal distance d(G1 ; G2 ), and (b) performing any subset of the indels and mutations, implied by the alignment, on s1 . Choice (a) can be performed in A (G1 ; G2 ) by choosing any path from source to sink. We can incorporate choice (b) by replacing each edge e in A (G1 ; G2 ) by one or two labeled edges. When only one letter x 2 0 occurs in the alignment column associated with e there will be one edge labeled x. When another letter y 6= x occurs there will be two edges labeled x and y . Since the new edges are labeled with elements of 0 , the resulting graph, denoted by P (G1 ; G2 ), is again a sequence graph. We call P = P (G1 ; G2 ) Shortest Paths Graph of G1 and G2 since it exactly represents the set 



S(P ) = fs 2  j s lies on a shortest path between S(G1 ) and S(G2 ) :

P (G1 ; G2 ) is computed from A (G1 ; G2 ) by performing the above replacement once for each edge. Hence the total time complexity for the computation of P (G1 ; G2 ) is still O (jE1 j  jE2 j). 3 Using Sequence Graphs for Clustering The following algorithm realizes the clustering approach described in the introduction while representing clusters Si as sequence graphs Gi : Ii = S(Gi ).

Algorithm DPH-AV (s1 ; : : : ; sn )

f g

f g

start with n singleton clusters S1 = s1 ; : : : ; Sn = sn ; let G1 ; : : : ; Gn be sequence graphs, Gi = G(si ); while there is more than one cluster left do find two sequence graphs Gl ; Gm with minimal distance A LIGN(Gl ; Gm ); eliminate clusters Sl and Sm and create a new cluster Sk = Sl Sm ; let Gk be the Shortest Paths Graph (Gl ; Gm ); end while labeling stage: for each non-terminal cluster Si select one sequence si from Ii ; output the hierarchy of clusters, each cluster Si labeled with si ;

P

[

Labeling Stage. It remains to specify how, in the labeling stage, suitable sequences from nonterminal cluster images are chosen. We let the labeling stage start with the root cluster Sk . Initially, one sequence sk 2 Ik is chosen arbitrarily. We proceed recursively as follows. sk lies on a shortest path Pk between two sequences sl and sm of the two subcluster images Il and Im . In order to recover sl and sm , the corresponding sequence graphs Gl and Gm must still be accessible in the labeling stage. Each edge on the directed path P that represents sk in Gk has two associated pointers into the edge sets of the sequence graphs Gl and Gm , so that P reveals the directed paths P1 and P2 in these sequence graphs that gave rise to P during the construction of Gk . Thus, sl and sm are the sequences generated by the directed paths P1 and P2 , respectively. Since sl 2 Il , sl itself lies on a shortest path Pl connecting two sequences in the images of the sub-clusters of Sl . The procedure is now repeated for all clusters and their subclusters, until one sequence si from each cluster Si and a shortest path Pk from each non-terminal cluster Sk have been chosen. Multiple Alignment and Tree Length. Finally a multiple alignment can be constructed from the tree. It is chosen compatibly with the implied pairwise alignments along the edges of the tree. This results in a tree alignment score equal to the length of the tree. Since the pairwise alignments have been computed during the execution of DPH-AV and are contained in the respective sequence graphs, they need not be computed again. The resulting tree T consists of the union of the Pk over all non-terminal clusters Sk . Thus the tree length is the sum of l(Pk ) over these k . Each Pk is a shortest path connecting its two sub-clusters Sl and Sm . Its length is l(Pk ) = d(Il ; Im ) = A LIGN(Gl ; Gm ), which is the quantity that is minimized during the execution of DPH-AV. Thus the total tree length is independent of the choices made in the labeling stage. 3.1 A Variant with a Guaranteed Error Bound The algorithm DPH-AV can be modified slightly so that it satisfies a worst-case error bound of (2 ? n2 ) for n sequences, compared to the optimum. To obtain this bound it suffices to enlarge the image of each cluster by the sequences of the leaves in the

corresponding tree. This can be accomplished by adding to the sequence graph Gk separate paths from source to sink, additionally representing each s 2 Sk . The modified graph G0k then represents not only all sequences on any shortest path between Gl and Gm , but also the sequences in Sl and Sm . The modified algorithm, called DPH-F2, calculates a score that is at most (2 ? n2 ) times the score of the optimal solution. The proof for this can be found in [13]. Computational Complexity. A governing factor in the overall complexity of the algorithms DPH-AV and DPH-F2 is the size of the sequence graphs generated. In terms of memory needed, a sequence graph P (G1 ; G2 ) can – in the worst case – grow to limiting sizes. This might happen if, e.g., G1 and G2 always represent only distantly related sequences. For our examples on biological data the sequence graphs stay moderately small.

4 Experimental Results Implementation. We have implemented DPH-AV and DPH-F2 in C++, using the LEDA library for efficient data structures and algorithms [8]. We compare the results of our algorithms to the latest publicly available version of J. Hein’s T REEALIGN [4, 5, 6]. We evaluated T REEALIGN on the basis of its tree output, e.g. we calculated the tree length from the multiple alignment and the tree topology that T REEALIGN delivers. Execution times of our program for the following examples are in the range of 3 minutes on a Sun UltraSparc machine. Alu Repeats. We extracted from the Genbank database the entry for the gene region of the human iduronate-2-sulfate sulfatase gene (accession number L43581). This sequence is 130000 base pairs long and rich in Alu repeats. The feature table of the database entry identifies the position of 19 Alu repeats. We extracted exactly these sequences and ran our alignment program on this data set. The first test run employed a mutation penalty of 4 and an affine linear gap penalty function of g (k ) = 10 + 3k . The complete sequence graph representing the root cluster has 1210 nodes, a longest path of length 247, and there are 4:70  1031 different paths from source to sink. Note that different paths do not necessarily represent different sequences. The total tree length of 4183 constitutes an improvement of 18.1% over the tree of length 5113 computed by T REEALIGN for the same data and parameters. Figure 2 depicts the topology of the tree yielded by the labeling stage and Fig. 3 shows the multiple alignment. The table at the top of the next page summarizes the improvements relative to the results of T REEALIGN. These data are given for both algorithm DPH-F2 and algorithm DPH-AV in combination with several gap penalty functions g(k ).

5 Discussion and Outlook Interpretation of Alu results From the viewpoint of data analysis there are some striking features about the tree of the Alu sequences. The sequences are taken from one

g(k) 10 + 3k 15 + 2k 15 + 3k DPH-AV DPH-F2

18.1% 19.7%

38.8% 39.3%

10.8% 12.3%

gene region and named in order of occurrence. Thus there is a linear order in th ese data. Visually the tree has a rather linear structure, i.e. to a large degree the tree resembles a “hedgehog tree”. Is there a correspondence between the linear of order of the Alus on the gene and this linear order in the tree? Inspecting the tree one finds, e.g., Alu09 and Alu10 on neighboring branches of the tree. Alu08 is nearby, in fact separated by the pair Alu18 and Alu19. Such an observation virtually demands that a formal model of the order of duplications on the gene be made. It is conceivable that duplication events produced successively more copies of the repeat. As a second mechanism one or more adjacent copies might have reinserted themselves elsewhere in the gene. The combination of these two aspects should give rise to interesting new formal problems in the study of evolution. Postoptimization. On the algorithmic level several improvements of our algorithm are possible. So far we have not made use of any postoptimization operations that could further reduce the length of the final tree. For example, the method of Sankoff et al. [12] locally improves the position of an inner node of degree 3 by replacing it with a sequence that minimizes the sum of the edit distances to its three neighbors. The procedure can be iterated over the inner nodes until no further reduction in total tree length is achieved. Further improvement can be obtained by modifying the tree topology. Hein [5, 6] performs nearest neighbor interchanges in a greedy fashion. Another greedy strategy rearranges the tree by inserting one edge and breaking the resulting circle by removing another edge.

Alu02 Alu16

Alu14 Alu03

Alu15 Alu06

Alu11 Alu12

Alu08

Alu17

Alu19

Alu01

Alu18 Alu09 Alu10

Alu05 Alu13

Alu04

Alu07

Fig. 2. Unrooted phylogeny of 19 Alu sequences as inferred by DPH-AV. Branch lengths are proportional to edit distance

Gap penalty: 10 + 3 k, mutation penalty: 4 Overall tree length = tree alignment score: 4183

Fig. 3. Alignment of 19 Alu sequences, aligned by DPH-AV

--------------------------------------------------------------------------------------------------------------TTTGAGACCAGCCTGG--CCAACATGGTGAAAC-CCTGTCTCTACTAAAAA------------------------------------------------------------------------------------------------------------------CCAG-CATCATCCTGATACCAAAATCTGGCAAA----GTCATGACAAGAAA----GGCCAGGCACGGTGGCTCATGCCTGTAATCCCAGAACTT-TGGGAGGCCGAGGCGGGCGGATCACCT--GAGGTCAGGAGT-----------------------------TCAA-GACCAGCCTGA--CCAACATGGAGAAAC-CCTGTCTCTACTAAAAA-----GCCAGCCACAGTGGCTCATGCCTGTAATCCCAGCACTT-TGGGAGGCTGAGGCAGGTGGATCACTT--GAGATCAGGAGT-----------------------------TCGA-GGCCAGCCTGA--TCAACATGGAGAAAC-CCCATCTCTACTAAAAA----GGCCAGGCGCGGTGACTCATGCCTGTAATCCCAGCACTT-TGGGAGGCCGAGGTGGGCGGATCACCT--GAGATCAGGAGT-----------------------------TCGT-GATCACCCCAG--CCAACATGGCGAAAT-CCCACCTCTACTAAAAA-----GCCGGGCGCGGT-GCTCACGCCTGTAATCCCAGCACTT-TGGGAGGCTGAGGCGGGTGGATCAC----GAGGTCAGGAGT-----------------------------TCAA-GACCAGCCTGG--CCAAGATGGTGAAAT-CCCGTCTCTACTAAAAA----GGCTGGGCGTGGTGGCTCACGCCAGTAATCCCAGCACTT-TGGGAGGCCAAGGTGGGCGGATCAC----GAGGTCAGGAGA-----------------------------TCAA-GACCATCCTGG--CCAACAAGGAGAAAC-CCCATCTCTACTAAAAA----GGCCAGGCGTGGTGGCTCACGTCTGTAATCCCAGCATTT-TAGGAGGTTGAGGCGGGTGGACCAC----GAGGTCAAGAGA-----------------------------TCGA-GACCATTCTGC--CCAACATGGTGAAAC-CCCGTCTCTACTAAAAC------CTGGGTGCAGCGGCTCACACCTGTAATCCCAGCACTT-TGGGAGGCCAAGGCAGGCGGATCAC--CTGAGGTGAGGAGT-----------------------------TCAA-GACCAGACTGA--CCAACATGGTGAAAC-CCTGTCTCTACTAAAAA----GGCTGGGTGCGGTGGCTCACACCTCTAATCCCAGCACTC-TGGGAGGTGA------GTGGATTGC--CTGAGGTAAGGAGT-----------------------------TCGA-GACCAGCCTGA--CCAACATGGTGAAAC-CCCGTCTCTACTAAAAA----TGTT-GGCACAATGTCTCACACCTGTAATCCCAGCAGGCGTGGGAGGGCAAGGCAGGAGGATCAC--CTGAGGTCAGGAGT-----------------------------TCAA-GATCAGCCTGA--CCAACATAGTGAAAC-TCCGTCTCTACTAAACA----AGTCGGATGCGGTGGCTCACACCTGTAATCCCAGCACTT-TGGGAGGCCCAGGCAGGCGGATCAC--TTGAGGTTAGGAGT-----------------------------TCAA-GACCAGCCTGG--CCAACATGGTGAAAC-CCCGTCTCTACTAAAAA----GGCCGGGTGTGGTGGCTCATGCCTGTAATCCTAGCACTT-TGGGAGGCCAAGGCGGGTGGATTGC--CTGAGCTCAGGAGT-----------------------------TCGA-GACCAGCCTGG--GCAGCTTGGTGAAAC-CCTGTCTCTACTAAAAT----GACTGGGCACAATGACTCACGCCTGTAATCTTAGCACTT-TGGGAGGCCGAGGCGGGCAGATAAC--TTGAGCTCCGGAGT-----------------------------TCAT-GACCAGCCTGG--GCAACATAGTGAGAC-CCTGTCTCTACTAAAAAGACAA ---------------------------------------------------------------------------------ATGAGGGAGGATTGCTTGAGGCCAGAAGCTCAA-GACTAGCCAGG--GCAACATAACAAGACACCCATCTTTACACAAAT-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Alu11 Alu14 Alu02 Alu16 Alu15 Alu06 Alu08 Alu19 Alu18 Alu09 Alu10 Alu07 Alu04 Alu13 Alu05 Alu01 Alu17 Alu12 Alu03

---T----------------ACAAAAA--TTAG-------------------------------------CCAGGCATGGTGG--TGCA-A-------GC------TTGTAATCCCAGCTACT-CA------GGAGGCTGAGAAGGAAGAATC---GCTTGAACCT Alu11 ---A----------------AAAAAGA--AG-G-------------------------------------CCTGGCACAATGG--CTCA-T-------GC------CTGTAATCCCAGCTACT-TG------GGAGGCTGAGGCATGATAATC---ACCTGAAACC Alu14 ---T----------------ACAAAA---TTAG-------------------------------------CTGGGCGTGGTGG--TGCA-T-------GC------CTGTAATCCCAGCTACT-TG------GAAGTCTGAGGCAGGAGAATC---GCTTGAACCC Alu02 ---T----------------ACAAAA---TTAG-------------------------------------CTGGGTGTGGTGG--TGCA-T-------GC------CTGTAATCCCAGCTACT-CA------AGAGGCTGAGGCAGGAGAACC---GCTTGAACCC Alu16 ---T----------------ACAAAA-AATCAG-------------------------------------CTGGGCGTGGTGG--CGGG-C-------AC------CTGTAATCCCAGCTACT-CT------GGAGGCTGAGGCAGGAGAATT---GCTTGAACCT Alu15 ---T----------------ACAAAAA--TTAG-------------------------------------CCGGGCGTGGTGG--TGGG-C-------GC------CTGTAATCCCAGCTACT-TG------GGAGGCTGAGGCAG-AGAATT---GCTTGAACCC Alu06 ---T----------------ACAAAAA--TTAG-------------------------------------CTGGGTGTGGCGG--TGCA-T-------GC------CTGTAATCCCAGCTACT-CA------GGAGGCTGAGGCAGGAGAATC---ACTTGAACCT Alu08 ---T----------------ACAAAAA--TTAG-------------------------------------CTGGGCATGCTGG--CACA-TAATCCCAGC------ATGTAATCCCAGTTACT-CG------GGAGGCTAAGGCAGGAGAATC---ACTTGAACCT Alu19 ---T----------------ACAAAAA--TTAG-------------------------------------CCATGCGTGGTGGCATGCA-T-------GCAC----CTGTAATCCCAGCTACT-TG------GGAGGCTGAGGCAGGAGAATT---GCTTGAACTC Alu18 ---T----------------ACAAACA--TTAG-------------------------------------CTGGGTGTGGTGG--TGGG-C-------AC------CTGTAATCCCAGCTACT-CG------GAAGGCTGAGGCAGGAGAATCATCGCTTGAACCT Alu09 ---C----------------AAAAAAA--TTAG-------------------------------------CGTGGCATGGTGG--TGGG-C-------GC------CTGTAATTCCAGCTACT-TA------GGAGGCTGAGGCAGGAGAATC---GCTTGAACCC Alu10 ---T----------------ACAAAAG--TTAG-------------------------------------CCGGGTGTGGTGG--TGCA-T-------GC------CTGTAATCCCAGCTTCT-TG------GGAGGCCAAGGCACAAGAATC---GCTTGAACCA Alu07 ---ACAAAAAAAAACCA---AAAAAAA--TTAG-------------------------------------CCAGGCTTGGTGG--CGTG-C-------AC------CTGTAGTCCCAGCTACT-TG------GGTGGCTGAGGCAGGAGAATT---GCTTGAACCT Alu04 AAAA----------------AAAAAAA--ATAG-------------------------------------CCAGGCGTTGTGG--TGTA-C-------AC------CTGTAGTCCCAGCTACT-AA------GGAGGCTGAAGTGGGAGGATG---GCTTGAGCCT Alu13 ---T-------------TTTAAAGAAA--TTAG-------------------------------------CTGGGTGTGGTGG--CATGGC-------AC------CTCTAGTCCCAGCTACT-CG------GGAGGCACAGGCAGGAGGATC---ACTTAAGCCT Alu05 ---------------------------------TAAAAATGGGCAAATGGCTGGGCTCTGTGCTGGGTGCCTGTAATCCCAGC--TACT-G-------GG--GAGGTTGAAGCAGGAGAATGGCTTACCCCCGAACCCAGGAGGTGGAGGTTG---CAGTGAGCCA Alu01 ----------------------------------------------------------------------CTGTAGCCCTAGC--TACC-T-------GG--GAGGCTGAGGTGGGAGGATTGCTT------GAGCCCAAAAGTTCAAGGTTA---CAGTCAGCTA Alu17 ----------------------------------------------------------------------CTGGGCACGGTGG--CTCA-T-------GC------CTGTAATCCCAGCACTT-TG------CAACACTGAGGCAGGTGGATT---GCTTGAG--- Alu12 ----------------------------------------------------------------------AGCTGCA--GTGG--TGCA-T-------GC------CTGTAGTCCCAGCGACT-TG------GGAGGCTGAGGTAGGAGAATC---GCTTGAGCCC Alu03 . GGGAGGCAGAGGT-------TGCAGTGAGC-----CGAGATCAT-------GCCACTGCACTCCAGTCTGGG-TTACC--GCGTGAGACCCTGTCTC-------------------AAAGAAAA---------------------------------------- Alu11 AGGATGCAGAGGT-------TGCAGTGAAC-----CAAGATTGCACCTTGCGCCACTGCACTCCAGCCTGGG-CGACA--GAGTAAGACTCTGTCTC-------------------AAAGAAAA---------------------------------------- Alu14 AGGAGGCAGAGGT-------TGCAGTGAGC-----TGGGATCAT-------GCCGTTGTACTCCAGCCTGG-GCAACAA-GAGCAAAACTCCATCTC-------------------AAACA--A---------------------------------------- Alu02 GGGAGGTGGAGGT-------TGTGGTGAGC-----CGAGACTGC-------GCCATTGCACTCCAGCCTGG-GCAACAA-GAGTGAAACTCCATTTC-------------------AAAAAGAA---------------------------------------- Alu16 GGGAGGCAGAGGT-------TGCAGTGAGC-----CAAGATCGC-------ACCACTGCACTCTAGCCTGG-GCGAGA--GAGCGAGACTCCATCTC-------------------AAAAAAAA---------------------------------------- Alu15 AGGAGGCGGAGGT-------TGCAGTGAGC-----CAAGATCGT-------GCCACTGCACTCCAGCCTGG-GTGACA--GAGCGAGACTCCGTCTC-------------------AAATAAAA---------------------------------------- Alu06 GGGAGACGGAGGT-------TGCAGTGAGCTGAGCCGAGATTGC-------GCCACTGAACTCCATCCTGG--AGACA--GGGCTAGACCCCGTCTA-------------------AAAAAAAA---------------------------------------- Alu08 GGAAGGCGGAGGT-------TGCAGTGAGC-----CAAGATCAC-------ACCACTGCATTCCAGCCTGG--CTACA--GAGCGAGACTCTGTCTC-------------------CAAAAAAA---------------------------------------- Alu19 AGGAGGCAAAGGT-------TGCAGTGAGC-----AGAGATTGC-------CCCATTGCACTACAGCCTGGG-CCACA--GAGTGAGACTCTGTCTC-------------------AAAAAAAA---------------------------------------- Alu18 GGGAGGTGGAGGC-------TACAGTGAGC-----TGAGATTGC-------GCCATTGCACTCCAGCCTGGG-TGACA-AGAGTGAAACTCCATCTC-------------------AAAAAAAA---------------------------------------- Alu09 GGGAGGCAGAGGT-------TGCAGTGAAC-----CGAGATTGC-------ACCACTGAACTCCAGCCTGGG-CAACA-ACAGCGAAACTCTGTCTC-------------------AAAGAAAA---------------------------------------- Alu10 GG-AGACGGAGGT-------TTCAGTGAAT-----GAAGATCGT-------GCCATTGTATTCCAGCCTCGG-CAACA--CAGCAGGACT-------------------------------------------------------------------------- Alu07 GGGGGGTGGAGGG-------TGCGGTGAGC-----TGAGATCGT-------GCCACTGCACTCCAGCCTGG--TGACA--GAGTGAGACTCCGCCAA-------------------AAAAAAAA---------------------------------------- Alu04 GTGGGTCAGAGGT-------TGCAGTAAGC-----TGAATTTGC-------ACCACTGTACTCCAGCCTGCG-TGACA--AAGCGAGACCC------------------------------------------------------------------------- Alu13 GGGAGTTCAAGAC-------TGCAGTGAGC-----TCTGATCAT-------GTCACTGCGCTCTGGCCTGAG-TGACA--GAGTGGGACCTTGTCTC-------------------AAACAAA----------------------------------------- Alu05 AGGTCGT-GCAACTGCATTTC-------------------------------------------AGCCTGGG-CGACA--GAGTGAGACTTTGTCTCTA-------------------AATAAA---------------------------------------- Alu01 TGATTGC-AATGCTGCTCTGC-------------------------------------------AGCCAGGG-CAATG--AAGGGAGACCCTGTCTTTA-------------------AAATAA---------------------------------------- Alu17 ----TTT-GAGAC-------C-------------------------------------------AGCCTGGG-CAACA--TAGTGAGACCCTGTCTCTATT----------GAAAAAAAAAAAA---------------------------------------- Alu12 AGGAAATCGAGGC-------T-------------------------------------------AGCCTGGG-CAACA--TAGGGAGATCCTGTCTCTAAAAAAGCAAAAGGAAAATAAACAAATAAGGAGGTGGATGAGTGTCCTTCCAACTCCCAGTCTCTG Alu03

References 1. R. O. Duda, P. E. Hart. Pattern Classification and Scene Analysis. Wiley & sons, 1973. 2. D.-F. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351–360, 1987. 3. O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology 162:705–708, 1982. 4. J. Hein. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Molecular Biology and Evolution 6:649–668, 1989. 5. J. Hein. A Tree Reconstruction Method That Is Economical in the Number of Pairwise Comparisons Used. Molecular Biology and Evolution 6:669–684, 1989. 6. J. Hein. Unified Approach to Alignment and Phylogenies. Methods in Enzymology 183:626–645, 1990. 7. J. B. Kruskal and D. Sankoff. An Anthology of Algorithms and Concepts for Sequence Comparison. In: Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison. Addison Wesley, 1983. 8. K. Mehlhorn and S. N¨aher. LEDA, a Platform for Combinatorial and Geometric Computing. Communications of the ACM 38:1,96–102, 1995. 9. S. B. Needleman, C. D. Wunsch. A general method applicable to the search for similarities in the amino-acid sequence of two proteins. Journal of Molecular Biology 48:443-453, 1970. 10. N. Saitou and M. Nei. The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Molecular Biology and Evolution 4:406–425, 1987. 11. D. Sankoff. Minimal Mutation Trees of sequences. SIAM Journal of Applied Mathematics 28:35–42, 1975. 12. D. Sankoff, R. Cedergren and G. Lapalme. Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. Journal of Molecular Evolution 7:133– 149, 1976. 13. B. Schwikowski and M. Vingron. The Deferred Path Heuristic for the Generalized Tree Alignment Problem. To appear in: Proceedings of the First Annual International Conference on Computational Molecular Biology, ACM 1997. 14. D. L. Swofford and G. J. Olsen. Phylogeny Reconstruction. In: Molecular Systematics. Sinauer, 1990. 15. Willie R. Taylor. A Flexible Method to Align Large Numbers of Biological Sequences. J. Mol. Evol. 28:161–169, 1988. 16. A. K. C. Wong, S. C. Chan and D. K. Y. Chiu. A Multiple Sequence Comparison Method. Bull. Math. Biol. 55:465–486, 1993.

This article was processed using the LATEX macro package with LLNCS style