Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

0 downloads 0 Views 182KB Size Report
system design and grid computing. S. Kuppuswami obtained his BE and MSc Engg. in Electronics from the. University of Madras. He also received his Doctorate ...
44

Int. J. Adaptive and Innovative Systems, Vol. 1, No. 1, 2009

Nomadic genetic algorithm for multiple sequence alignment (MSANGA) S. Siva Sathya*, S. Kuppuswami and K. Syam Babu Department of Computer Science, Ramanujan School of Mathematics & Computer Science, Pondicherry University, Puducherry, 605 014, India E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: Genetic algorithms (GA) are adaptive search procedures that try to produce a globally optimum solution for problems of huge search space. This paper speaks about a variant of the standard genetic algorithm (SGA) called nomadic genetic algorithm (NGA) which is based on the concept of ‘birds of the same feather flock together’. This NGA is found to maintain the diversity of individuals in the population by intelligently adapting to its environment as well as results in faster convergence of the solution. The objective of this paper is to prove the merits of NGA over SGA for problems of large search space like the problem of multiple sequence alignment (MSA) in bioinformatics. NGA was applied to MSA (MSANGA) and the convergence of NGA is compared with that of SGA and the results tabulated. Also, the accuracy of the alignment produced using MSANGA is compared with nine other popular tools for the data sets chosen from the standard BaliBASE benchmark alignment suite, illustrating the superiority of NGA over SGA and all other tools to produce quality alignment at a faster rate. Keywords: genetic algorithm; GA; multiple sequence alignment; MSA; selection; adaptive search, nomadic genetic algorithm; NGA. Reference to this paper should be made as follows: Siva Sathya, S., Kuppuswami, S. and Syam Babu, K. (2009) ‘Nomadic Genetic Algorithm for multiple sequence alignment (MSANGA)’, Int. J. Adaptive and Innovative Systems, Vol. 1, No. 1, pp.44–59. Biographical notes: S. Siva Sathya received her BTech and MTech in Computer Science and Engineering from Pondicherry University. She is a Research Scholar as well as holds the position of Senior Lecturer in the Department of Computer Science, Pondicherry University, Puducherry, India. She has published a number of papers in international conferences and her research interests include genetic algorithms, bioinformatics, object oriented system design and grid computing. S. Kuppuswami obtained his BE and MSc Engg. in Electronics from the University of Madras. He also received his Doctorate in Engineering in Computer Science from the University of Rennes I, France. He worked as a Faculty in Anna University, University of Rennes I, Pondicherrry Engineering College and then at Pondicherry University. Currently, he is holding the Copyright © 2009 Inderscience Enterprises Ltd.

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

45

position of Director of Studies, Educational Innovations & Rural Reconstruction in Pondicherry University. His area of research also includes multilingual computing, software engineering, agent technology, network engineering and smart systems. He has published more than 50 research papers in reputed international journals and conferences. K. Syam Babu received his MTech in Computer Science and Engineering from Pondicherry University. His research interests include genetic algorithms and bioinformatics.

1

Introduction

A variety of genetic algorithms (GA) for suiting different application needs have been proposed in the literature of genetic algorithms. Still, the basic architecture of any genetic algorithm remains the same. The basic difference among the variants arises in the fitness evaluation, selection or application of genetic operators. This paper describes a variant of the standard genetic algorithm (SGA) called nomadic genetic algorithm (NGA), which is an adaptive search procedure capable of intelligently adapting to its own group of individuals. It employs most of the principles of SGA except that, it allows for migration of individuals within the different communities in the population that the individuals are grouped into. The selection procedure followed in NGA insists on mating within the same community thus providing equal chances of mating even to the weakest section of the population. Once the fitness of an individual improves, the individuals migrate to a different group that suits its fitness range. This allows the NGA to maintain the diversity of individuals in the population, also ensuring faster convergence. The problem of multiple sequence alignment (MSA) is a very important problem in molecular biology that is considered to be NP hard in nature. The complexity of the problem (Wang and Jiang, 1994) increases tremendously with the number of sequences and the size of the sequences. An exact solution may not be guaranteed in this case. But the goal is to find the best possible solution to the problem and hence it falls under the class of optimisation problem. The MSA of nucleotide or protein sequences is critical in the understanding of many diseases and identification of many drugs. Hence, finding an optimal solution for the problem of MSA is very crucial to the medical field. A number of algorithms and techniques have been employed to solve the problem. For instance, the most popular tool ClustalW (Thompson et al., 1994) employs a progressive search strategy. DCA employs a divide and conquer strategy and Muscle is yet another popular tool. Also, several kinds of GAs (Zhang and Wong, 1997) have been used earlier. SAGA is one such tool that employs GA for MSA. As GAs are good at optimisation problems, the time taken to converge plays a very important role, especially in problems of bioinformatics origin. NGA has already been proved for solving the timetable generation problem (Siva Sathya et al., 2007) and found to converge very fast compared to SGA. Hence, it has been applied to MSA in this paper and the tool is named MSANGA. The rest of the paper is organised into the following sections. Section 2 gives the definition and applications of MSA along with the previous work done to solve MSA.

46

S. Siva Sathya et al.

Section 3 outlines GA in the perspective of population diversity, convergence and migration factors applicable to NGA and the related work in that area. Section 4 describes the proposed NGA with an architecture and pseudo-code. Section 5 gives the proposed design of MSANGA, Section 6 illustrates the results and statistics and finally Section 7 gives the conclusion followed by an exhaustive list of references in Section 8.

2

Multiple sequence alignment

The wealth of nucleotide and protein sequence information currently available demands better means of data interpretation. This interpretation is made significantly easier when the sequences are viewed in comparison rather than in isolation. MSA of nucleic acid or amino acid sequences plays a central role to the advancement of understanding in molecular biology (Altschul et al., 1990; Krane and Raymer, 2001). MSA (Carillo and Lipman, 1988) is the representation of the given sequences in a way that reflects their relationship. If the alignment is correct, each column will contain homologous residues. The definition of homology depends on the criterion used for the alignment. If a given sequence lacks one residue, a gap will be inserted in its place at the corresponding position. Gaps usually take the form of strings of nulls. In an evolutionary context, a null sign means that a residue was inserted in one of the sequences or deleted in the other while the sequences were diverging from their common ancestor. Consider a family of K(≥3) sequences over Σ, where Σ consists of a set of DNA/protein residues, to be aligned (Horng et al., 2004): X1 = X1,1X1,2X1,3…..X1,n1 …… …… Xk = Xk,1Xk,2Xk,3…..Xk,nk Where Xi,j ∈ Σ indicates that it is the j-th element in the i-th sequence; 1 ≤ j ≤ ni, ni is the length of the i-th sequence; and 1 ≤ i ≤ k, k is the number of sequences. Thus, a MSA of , denoted as X1#X2#X3#…#Xk is given by a KX N matrix for some N, max{n1,n2,n3,….,nk} ≤ N ≤

k



ni , where

i=1

⎛ X *1,1 X *1, 2 ......X *1, N ⎜ ⎜ X * 2 ,1 X * 2 , 2 ........X * 2 , N ⎜ ⎜ ................................ ⎜⎜ * * * ⎝ X k ,1 X k , 2 ......X k , N

⎞ ⎟ ⎟ ⎟ ⎟ ⎟⎟ ⎠

Xi,j * ∈ Σ U {–} for all 1 ≤ i ≤ k, 1 ≤ j ≤ N; for each i=1,…..,k, the row Xi* : Xi,1* Xi,2*… Xi,N* reproduces the sequence Xi upon ignoring all of its gaps; the alignment does not contain any column consisting of gaps only; and, N represents the length of alignment. Its applicability ranges from selecting homologues to structure and function prediction to the discovery of evolutionary relationships among various species. It is used in medicine in a number of ways. Some of them are:

Nomadic genetic algorithm for multiple sequence alignment (MSANGA) •

To identify the similarities between different species or different genes.



To identify the conserved regions between different gene sequences or species.



To identify the phylogentic relationship between different species.



To target the cause of diseases by multiple aligning different gene sequences.



To find new drugs for diseases based on the similarity or dissimilarity between disease causing genes.

47

A number of methods have been proposed in literature. But each comes with its own limitations on the length and number of sequences that can be aligned in a specific amount of time. Straight forward dynamic programming solves the multiple alignment problem for k sequences of length n in O(n k ) time. For large n and k this seems to be nearly impossible and is considered to be a NP-hard problem. The most popular one among them is the progressive alignment by (Feng and Doolittle, 1987). The accuracy of progressive alignment depends on the relation between the sequences aligned and the order in which the sequences are aligned. A number of stochastic methods like simulated annealing, Gibbs sampling and GAs (Chellapilla and Foegel, 1999; Isokawa, 1996) have been employed to solve MSA. GAs are probably one of the most interesting stochastic optimisation tools available today. SAGA is one such package designed to perform MSA using GA (Notredame and Higgins, 1996). To further enhance the speed and computational efficiency of the algorithm, the use of NGA is suggested in this paper.

3

Genetic algorithms

GAs (Goldberg, 1989) are a part of evolutionary computing which is a rapidly growing area of artificial intelligence. They are adaptive methods used to solve search and optimisation problems. They are based on the genetic processes of biological organisms. By mimicking the principles of natural evolution, i.e, ‘survival of the fittest’, GAs are able to evolve solutions to real world problems. The basic steps of GA are initial population generation, fitness evaluation and breeding which involves selection and application of genetic operators namely crossover and mutation to produce new offspring in the next generation. The process is iterated for a fixed number of generations or till convergence. Applying GA to solve an optimisation problem involves the following tasks: •

encode(represent) solution to the problem as a chromosome(by a bit string)



formulate fitness evaluation function



select or create the appropriate genetic operators (crossover, mutation, selection etc.)



select run parameters (population size, crossover rate, mutation rate, generation gap, convergence criteria etc.)

GAs (Holland, 1975, 1992) are considered to be adaptive search procedures that works randomly to choose an optimal solution from a large solution space. But different kinds of selection mechanisms and genetic operators have been employed to guide the random

48

S. Siva Sathya et al.

adaptive procedure to explore all possible solutions. In the case of GAs, achieving diversity (Oei, 1991) is considered to be an important goal to reach a global optimum solution. Some GAs rely primarily on mutation or mutation like mechanisms for diversification (Mahfoud, 1995). Simple GA’s selection mechanism replicates higher fitness solutions and discards lower fitness solutions leading to convergence of the population. For instance, Brindle (1981) has proved the inferior performance of roulette wheel selection on several test functions. Also, Baker (1987) has analysed various fitness proportionate selection methods. A number of mechanisms for restricting the mating of individuals have been proposed earlier (Gorges-Schleuter, 1992). Generally, mating is restricted among similar individuals with the notion that similar parents produce similar offspring which will not produce diversity in the population. Booker (1982) and Goldberg (1989) have explored various approaches in which a mating tag is added to each individual. The tag must match before a cross is permitted. Another type of mating restriction is introduced by Spears (1994) which adds a one dimensional ring topology and restricts mating to neighbours with identical tags. To maintain the diversity of individuals in a population, migration has also been attempted earlier, but with parallel GAs like in Genitor II by Whitley and Starkweather (1990), wherein individuals migrate from one processor to another. According to Tanese (1989a, 1989b), GAs that incorporate migration are reported to produce more population diversity. There is always a trade-off between convergence and diversity in GA. To balance both these aspects, the NGA has been proposed which allows beneficial search as well as controlled convergence.

4

Proposed NGA

NGAs are specialised forms of GAs that work on the principle of ‘birds of the same feather flock together’. Generally, in SGA, different kinds of selection mechanisms like roulette wheel selection, rank based selection, tournament selection, etc. are employed based on the type of application. All these selection mechanisms aim to select high fit individuals in different proportion for the purpose of mating. The low fit individuals are given very less chance for mating or they are totally discarded in some selection schemes. But the worst individuals, if given a chance may also result in better offspring in the next generation. This phenomenon is given importance in this variant of SGA. Here, the individuals in the population are grouped into different communities based on their fitness value. The size of the groups and the number of groups depends on several factors and it is currently an area of research. Individuals in a community mate with each other. Here again, different kinds of selection mechanisms could be used within the community. Now is the time for migration. If any offspring comes up with a better fitness, it leaves its community and joins a different community, i.e., the group of similar fitness value. This is an instance of an intelligent adaptive behaviour that is being exhibited by NGA. Thus, equal opportunities are being given to all individuals in the population whether they are of high fit or low fit. Individuals constantly improve their fitness value and keep migrating through successive generations of the GA until convergence or some stopping criteria is reached. Since the individuals do not stay in one place and keep migrating from

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

49

group to group, the term NGA has been coined. The following is the pseudo code of the NGA: 1

generate initial population randomly

2

evaluate the fitness of each individual

3

sort the individuals in non-increasing order of their fitness values

4

the population is then arranged into groups based on their fitness range. a select individuals from each group b apply crossover/mutation operators c evaluate the fitness of off springs d add off springs to the same group

5

combine all the groups in to a single list

6

sort the list in non-increasing order of their fitness values and trim the list to the size of initial population

7

repeat the process from Step 4 to the no. of generations

8

select the best individual (best individual is one which gives the best sequence alignment of the whole population).

5

Proposed MSA using NGA (MSANGA)

Figure 1

Architecture of MSANGA Read input sequences

Do pairwise alignment

Determine the number of gaps to be inserted in each sequence

Apply NGA

Select the best alignment

This section details the application of the NGA to MSA. The architecture of the MSANGA is shown in Figure 1. It first reads the input sequences in FASTA format from the input file and calculates the number of gaps to be inserted in each sequence by doing pair-wise alignment between every input sequence from the input file using popular

50

S. Siva Sathya et al.

global pair-wise alignment (Needleman and Wunsch, 1970). Then it applies NGA and selects the best individual which gives the best alignment. The representation, fitness evaluation and the genetic operators used follows to some extent Horng et al. (2004), though the implementation procedure varies.

5.1 Representation The representation of the individual plays an important role in any GA. Each individual in the population also termed as a chromosome, is a candidate solution to the problem and hence it should be represented appropriately. In this case, each chromosome represents an alignment. Here, each chromosome is encoded as a multiple-number string that corresponds to the gap positions in an alignment. Figure 2

Chromosome representation

Figure 2 shows an example of how the chromosome consisting of randomly generated numbers will be translated into gap positions in the corresponding sequences. For example, (0,5,2,2,6) means, that there is a gap after the 0th, 5th, 2nd, 2nd and 6th residues (character positions) of the first sequence in the corresponding MSA. The number of gaps to be inserted in each sequence is calculated according to the pairwise alignments of the input sequences. For instance, in the above example, the maximum length of the sequence from the pairwise library is 11. Then the number of gaps to be inserted in each sequence is (11 – length of sequence without gaps).

5.2 Fitness evaluation The sum-of-pairs function (Setubal and Meidanis, 1997) is taken to evaluate the fitness of chromosomes. When computing the fitness, a chromosome must be converted to the alignment form. The sum-of-pairs score (SPS) is defined as the sum of all pairs of symbols in the column. If the symbols in the alignment match, a match score is assigned or else a mismatch score. For nucleotide or DNA sequences, match and mismatch score may be simple like 1 and –1 respectively or it may be based on simple identity or transition/transversion matrix considering the biological nature of the sequences. But when protein sequences are considered for alignment, the fixing up of match and mismatch score becomes complex and may be taken from PAM or BLOSUM matrix.

5.3 Genetic operators Crossover and mutation are the basic operators applied. The principle of crossover is to exchange the information among chromosomes to produce offspring. The chromosomes

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

51

with better performance will be produced by preserving the good structures of parent chromosomes. In a crossover process, two parent chromosomes, denoted as X and Y, are randomly selected and are used to produce two child chromosomes. Then, cutting points are randomly selected in parent chromosomes. The blocks among the cutting points are called crossover blocks. Four kinds of crossover operators are used in this system. They are: •

singlepoint crossover



twopoint crossover



multipoint crossover



uniform crossover

One operator from the four is selected randomly. The frequency of selection for each operator is controlled by the probability of each crossover operator. Each operator is having equal probability for selection in this system. Similarly, four kinds of mutation operators have been applied in this system. They are: •

MoveGap



MergeSpace



MoveGroupGap



BypassGap

In the mutation process, an input chromosome is mutated n times by several operators randomly selected from the four mutation operators described here. The frequency of selection for each operator is controlled by the probability of each mutation operator. The n value is proportionate to the length of number-strings in the chromosomes.

6

Results and statistics

This section shows the results of applying the NGA to solve the problem of MSA. To evaluate the performance of the proposed system, it has been compared with nine existing popular MSA programs namely T-COFFEE, ClustalW, MUSCLE, DCA, DIALIGN, MultAlin, ClustalX, MAFFT and GAMSA. ClustalW which uses a progressive alignment algorithm is a general purpose MSA program for DNA or proteins. It produces biologically meaningful MSAs of divergent sequences, but the drawback is the enormous amount of time it takes for long sequences. Also, it follows the ‘once a gap always a gap policy’. MUSCLE is a program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function i.e., the log-expectation score, and refinement using tree-dependent restricted partitioning. DIALIGN is a software program for multiple alignment developed by Burkhard Morgenstern et al. DIALIGN constructs pairwise and multiple alignments by comparing whole segments of the sequences. No gap penalty is used. This approach is especially

52

S. Siva Sathya et al.

efficient where sequences are not globally related but share only local similarities, as is the case with genomic DNA and with many protein families. MultAlin performs a progressive multiple alignment for a set of sequences. Pairwise distances between sequences are computed after pairwise alignment with the Gonnet scoring matrix and then by counting the proportion of sites at which each pair of sequences are different (ignoring gaps). The guide tree is calculated by the neighbour-joining method assuming equal variance and independence of evolutionary distance estimates. MAFFT includes two novel techniques. 1

Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue.

2

A simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT.

Divide and conquer multiple sequence alignment (DCA) is a program for producing fast, high quality simultaneous MSAs of amino acid, RNA, or DNA sequences. The program is based on the DCA algorithm, a heuristic approach to sum-of-pairs (SP) optimal alignment. ClustalX is a variation of the ClustalW MSA program with a graphical user interface. The display colours allow conserved features to be highlighted for easy viewing in the alignment. GAMSA is our own MSA using the SGA.

6.1 Input sequence data The sample data sets have been taken from the standard BAliBASE database, which is a publicly available suite of alignment benchmarks. Since each tool has its own format for input and output, conversion of input sequences to an appropriate format has to be done. Whatever be the source and format of the input sequences, it is converted to a uniform format, in this case, the FASTA format. Hence, a module to convert sequences from any format to FASTA format has been developed.

6.2 BaliBase testing BAliBASE SP-score calculation (Thompson et al., 1999) takes input alignment and reference alignment in MSF format. So, conversion of MSA program output to MSF has been done under this module. BAliBASE SP-scores of various alignment tools for various data sets are shown in Table 1. SP score is calculated as follows: given a candidate alignment (individual) of N sequences containing M columns, the ith column in the alignment was designated by Ai1, Ai2, …., AiN. For each pair of residues Aij and Aik, there is pijk such that pijk = 1 if residues Aij and Aik from the candidate alignment were aligned with each other in the reference alignment. Otherwise, pijk = 0. The score for the ith column is thus:

53

Nomadic genetic algorithm for multiple sequence alignment (MSANGA) Si =

N

N

∑ ∑

Pijk

j =1 j ≠ k k =1

The overall SPS for the candidate alignment is: M Si SPS = ∑ Mr i =1 ∑ Sri i =1

where Mr is the number of columns in the reference alignment and Sri is the score Si for the ith column in the reference alignment. The range of SPS is 0.0–1.0, where higher values indicate closer resemblance with the BAliBASE reference alignment. The MSAs constructed from this system and from other programs are almost identical. With respect to the BAliBASE SP-score, the performance of the system is better than other existing sequence alignment tools. The performance of the system has been shown in Figure 3 which corresponds to the values in Table 1. The better values of MSANGA compared to other tools are highlighted in Table 1. Comparison of MSANGA with other MSA programs (see online version for colours)

Figure 3

0.6 0.5 0.4 0.3 0.2 0.1 0

1aho 1hpi

MSA Programs

M

SA N

SA

G A

1tvxA G AM

ul tA lig n Cl us ta lX M AF FT

M

A

LI G N DI A

DC

US CL E M

st al W Cl u

EE

1plc

O FF TC

SP-Scores

Comparison of various MSA Programs

2mhr 3cyr 9rnt

54 Table 1

S. Siva Sathya et al. Comparison of the SP scores of MSANGA with other popular MSA tools

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

55

6.3 Comparison of NGA and SGA In the previous section, MSANGA was compared with other tools for the accuracy of the results in terms of SP scores. Now, to compare the NGA with SGA in terms of accuracy and rate of convergence, Table 2 and Table 3 are provided. Table 2 shows the SP-scores obtained by both NGA and SGA for 16 different BAliBASE data sets. This table enables to compare the quality of alignment obtained by the SGA and NGA. The SP scores for which NGA gives better performance is highlighted in italic. Even in cases where SGA shows better scores, NGA scores do not drastically deviate from it and hence the scores of both are of acceptable quality. This is shown through the Figure 4. Table 2

SP scores of NGA and SGA for different data sets

DataSet

NGA

SGA

1aho

0.106

0.1

1hpi

0.126

0.136

1tvxA

0.076

0.102

2mhr

0.33

0.365

3cyr

0.156

0.145

9rnt

0.189

0.171

2fxb

0.246

0.235

1ycc

0.053

0.033

1tgxA

0.237

0.24

1ar5A

0.116

0.181

1ad2

0.049

0.062

1pgtA

0.171

0.117

1zin

0.111

0.121

1led

0.067

0.092

5ptp

0.088

0.088

1amk

0.166

0.183

Figure 4

Comparison of alignment quality of NGA and SGA (see online version for colours)

56

S. Siva Sathya et al.

In order to prove the efficiency of the NGA with respect to the time of convergence, the number of generations has been taken into consideration. Table 3 gives the details of the number of generations for convergence of NGA and SGA. The graph in Figure 5 shows that the rate of convergence of NGA is better than that of SGA for the same data set whose values correspond to Table 3. Table 3 DataSet

No. of generations to converge for NGA and SGA NGA

SGA

1aho

99

219

1hpi

84

47

1tvxA

29

140

2mhr

141

203

3cyr

53

41

9rnt

70

170

2fxb

94

98

1ycc

24

205

1tgxA

246

209

1ar5A

7

6

1ad2

15

144

1pgtA

196

38

1zin

105

234

1led

118

217

5ptp

57

54

1amk

52

169

Figure 5

Comparison of convergence rate of NGA and SGA (see online version for colours)

For some datasets like the ones given in Table 4, NGA gave better results than SGA both in terms of SP-scores and time of convergence. Since, there is no best alignment program or tool for all kinds of sequences, so selection of program depends on the nature of sequences to be aligned. For some data

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

57

sets, though SGA gave better SP-scores than NGA, it does not deviate much from the best result. Overall results show that for the given BAliBASE data sets, quality of alignment measured through the SP scores of NGA are comparable and at times better than that of SGA, but the convergence rate of NGA is far better than that of SGA in most of the cases. Table 4

Datasets for which NGA gave better performance in terms of SP scores as well as rate of convergence NGA

DataSet

SGA

SP-score no. of gen.

SP-score no. of gen.

1ycc

0.053

24

0.033

205

9rnt

0.189

70

0.171

170

2fxb

0.246

94

0.235

98

1aho

0.106

99

0.100

219

7

Conclusions

This paper is aimed at two goals: one is to prove the efficiency of NGA with respect to SGA and the second is to solve the MSA problems by NGA and compare it with popular tools for MSA. It has been compared with the existing popular MSA tools namely T-COFFEE, ClustalW, MUSCLE, DCA, DIALIGN, MultAlin, ClustalX, MAFFT, GAMSA for validation. The data sets have been taken from BAliBASE database which is a publicly available suite of alignment benchmarks. The results are compared in terms of the quality of alignment and the rate of convergence and found to be better than SGA and other existing tools in most of the cases. From the results obtained, it is concluded that NGAs are efficient at faster convergence, also, preserving the diversity of individuals by giving equal chances of mating to every individual in the population. NGA is very versatile and can be applied to any problem that can be solved by SGA. As NGA lends itself easily to parallelism, it could be exploited for further performance enhancement for problems that have a huge search space. Currently, the effect of various GA parameters are tested on NGA and compared with SGA.

References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) ‘Basic local alignment search tool’, Journal of Molecular Biology, Vol. 215, pp.403–410. Baker, J.E. (1987) ‘Reducing bias and inefficiency in the selection algorithm’, Genetic Algorithm and their applications, Proceedings of the Second International Conference on Genetic Algorithms, pp.14–21. Booker, L.B. (1982) ‘Intelligent behaviour as an adaptation to the task environment’, Doctoral dissertation , University of Michigan. Brindle, A. (1981) Genetic Algorithms for function optimization, unpublished Doctoral Dissertation, University of Alberta, Edmonton. Carrillo, H. and Lipman, D.J. (1988) ‘The multiple sequence alignment problem in biology’, SIAM J. Appl. Math., Vol. 48, pp.1073–1082.

58

S. Siva Sathya et al.

Chellapilla, K. and Foegel, G.B. (1999) ‘Multiple sequence alignment using evolutionary programming’, Proceedings of the 1999 Congress on Evolutionary Computation, Washington D.C., pp.445–452. Feng, D.F. and Doolittle, R.F. (1987) ‘Progressive sequence alignment as a prerequisite to correct phylogenetic trees’, J. Mol. Evolution, Vol. 25, pp.351–360. Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley. Gorges-Schleuter, M. (1992) ‘Comparison of local mating strategies in massively parallel genetic algorithms’, in R. Manner and B. Manderick (Eds.): Parallel Problem Solving from Nature, Elsevier, Amsterdam Vol. 2, pp.553–562. Holland, J.H. (1975) Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. Holland, J.H. (1992) Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, MA. Horng, J-T., Lin, C-M., Yang, B-H. and Kao, C-Y. (2004) ‘A genetic algorithm for multiple sequence alignment, soft computing’, A Fusion of Foundations, Methodologies and Application, Springer, Vol. 9, No. 6, pp.407–420. Isokawa, M., Wayama, M. and Shimizu, T. (1996) ‘Multiple sequence alignment using a genetic algorithm’, Proceedings of the Seventh Workshop on Genome Informatics, Vol. 7, pp.176–177. Krane, D.E. and Raymer, M.L. (2001) Fundamental Concepts of Bioinformatics, Benjamin Cummings, New York, USA. Mahfoud, S.W. (1995) ‘Niching methods for genetic algorithms’, PhD Thesis, University of Illinois, Urbana-Champagne. Needleman, S.B. and Wunsch, C.D. (1970) ‘A general method applicable to the search for similarities in the amino acid sequence of two proteins’, J. Mol. Biol., Vol. 48, pp.443–453. Notredame, C. and Higgins, D.G. (1996) ‘SAGA: sequence alignment by genetic algorithm’, Nucleic Acids Research, Vol. 24, No. 8, pp.1515–1524. Oei, C.K., Goldberg, D.E. and Chang, S.J. (1991) ‘Tournament selection, niching and the preservation of diversity’, IlliGAL Report, University of Illinois, Illinois Genetic Algorithms Laboratory. Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology, PWS Publishing Company. Siva Sathya, S., Kuppuswami, S. and Rajashekar, K. (2007) ‘Nomadic genetic algorithm for course time tabling problem’, Proceedings of the International Conference on Science Technlogy and Management (CISTM 07), Hyderabad, India. Spears, W.M. (1994) ‘Simple subpopulation schemes’, Proceedings of the Third Annual Conference on Evolutionary Programming, pp.296–307. Tanese, R. (1989a) ‘Distributed genetic algorithm’, Proceedings of the Third International Conference on Genetic Algorithms, pp.434–439. Tanese, R, (1989b) ‘Parallel genetic algorithm for a hypercube, genetic algorithms and their applications’, Proceedings of the Second International Conference on Genetic Algorithms, pp.177–183. Thompson, J., Higgins, D. and Gibson, T. (1994) ‘CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice’, Nucleic Acids Res., Vol. 22, pp.4673–4690. Thompson, J.D., Plewniak, F. and Poch, O. (1999) ‘BaliBASE: a benchmark alignment database for the evaluation of multiple sequence alignment programs’, Bioinformatics, Vol. 15, pp.87–88. Wang, L. and Jiang, T. (1994) ‘On the complexity of multiple sequence alignment’, Journal of Computational Biology, Vol. 1, No. 4, pp.337–348.

Nomadic genetic algorithm for multiple sequence alignment (MSANGA)

59

Whitley, D. and Starkweather, T. (1990) ‘GENITOR II: a distributed genetic algorithm’, Journal of Experimental and Theoretical Artificial Intelligence, Vol. 2, pp.189–214. Zhang, C. and Wong, A.K. (1997) ‘Genetic algorithm for multiple molecular sequence alignment’, Comput.Appl. Biosci, Vol. 13, No. 6, pp.565–81.

Websites www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ www.drive5.com/muscle/ ftp://ftp-igbmc.u-strasbg.fr/pub/BaliBASE/ bibiserv.techfak.uni-bielefeld.de/dca/ cbrg.inf.ethz.ch/Server/MultAlign.html www.rna.icmb.utexas.edu/linxs/seq-info/alignments.html