MULTI-OBJECTIVE EVOLUTIONARY APPROACH TO ...

September 22, 2006

21:28

Proceedings Trim Size: 9in x 6inMulti-ObjectiveEvolutionaryApproachtoAbInitioProteinTertiaryStructureP

MULTI-OBJECTIVE EVOLUTIONARY APPROACH TO AB INITIO PROTEIN TERTIARY STRUCTURE PREDICTION

T. W. DE LIMA∗, P.H.R. GABRIEL†, A. C. B. DELBEM Institute of Mathematic and Computer Sciences, 400 Trabalhador Sao Carlense Avenue, Sao Carlos, Brazil phone: 55 1633738167 E-mail: [email protected], [email protected], [email protected] R. A. FACCIOLI AND I. N. DA SILVA Sao Carlos Engineering School, 400 Trabalhador Sao Carlense Avenue, Sao Carlos, Brazil phone: 55 1633739367 E-mail: [email protected], [email protected]

The Protein Structure Prediction (PSP) problem aims at determining protein tertiary structure from its amino acids sequence. PSP is a computationally open problem. Several methodologies have been investigated to solve it. Two main strategies have been employed to work with PSP: homology and Ab initio prediction. This paper presents a Multi-Objective Evolutionary Algorithm (MOEA) to PSP problem using an ab initio approach. The proposed MOEA uses dihedral angles and main angles of the lateral chains to model a protein structure. This article investigates advantages of multi-objective evolutionary approach and discusses about methods and other approaches to the PSP problem.

Keywords: Computational Biology, Bioinformatics, Protein Design, Protein Structure Prediction, Evolutionary Algorithm

1. Introduction The post-genomic era have been characterized by two different scenarios: on the hand, the huge amount of available biological data sets all over ∗ t.

w. de lima will be presenting this full paper. supported by fapesp

† financial

1

September 22, 2006

21:28


2

the world requires suitable tools and methods both for modeling biological process and analyzing biological sequences: on the other hand, many new computational models and paradigms inspired and developed as metaphors of biological system are ready to be applied to the context of computer science 1 . The interest in discovering a methodology for solving the Protein Structure Prediction (PSP) extends into many fields of research, including biochemistry, medicine, biology, engineering and scientific disciplines. Native protein structures have been determined using x-ray crystallography methods and magnetic nuclear resonance 2 . The latter has its application restricted to proteins with small size, while the former needs a great amount of laboratory processing requiring high cost. On the other hand, approaches for PSP range from empirical researches to mathematical modeling for protein potential energy. One algorithm strategy to solve PSP uses information from protein homology to guide a search process. Despite of relevant results obtained using such strategies, algorithms based on protein homology are highly dependent on the set of proteins with native known structure. This set is extremely smaller than the universe of proteins. On the other hand, Ab initio PSP does not depend on previous knowledged of protein structures. This problem is one of the most important unsolved problems in molecular biophysics 3 . At first glance, this problem may seem not complex since knowing the exact formulation of the physical environment within a cell, where proteins fold, it is possible to mimic the folding process in nature by computing the molecular dynamics based on our knowledge of the physical laws 4 . Nevertheless, we do not completely understand the driving forces involved in protein folding. Perturbations in the potential energy landscape may result in a different folding pathway, generating a different folded structure. Due to insufficient results referring to the tertiary protein structure determination from amino acid sequences, it has been investigated different computational algorithms trying to resolve this problem. Among this algorithms, Evolutionary Algorithms (EA) have presented relevant results 5 6 7 . EAs are powerful tools of optimization inspired in natural evolution and they has been applied to many complex problems in the most different areas of human knowledge 8 . Development of effective computational tools for PSP is fundamental in order to deal with PSP computational complexity. Moreover, the development of these tools may guide part of the future scientific effort in molecular

September 22, 2006

21:28


3

biology. Nevertheless, there has been no computational breakthrough for accurately forecasting the final folded state of a protein. The growth of interest in ”non conventional computer algorithms” applied to biological domains relies on the so called ”data revolution”. Among these algorithms, Evolutionary Algorithms (EA) has presented relevant results 9 , 10 .

2. Evolutionary computation and evolutionary algorithms Evolutionary computation research algorithms are inspired in the Evolutionary Theory 11 . Genetic algorithms (GAs) is the most known EAs. The approach proposed in this paper is based on GAs. GAs are search techniques based on the mechanism of natural selection 11 , which simulate the processes of nature evolution aiming at solving optimization problems. The GA is robust and efficient in order to work with optimization problems, very large search space. Another important characteristic of the GAs is that they provide a set of solutions, named population, instead of not single solution. A GA is an optimization algorithm capable of solving single objective or multiple objective problems. Multi-Objective Optimization Problems (MOOP) have a solution set that represents a trade off among the objectives 12 . Next section discusses multi-objective EAs.

2.1. Multiobjective EAs Several aspects can be relevant in order to evaluate a solution for a problem. If no aspect can be measurable better than other, a solution will dominate the other only if it is better for all aspects. The entire set of solutions that are not dominated can be represented by a curve in the Cartesian space, named Pareto Front 12 . The first implementation of a Multi-Objective Evolutive Algorithm (MOEA) was proposed by Schaffer in 1985 13 . This implementation was a modification from the conventional GA for the purpose of estimating each objective independently. However, this approach can not obtain an adequate diversity in Pareto Fronts solution. The main difference between MOEAs and traditional AEs is the selection operator since the comparison among these solutions must be performed according to Pareto’s dominance. The most efficient algorithms are SPEA 14 , PAES 14 and NSGA-II. The proposed approach for PSP employs NSGA-II (see Section 2.2).

September 22, 2006

21:28


4

2.2. NSGA-II The basic idea behind NSGA-II is the ranking process performed before the selection operation. This process identifies non-dominated solutions in the population (Pi ) to compose non-dominated fronts (Figure 1) approximating the Pareto Front. Afterward usual GA operators (selection, crossover, mutation) are applied to generate new solutions (offsprings - Qi ). Pi and Qi are grouped and named Ri . Then, Ri is ranked by a non-dominated ranking procedure.14 .

Figure 1. Ilustration of Pareto Fronts (F1, F2 and F3) generation process for minimization of objectives f1 and f2.

In the ranking procedure, the non-dominated individuals in the current population are first identified. Then, these individuals are assumed to constitute the first non-dominated front (F1 from Figure 1) with a small dummy fitness value 15 . Afterwards, the individuals of the first front are ignored and the resultant population is processed in the same way in order to identify individuals for the second non-dominated front (F2 from Figure 1). This process continues until the whole population is classified into non-dominated fronts. A new population (Pi+1 ) is obtained with the N first non-dominated individuals of Ri ; the remaining of the individuals are rejected. This process is represented in Figure 2 14 . The individuals in the first front represent the Pareto Front solution. The NSGA was first proposed with a stochastic remaind proportional selection (SRS) procedure. However, it is possible to use any other selection technique as roulette wheel or tournament 11 . As individuals in the first front have smaller fitness value, they always get more copies than the remaining individuals. This process emphases exploration of non-dominated

September 22, 2006

21:28


5

non-dominated sorting

crowding distance

F1 Pi

F2

Qi

F3

Pi+1

crossover mutation

Ri

rejected

Figure 2.

regions of the search space

15

NSGA-II Design.

.

3. Protein Tertiary Structure Prediction Proteins are macro molecules built from 20 basic units, named amino acids. All amino acids possesses a same generic chemical structure. There is a central carbon atom (Cα) angle attached to an hydrogen atom, an amino group (N H2 ), a carboxyl group (COOH) and a lateral chain or residue (R), which distinguishes one amino acid from the others. Every residue is assigned a 3-letter or an 1-letter code. During DNA transcription-translation phases, proteins are composed from peptide bonds, where the carboxyl group of one amino acid is joined with the amino group of another to release water. In this way, we can talk of a protein as a polypeptide formed by a backbone (the sequence of peptide bonds) and a side chain (the sequence of residues) 9 . The structure of a protein is hierarchically represented with three structural description levels. The primary structure is the sequence of amino acids in the polypeptide chain, which can be described as a string from a finite alphabet. The secondary structure refers to the set of local conformation motifs of the protein and schematizes the path followed by backbone in the space. The most important description level and main objective of experimental and prediction efforts is to obtain protein tertiary structure. It describes the three-dimensional organization of polypeptide chain atoms (both backbone and side chain atoms). The formation process of tertiary structure is designated folding. There are some physical properties that define this process 16 (see Section 3.1):

September 22, 2006

21:28


6

• Hardness of the backbone of the sequence; • Interactions between amino acids, including electrostatic interactions; • van der Waals forces; • Volume constraints; • Hydrogen or disulfate bounds; • Interactions of amino acids with water. Due to difficulty of understanding the folding process of a protein, this problem has been modeled as an optimization process. Different computational strategies have been investigated to solve this problem: homology 2 , threading 2 and Ab initio 17 and Semi Ab initio 18 . We will focus on the Ab initio approach (see Section 3.2). The dihedral angles φ and ψ determine the protein fold. Unfortunately, there is a large number of free degrees, i.e. φ and ψ, making the Ab initio approach computationally very complex. 3.1. The folding problem Protein chains are subject to a folding process. Starting from the primary structure, they are capable of organize themselves into an unique three-dimensional stable (native) conformation which is responsible of their biological functions 9 . The task of searching through all the possible conformations of a polypeptide chain to find those with low energy is very complex. It requires enormous amounts of computing time. Moreover, the energy difference between a stable folded molecule and its unfolded state is a very small 19 . The folding problem is one of the most difficult challenging open problem in structural genomic. The number of proteins for which the sequences are known is about a half million 20 . In n the other hand, the Protein Data Bank (PDB) has only 45 thousands 1 . Excluding experimental difficulties, the reason for this impressive difference is largely due to our lack of a comprehensive theory of the folding process 9 . 3.2. Ab initio The Ab initio structure prediction aims at predicting a protein structure from its amino acid sequence. It is generally assumed that a protein sequence folds to a native conformation or an ensemble of conformations that is near the global free-energy-minimum 17 .

September 22, 2006

21:28


7

In Ab initio approach, no homology between proteins are employed. Ab initio prediction is more challenging than homology modeling or threading. Moreover, it is the only way to derive a prediction, when no similar test fold is known. 4. Proposed Approach An important task when proposed a search procedure for the PSP is defined good representation of the conformations and cost function for evaluating conformations. These aspects are discussed in the sequel. 4.1. Representation of the conformation Few conformation-representations are commonly used: (1) (2) (3) (4) (5) (6)

all-atom three-dimensional coordinates; all-heavy-atom coordinates; backbone atom coordinates and side-chain centroid; Cα coordinates backbone and side-chain torsion angles and lattice models.

In our approach we choice the backbone and side-chain torsion angles representation, based on the fact that each residue type requires a fixed number of torsion angles to determine the three-dimensional coordinates for all atoms. The bond lengths and angles are considered at their ideal values. The dihedral angle ω is fixed at 180o . Thus, in order to represent a solution (chromosome) we need two torsion angles of the backbone (φ and ψ) and the side-chain torsion angles (χi ) i = 0, . . . , 4 depending on the each residue type. 4.2. Cost Function In order to evaluate the molecule structure is needed to use some cost or energy function. Quantum mechanics produces the most adequate energy functions. However, they are too computationally complex to be employed in modeling larger systems. Thus, the proposed approach uses energy function obtained from classical physics. Named potential energy functions or force fields, these functions return a energy value based on the molecule conformation. They provide information on molecule conformations are better or worse. The lower the energy value indicates the better conformation.

September 22, 2006

21:28


8

The most typical potential energy functions have the form 29 : X X X X Energy(R) = B(R) + A(R) + T (R) + bonds

torsions

angles

N (R)

non−bonded

(1) where R is the vector representing the molecule conformation, typically in cartesian coordinates or torsion angles. The literature on cost functions is enormous 2928 . The proposed approach, in order to evaluate the protein conformation we use the TINKER (Software Tools for Molecular Design) energy functions and the CHARMM (Chemistry at HARvard Macromolecular Mechanics) parameters v.27. It is a composite sum of several molecular mechanics functions that can be grouped in two major types: bonded (stretching, bending, torsion, UreyBradley and impropers) and non-bonded (van der Waals and electrostatic). The Tinker energy functions has the form: ET IN KER =

X

kr (r − r0 )2 +

bonds

+

X X

+

non−bond

kθ (θ − θ0 )2 +

εi,j

Ri + Rj ri,j

kurey (s − s0 )2

X

Vn (1 + cos(nφ − γn )) + "µ

X UB

angles

torsions n

X

X

impropers

¶12

µ −2∗

Ri + Rj ri,j

kimproper (ω − ω0 )2 ¶6 # +

qi qj Dri,j

where (i) r is the bond length, r0 is the bond length equilibrium and kr is the bond energy constant; (ii) θ is the bond angle, θ0 is the bond angle equilibrium and kθ is the valence angle energy constant; (iii) s is the distance between two atoms separated by two covalent bonds (1-3 distance), s0 is the equilibrium distance and kurey is the UreyBradley energy constant; (iv) φ is the dihedral or torsion angle, Vn is the dihedral energy constant, n is the multiplicity and γ is the phase angle; (v) ω is the improper angle, ω0 is the equilibrium improper angle and kimproper is the improper energy constant; (vi) εi,j is the Leonard-Jones well depth, ri,j is the distance between atoms i and j, Ri is the van der Waals atom i radius, Rj is the van der Waals atom j radius, qi and qj are the partial atomic charges from atom i and j and D is the dielectric constant.

September 22, 2006

21:28


9

4.3. Multi-Objective Formulation In order to reduce the size of the conformational space the backbone torsion angles are constrained in regions derived from the CADB-2.0 22 database, that contains the most torsion angles to each residue. The side-chain torsion angles are constrained in regions derived from the the backboneindependent rotamer library of Tuffery 21 . Side-chain constraint regions are of the form: [m − σ, m + σ]; where m and σ are the mean and the standard deviation for each side-chain torsion angle computed from the rotamer library. Under these constraints, the conformation is still highly flexible and the structure can take on various shapes that are vastly different from the native shape. The proteins can be seem as a collection of atoms linked by a chemical bond. The Tinker energy functions are used to evaluate the protein conformation and the atoms can be divided into bond and non-bonded groups. The bond group represents the local interaction and considers all atoms chains of max length four. The non-bond group represents the non-local interaction and considered all atoms separated by at least three or more covalent bonds. This division reflects the function energy decomposition in two partial sums: bonded and non-bonded atom energies. This is the most used decomposition of the function cost in two objectives 7 . The proposed approach uses a different decomposition: Energy1 = Eangle + Ebond + Edihe + Eimpr

(2)

Energy2 = Evdw

(3)

Energy3 = Eelec

(4)

The first equation grouped the potential energies bonded, relative to the bonded atoms. The second equation is relative to the van der Waals interactions and the last equation is relative the electrostatic interactions between non-bonded atoms. This objectives are relatives to different interactions among the atoms, so is more interesting minimize them separately. These three functions represent our minimization objectives, the torsion angles of the protein are the decision variables of the multi-objective problem and constraint regions are the variable bounds.

September 22, 2006

21:28


10

4.4. MOEA Proposed The proposed MOEA is based in NSGA-II. The algorithm starts by initializing a random conformation. The torsion angles (φ, ψ, χi ) are generated at random from the constrained regions. Afterward, the energy of the conformation is evaluated. First, the protein structure in internal coordinates (backbone and side-chain torsion angles) is transformed in Cartesian coordinates. Then, the energy potential is calculated using the Tinker routines. At this point, we have the main loop of the algorithm. From the current solutions new solutions are obtained using genetic operators. We proposed three kinds of recombination operators. The first operator is the BLX-α operator especially development to floating point representation 14 . The second operator uses the uniform crossover. The last operator is two-point crossover. Three kinds of mutation operators were proposed. When the first mutation operator acts on a peptide chain, all the values of the backbone and side-chain torsion angles of a residue chosen at random are re-selected from their corresponding constrained regions. The second and the third mutation operators applied a uniform mutation. These operators modifies all the values of the backbone and side-chain torsion angles of a selected residue are perturbed with a uniform distribution. The difference between the second and the third operators is in the uniform distribution. For the second operator, the uniform distribution is between 0 and 1. For the third operator the interval is between 0 and 0.1. The remains steps of the proposed approach are the same of the standard NSGA-II. 5. Results This section reports the results obtained using the multi-objective proposed approach for PSP. This algorithm was applied to four protein sequences from the Protein Data Bank (PDB): 1ZDD, 1ROP, 1CRN and 1UTG. The population size is 200 chromosomes and the maximum number of generations is 1000. The cost functions have dielectric constant equal 4.0. In order to compute van der Waals and electrostatic energies were defined a maximum and a minimum distance (d) between atom i and atom j were defined. Only the atoms-interactions in this interval (dmin