A Non-deterministic Grammar Inference Algorithm

1 downloads 0 Views 82KB Size Report
1. Pontificia Universidad Javeriana Cali. Calle 18 118-250 Cali, Colombia. {galvarez ... Language) grammar inference algorithm to predict cleavage sites in po- lyproteins .... genomic RNA of bean common mosaic virus strain nl4. Revista de la ...
A Non-deterministic Grammar Inference Algorithm Applied to the Cleavage Site Prediction Problem in Bioinformatics Gloria In´es Alvarez1 , Jorge Hern´ an Victoria1 , 2 Enrique Bravo , and Pedro Garc´ıa3 1

Pontificia Universidad Javeriana Cali Calle 18 118-250 Cali, Colombia {galvarez,jhvictoria}@javerianacali.edu.co 2 Universidad del Valle Sede Melendez, Cali, Colombia [email protected] 3 Universidad Polit´ecnica de Valencia Camino de Vera s/n 46022 Valencia, Espa˜ na [email protected]

Abstract. We report results on applying the OIL (Order Independent Language) grammar inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genome. This non-deterministic algorithm is used to generate a group of models which vote to predict the occurrence of the pattern. We built nine models, one for each cleavage site in this kind of virus genome and report sensibility, specificity, accuracy for each model. Our results show that this technique is useful to predict cleavage sites in the given task with accuracy rates higher than 95%.

Introduction Grammar inference is a technique of inductive learning, belonging to the syntactic approach of machine learning. Here we propose an inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genomes. Our goal is to develop an application for automatic segmentation of polyproteins available in large bioinformatic databases. Often these databases collect sequences which are not segmented, difficulting its use for analysing and extracting features from a particular segment or segmented chains. Furthermore, this real problem allows us to evaluate the behaviour of the OIL algorithm in real world conditions which are different from synthetic data tests. The paper is organised as follows: in Section 1 we briefly overview the OIL algorithm, in Section 2 we describe the cleavage site prediction problem. Design and experimental results are presented in Section 3. Finally, in Section 4 some final remarks and future work are discussed. 

This work was partially supported by the Spanish Ministry of Education and Science TIN2007-60769.

J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 267–270, 2010. c Springer-Verlag Berlin Heidelberg 2010 

268

1

G.I. Alvarez et al.

Algorithm OIL

This algorithm was first published in [3], the Order Independent Language (OIL) inference algorithm is a non deterministic approach to grammar inference for regular languages. Algorithm 1 below presents OIL strategy: positive and negative samples are sorted in lexicographical order (lines 1,2). At the beginning, hypothesis M is empty (line 3). Then positive samples are considered one by one (line 4). If the current hypothesis M accepts a positive sample pS, M remains unchanged. If hypothesis M rejects it (line 5), a new automaton M’ is built to accept pS and it is added to M (lines 6,7). The elements of M’ are defined in the following way: Q = P ref (pS), δ = {(u, v) | u ∈ P ref (pS), v = ua, a ∈ Σ, ua ∈ P ref (pS)}, q0 = ε, finally Φ is defined: ∀w ∈ (P ref (pS) − pS), Φ(w) =? and Φ(pS) = 1. In line 8, M is modified by merging as many states as possible. The states to be merged are selected randomly. Once the merge is completed, negative samples are computed in the new model M; if there are any inconsistencies, the merging procedure is undone. When all the positive samples are processed, the algorithm ends and the final value of M is the model learned. OIL is a convergent algorithm, the proof is in [1]. Notice that every running of OIL may produce a different model because it is a non-deterministic algorithm. For this reason, we compute a group of models from a given training sample. To test the algorithm, several heuristics may be applied to get a final response. For example, we can test it with the smaller model (with less states) or by applying a voting method among models to tag the test samples. Algorithm 1. OIL (D+ , D− ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

posSample = sort(D+ ) (in lexicographical order) negSample = sort(D− ) (in lexicographical order) M = (Q, Σ, {0, 1, ?}, δ, q0 , Φ) (empty automaton) for pS in posSample do if M doesn’t accept pS then M’ = (Q , Σ, {0, 1, ?}, δ  , q0 , Φ ) (M’ accepts only pS) M = M ∪ M’ M = DoAllMergesPossible(M,negSample) end if end for return M

2

The Cleavage Site Prediction Problem

Given a sequence of amino acids, the cleavage sites prediction problem consists in predicting where a particular subsequence with specific meaning or function begins and ends. We can predict cleavage sites for signal peptides, viral coding segments and other biological patterns. Thus the generic problem is present in any genome, from viruses to human beings. We are interested in cleavage site

A Non-deterministic Grammar Inference Algorithm

269

prediction of Potyviruses, since they are pathogenic for many important crop plants such as beans, soybean, sugarcane, tobacco among others, which have a large economic and alimentary impact in South America. Prediction of cleavage sites may facilitate the understanding of the molecular mechanisms underlying the diseases caused by these viruses. Researchers in the region have studied this family of viruses [2] and more than fifty viruses have been sequenced. The potyviral genome is expressed through the translation of a polyprotein which is cut by virus-encoded proteinases at specific sites in the sequence of amino acids, resulting in 10 functionally mature proteins responsible for the infection and virus replication called: P1, HCPro, P3, 6K1, CI, 6K2, VPg, NIa, NIb and CP. The functions of these viral-encoded proteins are partially understood. Each cleavage site is identified by the name of the segments it separates. Prediction of cleavage sites is not trivial because even though there are patterns of symbols that mark these places, these patterns can be variable. Because of the complexity of the cleavage site sequences, the use of algorithms makes easier the detection of specific features of those points. The prediction of cleavage sites allows isolating specific segments to be studied and facilitates the analysis and annotation of the data obtained experimentally and their comparison with those existing in databases such as GenBank.

3

Experimental Results

We apply the OIL algorithm to the problem of predicting cleavage sites in polyproteins translated from the genome of viruses of the family Potyviridae. Our purpose is to learn a model for recognising each of the nine cleavage sites present in the polyprotein. Training samples are obtained from sequences published at www.dpvweb.net/potycleavage/index.html. Approximately 50 samples for each cleavage site are trained. Since the algorithm needs negative samples, we use positive samples of other sites as negative samples for a given model; the ratio between positive and negative samples is 1/10. The amino acid sequence is considered one window at a time. Three window lengths are explored: in the first case we suppose cleavage site is located between the fourth and fifth symbols. For this reason we refer to this window as 4/1; in a similar way, we experiment with windows 14/1 and 10/10. We train 15 hypotheses from each training set with the OIL algorithm and all of them vote to decide if a test sample is accepted or rejected. We use a simple voting criterion where each model adds 1 to a counter initialised to zero if it accepts the sample and subtracts 1 from the counter if it rejects it. We calculate several very common measures for evaluating algorithms in bioinformatics: sensibility, specificity and accuracy. Table 1 shows average performance of the algorithm for three window sizes 4/1, 14/1 and 10/10, for each cleavage site. Values highlighted show the best window for each cleavage site. We obtain models for each cleavage site with an accuracy higher than 0.95. From Table 1 we can decide which window size is best suited for each cleavage site: P1-HCPro, HCPro-P3, P3-6K1 and 6K2-VPg yield better results when learning

270

G.I. Alvarez et al.

Fig. 1. Average sensibility, specificity and accuracy of the OIL algorithm results predicting cleavage sites on polyproteins from Poriviridae family viruses genome with a group of 15 models Cleavage site Sens. P1-HCPro 0.81 HCPro-P3 0.88 P3-6K1 0.65 6K1-CI 0.74 CI-6K2 0.7 6K2-VPg 0.74 VPg-Nia 0.81 NIa-NIb 0.73 NIb-CP 0.94

4/1 Spec. 0.85 0.97 0.80 0.76 0.65 0.73 0.81 0.77 0.91

Acc. 0.98 0.99 0.96 0.97 0.95 0.96 0.97 0.96 0.93

Sens. 0.63 0.74 0.61 0.72 0.72 0.63 0.81 0.71 0.96

14/1 Spec. 0.84 0.86 0.74 0.66 0.74 0.69 0.92 0.73 0.87

Acc. 0.97 0.97 0.96 0.95 0.96 0.95 0.98 0.96 0.92

Sens. 0.58 0.67 0.65 0.79 0.79 0.60 0.72 0.76 0.95

10/10 Spec. 0.78 0.88 0.76 0.87 0.92 0.74 0.91 0.77 0.95

Acc. 0.96 0.97 0.96 0.98 0.98 0.96 0.98 0.97 0.95

from a 4/1 window, while 6K1-CI, CI-6K2, NIa-NIb and NIb-CP from window 10/10 and VPg-NIa from 14/1. This information gives hints about the size of the pattern to be learned and allows us to specialise the training process for each cleavage site.

4

Conclusions and Future Work

It is possible to learn patterns to predict cleavage sites in potivirus polyproteins with grammar inference algorithms like OIL. Our experimental rates suggest that it is possible to develop an automatic segmentation tool for such models. Currently, we are applying other methods to compare their performance. We will assemble the best models into a computational tool which receives a complete polyprotein and segments it. Finally, some biological considerations will be taken into account to improve the performance of the proposed tool.

References 1. Alvarez, G.: Estudio de la Mezcla de Estados Determinista y No Determinista en el Dise˜ no de Algoritmos para Inferencia Gramatical de Lenguajes Regulares. PhD thesis, Universidad Polit´ecnica de Valencia (2007) 2. Bravo, E., Calvert, L.A., Morales, F.J.: The complete nucleotide sequence of the genomic RNA of bean common mosaic virus strain nl4. Revista de la Academia Colombiana de Ciencias Exactas, F´ısicas y Naturales 32(122), 37–46 (2008) 3. Garc´ıa, P., de Parga, M.V., Alvarez, G.I., Ruiz, J.: Universal automata and NFA learning. Theoretical Computer Science 407, 192–202 (2008)