Cloudbased parallel solution for estimating ... - Wiley Online Library

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 26:118–133 Published online 12 November 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.2953

Cloud-based parallel solution for estimating statistical significance of megabyte-scale DNA sequences{ Ahmad M. Hosny*,†, Howida A. Shedeed, Ashraf S. Hussein and Mohamed F. Tolba Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Abbassia, Cairo, 11566, Egypt

ABSTRACT Confidence in a pairwise local sequence alignment is a fundamental problem in bioinformatics. For huge DNA sequences, this problem is highly compute-intensive because it involves evaluating hundreds of local alignments to construct an empirical score distribution. Recent parallel solutions support only kilobyte-scale sequence sizes and/or are based on sophisticated infrastructures that are not available for most of the research labs. This paper presents an efficient parallel solution for evaluating the statistical significance for a pair of huge DNA sequences using cloud infrastructures. This solution can receive requests from various researchers via web-portal and allocate resources according to their demand. In this way, the benefits of cloud-based services can be achieved. The fundamental innovation of this research work is proposing an efficient solution that utilizes both shared and distributed memory architectures via cloud technology to enhance the performance of evaluating the statistical significance for pair of DNA sequences. Therefore, the restriction on the sequence sizes is released to be in megabyte-scale, which was not supported before for the statistical significance problem. The performance evaluation of the proposed solution was carried out on Microsoft’s cloud and compared with the existing parallel solutions. The results show that the processing speed outperforms the recent cluster solutions that target the same problem. In addition, the performance metrics exhibit linear behavior for the addressed number of instances. Copyright © 2012 John Wiley & Sons, Ltd. Received 10 March 2012; Revised 20 September 2012; Accepted 14 October 2012 KEY WORDS:

cloud computing; megabyte-scale DNA sequence; statistical significance estimation; sequence alignment; multi-core architectures

1. INTRODUCTION Sequence alignment is a fundamental process in bioinformatics. It is a way of arranging DNA, RNA, or protein sequences to identify regions of similarity or difference. From the biological point of view, matches may turn out to be similar functions, for example, homology pairs or conserved regions; whereas mismatches may detect functional differences, for example, single nucleotide polymorphism. Local pairwise sequence alignment algorithms are divided into two categories. The first one is related to the exact methods that are based on dynamic programming, such as Smith–Waterman algorithm (SW) [2]. SW algorithm for local alignment can find the optimum score according to a scoring function, but with quadratic space and computational complexities in terms of the two sequences’ lengths. These quadratic complexities make it challenging to apply SW to large-scale sequences. The second category concerns the heuristic methods such as Blast algorithm [3]. These heuristic methods generally reduce the search space and make the comparison of large genomic banks faster, but at the expense of considerable reduction of algorithmic accuracy. *Correspondence to: Ahmad M. Hosny, Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Abbassia, Cairo, 11566, Egypt. † E-mail: [email protected] { A preliminary conference version of this paper with preliminary results appeared in the Informatics and Systems Conference Proceedings of INFOS 2012 [1] Copyright © 2012 John Wiley & Sons, Ltd.

CLOUD-BASED PARALLEL SOLUTION FOR ESTIMATING STATISTICAL SIGNIFICANCE OF MEGABYTE-SCALE DNA SEQUENCES

119

Recently, the development of the methods that assess the significance of a local alignment is one of the most important challenges in sequence analysis, answering the important question of ‘whether a computed alignment is evolutionarily relevant or the similarity is just a coincidence’. In addition, it is also necessary to know if the alignment optimal score is high enough to conclude a biologically important alignment [4]. For sequences that are quite similar (such as two proteins, which clearly belong to the same family), this analysis is not necessary. The importance of significance estimation arises when comparing two sequences that are not clearly similar. In such a case, the significance test can help biologists to decide whether the calculated alignment score is a ‘reasonable’ one between two related sequences or just may be an outcome of aligning an unrelated pair [5]. The significance test is needed as well to evaluate the results of different parameters of a sequence alignment such as SW gaps penalties and scoring functions. The test is applied to each parameter so that the most significant scores are reported. To estimate pairwise optimal local alignment score significance, two methods have been proposed [5]. The first method is the Extreme Value Distribution proposed by Karlin and Altschul [6]. This method is based on a statistical model that calculates the p-value for two sequences A = a1, a2 . . . an and B = b1, b2, . . . bm. Given the distribution of the individual residues of the real sequences and the scoring matrix, the p-value is the probability of finding an ungapped segment pair with a score greater than or equal to the alignment score (s)[7]. To be valid, this model is restricted by two conditions. First, the individual residues’ distribution for the two sequences should be quite dissimilar (the significance validity becomes limited when sequences come from the same family). Second, the sequence lengths (m and n) should be of nearly equal, quite small length [8]. The second method is the Z-score [9], which is often used to estimate the statistical significance of a pairwise alignment. It is based on a Monte-Carlo process, where one of the sequences is aligned with N randomly shuffled permutations of the other. The Z-score calculation is very computationally expensive, especially when comparing large sequences. Although the first method is computationally much faster, the second method is not only more accurate but also valid for huge sequences. When the restricting conditions are not fulfilled, the computed p-values are strongly overestimated. Therefore, the Extreme Value Distribution cannot be trusted for large sequence pairwise alignments [8,10]. The Z-score estimation requires a large number of alignment computations [11], as each has a quadratic computational complexity in terms of the length of the two sequences. In addition, a larger number of shuffling operations is required to obtain more accurate estimations [11]. These massive computational requirements inhibit applying the Z-score method for large-scale DNA sequences. The capacities of sequential computing resources are usually prohibitive for such compute-intensive problems, especially when considering huge input sequences. Fortunately, this type of problems lends itself to a high degree of parallelism, which motivates the use of parallel computing techniques. Computer clusters and computational grids are the most suitable architectures for this type of problems. However, these architectures are usually too expensive to be available to the majority of ‘intermediate’ research labs. Graphics processing units (GPUs) are efficient computing resources, but their memory cannot support huge input sequences. In addition, using a cluster of GPUs is relatively expensive and not reliable [12]. Cloud computing can be defined as the delivery of computing resources as services rather than products. The clouds’ architectures provide cost-effective, robust and fault-tolerant computing services. This offers researchers the ability to access large-scale computing resources matching their demand without investing in local computational resources. Clouds are commercially supported through several cloud computing vendors, such as Amazon Web Services [13], Google Apps Engine [14] and Microsoft Windows Azure [15]. One of the cloud computing models is to provide a Platform as a Service (PaaS)[16], which can be employed to run scientific applications. In this way, cloud computing provides sustainable resources for rapid and large-scale bioinformatics problems such as the one under consideration [17]. However, the potential of the Cloud platforms for evaluating the statistical significance of local alignments remains unexplored. In this paper, an effective algorithm is provided to evaluate the statistical significance of pairwise local megabyte-scale sequence alignment scores obtained by the SW algorithm. The proposed iterative solution follows an intermediate-grained parallel design that utilizes all the cloud resources. The solution iterations are dynamically distributed among the allocated instances. A carefully Copyright © 2012 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2014; 26:118–133 DOI: 10.1002/cpe

120

A. M. HOSNY ET AL.

designed tiled wave-front parallelization [18] is adopted to calculate SW score, employing the sharedmemory cores available at each cloud instance. This solution ensures better load balancing, resource utilization and results with significant performance enhancement, which are confirmed by the experimental results. Also, a web portal is developed to obtain the user requests then delegates the computational work to the cloud resources. The rest of the paper is organized as follows. Section 2 presents briefly the related work. Section 3 explains the background of the statistical significance estimation and the details of the SW algorithm. Section 4 gives an overview of the Microsoft Azure cloud platform. Section 5 describes the proposed solution architecture and the tiled wave-front parallelization. Section 6 depicts the experimental results of the present solution, considering the system performance and scalability in comparison with other existing solutions. Finally, the conclusions and the future work are presented in section 7.

2. RELATED WORK Most of the recent studies were focused on investigating the Z-score algorithm’s statistical model [10,19] and tuning its parameters to enhance its accuracy, especially for the special cases such as sequences with equal length or normal letter distribution [5,8,20]. However, the solution of this compute-intensive problem needs to be optimized, supporting megabyte-scale pairwise sequence analysis without adding new heuristics that affect the accuracy. Recently, there have been interesting efforts to develop parallel accelerated solutions on the basis of SW optimal algorithm to support huge sequences. Despite that interest, to our knowledge, there is no solution to calculate the statistical significance for pairs of huge sequences, which implies aligning hundreds of huge sequences and consequently requires excessive computations. Application of the current solutions for the statistical significance problem represents an obstacle, as most of these solutions are originally designed to align only a single pair of huge sequences over all the assigned computational resources to obtain the results in a reasonable time frame. Therefore, this statistical significance problem needs an intermediate-grained parallel design to compromise between aligning huge sequence pair and aligning many pair instances simultaneously. This compute intensive problem also needs to be implemented over an easy-to-scale parallel architecture such as clouds. The related work is divided into two categories. The first category focuses on the recent parallel optimizations pertaining to the problem of aligning huge sequences in general and/or estimating pairwise statistical significance. The second category walks through the recent utilization of cloud computing for the sequence analysis problems, demonstrating the progress in this area and differentiating the proposed solution from other existing solutions. For the first category, there are two common approaches to accelerate SW processing. The first is to introduce a heuristic approach to reduce the calculations, but intuitively sacrifices the solution guaranteed optimality. The second is to accelerate the algorithm using different parallel architectures, keeping the optimality of the results and supporting only a single huge alignment. In this way, Misra et al. [21] proposed a parallel solution using GPU to accelerate the statistical significance evaluation using exact SW algorithm. This solution adopted a tiled approach to solve the GPU’s limited memory size problem. The maximum supported sequence size was 1.6 K, which clearly reflects another weakness introduced by the memory limitation of GPUs. In addition, the solution utilized only 50% of the available computational power. CUDAlign [22] presented an efficient memory design to align megabyte-scale DNA sequence using GPU. CUDAlign was designed only to align a single pair of sequences using the full limited memory of the GPU. Also, it cannot be easily scaled to solve the addressed problem because of the need for a large cluster of GPUs. The MPIPairwiseStatSig parallel solution was presented in [23] for pairwise statistical significance estimation. It is implemented using message passing interface for a computer cluster consisting of 64 nodes. This solution adopted only a coarse-grained parallelism by distributing different alignment tasks over separate nodes. The alignment score was computed using exact SW algorithm, and the processing was performed sequentially, raising two main drawbacks. The first is the low utilization of the computing resources by working on a single core and ignoring the others. The second is the Copyright © 2012 John Wiley & Sons, Ltd.



121

supported sequence size limitation, as increasing the size will be restricted by the availability of a more powerful CPU. The maximum supported sequence size was 1.6 K, and the solution was clearly unscalable. A fine-grained parallel solution called Z-Align was presented in [24]. Z-Align is a parallel multicore architecture, which was used to calculate the optimal score and alignment for the largest known sequence size of (23 24 M) using a supercomputer of 64 node. This solution cannot be escalated to solve the problem under consideration, as its implementation consumes all the processing power to evaluate only a single huge alignment by dividing it among the available computing nodes. Therefore, it is too unpractical to use the Z-Align to evaluate either hundreds of alignments simultaneously or one by one. In addition, this solution cannot support this huge size, using the commodity architectures, which represent one of its main drawbacks. However, the performance of this solution was relatively good. Some other parallel solutions such as RPAlign [25] accelerated the SW algorithm, using computer clusters, by aligning a single pair of huge sequences. RPAlign first detects regions that are more frequently ‘alignable’ then starts their actual alignment across multiple processors. Although RPAlign gives sensitive results relative to the other heuristic solutions, there is no guarantee for the results optimality. Therefore, sacrificing the results optimality makes SW algorithm lose its main advantage. A parallel framework based on a multi-core architecture was proposed in [26]. The maximum supported sequence size was 1.25M 0.2 M. This framework adopted an intermediate-grained parallelism by dividing the query and database sequences among the cores. It calculates both score and alignment using a heuristic approach, which limits the number of processed cells to calculate the trace-back. Certainly, this limitation affects the solution’s optimality. Recently, cloud computing has introduced a new trend for parallel genome computing [17]. To date, there are several projects and solutions in the context of bioinformatics, which utilize cloud computing models. Many bioinformatics algorithms have exploited the cloud platforms, such as accelerating BLAST with different approaches [27, 28]. Other solutions addressed the genome sequencing problem with many recent implementations such as Cloud-Burst [29] and RSD-Cloud [30]. All Cloud-based bioinformatics solutions primarily target specific problems and provide custom solutions for a given methodology, which proves the novelty of the proposed solution. The proposed research work is concerned with the development of an efficient parallel solution for the statistical significance of optimum local alignment between two megabyte-scale DNA sequences, utilizing the computational resources provided by cloud services.

3. SMITH–WATERMAN ALGORITHM AND SIGNIFICANCE ASSESSMENT The proposed solution for estimating the statistical significance is based on SW algorithm with Gotoh improvements [31] to handle the multiple sized gap penalties for computing the alignment score. 3.1. Smith–Waterman algorithm Consider two sequences Q and D of length m and n. The individual residues for Q and D are q1, q2 . . . qm and d1, d2 . . . dn, where 1 ≤ i ≤ m and 1 ≤ j ≤ n. A scoring matrix P (qi, dj) is defined for all residue pairs. A constant value may be assigned to gaps. However, keeping gaps together generates more significant results. For this reason, the opening of a gap must have a greater penalty than its extension (affine gap model)[32]. The penalties for opening and extending a gap are defined as: Ginit and Gext. The algorithm is divided into two phases: computing the dynamic programming matrices and finding the best local alignment. The first phase starts by filling a matrix H obtained from Equations (1)–(3). The values of Hi,j, Ei,j and Fi,j are defined as 0 where i < 1 or j < 1. The similarity score between sequences Q and D is the highest value in H, and the position (i, j) of its occurrence represents the end of the alignment. To calculate the trace-back, only the arrows’ directions need to be stored in the matrix cells. The second phase finds the best local alignment. The algorithm starts from the cell that contains the highest score value and follows the directions until a zero-valued cell is reached. Copyright © 2012 John Wiley & Sons, Ltd.


122

A. M. HOSNY ET AL.

8 > > < Hi;j ¼ max > > :

0 Ei;j Fi;j Hi1;j1 Pði; jÞ

(1)

Ei;j1 Gext Ei;j ¼ max Hi;j1 Gfirst

(2)

Ei1;j Gext Fi;j ¼ max Hi1;j Gfirst

(3)

In this manner, any cell of the alignment matrix can be computed only after computing the values of the Northern, Western, and the North-Western cells. The access pattern presented by the matrix computation is non-uniform. So, the traditionally used parallelization strategy, in this kind of problems, is the wave-front [33]. In this manner, cells can be only processed in parallel if they are on the same anti-diagonal in a wave-front pattern as depicted in Figure 1. 3.2. The significance assessment The Z-score is often used to estimate the statistical significance of a pairwise alignment. The method is based on calculating N alignments. One of the sequences is aligned with N randomly shuffled permutations of the other sequence. The Z-score of the two sequences A and B is mathematically defined as in Equation (4). SðA; BÞ r (4) ZScore ðA; BÞ ¼ s Where S(A, B) is the pairwise alignment score obtained when comparing sequences A and B (using distances or similarity measures). Variable r is the mean of the alignment scores of the sequence A with all shuffles of the sequence B, and s is its standard deviation from the distribution of the alignments score. As discussed by Comet [34], the Z-score depends on the sequence being shuffled. The score may be relatively different if the second sequence is being shuffled instead of the first sequence. Thus, we have two different Z-Score values; one for shuffling the first sequence and the other for shuffling the second sequence. To avoid the overestimated Z-Score, the conservative Z-value [11,35] is defined as the minimum of the two Z-scores as shown in Equation (5). In this paper, the theoretical discussion considers the conservative Z-value that leads to more accurate results. Zvalue ðA; BÞ ¼ minðZScore ðA; BÞ; ZScore ðB; AÞÞ

(5)

Figure 1. Wave-front execution; three steps are shown where each step calculates the next anti-diagonal cells. Copyright © 2012 John Wiley & Sons, Ltd.



123

4. MICROSOFT WINDOWS AZURE Microsoft Azure [15] is a set of on-demand, over the wire cloud computing services offered by Microsoft. Azure is a Platform as a Service (PaaS) that provides developers with on-demand compute and storage to host, scale, and manage web applications on the Internet through Microsoft data centers. Azure enables users to lease virtual machine instances over the Internet, billed hourly with the use of a credit card, allowing users to dynamically provision resizable virtual clusters in a matter of minutes. The architecture of a typical Azure application is shown in Figure 2. The web and the worker roles represent the processing components of an Azure application. The web role is a web application accessible via HTTP or endpoints. Worker role is the main processing entity, which can be used to execute functions. Azure provides three types of storage services; Blob, Queue, and Table [15]. The blob represents a persistence storage services. The user can store data (or partitioned data) in blobs and access them from the web and the worker role instances. Azure Queues are first-in first-out persistent communication channels, which can be used to develop complex communication patterns among a set of corporation worker roles and web roles. Azure Tables provide structured storage for maintaining service state.

5. THE PROPOSED SOLUTION The main goal of this research work is to compute the statistical significance of local alignment between 2-MB-scale DNA sequences, using optimal SW algorithm. Aligning huge sequences with optimal SW algorithm is a challenging task, as it needs extensive computations. For instance, to compute the statistical significance for a pair of genome sequences, each of 1-M length, using SW algorithm with 100 shuffles, it needs to compute 100 similarity matrices each of which has approximately one billion cells; considering each matrix cell processing needs evaluating Equations (1)–(3). Recent studies revealed that clouds are more effective than traditional distributed systems for low communication intensive and high compute intensive problems such as the addressed one [36, 37]. In addition, clouds provide advanced bundles of services, allowing better communications than the traditional message passing, reliability and robustness, and easy-to-scale architectures. Therefore, the compute-intensive problem under consideration fostered the research team to investigate the potential of the cloud computing to have a scalable solution with reasonable performance. To conceive the solution, overcoming the mentioned challenges, consider estimating the statistical significance for a pair of input DNA sequences as a task. This task can be expressed as N independent alignment subtasks; each alignment subtask aligns a shuffled input sequence against the

Figure 2. Azure architecture. Copyright © 2012 John Wiley & Sons, Ltd.


124

A. M. HOSNY ET AL.

other one. The alignment scores, obtained by aligning all the subtasks, are used to calculate the statistical significance according to Equation (4). Processing a task can be divided into three sequential procedures: (i) initialize the shuffled sequences, which generates the shuffled sequences, then initialize the subtasks inputs with the generated sequences; the time complexity is O(N) where N is the number of shuffles; (2) process the independent subtasks, which process the independent subtasks by running SW algorithm for each subtask input sequences to produce the alignment score; the time complexity for this procedure is O(Nmn) where N is the number of shuffles, m and n are the lengths of the input sequences; and (3) evaluate the statistical significance, which collects the subtasks scores and calculate the statistical significance; the time is complexity O(N) where N is the number of shuffles. The overwhelming and time-consuming part is the second procedure that needs a considerable parallel design that can achieve good performance, speedup and load balancing. To parallelize the second procedure’s processing, utilizing multiple instances, and each of these instances uses multiple cores, there are two approaches as follows: • Coarse-grained parallelism: All tasks are distributed among different instances and cores, whereas each task is sequentially processed exactly by one instance • Fine-grained parallelism: Each task is distributed among all instances that cooperate to perform the task in parallel The proposed solution adopts an intermediate-grained parallelism. It exploits both shared and distributed memory resources assigned to solve and enhance the performance. This intermediate-grained parallelism dynamically distributes the subtasks among the available distributed instances and exploits each instance local shared-memory cores to parallelize the subtask processing. Tiled wave-front is the parallel strategy adopted to evaluate the SW score for each subtask input sequences. Therefore, this solution highly exploits the assigned resources and presents an excellent parallel strategy that is also easy-to-scale. In the next subsections, the proposed tiled wave-front parallelization adopted to accelerate the SW algorithm on the cloud instances’ multi-cores is explained in details. Then, the overall solution architecture will be discussed, considering the improvements from the performance, scalability and the assigned resources utilization points of view. 5.1. Tiled wave-front parallelization The implementation of the SW algorithm with an affine gap requests the computation of the three matrices E, F and H. These three matrices are logically grouped into a single matrix M. Each cell Mi,j contains the three values Hi,j, Ei,j and Fi,j, where each value is declared as an unsigned integer (4 bytes) to support huge sequence sizes. As illustrated in the literature, the wave-front parallel strategy adopts processing the matrix M a diagonal at a time. However, processing individual matrix cells in the wave-front strategy incurs high communication overhead that unfavorably influences the overall performance. This overhead can be reduced by grouping matrix cells into large, computationally independent square blocks; each one is called a tile. Each anti-diagonal is composed of multiple tiles, and the tiles of any anti-diagonal can be processed in parallel. This strategy is called tiled wave-front. The anti-diagonal of tiles is referred to as a tiled anti-diagonal (TAD). This optimization strategy is depicted via an example as shown in Figure 3. For simplicity, the matrix shown in Figure 3 is assumed to be divided into square tiles, and the tile length is equal to three, so that each nine cells represent a tile. The execution starts by processing tile t1 that represents the first TAD, then the two tiles lying on the next TAD labeled t2 are ready to be processed in parallel on two different cores. Next, the three tiles labeled t3 are ready, and so on. The computations of the tile’s cells are sequentially processed as per Equations (1)–(3). The number of the independent tiles per TAD increases with every step forward till reaching the middle of the matrix, then starts to decrease again till it reaches a single tile. To maintain the maximum score obtained all over the cells of the matrix M in a way that minimizes the memory bottleneck, every core preserves a local maximum score during the processing of the assigned tiles of the current TAD. These local scores are synchronized after the processing is completed to give the final score used in the statistical significance evaluation. Copyright © 2012 John Wiley & Sons, Ltd.



125

Figure 3. Tiled wavefront execution example. Tile size = 3, tl labels the tile of the first tiled anti-diagonal (TAD); t2 labels the tiles of the second TAD and so on; Numbers define the order of which tiles are processed.

The next subsections discuss the details of the different options for the tile’s dimension, data dependence and communication between tiles lying on different TADs, and the scheduling schema for balancing the tile tasks distribution for the available cores. 1) Tile’s dimension: It is interesting to raise a question here about the best choice for a tile’s dimension. Increasing tiles’ dimension reduces problem granularity by reducing the number of tiles per diagonal (a large number normally degrades resource utilization). On the contrary, decreasing the tiles’ dimension reduces the communication overhead between processing cores. Distributed tiledwave-front solutions usually adopt a relatively large number equal to the matrix length divided by the number of processing instances, which reduces communication overheads. In the case of cloud computing, the number of available cores per instance is usually a small number. This limitation recommends increasing the tiles’ granularity to fully utilize the available cores through a dynamic load balancing strategy. So, the tiles’ dimension recommended here is a relatively small number. 2) Data dependency and communication model: In this section, the communication mechanism used to obtain the necessary data for every tile computation is described. As shown in Figure 4, processing any tile that lies on TADx need boundary data from TADx 1 and TADx 2. For the case of x < 2, default values are obtained from Equations (1)–(3). Therefore, the last two processed TADs should be cached in the instance’s local memory to make data available for the next TAD. Internal tile cells are processed sequentially and access the Northern, Western and North-Western data from the local memory allocated for the tile. 3) Tiles scheduling schema: The adopted scheduling scheme dynamically assigns the TAD’s tiles to the available cores; where different cores process the tiles in sequence. During processing, when the number of tiles per TAD doubles the cores, the dynamic allocation for tiles turns to be in adapted small chunks to minimize the processing interruptions. This adapted dynamic scheduling policy achieves perfect load balancing among the cores and enables complete utilization of the cloud instance. 5.2. Solution setup and architecture The proposed solution architecture consists of four separate workers as depicted in Figure 5. Every worker represents a loosely coupled component that can be deployed on one or more cloud instance, according to its work load and user configurations. Copyright © 2012 John Wiley & Sons, Ltd.


126

A. M. HOSNY ET AL.

Figure 4. Data dependence between neighboring tiles. Every tile needs boundary data from the last two tiled anti-diagonals.

Figure 5. The proposed solution architecture.

1) The user interface web worker: The user interface web worker is a web application that represents the user interface of the system. It allows clients to upload input files and submit other job parameters. It is hosted by a web role instance. The input parameters are the DNA sequences, in ‘Fasta’ format, the gap opening and the extension penalties and the number of shuffles (N). This web worker also displays the output, which is the statistical significance estimation value for the input sequences. This worker performs few computations, so it can be hosted at only one cloud instance shared with other workers. 2) The task manager windows worker: It is a windows worker that manages the task processing and creates the subtasks, which will be processed in parallel. It obtains the task inputs from the user interface worker and assigns ID for the new task. Next, it creates a blob container for the task to host Copyright © 2012 John Wiley & Sons, Ltd.



127

the uploaded input sequences’ files. The task parameters and the blob data references are inserted as a record into a table that called the ‘Inputs/Outputs’ table. This table represents a shared data structure between different workers. The task manager creates subtasks for each alignment process. Each of these subtasks is a work item that holds the task ID, a reference to the task data record at the ‘Inputs/Outputs’ table, a flag for the sequence to be shuffled and a generated distinct seed. This distinct seed will be used in shuffling one of the task input sequences, according to the shuffle flag. The shuffled sequence with the other unshuffled sequence is representing the subtask input. To handle the conservative statistical significance, this worker will generate 2N subtasks, where N is the number of shuffles. The first N subtasks will set the shuffle flag to the first input sequence, whereas the next N subtasks will set the flag to the second sequence. The created subtasks are pushed in the shared subtasks queue. The pseudo-code of this procedure is shown in Algorithm 1. This worker does few processing, so it can be deployed onto only one cloud instance. Also, it can share another worker at the same instance. 3) The subtasks processor windows worker: It is a parallel windows worker, which is recommended to initiate different instances from it on all assigned cloud instances. It is actively querying the shared subtasks queue for one or more subtask(s) to process. After getting a subtask, it reads the task data record from the ‘Inputs/Outputs’ table then access the task blob to get the sequences. Then, this worker generates the shuffled sequence and applies the tiled wave-front SW parallelization, illustrated in Section 5.1, to compute the alignment. The computed alignment score will be pushed with the task ID and the shuffle flag into the ‘Outputs’ queue to be used in calculating the statistical significance. Algorithm 2 summarizes the process of this worker. This worker represents the overwhelming processing of the solution. Therefore, it is highly recommended to be hosted at all of the assigned cloud instances. 4) The task assembler windows worker: It is a windows worker that collects the subtasks’ results from the output queue to calculate the statistical significance using Equation (4). It is a timed process that checks whether all the subtasks for every ‘in-progress’ task are processed or not. In case of finding a task with complete subtasks scores, it obtains the subtasks alignment score from the ‘output’ queue then evaluates the statistical significance two times; firstly by considering only the scores of shuffling the first sequence and secondly by considering the scores of shuffling the second sequence. To estimate the non-conservative statistical significance, the minimum of the two estimates is reported as the statistical significance as in Equation (5). Finally, it reports the statistical significance score to the User Interface Web Worker through the ‘Inputs/Outputs’ table. Algorithm 3 outlines the pseudo-code of this worker. To increase the assigned resources utilization, this worker can be hosted-shared with other workers at any cloud instance.

Copyright © 2012 John Wiley & Sons, Ltd.


128


A. M. HOSNY ET AL.


129


6. EXPERIMENTAL RESULTS The proposed solution was implemented on Microsoft Windows Azure platform; it can also be implemented on any other cloud platforms. Azure instances come in four unique sizes to enable a variety of application complexity and workloads as listed in Table I. Our experiments were carried out and evaluated on the Azure emulator as a proof of concept to fulfill the initial testing phase. Next, experiments with different records and number of instances are run at medium-size Azure instances to test the solution performance, scalability to be compared with that of the existing solutions. As recommended in the solution’s architecture, for all of our experiments, the first and the third workers were deployed to only one instance, whereas the second worker, that performs the overwhelming task, was deployed to all the assigned instances. The SW scoring parameters used in the tests are: +1 for match; 1 for mismatch; 2 for the first gap;1 for the gap extension and 32 for the tile size. The number of shuffles N was set to 100 for all real-time Azure experiments. Windows Azure provides automatic load balancing based on a round robin that allocates the user requests among the web works [38]. Dynamic load balancing is used in allocating both the subtasks to the computing instances (intra-processing). This could maximize the efficiency of the application, especially if there are some instances hosting more than one worker. Mega cell updates per second (MCUPS) is the most common performance metric used to compare the sequence analysis solutions. An MCUPS represents the time for a complete computation of one million entries of the SW computational matrix, including all comparisons, additions and maximum 6 computations. MCUPS ¼ mn t 10 , where m and n are the sequences sizes, and t is the total execution time. The performance of the proposed solution is reported in MCUPS as a standard performance figure, which can be used for comparison purposes with other relevant solutions regardless to the platform, parallel architecture or the configurations. In addition, reporting the proposed solution results in MCUPS will facilitate any future comparisons. To estimate the MCUPS values, which were not reported in the relevant original papers, the sizes of the largest supported sequences and the best reported execution times are simply used to estimate these values using the previously mentioned equation. The performance and scalability were computed by recording the run-times achieved for three large sequences (512 K, 1 M and 1.5 M) with variable sizes. The 512 K tested with increasing the number of instances from 2 to 32, whereas the 1 M tested with the number of instances from 4 to 32, and the number of instances for the 1.5 M ranged from 8 to 32. Increasing the number of starting instances for the 1 and 1.5 M experiments was done to reduce the experimental time with such huge sequences. Figure 6 represents the speedup achieved for the different sequence sizes. The speedup curves indicate that the proposed solution scales linearly with the number of instances. Also, it is highly scalable with more instances to solve much larger sequences. Generally, the real speedup is quite close to the theoretical one. Therefore, this proves the efficiency of the proposed intermediategrained parallel solution over the scalable cloud architecture. The performance (in MCUPS) of the proposed solution for the three different sequence sizes, 512 K, 1 M, 1.5 M, was estimated on real azure instances as shown in Figure 7. Overall, long sequences exhibit better performance. This is due to the higher utilization of the cloud resources and the increase of the parallel processing in comparison with relatively constant sequential processing and communication. Next, the proposed algorithm is compared with MPIPairwiseStatSig [23]. As stated in the related work section, MPIPairwiseStatSig is one of the recent parallel solutions for studying the problem under consideration using a cluster of 64 nodes. The recorded MCUPS for the proposed solution and that of MPIPairwiseStatSig for different number of instances are depicted in Figure 8. Table I. Azure instance size and configuration. Instance size Small Medium Large Extra large

CPU

Memory

Storage

I/O performance

1.6 GHZ 2 1.6 GHZ 4 1.6 GHZ 8 1.6 GHZ

1.75 GB 3.5 GB 7 GB 14 GB

225 GB 490 GB 1000 GB 2040 GB

Moderate High High High


Rate (cert./h) 0.12 0.24 0.48 0.96


130

A. M. HOSNY ET AL.

Figure 6. Solution speedup for different sequence lengths.

Figure 7. Solution performance for different sequence lengths.

Figure 8. Performance comparison between the proposed solution and the MPIPairwiseStatSig solution. Copyright © 2012 John Wiley & Sons, Ltd.



131

Figure 9. Performance comparison between the proposed solution and the Z-Align solution.

Experiments were carried out using sequence size of 1.6 K, the largest size reported by MPIPairwiseStatSig for the first 32 computing nodes. According to Figure 8, the proposed solution significantly outperforms the MPIPairwiseStatSig. This is because the MPIPairwiseStatSig is designed only for coarse-grained parallelism. In this manner, only alignments are distributed over the available instances without parallelizing the alignment score calculation process, which takes a considerable time. On the other hand, the proposed solution adopts intermediate-grained parallelism that parallelizes the alignment scores calculation in addition to the efficient load balanced distribution of the alignment tasks. For the sake of comparison with the recent state-of-the-art solutions that align huge sequences, we have evaluated the proposed solution performance against Z-Align [24]. The results of Z-Align methodology was computed using a supercomputer of 64 nodes. To ensure fair comparison, computations were carried out using common sequence size (1 M). Figure 9 depicts the results, revealing the efficient performance of the proposed solution against Z-Align. It is interesting to point out that the proposed solution utilizes the computational resources more effectively using intermediate-grained parallelism (which dramatically reduces the communication overhead between different instances); whereas the Z-Align uses fine-grained parallelism between separate instances (which consumes a considerable amount of time during communication). Also, the tiled wave-front parallel strategy is a more efficient solution when the main target is computing the alignment score only (without more details). Z-Align can be more effective for evaluating the alignment trace-back of a huge pair of sequences; however, it is still mainly designed to fully utilize its resources for a single pair of sequences. In addition, Z-Align operate on a large number of shared memory cores that can only be afforded in some supercomputers. Therefore, the proposed solution can be considered as a complete solution for the problem under consideration, as it implements the best statistical significance evaluation technique, keeping the solution optimality, overcoming the sequence size, computing resources and the performance constraints. In addition, the solution architecture is designed to utilize the advantages provided by cloud technology and models. With the relatively linear scalability and good performance, the proposed solution proves its potential to scale for much larger sequence sizes, which provides a good advance to the important problem of evaluating the statistical significance for local, optimal and huge sequence.

7. CONCLUSION AND FUTURE WORK In this paper, an efficient parallel solution is proposed for evaluating the statistical significance for pair of huge DNA sequences using cloud-based infrastructures. We carefully designed an intermediategrained parallelism to exploits both the cloud’s distributed and shared memory resources. Copyright © 2012 John Wiley & Sons, Ltd.


132

A. M. HOSNY ET AL.

To accelerate the addressed compute-intensive problem, a tiled wave-front parallelization is proposed to parallelize SW score calculation over every instance shared memory multi-cores, whereas the alignment tasks are dynamically distributed on the available cloud instances. The performance and scalability of the proposed solution were computed by recording the run-times achieved for three large sequences (512 K, 1 M and 1.5 M), using increasing number of instances from 2 to 32. Overall, the more the length of the sequences, the better the performance due to much better computing to inter-communication ratios, leading to better utilization of the cloud resources. On the other hand, the performance of the proposed solution is compared favorably with two recent stateof-the-art solutions; one aligns huge sequences and the other calculates the statistical significance for small size sequences. Therefore, this work is considered as the first solution to solve the statistical significance problem using the promising cloud technology and models. The future work will include extending the proposed solution to include other sequence analysis functions that need high performance such as near-optimal sequence alignment and multiple sequence alignment.

REFERENCES 1. Hosny AM, Shedeed HA, Hussein AS, Tolba MF. Cloud Statistical Significance Estimation for Optimal Local Alignment of Huge DNA Sequences. Informatics and Systems (INFOS2012): Egypt, Cairo, 2012; 48–55. 2. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147(1):195–197. 3. Pearson WR. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith–Waterman and FASTA algorithms. Genomics 1991; 11(3): 635–650. 4. Doolittle RF. Similar amino acid sequences: chance or common ancestry? Science 1981; 214(4517):149–159. 5. Karlin S, Altschul SF. Statistical significance in biological sequence analysis. Briefings in Bioinformatics 2006; 7(1):2–24. 6. Altschul S, Karlin SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences 1990; 87(6):2264–2268. 7. Mott R Accurate formula for p-values of gapped local. Journal of Molecular Biology 2000; 300(3):649–659. 8. Bellgard M, Nozaki Y. Statistical evaluation and comparison of a pairwise alignment. Bioinformatics 2004; 21(8):1421–1428. 9. Bacro JP, Comet J. Sequence alignment: an approximation law for the Z-value with applications to databank scanning. Computers and Chemistry 2001; 25(4):401–410. 10. Bastien O, Aude JC, Roy S, Marechal E. Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics. Bioinformatics 2004; 20(4):534–537. 11. Aude JC, Louis A. An incremental algorithm for Z-value computations. Computers and Chemistry 2002; 26(5):402–410. 12. Akoglu A, Striemer GM. Sequence alignment with GPU: Performance and design challenges. Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing 2009; 1–10. 13. Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2 [1 Septemper 2012]. 14. Google App Engine. http://code.google.com/appengine [1 Septemper 2012]. 15. Microsoft Azure. http://www.microsoft.com/windowsazure [1 Septemper 2012]. 16. Chawla V, Sogani P. Cloud computing – the future. High Performance Architecture and Grid Computing (Communications in Computer and Information Science,vol.169) Springer: Berlin, 2011; 113–118. 17. Stein LD. The case for cloud computing in genome informatics. Genome Biology 2010; 11(5): 1‐7. 18. Abdelrahman TS, Manjikian N. Scheduling of wavefront parallelism on scalable shared-memory multiprocessors.International Conference on Parallel Processing, 1996; 122–131. 19. Wilbur WJ, Smith TF, Waterman MS, Lipman DJ. On the statistical significance of nucleic add similarities. Nucleic Acids Research 1984; 12(1): 215–226. 20. Webb-Robertson BJ, McCue LA, Lawrence CE. Measuring global credibility with application to local sequence alignment. PLoS Computational Biology 2008; 4(5). 21. Zhang Y, Misra S, Honbo D, Agrawal A, Wei-keng L, Choudhary A. Efficient pairwise statistical significance estimation for local sequence alignment using GPU. IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences 2011; 226–231. 22. Sandes EFO, Melo AC. CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences. Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010. ACM SIGPLAN Notices: New York, NY, USA, 2010; 137–146. 23. Agrawal A, Misra S, Honbo D, Choudhary A. Parallel pairwise statistical significance estimation of local sequence alignment using message passing interface library. Concurrency and Computation 2011; 23(17): 2269–2279. 24. Batista RB, Melo AC, Boukerche A. Exact pairwise alignment of megabase genome biological sequences using a novel z-align parallel strategy. IEEE International Symposium on Parallel & Distributed Processing 2009; 1–8. Copyright © 2012 John Wiley & Sons, Ltd.



133

25. Mitra R, Bandyopadhyay S. A parallel pairwise local sequence alignment algorithm. NanoBioscience, IEEE Transactions 2009; 8(2): 139–146. 26. Roma NF, Almeida TJ. A parallel programming framework for multi-core DNA sequence alignment. International Conference on Complex, Intelligent and Software Intensive Systems 2010; 907–912. 27. Tsugawa M, Fortes AJ. CloudBLAST: combining MapReduce and Virtualization on distributed resources for bioinformatics applications. eScience, IEEE International Conference 2008; 222–229. 28. Jackson J, Barga R, Lu W. AzureBlast: a case study of developing science applications on the cloud. HPDC ’10 Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing 2010; 413–420. 29. Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009; 25(11): 1363–1369. 30. Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ, Wall DP. Cloud computing for comparative genomics. BMC Bioinformatics 2010; 11(7): 259. 31. Gotoh O An improved algorithm for matching biological sequences. Journal of Molecular Biology 1982; 162(3): 705–708. 32. Gusfield D. Algorithms on strings, trees and sequences. Computer Science and Computational Biology (Computer algorithms; Molecular biology; Data processing). Cambridge University Press: New York, NY, USA, 1997; 505–523. 33. G Pfister. In search of clusters: the coming battle in lowly parallel computing. Prentice-Hall, Inc. Upper Saddle River, NJ, 1995. http://dl.acm.org/citation.cfm?id=207418&CFID=190133757&CFTOKEN=75060752 34. Aude JC, Glemet E, Risler JL, Henaut A, Slonimski PP, Codani JJ, Comet JP. Significance of Z-value statistics of Smith–Waterman scores for protein alignments. Computers & Chemistry 1999; 23(3): 317–331. 35. Comet JP, Aude JC, Glemet E, Wozniak A, Risler JL, Henaut A, Slonimski PP, Codani JJ. Automatic analysis of large scale pairwise alignment of protein sequences. Methods in microbiology 1999; 28: 229–244. 36. Milojicic D, Gupta A. Evaluation of HPC applications on cloud. HP Laboratories, 2001. 37. Buyya R, Pandey S, Vecchiola C. High-performance cloud computing: a view of scientific applications. Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks; 2009; 4–16 38. Rajiv R, Liang Z, Liang WU, Anna L, Andres Q, Manish P. Peer-to-peer cloud provisioning: service discovery and loadbalancing. Cloud Computing (Computer Communications and Networks vol.0). Springer: London, 2010; 195–217.