ated with the genetic epidemiology and computer sci ence/technology communities. .... tion Genetics Laboratory in Hawaii, the routines were designed to meet the ...... NSF Engineering Research Center for Computa tional Field Simulation ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences  199s
Genetic
Epidemiology, Parallel Algorithms, and Workstation Networks Mark A. Franklin* Michael A. Province+
Roger D. Chamberlain* Gregory D. Peterson*
*Computer and Communications Research Center tDivision of Biostatistics Washington University, St. Louis, Missouri Abstract
This latter reason constitutes the second trend. For at least the last twenty years computer performance, at a given cost level, has been roughly doubling every two to three years. Sophisticated workstations are now typically present on most researchers’ desks. Note, however, that studies have shown that these workstations, in general, are utilized at very low levels (e.g., under 5%). The third trend relates to the current communications revolution. The ubiquitous workstations within the laboratory and corporation are now routinely networked together permitting the sharing of data between computers. While the initial networks developed were relatively slow, optical networks and ATM (Asynchronous Transfer Mode) technologies are rapidly increasing the bandwidth and (to a lesser degree) lowering the latencies associated wit#h such networks. The fourth trend concerns the development of parallel computers and parallel algorithms. With the advent of single chip microprocessors, the cost of the processor chip has become a relatively minor component of overall system costs. The result is that there has been a steady increase in interest and in the development of parallel processors. One result of this is that a fairly good understanding of the essential components of parallel algorithm development has been achieved. With these four trends coming together it is now reasonable to tackle the important and computationally bound problems arising in genetic epidemiology by using parallel processing techniques applied to networks of workstations. In this paper, the Gemini/Almini library of routines is considered for parallelization. These routines are used extensively in problems in genetic epidemiology which require complex nonlinear optimization. This is typical of a host
Many interesting problems in genetic epidemiology are formulated as nonlinear optimization problems wing the Gemini/AImini library of routines. Because of the wide availability of networked workstations, we investigate costeffectively improving the performance of the Gemini/Almini library by exploiting parallelism m’th a set of workstations connected via a localarea network. Instrumentation of the Gemini/Almini optimization routines reveals significant potential for improving performance via parallelism. Using these instrumentation results, we identify promising targets of pamllelism and discuss two preliminary implementations that demonstrate the potential benefits of costeffective parallel implementations. By applying parallelism to the Almini/Gemini routines, we hope to potentially improve the performance of a large number of genetic epidemiological applications.
1
Introduction
The research discussed in this paper is the direct result of the convergence of four trends associated with the genetic epidemiology and computer science/technology communities. First, researchers in the genetics community continue to develop and evaluate ever more sophisticated statistical models. This has been in response to the availability of greater quantities of population data, increasing attempts to develop more realistic models, and the enormous increase in the performance (and decrease in the cost) of computers. ‘This material is based upon work supported by the National Science Foundation under grants MIP9309658 and CCR9021041 and the National Institutes of Health under grant GM28719.
101 10603425/95
$4.00 0 1995 IEEE
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii international Conference on System Sciences  1995
Q n Figure 1: Distributed
computing
tion. A parallel version of the gradient calculation was implemented and the resulting performance monitored. This is described in Section 4 which indicates, for example, almost linear speedup for the segpath application. That is, execution time associated with the gradient calculation decreased by almost a factor of four when executing on a network four workstations. However, since the gradient calculation is only a portion of the overall computation (between about 50 and 70%/o),to achieve good overall speedup, parallelization of other components of the computation must be undertaken. This is considered in the latter part of Section 4 where a performance model of the overall computation is presented and based on this model, overall system speedup predictions are developed. One interesting experimental observation relates to the role of functional evaluation in the overall computation. Our measurements have shown that function evaluation consumes up to 94% of the overall uniprocessor execution time. Since this can usually be parallelized efficiently, it appears that achieving performance improvements of an order of magnitude are possible. A summary and discussion of future work in this area conclude the paper.
environment
of problems formulated as maximum likelihood problems. One of the goals of this work is to develop parallel versions of the Gemini/Almini libraries that can then be used for further application development. In this way, applications developers will be able to benefit from the performance advantages of parallelism, without having to personally develop any parallel code. This will free the applications developer from dealing with the complexities of general parallel programming. The processing environment assumed in this paper is a network of workstations that are connected via a localarea network as illustrated in Figure 1. The workstations are not dedicated resources; several users may be utilizing them while parallel computations are executing. In order to facilitate cooperative work with the workstations, the Parallel Virtual Machine (PVM) [15] system is used to provide message passing and process control primitives. Since the workstations are existing machines, exploiting their idle time for parallel processing enables better utilization of machines while at the same time providing a powerful parallel computational resource. Estimates that workstations are idle for 95% of their available CPU cycles indicate that this approach has great promise [14, 161. In the examples discussed here, diskless SUN Sparcstation SLC workstations on an Ethernet network are used for the parallel computation. Section 2 describes the Gemini/Almini library of optimization routines and indicates their use in several applications, including skumia: and segpath. Section 3 presents results of our initial studies which focus on locating the computational bottlenecks associated with these applications. The result is an identification of the gradient calculation as being the most time consuming aspect of the computation, and thus potentially the most productive candidate for paralleliza
2
Problem
Description
Gemini/Almini is a FORTRAN library of gradient search optimization routines developed in the 1970s specifically for genetic epidemiological applications. Written by JeanMarc Lalouel while at the Population Genetics Laboratory in Hawaii, the routines were designed to meet the needs of statistical geneticists in performing segregation, path, and linkage analysis using the computing resources of the day [4]. Although the genetic models have grown more complex since then as both information on the human genome and computing power have simultaneously increased, the basic characteristics of the models remain the same, and these lines of research still flourish as the mainstream of genetic epidemiology today. Gemini/Almini is the primary optimization package used in some of the most popular genetics software, to do commingling analysis [7], path analysis [13], segregation analysis [5], linkage analysis [6], regressive model analyses [2], and combined genetic modeling [3, 121. One reason for the continued use of Gemini/Almini in optimizing genetics problems is that key features of this class of problem were specifically addressed in its design. The general theory of maximum likelihood lies at the heart of each of these genetic models. 102
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences 
1995
a true local maxima has been obtained. This matrix also yields estimates of the variance covariance matrix of the parameter estimates at the function maximum. If there are k parameters, it takes O(k) function evaluations to find the gradient at any given point using numerical differentiation. However, it takes O(ka) evaluations to calculate the matrix of second derivatives. If the number of parameters is large, evaluation of the curvature at the final step can take longer than finding the maximum itself. Thus it is important to allow the user to selectively calculate this matrix at the end of the convergence process. Again, however, there is potential here for evaluating the functions in parallel, thus improving performance.
This says that if we correctly characterize the probability of obtaining observed data, X, given a set of model parameters, p, as L(p) = L(XIp) (cded the likelihood function), then the values of p which maximize L(p) are good estimates with desirable properties. Thus, each problem involves repeated optimizations of a likelihood function, subject to different sets of nonlinear constraints on the parameters, p. These likelihood functions are complex and nonlinear, involving a fairly large number of unknown parameters to be estimated (from 50 to several hundred). The calculations also involve the pedigree data X, which can be quite large, coming from up to hundreds of families and thousands of individuals. For the more complicated models, the nature of the likelihood functions is such that the data cannot in general be reduced to a smaller set of sufficient statistics. The derivatives of the likelihood functions cannot be written analytically. Indeed, even calculating the functions themselves numerically involves repeated passes through the data, piecewise numerical integration of multidimensional integrals, and/or repeated inverses of variable dimensioned, often illconditioned matrices. Thus, one critical, common characteristic of these problems is that every evaluation of the function to be maximized takes a very long time to calculate. Therefore, the overriding concern in the design of Gemini/Almini is to find the maximum using as few function evaluations as possible. In a parallel computation environment this suggests that a way to improve performance is to allocate different function evaluations to different processors. Another common feature of the genetics class of problems is the imposition of nonlinear constraints in the optimization. Not all combinations of parameters are valid or even meaningful. Parameters are often nearly collinear in some parts of the parameter space, or even unidentifiable in others. Yet the true optimal values of the likelihood functions are often near or at fixed or variably constrained boundary values. Gemini/Almini is designed to give the user flexibility in handling these constraints at execution time by allowing boundaries to be sticky or not and selectively enabling/disabling and scaling boundaries and constraints. It also adds penalty functions to the likelihood based upon the degree of constraint violation to force the gradient search back into valid space.
Despite the efficiency of Gemini/Almini, genetic epidemiological problems using it can take hours, days and sometimes weeks of CPU time to complete a single optimization run. Since each problem involves tests of hypotheses, and each test is a comparison of several maximum solutions with different constraints imposed, a single analysis of a large pedigree dataset using a complex model can take months of CPU time to complete. Interestingly, for many genetic epidemiologists, this time frame has remained relatively constant since the emergence of the field, despite the exponential increase in computing power that can be brought to bear on problems. Where one might have been satisfied to do path followed by segregation analyses on nuclear families in the early 1970s a proper analysis today might involve doing both simultaneously in one model, or combined segregation and linkage, or to entertain two locus latent gene models, or considering multivariate phenotypes, or multipoint linkage, or QTL interval mapping, etc. In general, analysts have used the increased computing power to increase the complexity (and hopefully realism) of their models, rather than shorten their workload by sticking to the old, hopelessly naive ones. This trend is likely to continue, since genetic epidemiologists still make compromises in their models not because the data cannot support their estimation, but because the CPU time needed to fit them is prohibitive to all but the most patient. Thus, a substantial increase in the efficiency of the optimization process could have a tremendous impact in the quality as well as the quantity of genetic epidemiological research.
Yet another aspect of the genetics problems is that the likelihood surface to be maximized is complex and can often have multiple local maxima or even saddle points. It is then important to check the local curvature of the space through the matrix of second derivatives as well as the gradient to ensure that at least
To investigate the viability of a parallel Gemini/Almini library, a pair of applications were chosen for initial investigation and parallel implementation. These two applications, skumix and segpath, both use the Gemini/Almini library for optimization of a likelihood function.
103
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences 
1995
The skumix model is one of the earliest genetic epidemiology models [7, 81. It was designed to provide a very crude, quick test of the hypothesis of whether an unknown, unmeasured single major gene was having any detectable impact on a measured phenotype. Even when first proposed, the model was meant to be used more as a screening device, rather than a deflnitive test, so that a vast pool of potentially interesting phenotypes could be quickly examined to see which ones to further analyze for evidence of major genes using more sophisticated, accurate and time consuming techniques. Thus the model was designed to be sensitive to but not necessarily specific to major gene evidence, since it can infer major genes when there really are none operating. It is also used today by many groups to decide on the best scale of analysis of phenotypes for further study, by choosing that scale which maximizes the major gene evidence. The model has always been so simple and quick to run, even using the hardware of the 197Os, that parallelization or acceleration techniques have never been particularly needed for it. However, it has the advantage of being very simple to deal with, it is similar in structure to its more complex cousins, and it does use the Gemini/Almini library. Indeed, many of the current, stateofthe art genetic models have the skumix model embedded in them. Thus, it makes sense to use skumb as a test problem for demonstrating the feasibility and payoff potential of parallelization techniques in genetic epidemiology models.
The model implies that the distribution of the phenotype will be a mixture of three normal populations, corresponding to the three genotypes. If the genotypic means are sufficiently separated compared to the error variance, e2, then the distribution of P will actually appear as multimodal. In practice, however, major genes are not that obvious in their effect. More likely, such a mixture distribution appears unimodal but skewed and/or kurtotic to some degree. Since distributions can easily be skewed for reasons other than genetics, slcvmiar has allowed for a set of general power transformation functions to adjust for skewness, so that we accept evidence for a major gene only if there is significant admixture even after correcting for skewness in a generalized power transform. Also, some populations are inbred to a certain degree, which would cause deviations horn HardyWeinberg equilibrium proportions. Skumti adds parameters for the degree of inbreeding to add some robustness to the model.
The skurnti model assumes that a measured random variable, P, is given by a simple linear regression on a single unmeasured major gene, g. Since g is unmeasured, we cannot simply perform the linear regression in the traditional way, but must infer the effect of the gene from the distribution of the phenotype in a random sample from the population, using maximum likelihood estimation. If the gene has two alleles, A and a, then there are three possible genotypes, AA, Au, and au, so that the random variable g has three possible values. If the frequency of the A allele in the population is Q, and the population is in HardyWeinberg equilibrium with random mating (with respect to this gene), then we would expect the three genotypes AA, Aa and aa to occur in a random sample with the proportions q2, 2q( 1  q) and (1  q)‘, respectively. Making no assumption about the scale of the genotype, there are three parameters in the regression, corresponding to the means of the phenotype in the three genotype groups, u(AA), u(Aa), and I. Skumia: reparameterizes these into the three equivalent but more genetically meaningful parameters: u,
Segpath is a general purpose model and a flexible computer program, which has been developed to assist in the creation and implementation of a variety of genetic epidemiological models [12]. It can be used to generate programs to implement linear models for pedigree data, based upon a flexible, modelspecification syntax. Segpath models can perform segregation analysis, path analysis, or combined segregation and path analysis using any userspecified path model. They can be structured to analyze any number of multivariate phenotypes, environmental indices, and/or measured covariate fixed effects (including measured genotypes). Population heterogeneity models, repeatedmeasures models, longitudinal models, autoregressive models, developmental models, and gene by environment interaction models can all be created under segpath. Pedigree structures can be defined to be arbitrarily complex, and the data analyzed with programs generated by segpatla can have any missing value structure, with entire individuals missing, or missing on one or more measurements. Corrections for ascertainment can be done on a vec
the grand mean of P; t, the displacement or difference between the two homozygous means u(aa)  u(AA); and d, the dominance factor, or ratio between the difference U(U)  u(Aa) and t. The values d = 0 and d = 1 correspond to recessive and dominant expression of the genotype, respectively. If we make the assumption that the error term in the regression on g is normally distributed, with zero mean and variance e2, then there are five basic parameters to the skumia: model: q, e, u, d, and t.
104
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences  1995
tor of phenotypes and/or other measures. Because the model specification syntax is general, segpath can also be used in nongenetic applications where there is a hierarchical structure, such as longitudinal, repeatedmeasures, time series, or nested models. In this paper, we have parallel&d a specific segputh generated application, called famcorr, which estimates means, variances, and familial correlations for up to a fourvariate vector of related phenotypes, including all cross trait variancecovariances within and across family members. There are 154 total possible parameters in the most general version of this model, although in practice some are fixed at sample values so that only 132 are normally estimated. The model can handle any missing data value, from entire family members who are missing, to individual elements of a phenotypic vector within a family member.
3
while
(optimization has not converged) do d’ t direction of search new L? t bracket minimum F(q along d’ 9’ t gradient at jt using forward or central difference update Cholesky factors using % and 9’ endwhile evaluate the local curvature around F(q
Figure 2: Gemini/Almini
algorithm
of parallelization. Being able to state the relative timing requirements of various parts of the application also enables one to identify tasks that are suitable for parallelization and that are performance bottlenecks for the computation [9].
Instrumentation
The Gemini/Almini library was instrumented and measurements were taken using both the skmix and segpath applications. The results of instrumentation are summarized in Table 1. The initialization, the four steps of an iteration, and the evaluation of the curvature after convergence were explicitly timed; the “other” time represents operations not included in the initialization or the steps of the Gemini/Almini optimization routine. These operations include checking that the nonlinear constraints are met, computing the standard error of the estimated parameters, and checking convergence to a solution. In these runs, skumix is used to estimate parameters for three distributions and segpath is estimating a number of covariance values for blood pressure and body fat measures within families. The run time for segpath is significantly longer than for skunk because the segpath optimization involves a function of over a hundred variables while skumix optimizes over less than ten variables. As expected, as the problem size increases (with segpath as opposed to skumiz), the fraction of time spent on work outside of the GeminiJAlmini optimization function decreases. For Gemini/Almini, a substantial amount of time is spent finding the gradient for each iteration. This task represent,s 50  70% of the total runtime of the serial skumix and segpath algorithms, and thus it was selected for initial parallelization. This effort also provides perspective on the potential for improving the overall performance of the Gemini/Almini libraries (and the large number of applications that use the libraries) through the use of parallel processing.
As the above discussion indicates, the Gemini/Almini library is extensively used in genetic epidemiological research. We are interested in the feasibility of developing parallel implementations of Gemini/Almini that can execute on networks of workstations. We st#art by describing the structure of the Gemini/Almini optimization code. We then instrument the serial implementation and present the kIStNmentation results. Based on these results, one piece of the code (gradient calculation) was initially parallelized. The performance improvement associated with this change is presented in the next section. The applications that use Gemini/Ahnini supply the function F(s) to be optimized and call the Gemini/Almini routines as needed within the applications. At its core, the Gemini/Almini algorithm iteratively performs a gradient descent to the minimum value of the cost function F(Z). The maximum likelihood cost functions described in the previous section are typically transformed into negative log likelihood functions to be minimized. Each iteration is comprised of calculating the direction of search, bracketing the minimum along the search direction and moving to that minimum, computing the new gradient, and updating the Cholesky factors that determine the subsequent search direction. Once convergence has been detected, the local curvature is evaluated to ensure the solution is at a true local minimum. See Figure 2 for a pseudocode description of the algorithm. Understanding the timing requirements of an application is beneficial in determining the potential benefit 105
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences 
Table 1: Instrumentation
199.5
Results
~ I
Calculate new gradient Update Cholesky factors Evaluate curvature Other computations
4 4.1
Parallel Initial
54ti < 1% 9% 3%
~71% < 1% 24% 1%
\I 2
Calculate Gradient
Performance Parallel
F(xl ,...,
xi ,a.., xk)F(xl A
, . . . ,x;$, gf = F(xl
0
0
0
Calculate Gradient
Algorithm
Based on the instrumentation results in the previous section, the gradient calculation was identified for initial parallel implementation. To compute the gradient, either a forward or central difference is computed along each dimension of the function. The forward difference (gf) requires one function evaluation along each dimension (since the first, term needs to be computed only once), while the central difference (gc) requires two along each dimension. gf =
V
. . . ,q)F(xl, A
,..., zi+A
. . . ,z;+$
Figure 3: Parallel Gemini/Almini
,..., zk)
gradient calculation
workstations used as a slave processor is shared with other users, the background load on the workstations can perturb the performance results. In orcler to minimize these effects and better understand the potential performance improvement due to parallelism, performance statistics were collected when other users were not using the workstations. The performance implications of shared computational resources are considered explicitly in [lo] and [l 11.
,a.. ,xk)
Here, gi represents the derivative in the ith dimension and A is the distance used to approximate the derivative. Assuming that function evaluations do not, modify the data or system state (which is true for the problems of interest), each functional evaluation within the gradient calculation is independent, of the others and therefore the evaluations can be performed in any order without interaction. For the parallel Gemini/Almini library, the gradient calculation is computed as shown in Figure 3. In the parallel code, a master process spawns slave processes, sends the slaves copies of the data, and performs all the computations other than gradient calculations. When a gradient calculation must be performed, the master process assigns each slave an equal number of the functional evaluations. When the slaves have completed their computations and sent the results to the master, it proceeds to the next iteration until the optimization is complete. Since each of the
4.2
Performance
Results
Ignoring the remainder of the computation, the speedup for the parallel gradient calculation is presented in Figure 4 when finding the best parameter estimates for three unskewed distributions with skumti and estimating 132 values from the covariance structure for segpath. Speedup is defined as the ratio of single processor runtime to parallel processor runtime. Despite the addition of overhead to support the remote computations (primarily due to communication delays), the performance is good, with the gradient calculation running twice as fast, for skumti and four times faster for segpath using four slave processors when compared to the serial version. 106
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences  1995
Figure 5 shows the total algorithm execution time, divided into parallel gradient calculations and the rest of the computation, for the two Gemini/Almini applications. The difference in time between the serial version and the parallel version executed on one processor gives a good estimate of the additional overhead of the parallel implementation. With the use of parallel gradient calculations, the overall run time of skumia: is reduced by approximately 25%, and that of segpath by over SO%, even when the additional overhead resulting from handling the coordination and data interchange between multiple processors is considered. Although the parallel performance results are quite good when one focuses on the gradient calculations themselves, Amdahl’s law [l] indicates that there is a limit to the performance gain which can be achieved when only a portion of an application is parallelized. In its simplest form, if l/n of an application’s execution is serial, the maximum performance gain is bounded by n. For example, the gradient calculation is roughly l/2 of the execution time for skumiz. Even if gradient calculations took zero time, a complete skum&r run would only execute twice as fast. This implies that for a parallel Gemini/Almini implementation to have scalable performance (i.e., the execution time continues to improve as additional processors are made available), the parallelization effort must include more than just the gradient calculations. Reexamining the instrumentation results of Table 1, the bracketing iteration step and evaluation of the curvature after convergence are seen to be the next largest components of the execution time. As is the case for the gradient calculation, these are comprised almost entirely of independent evaluations of the the cost function F(Z). As a result, the parallelization techniques applied to the gradient calculation are also applicable to bracketing and evaluation of curvature.
Speedup for Skumix I
*I* l
9 F
/
/ l I
1
3
2
4
Number of Processors
Speedup for Segpath l
/ l
4.3
Performance
Models
and Potential
/ l
To explore the potential performance associated with this further parallelization, it is useful to develop a performance model. Such a model also helps to predict the behavior of the parallel program as parameters such as problem size and number of processors change. Because each slave process is independent of the others and performs the same computations, a simple model can be used to characterize the behavior of the parallel computation. The execution time of the parallel Gemini/Ahnini is modeled in three parts: a serial component, a parallel component, and a parallel overhead component. The serial component, denoted teeriot, represents the time
/ 1
2
3
4
Number of Processors
Figure 4: Speedup for parallel gradient calculation
107
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii international Conference on System Sciences  1995
to complete calculations that are not or cannot be parallelized. The parallel component, denoted tporal;cl, represents the work that can be completed in parallel. The parallel overhead component, tporoverheod, consists of the communications overhead incurred in sending the computations to the slave processors and the overhead associated with control of the slaves. If we define Rp as the runtime of a Gemini/Almini application on P slave processors, then the model has the form [9]: Rp
=
teerial
+
v
+tparoverhead
where the scale factor p represents the load imbalance of the parallel computation (i.e., variation in the amount of work performed at each of the slaves). Note that B = 1 indicates perfect load balance. As ,L3increases, there is more load imbalance and the performance of the application is degraded. With this model, one can express the uniprocessor runtime by removing the parallel overhead and load imbalance factor and setting P = 1. Serial
Pd
P=2
P=3
RI
Pd
=
cwial + tparauel
Speedup is defined as the ratio of the uniprocessor runtime to the parallel runtime:
Skumix
sp
=
Rl

RP
Serial
P=l
P=2
P=3
To determine the values of tseriol and tparollel, the instrumentation results described earlier are used. The load imbalance factor P and parallel overhead i?pcrrdV&ead are measured from the parallel implementations. Figure 6 plots predicted speedup for both skumix and segpath under the following assumptions. First, the gradient calculation, bracketing, and curvature evaluation are all parallelized. Second, the parallel efficiency is assumed to be equal to the measured parallel efficiency of the gradient calculations (i.e., the and p do not change). Third, V&Es for tpa+overhead additional processors are used to execute the applications (up to eight for skumix and 16 for segpath).
Pd
Segpath
The results show that the performance is not very good for the skumix application, primarily due to the fact that for such a small problem size, the serial component of the algorithm is a significant fraction of the overall execution time. Also, the load imbalance is greater since there is a limited number of function evaluations to be divided among the slave processors. This is not likely to be the case for larger problems, as
Figure 5: Execution time
108
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences  1995
we see with segpath. Nevertheless, the execution time has decreased by a factor of three on eight processors, for an application with a total serial execution time of under one minute. There is a clear performance benefit for the segpath application, with a predicted speedup of just over 10 on 16 processors. The key to success here is the fact that a large fraction of the overall execution time is included in the parallel component of the algorithm and therefore benefits from parallelism. Correspondingly, the serial and parallel overhead components are a small fraction of the execution time, contributing little to the overall performance results. To explore the limits of parallel performance using the techniques described here, the uniprocessor code was reinstrumented to indicate the total elapsed time performing functional evaluations throughout the application (i.e., over all entries in Table 1). For slcumiz the fraction is 88%, and for segpath the fraction is 94%. (This is slightly lower than the sum of the bracketing, gradient, and curvature code since these sections of code are not exclusively function evaluation. Also, the reporting of instrumentation results at 1% resolution leads to some roundoff error.) Amdahl’s law therefore limits the speedup to approximately 16, indicating that the proposed parallel implementation will perform near its theoretical performance limits. To further increase performance, additional parallelization techniques will have to be considered (e.g., executing individual function evaluations in parallel).
Speedup for Skumix /*
/* /* /* 1/’ / l
/ *
2
4
a
6
Number of Processors
Speedup for Segpath *
5
* 1
Conclusions
* *
This paper has investigated the viability of using an existing set of networked workstations as a parallel execution platform for genetic epidemiology research. The Gemini/Almini library, used extensively in the field, was partially parallel&d to explore the potential for performance gains via parallel processing. The results show that (for large enough problems) significant performance improvements are possible, while not requiring that users of the Gemini/Almini library deal with the complexities of general parallel programming. By executing independent function evaluations on separate workstations, we predict an order of magnitude performance improvement using a set of 16 workstations. This is close to the theoretical speedup limit of 16 imposed by Amdahl’s law. To achieve this performance, however, the bracketing and curvature evaluation routines must be parallelized in addition to the gradient calculation routine.
* * * l
*
*
l l
5
10
15
Number of Processors
Figure 6: Predicted performance lelization
with further paral
109
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on SystemSciences  1995
[9] G. D. Peterson and R. D. Chamberlain. Beyond Execution Time: Expanding the Use of Performance Models. IEEE Parallel & Distributed Technology 2(2):3749, 1994. [lo] G. D. Peterson and R. D. Chamberlain. Sharing Networked Workstations: A Performance Model. In Symp. on Parallel and Dist. Processing, October 1994. [ll] G. D. Peterson and R. D. Chamberlain. Stealing Cycles: Can We Get Along? In Hawaii Znt’l Conf. on System Sciences, January 1995. [12] M. A. Province and D. C. Rao. A General Purpose Model and a Computer Program for Combined Segregation and Path Analysis (SEGPATH): Automatically Creating Computer Programs from Symbolic Language Model Specifications. Genetic Epidemiology (in press), 1994. [13] D. C. Rao, M. McGue, C. Glueck, and R. Wette. Path Analysis in Genetic Epidemiology., In Human Population Genetics: The Pittsburgh Symposium, Hutchinson Ross Publishing Co., 1983. [14] Michael Schrage. Piranha Processing  Utilizing Your Down Time. HPCwire (Electronic Newsletter), August 1992. [15] V. S. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, 2(4):315339, December 1990. [16] Louis H. Turcotte. A Survey of Software Environments for Exploiting Networked Computing Resources. Technical Report MSUEIRSERC932, NSF Engineering Research Center for Computational Field Simulation, Mississippi State University, Starkville, MS, June 1993.
To exceed a speedup of 16, it will be necessary to incorporate additional parallel$ation techniques. It is highly likely that individual functional evaluations can be executed in parallel. This, however, requires the cooperation of the application programmer. These techniques would be required to exploit alternative execution platforms (e.g., MPP systems such as the Intel Paragon or Cray T3D). The above results assume the available processors are idle and form a homogeneous set, We are interested in expanding this investigation into the use of heterogeneous processors and nonidle workstations. We have developed performance models that predict the performance of synchronous iterative algorithms (of which Gemini/Almini is an example) executing on a set of shared, heterogeneous processors [lo, 111, and plan to use these models to guide the assignment of work to slave processors (e.g., send more work to faster, lightly loaded machines). Finally, we plan to make the parallelized Gemini/Almini library available to the research community. This will help advance the overall goal of improving the productivity of researchers within the genetic epidemiology community.
References
PI G.
M. Amdahl. Validity of the SingleProcessor Approach to Achieving Large Scale Computing Capabilities. In AFIPS Conference Proceedings, pages 483485, AFIPS Press, 1967. PI G. E. Bonney, G. M. Lathrop, and J.M. Lalouel. Combined Linkage and Segregation Analysis using Regressive Models. American Journal of Human Genetics, 43:2937, 1988. PI S. J. Hasstedt. A Mixed Model Likelihood Approximation for Large Pedigrees. Computational Biomedical Research, 15:295307, 1982. k4J.M. Lalouel. Gemini  A Computer Program for Optimization of General Nonlinear Functions. Tech. Rep. 14, Population Genetics Laboratory, University of Hawaii, December 1979. I31J.M. Lalouel and N. E. Morton. Complex Segregation Analysis with Pointers. Human Heredity, 31:312321, 1983. PI G. M. Lathrop and J.M. Lalouel. Easy Calculations of Lod Scores and Genetic Risks on Small Computers. American Journal of Human Genetics, 36:460465, 1984. C. PI J. MacLean, N. E. Morton, R. C. Elston, and S. Yee. Skewness in Commingled Distributions. Biometrics 32:695699, 1976. N. PI E. Morton, D. C. Rae, and J.M. Lalouel. Methods in Genetic Epidemiology, Karger, 1983. 110
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences 
[This page intentionally left blank]
111
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 10603425/95 $10.00 © 1995 IEEE
1995