Efficient Computation of Ridge-Regression Best ...

4 downloads 0 Views 1MB Size Report
H. Wickham and B. Mæland. http://crantastic.org/packages/rrBLUP/versions/12431. (accessed 11 Aug. 2011). Habier, D., R.L. Fernando, and J.C.M. Dekkers.
RESEARCH

Efficient Computation of Ridge-Regression Best Linear Unbiased Prediction in Genomic Selection in Plant Breeding H. P. Piepho,* J. O. Ogutu, T. Schulz-Streeck, B. Estaghvirou, A. Gordillo, and F. Technow

ABSTRACT Computational efficiency of procedures for genomic selection is an important issue when cross-validation is used for model selection and evaluation. Moreover, limited computational resources may be a bottleneck when processing large datasets. This paper reviews several options for computing ridge-regression best linear unbiased prediction (RR-BLUP) in genomic selection and compares their computational efficiencies when using a mixed model package. Attention is also given to the problem of singular genetic variance-covariance. Annotated code is provided for implementing and evaluating the methods using the MIXED procedure of SAS. It is concluded that a recently proposed method based on a spectral decomposition of the variance-covariance matrix of the data is preferable compared to established methods because of its superior computational efficiency and applicability also for singular genetic variance-covariance.

H.P. Piepho, J.O. Ogutu, T. Schulz-Streeck, and B. Estaghvirou, Bioinformatics Unit, Institute of Crop Science, University of Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany; A. Gordillo, AgReliant Genetics, LLC, 4640 East State Road 32, Lebanon, IN 46052; F. Technow, Institute of Plant Breeding, University of Hohenheim, Fruwirthstrasse 21, 70599 Stuttgart, Germany. Received 11 Nov. 2011 *Corresponding author ([email protected]). Abbreviations: BLUP, best linear unbiased prediction; GS, genomic selection or genome-wide selection; MET, multienvironment trials; MME, mixed model equations; REML, restricted maximum likelihood; RR-BLUP, ridge-regression best linear unbiased prediction; SNP, single nucleotide polymorphism.

I

n genomic selection (GS), marker information is used to predict breeding values or genotypic values of both tested and untested genotypes (Meuwissen et al., 2001). In plant breeding, GS procedures are usually applied to phenotypic data from multienvironment trials (MET). Analysis can be done either in a single stage, using a mixed model for plot data, or in two stages, in which adjusted means per genotype across the MET are computed in the first stage and are then submitted to a procedure for GS, the standard procedure being ridge-regression best linear unbiased prediction (RR-BLUP) (Piepho, 2009; Iwata and Jannink, 2011; Heslot et al., 2012). We focus here on the two-stage approach because it is computationally much more efficient than a single-stage approach (Möhring and Piepho, 2009). Model selection and evaluation of the predictive performance for GS is often done by cross-validation procedures, meaning that the same mixed model analysis needs to be performed many times. In such applications, computational efficiency is an Published in Crop Sci. 52:1093–1104 (2012). doi: 10.2135/cropsci2011.11.0592 © Crop Science Society of America | 5585 Guilford Rd., Madison, WI 53711 USA All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

1093

important consideration. This paper, therefore, compares different computational strategies to perform RR-BLUP. Consider the following mixed model for adjusted means of tested genotypes (Piepho, 2009): y = 1n μ + Zu + e,

[1]

in which y is an n-vector of adjusted means per genotype, 1n is an n-vector of ones, μ is a common intercept, Z is an n × p covariate matrix of p single-nucleotide polymorphism (SNP) markers for n tested genotypes, and u is a vector of p random SNP effects with u ~ N(0,Ip σu2 ), in which Ip is a p-dimensional identity matrix, σu2 is the variance of SNP effects, and e is a residual error associated with y, assumed to follow e ~ N(0,R). Either an independent estimate of R is available from the analysis that yielded adjusted means y and can be plugged in here, regarding it as a fi xed quantity (Möhring and Piepho, 2009), or we set R = In σ2, in which σ2 is a residual error variance that needs to be estimated. We prefer the former approach because it correctly represents the error structure of y while setting R = In σ2 constitutes an approximation, although this approximation may have the advantage of accounting for possible unexplained genetic variance. Under the assumed model the variance of the observed data is V = var(y) = Γ σu2 + R, in which Γ = ZZT and ZT denotes the transpose of Z. The objective of analysis for GS may be twofold: (i) Estimation of the genotypic value of the tested genotypes,

g = Zu,

[2]

and

METHODS TO FIT MODEL [1] Method 1 The (p + 1) mixed model equations (MME) for μ and u in model [1] can be solved to obtain the best linear unbiased prediction (BLUP) of u: ⎛ 1T R −11n ⎜⎜ ⎜⎜ZT R −11 ⎝ n

⎞⎟⎛μˆ ⎞ ⎛ 1T R −1y ⎞ 1Tn R −1Z ⎟⎟ . ⎟⎟⎜⎜ ⎟⎟ = ⎜⎜ n ZT R −1Z + I p σˆ −u 2 ⎠⎟⎜⎝uˆ ⎠⎟⎟ ⎜⎝⎜ZT R −1y ⎠⎟⎟

Use of these MME involving u is preferable when p is small relative to n. The genotypic value of tested and untested genotypes can then be estimated by evaluating Eqs. [2] and [3], respectively. In case R = In σ2 the MME can be rewritten as

(Z Z + λˆ D) u = Z y , T

2

(

g0 = Z 0u, [3] in which Z0 is the n0 × p marker data matrix of n0 untested genotypes. It is assumed here that both objectives are to be achieved either simultaneously or successively as new untested genotypes are identified. In the latter case, it is particularly convenient to obtain estimates of u and store these for later use. We review computational options for these two tasks and discuss their merits and demerits. Particular attention is given to the problem of possible singularity of Γ and to the implementation in SAS (SAS Institute, 2004). It will be assumed that model [1] is to be fitted to the data directly using restricted maximum likelihood (REML) (Piepho,

T

 = (1 ,Z), D = (0⊕Ip) with ⊕ denoting the direct in which Z n sum (Searle et al., 1992), λˆ 2 = σˆ 2 σˆ u2 , and u% T = (μ,uT). This has the solution % TZ % + λˆ 2 D u% = Z

(ii) Estimation of the genotypic values of untested genotypes,

1094

2009), which we consider superior to alternative methods that estimate the SNP variance by dividing a separate estimate of the total genetic variance by the number of SNPs (Meuwissen et al., 2001; Habier et al., 2007; Bernardo and Yu, 2007; Crossa et al., 2010). We identified eight methods, which are compared on the basis of four broad criteria: (i) whether they provide estimates of the random marker effects (u) and tested and untested genotype effects (g and g0), (ii) specific requirements, for example, that the number of markers (p) is larger than the number of genotypes (n) or that the variance-covariance matrix of genotypes (Γ) is positivedefinite, (iii) when they are most useful, for example, when p > n and when Γ is singular, and (iv) relative timings on the same dataset. All eight methods produce identical estimates and are based on the same basic model [1]. They differ only in the way the model is formulated and implemented and therefore in their efficiencies in specific situations.

)

−1

% T y, Z

which is known as the “ridge-regression” formulation of BLUP (Ruppert et al. [2003], p.100). Therefore, the name ridge-regression BLUP (RR-BLUP) is used for the method.

Method 2 When p > n, the MME for u can become prohibitively many. In maize (Zea mays L.), for example, it is now common to use in the order of p = 50,000 SNPs. In animal breeding, even larger values of p are common. Solving the MME for u with such large values of p will be either impossible or prohibitively costly with most current mixed model packages. Therefore, for p > n, it is more efficient to rewrite model [1] as

WWW.CROPS.ORG

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

y = 1n μ + g + e,

in which g = Zu, var(g) = G = Γ σ , and e is defi ned as in model [1]. The (n + 1) MME for μ and g are: 2 u

⎛ T −1 ⎜⎜1n R 1n ⎜⎜ −1 ⎝ R 1n

1Tn R −1 ⎞⎟⎛⎜μˆ ⎞⎟ ⎛⎜1Tn R −1y⎞⎟ ⎟⎜ ⎟ = ⎜ ⎟. ˆ −1 ⎠⎟⎟⎟⎜⎝gˆ ⎠⎟⎟ ⎜⎜⎝ R −1y ⎠⎟⎟ R −1 + G

Then, provided that Γ is positive-definite and therefore invertible, the BLUP of u can be obtained as (Henderson, 1977)

uˆ = ZTΓ–1 gˆ ,

[4]

from which g0 can be computed using Eq. [3]. Method 2 generally does not work when p < n, in which case Γ is singular and therefore has no inverse. The case p < n is expected to keep playing a role in plant breeding, particularly in crops such as maize, in which decay of linkage disequilibrium is slow, so that a relatively small number of markers may give reasonable predictions (Albrecht et al., 2011; Zhao et al., 2012). Moreover, Γ may become singular when p ≥ n, for example when two genotypes happen to have identical marker profi les or when Z is mean-centered by columns (VanRaden, 2008). When p is very large, it is not likely that two genotypes will have identical marker profi les. But there is also the possibility of linear dependencies among several marker profi les that would render Γ singular, and such cases do occur occasionally even when p > n (Christian Riedelsheimer, Institute of Plant Breeding, University of Hohenheim, personal communication, 2011). In animals, identical twins or clones will give rise to singular Γ (VanRaden, 2008). In some cases, data pruning techniques such as deletion of identical twins can be used to circumvent the problem of singularities. For routine use of a mixed model package, however, it is convenient if the computational approach can deal with any kind of singularity in Γ without the need for data pruning.

Method 3a

In case of singular Γ, the linear mixed model and the MME may be modified in various ways as proposed by Harville (1976) and advocated by Henderson (1984, p.48). For example, the MME can be modified as ⎛ ⎜⎜1 R 1n ⎜⎜ ˆ −1 ⎝ GR 1 T n

−1

n

ˆ ⎞⎟⎛μˆ ⎞ ⎛1 R y⎞⎟ 1 R G ⎟⎟⎜⎜ ⎟⎟⎟ = ⎜⎜ ⎟. − 1 ˆ −1y ⎠⎟⎟⎟ ˆ ˆ +G ˆ ⎠⎟⎟⎜⎜⎝dˆ ⎠⎟ ⎜⎜⎝ GR GR G T n

−1

T n

−1

Alternatively, we may rewrite model [1] as y = 1n μ + Lh + e, CROP SCIENCE, VOL. 52, MAY– JUNE 2012

⎛1T R −11 ⎜⎜⎜ nT −1 n ⎝⎜L R 1n

⎞⎟⎛⎜μˆ ⎞⎟ ⎛1T R −1y ⎞⎟ ⎜ n ⎟⎟⎟⎜⎜⎜ ˆ ⎟⎟⎟ = ⎜⎜⎜ T −1 ⎟⎟⎟ . LT R −1L + I r σˆ u2 ⎠⎝ h⎠ ⎝ L R y ⎠ 1Tn R −1L

From the solution of these MME, we may compute gˆ = Lhˆ . The advantage of this particular variant of Harville’s method is that it formulates the model so that the variance-covariance of the fitted random effects (h) is positive-definite. Therefore, the method can be applied also with mixed model packages that cannot otherwise deal with singular Γ. Of course, it also works for positivedefinite Γ. These two variants of Harville’s method (methods 3a and 3b) are fine for computing gˆ . But we cannot obtain uˆ or gˆ 0 from gˆ as per Eq. [4] when Γ is singular, so the method is of limited practical value when the prediction of breeding values for untested genotypes is needed.

Method 4a

Methods 3a and 3b are designed to work when Γ is singular. While they provide estimates of g directly, they will not produce estimates of u and g0 in this case. An extension of Harville’s ideas can be used to also obtain estimates of g0 (but not of u when Γ is singular). We may include g0 in the mixed model for tested genotypes, that is, we may re-express model [1] as y = 1n μ + Wa ga + e,

[5]

in which Wa = (In,0) and gTa = (gT , gT0 ) is the augmented genotypic effect vector that concatenates effects of tested and untested genotypes. The key idea of this approach is to exploit the covariance between g0 and g, which permits estimating g0 without direct observations on the untested genotypes and, in fact, without estimating marker effects u. We have

⎛ ZZT var(ga) = Ga = ⎜⎜ ⎜⎜⎝Z ZT 0

ˆ ˆ . This is From the solution, we can compute gˆ = Gd essentially the method that is automatically invoked by the ˆ = Γσˆ 2 . MIXED procedure of SAS in case of singular G u

Method 3b

in which g = Lh, which uses a decomposition Γ = LLT, and h ~ N(0,Ir σu2 ), in which r = rank(Γ) and L is an n × r matrix. One way to obtain L is via a spectral decomposition of Γ, that is, L = UΓdiag (λ10.5 ,...,λ 0.5 r ) , in which UΓ is an n × r eigenvector matrix corresponding to the r nonzero eigenvalues λ1 ,..., λ r of Γ. The (r + 1) MME for this model are

ZZT0 ⎞⎟ 2 ⎟ σu = Γ a σu2 . T⎟ Z0Z0 ⎠⎟

The augmented MME are

⎛ 1Tn R −11n ⎜⎜ ⎜ T −1 ⎝⎜Wa R 1n

⎞⎟⎛ μˆ ⎞ ⎛ 1T R −1y ⎞ ⎟⎟ . ⎟⎜⎜ ⎟⎟ = ⎜⎜ n ⎟ −1 ⎟ T ⎜ −1 −1 ⎟ T ⎟ ⎟ ˆ ⎜ ⎜ ˆ g Wa R Wa + Ga ⎠⎟⎝ a ⎠ ⎝Wa R y ⎠⎟ 1Tn R −1Wa

When Γa is found to be singular, we may solve the modified MME as for method 3a (Harville, 1976; Henderson, 1984, p.48) WWW.CROPS.ORG

1095

⎛ T −1 ⎜⎜1n R 1n ⎜⎜ ˆ −1 ⎝Ga R 1n

⎞⎟⎛ μˆ ⎞ ⎛1T R −1y ⎞ ˆ 1Tn R −1G ⎟⎟ a ⎟⎟⎜⎜ ⎟⎟⎟ = ⎜⎜ n ⎜⎜dˆ ⎟ ⎜⎜ ˆ −1 ⎟⎟⎟ −1 ˆ ⎟ ˆ ˆ ⎟ Ga R Ga + Ga ⎠⎝ a ⎠ ⎝Ga R y ⎠

ˆ dˆ . Again, this method is autoand then compute gˆ a = G a a matically invoked by the MIXED procedure of SAS in ˆ = Γ σˆ 2 is found to be singular. case G a a u

Method 4b Alternatively, we may proceed as for method 3b, that is, we can rewrite the genetic effect as ga = L a ha using a decomposition Γa = L a LTa , in which L a is an na × ra matrix, ra = rank(Γa), na = n + n 0, and ha ~ N(0,Ir(a) σu2 ). Therefore, the model can be rewritten as y = 1n μ + Ma ha + e,

[6]

with Ma = Wa L a. Solving the (ra + 1) MME yields the BLUP of ha , from which we compute gˆ a = L a hˆ a . To implement the method in a mixed model package, model [6] needs to be fitted explicitly, which requires the computation of Ma before invoking the mixed model package to fit random coefficients ha. Note that method 4b works with singular and nonsingular Γa. When Γ is positivedefinite, we may use Eq. [4] to obtain the BLUP of u. Otherwise, an estimate of u is not available with methods 4a and 4b.

Method 5

ˆ = Γσˆ is not This method is particularly useful when G u positive-definite but p > n and the number of untested genotypes (n 0) is high compared with the number of ˆ is generally positive-deftested genotypes (n). Because V ˆ is not, we can always predict random inite even when G effects u by ˆ −1 ( y − 1 μˆ ) , [7] uˆ = σˆ 2ZT V u

2

n

which can be derived in various ways, including using the conditional expectation of u, given R, derived from the joint distribution of y and u (Searle et al., 1992, section 7.4). The genotypic value of tested and untested genotypes (g and g0, respectively) can then be estimated by evaluating Eqs. [2] and [3], respectively. ˆ , including those We can use any method to obtain V used with methods 1 to 4. We recommend using the same procedure for estimating V as used in method 2 when Γ is positive-definite and methods 3a and 3b when Γ is not positive-definite. Note that the only difference between methods 2 and 5 is that method 2 uses Eq. [4] to predict u whereas method 5 uses Eq. [7].

Method 6 This method also uses Eq. [7] to obtain uˆ but a different approach than methods 1 to 4 to estimate V. The method 1096

is restricted to the case in which R = In σ2. In case R does not meet this assumption, we can always apply a linear transformation to ensure R = In σ2 (Piepho et al., 2011) provided that R is known. Therefore, we would replace y with LRy and Z with LR Z, in which R –1 = (LR)2 such that LR is square and symmetric (Rao et al., 2008, p. 151). LR is easily obtained from a spectral decomposition of R –1. With these replacements, analysis can proceed assuming that R = In σ2 with σ2 = 1. We use a modification of the method proposed by Kang et al. (2008), accounting for the fact that in our case the residual variance σ2 (or R) is estimated in the first stage and can be regarded as known in the second stage, which requires a different parameterization compared to Kang et al. (2008). The crux of the approach is to express the likelihood as a sum of independent terms, which avoids any computationally costly matrix operations in the numerical maximization of the likelihood and therefore speeds up computing time compared with other methods. Let S = In – X(XTX)–XT, in which X is a design matrix of fi xed effects pertaining to adjusted means. When R = In σ2, then X = 1n while when linear transformation is needed we have X = LR1n. Consider the spectral decomposition SVS = [ U R , WR ]diag (σu2 λ1 + σ2 ,..., σu2 λ n−q + σ2 ,0,...,0)[ U R , WR ]

T

= U R diag (σ λ1 + σ ,..., σ λ n−q + σ ) U 2 u

2

2 u

2

T R

[8]

,

in which UR is an n × (n – q) eigenvector matrix corresponding to the (n – q) nonzero eigenvalues of the matrix product SVS, which take the form σu2 λ j + σ2 ( j = 1,…,n – q), in which λ j are constants and WR is an n × q eigenvector matrix corresponding to the zero eigenvalues. An important property of this spectral decomposition is that the matrix of eigenvectors UR does not depend on the variance components σu2 and σ2 (Kang et al., 2008). It is noteworthy for the sake of completeness that there is an indeterminacy in UR when Γ = ZZT is not full rank. In this case, the smallest nonzero eigenvalue will display a multiplicity, so the corresponding eigenvectors in UR are not unique, that is, they can be orthogonally rotated without altering SVS. However, this non-uniqueness does not affect the REML log-likelihood computations below and so does not pose any problem. To compute the matrix UR , we can temporarily fi x the variance components σu2 and σ2 at some arbitrary values and then compute the decomposition, from which we obtain UR. Plugging in the fi xed values of the variance components, we can then find the corresponding λ j ( j = 1,…,n – q). The REML log-likelihood of y is given by the likelihood of Ay in which A is chosen such that S = AAT and ATA = I (Patterson & Thompson, 1971; Harville, 1976). It can be shown that S = U R UTR and I = U TR U R (Appendix A). Therefore, the REML log-likelihood is the loglikelihood of η = UTR y ~ N(0,D),

WWW.CROPS.ORG

[9]

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

in which D = diag (σu2 λ1 + σ2 ,..., σu2 λ n−q + σ2 ) (Appendix B), meaning that (apart from a constant not depending on the parameters) the log-likelihood ( f R) involves a sum of (n – q) terms: ⎧n − q ⎪

n−q

⎪⎫

2 2 2 2 2 f R ⎡⎢σu2 , σ2 ⎤⎥ = –1/2 ⎪⎨⎪∑ log (σu λ j + σ ) + ∑ ⎡⎣⎢η j / (σu λ j + σ )⎤⎦⎥⎪⎬⎪ . [10] j =1 ⎣ ⎦ ⎪ ⎩⎪ j = 1 ⎭

This could be profi led for σ2 if we want to estimate both variance components simultaneously, leading to a result similar to that in Kang et al. (2008) but using a different parameterization. Here, we fi x σ2 at a value known from stage one. The log-likelihood in Eq. [10] can be maximized by standard procedures, such as the NewtonRaphson method (Searle et al., 1992). Once the genetic variance has been computed, BLUPs of u, g, and g0 still have to be computed. Here we use the same procedure as with method 5, that is, we first deterˆ −1 and then use Eq. [7] to get estimates of u from mine V which we compute estimates of g and g0 using Eqs. [2] and [3], respectively.

evaluation of Eqs. [4] and [7] for methods 2 and 5, respectively, can be done using the facilities for matrix algebra such as the IML procedure of SAS, as long as the matrices are not too large. When p is very large, for example p = 50,000, then when n is also large, most of the currently available packages and computer memory may not always allow computations involving Z, an n × p matrix. In this case, matrix operations can be made feasible by processing Z in parts. Upon partitioning Z as Z = (Z1,Z2,…,ZK),

we can express Γ as K

Γ = ∑ ZkZTk .

[11]

k =1

This suggests reading Z in parts as submatrices Zk (k = 1,2,…,K) and successively augmenting Γ according to Eq. [11]. T Similarly, u = (uT1 ,uT2 ,L , uTK ) in Eq. [7] can be evaluated in parts uk corresponding to Zk, and likewise the BLUP of g can be evaluated in a componentwise fashion using K

Hints on Computational Issues Good starting values for the variance components are crucial for speedy convergence of the REML algorithm. A simple way to obtain a good starting value for the SNP variance σu2 is to run a mixed model analysis without marker information, in which independent genotypic effects gi are assumed, and then to divide the resulting estimate of the genetic variance by the number of markers (p). This will usually yield a starting value that has the same order of magnitude as the final REML estimate under an RR-BLUP analysis. The method of computing a starting value proposed here has been used by other authors to obtain the final estimate of σu2 . We do not think that this is a good method because the model with independent genotypic effects is not generally commensurate with the model underlying RR-BLUP, which assumes correlated genotypic effects (Piepho, 2009). We therefore think it is better to estimate σu2 using the same model that is used to compute RR-BLUP. When more than one variance component needs to be estimated by REML, convergence can be improved by ensuring that all variance parameters have the same order of magnitude. For example, when p is very large, then σu2 becomes very small compared with σ2, which may prolong convergence. One option is to multiply all the elements of Z and Z 0 by p –1/2, meaning that σu2 is multiplied by p while Γ and Γa are divided by p. This rescaling ensures that σu2 and σ2 have about the same order of magnitude. Savings in computing time from suitable rescaling may be nontrivial, particularly when many rounds of cross-validation need to be performed. Matrix computations used before and after calling a REML procedure, such as computation of Γ from Z or CROP SCIENCE, VOL. 52, MAY– JUNE 2012

g = ∑ Zk uk .

[12]

k =1

Equation [4] can be evaluated similarly. In each case, memory can be saved by releasing the space occupied by Zk once uˆ , gˆ , or Γ have been updated. The number of submatrices (K) can be chosen such that each Zk is small enough to be processed by the package.

Implementation in SAS

When estimating the residual variance σ2 along with the SNP-marker variance σu2 , we frequently observed convergence problems when the default settings of the MIXED procedure were used. We found that the SIGITER option, which suppresses profiling of the likelihood for σ2, solved the problem in most cases. Therefore, we recommend using the SIGITER option (or, equivalently, the newer NOPROFILE option) whenever σ2 and σu2 need to be estimated in the same analysis. In practice, however, the data analyzed for GS will mostly be adjusted genotype means from replicated trials, often repeated across locations, so an independent estimate of the variance-covariance R of adjusted means will be available and this should be used instead of reestimating the residual variance. To input a fixed matrix R into a mixed model analysis, one may use a LIN(1) structure in the REPEATED statement, providing R as a dataset using the LDATA = option (Supplemental File S1). Methods 1 to 4 involve MME, which will be solved by suitably specifying the mixed model in the MIXED procedure. Method 5 does not involve MME. Therefore, Eq. [7] needs to be computed separately using estimates of variance components and fi xed effects obtained from the MIXED procedure. For method 6, the log-likelihood in Eq. [10] was maximized using the NLMIXED procedure,

WWW.CROPS.ORG

1097

which implements a Newton-Raphson algorithm. Estimation of ga and u requires extra computations in IML. Simulation can be used to explore the performance (computing time) of methods 1 to 6 for different configurations of the data (n, n 0, and p). The SAS code provided as Supplemental File S2 illustrates componentwise processing of Z for the computation of Γ for all methods. The SNP data are generated by assigning alleles based on random draws from a binary distribution with a probability parameter equal to 0.5. Errors are simulated as independent draws from a standard normal distribution, and accordingly the residual variance is fi xed at unity in calls of the MIXED procedure. This mimics the practice in two-stage analysis of computing adjusted means and their associated variance-covariance R in the first stage and then submitting the adjusted means to RR-BLUP, with R fi xed at its estimate from stage one, or some approximation thereof (Möhring and Piepho, 2009). Note that we can always rotate the data so that R = In, as explained in the section on method 6 (Piepho et al., 2011). We set σu2 = σ2/p to yield a heritability of 0.5. In case p > n, a singular Γ can be generated by setting SNP alleles of genotype 2 equal to those of genotype 1, say. Starting values for runs of the MIXED procedure were set to the true values, which is a best-case scenario as regards computing time.

Comparison of Methods Table 1 summarizes the situations in which each method is useful and its requirements and whether predictions for u are provided. A very important question is whether g0 can be estimated easily at a later stage for an arbitrary number of new genotypes without having to rerun the full mixed model analysis each time. The answer is “yes” for methods that can operate directly on uˆ in computing gˆ 0 , that is, methods 1, 2, 5, and 6. These methods should be most useful for routine application in breeding programs. If p > n, as will almost certainly be the rule in the foreseeable future, methods 2, 5, and 6 remain viable options, with methods 5 and 6 being the only ones able to deal with all cases, including singular Γ. Therefore, methods 5 and 6 appear to be the best suited for routine application. In cross-validation, we need to repeatedly analyze partitions of all tested genotypes into a training set (corresponding to effects g) and a validation set (corresponding to effects g0). For each partition, a new mixed model analysis needs to be performed, with predictions needed only for the validation set, so estimates of u are not strictly required. In this case, method 4 is a potential alternative, but a comparison of computing times shows that it can be much slower than methods 5 and 6 when using the MIXED procedure of SAS and when both n and n 0 are relatively large (see below). We compared the performance of methods 2, 4a, 4b, 5, and 6 in each of 16 scenarios (defined in terms of n, n 0, 1098

Table 1. A comparison of the five methods for ridge-regression best linear unbiased prediction (RR-BLUP). Provides estimate of u?

Requirements

Most useful when

1

Yes

None

p n and Γ positive-definite

p > n and Γ positive-definite

3a and 3b

Only when Γ is positive-definite

None

p > n and Γ singular and only want to estimate g

4a and 4b

Only when Γ is positive-definite

None

p > n and Γ singular and only want to estimate g and g 0

5

Yes

None

p > n and Γ singular

6

Yes

None

p>n

Method

and p) that we think approximately match current practice in GS in plant breeding programs. The same parameters and effects were estimated by each method, that is, the variance component σu2 and the effects u, g0, and g. The only exception were methods 2, 4a, and 4b, when p < n, in which case u could not be estimated. All computations were done on a 64-bit Windows 7 workstation with 8 GB RAM and an Intel Core Quad 2.66 GHz processor. Computing times reported in Table 2 and Fig. 1 for each scenario are averages across 10 simulation runs (increasing the number of simulation runs to 50 did not change the results). For each scenario we compared computing times of the five methods using t tests (α = 5%) adjusted for multiplicity using simulation adjustment. The t tests were based on a mixed model with fi xed effects for method and scenario and their associated interaction and a random effect for replicate simulation runs nested within scenarios. The total computing times were log-transformed for normality and to enhance variance homogeneity across scenarios before performing the statistical comparisons. The results show that all pairwise differences between methods are statistically significant in each of the 16 scenarios. The only exception was the comparison of methods 2 and 5, which was nonsignificant in all investigated cases. For all scenarios, method 6 is the fastest followed by methods 2 and 5, method 4b, and method 4a, in decreasing order of speed. Minor deviations from this pattern occurred only for method 4b and only in those few instances (e.g., scenarios 9 and 13) in which the rank of Γ is smaller than the number of genotypes, rendering strict comparison of the total computing time for methods 4a and 4b with those for the other methods inappropriate because methods 4a and 4b cannot be used to estimate u when Γ is singular. Using methods 4a and 5, the MIXED procedure of SAS ˆ = Γ σˆ 2 is automatically uses modified MME in case G a a u found to be singular, which can adversely affect the computing time. The timings for methods 2 and 5 were similar because the two methods use the same solution to the

WWW.CROPS.ORG

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

WWW.CROPS.ORG

1099

250 250 250 250 500 500 500 500 750 750 750 750 1000 1000 1000 1000

n

250 250 250 250 500 500 500 500 750 750 750 750 1000 1000 1000 1000

No.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

250 250 250 250 500 500 500 500 750 750 750 750 1000 1000 1000 1000

n0

500 5000 10000 50000 500 5000 10000 50000 500 5000 10000 50000 500 5000 10000 50000

p

500 5000 10000 50000 500 5000 10000 50000 500 5000 10000 50000 500 5000 10000 50000

p

Method 2

0.039 0.255 0.500 2.441 0.116 0.926 1.808 8.909 0.269 2.032 3.988 19.649 0.891 8.011 18.023 78.420

0.708 0.629 0.637 0.746 5.329 5.034 5.363 5.731 45.815 29.874 29.796 31.499 129.646 82.968 99.921 90.255

0.110 0.331 0.618 2.985 0.278 1.175 2.203 10.362 . 5.305 9.351 42.252 . 9.877 16.654 73.117

0.043 0.326 0.669 3.213 0.135 1.082 2.129 10.362 0.465 4.075 8.062 40.059 0.812 7.248 14.797 70.938

Estimation of u, g, and g 0 in IML

Method 5

0.010 0.552 0.550 0.672 4.908 4.681 4.932 5.425 . 27.932 27.786 29.209 . 74.660 89.068 82.220

Computation MIXED of Γ in IML analysis§

0.030 0.254 0.497 2.444 0.116 0.925 1.838 9.055 0.271 2.106 4.139 20.202 0.902 8.090 15.746 79.804

MIXED analysis, Estimation Computation estimation of u and g 0 of Γ in IML† in IML of g‡

0.790 1.210 1.806 6.400 5.580 7.042 9.300 25.000 46.549 35.980 41.846 91.207 131.349 98.227 132.741 239.613

Total time

0.040 1.137 1.665 6.101 5.302 6.781 8.973 24.842 . 35.343 41.276 91.663 . 92.627 121.468 235.141

Total time 5.831 5.007 4.965 6.225 62.995 80.246 84.703 93.635 238.192 295.368 295.524 307.974 576.546 654.526 784.080 733.978

6.032 6.234 7.339 17.841 64.233 89.476 102.531 183.606 240.707 318.537 340.402 526.184 581.212 696.092 863.395 1161.243

Method 6

0.073 0.300 0.546 2.551 0.310 1.120 2.026 9.289 . 4.956 8.837 39.242 . 9.620 16.422 87.781

Total time#

0.035 0.253 0.492 2.449 0.116 0.927 1.818 9.049 0.262 2.072 4.084 19.939 0.880 7.865 16.082 79.270

0.096 0.098 0.098 0.097 0.653 0.703 0.711 0.705 2.403 3.090 3.087 3.055 5.688 7.712 7.830 7.748

0.145 0.197 0.201 0.328 0.074 0.158 0.174 0.107 0.070 0.080 0.076 0.067 0.064 0.056 0.069 0.059

0.039 0.036 0.037 0.041 0.190 0.195 0.186 0.193 1.116 1.119 1.101 1.074 2.744 2.758 2.820 2.771



Method 4b

0.758 1.600 2.515 9.678 5.044 15.380 23.042 87.382 17.331 43.011 60.993 200.615 40.538 89.905 120.807 351.775

0.040 0.346 0.688 3.423 0.135 1.114 2.175 10.855 0.479 4.259 8.497 41.596 0.804 7.251 14.729 72.105

0.355 0.930 1.516 6.338 1.168 3.097 5.064 20.909 4.330 10.620 16.845 65.731 10.180 25.642 41.530 161.953

Total time

1.802 1.607 1.629 1.914 2.301 36.358 37.532 41.154 6.600 122.621 123.069 126.700 16.501 266.402 309.724 301.827

0.070 0.298 0.552 2.557 0.327 1.124 2.025 9.290 . 4.931 8.769 39.119 . 9.615 16.381 77.672

MIXED Computation analysis, of Γa and Ma g and g 0 Estimation in IML in IML of u in IML¶

Estimation Computation Computation NLMIXED Computation of u, g, and of Γ in IML of η †† analysis of V–1 g 0 in IML

0.128 0.927 1.828 9.065 0.928 8.110 15.802 80.682 2.137 18.213 36.041 178.968 3.624 31.946 62.893 339.484

MIXED Computation analysis, Estimation of Γa in IML g and g 0§ of u in IML¶

Method 4a

IML, MIXED, and NLMIXED: SAS Institute, 2004. When Γ is singular then method 2 cannot solve the mixed model equations and therefore cannot estimate u in scenarios 9 and 13. § Method 4a uses modified mixed model equations to handle the singularity of Γa in scenarios 5, 9, and 13 whereas method 5 does the same to handle singular Γ in scenarios 9 and 13. ¶ If Γ is singular, u cannot be computed by methods 4a and 4b; in this case, no computing time is reported for u. # If Γ is singular then the total time is italicized for methods 4a and 4b. †† Computation of η involves a spectral decomposition of SVS.



250 250 250 250 500 500 500 500 750 750 750 750 1000 1000 1000 1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

n0

Scenarios

n

No.

Scenarios

2.630 3.505 4.696 14.149 7.672 52.862 62.599 137.826 24.301 170.563 192.831 366.434 58.074 365.922 446.912 731.274

Total time#

Table 2. The time (in seconds) taken by methods 2 to 6 for each of the 16 simulated scenarios. The total timings for all pairs of the five methods are significantly different at each of the 16 scenarios (α = 5%; pairwise t tests with simulation adjustment for multiplicity) except methods 2 and 5, which are not significantly different.

Figure 1. Changes in the absolute differences between the timings of methods 5 and 6 (methods 4a and 6) in seconds (left panels) and in the ratio of the timings of method 5 to method 6 (method 4a to method 6) (right panels) as functions of the number of genotypes at fixed numbers of markers (p = 500, 5000, 10,000, and 50,000) and as functions of the number of markers at fixed number of genotypes (n + n0 = 250 + 250, n + n0 = 500 + 500, n + n0 = 750 + 750, and n + n0 = 1000 + 1000). The vertical whiskers indicate the standard errors. The label Tmethodi on the vertical axes denotes the computing time of method i.

1100

WWW.CROPS.ORG

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

MME when Γ is positive-definite. The main source of difference between the timings of the two methods is that ˆ −1 is computed and saved by the MIXED procedure V for method 5 but is not required by method 2. Even so, method 5 is clearly advantageous relative to method 2 because it can be used regardless of whether Γ is singular whereas method 2 breaks down when Γ is singular, as it did in scenarios 9 and 13. Although the main focus of our comparisons is between methods within individual scenarios, we also examine the pattern of differences between methods across scenarios to explore how the timings varied with increasing number of genotypes or markers. Generally, differences in computing times among the methods increased with increasing number of genotypes for fi xed numbers of markers or with increasing number of markers for fi xed numbers of genotypes, resulting in two distinct patterns across scenarios (Table 2; Fig. 1). For scenarios 1 to 4 with the smallest numbers of genotypes, the absolute differences between all pairs of total computing times were minor but statistically significant (except between methods 2 and 5). However, the differences grew larger with increasing numbers of genotypes and markers (Table 2; Fig. 1, left panels; method 2 was not considered in the graphical comparisons because its performance was very similar to that of method 5). A few unexpected differences in the rank ordering of timings for methods 5 and 4a across scenarios were noted (e.g., between scenarios 13 and 14 for method 5; Table 2; Fig. 1) and these differences reflected differences in the time required for the optimization of the log-likelihood function by the MIXED procedure of SAS. Besides being the fastest method overall, method 6 also displayed relatively small variation in timings across replicate simulation runs within scenarios. Despite its unrivalled fast performance, the relative advantage of method 6 over the other methods (e.g., timing of method 5/timing of method 6; Fig. 1, right panels) reduced with increasing number of markers, most noticeably for method 5, whose timings became most comparable to those for method 6 at 50,000 markers. This is because the times required to compute Γ and to estimate u by methods 5 and 6 increasingly dominate the total computing time for each method at large numbers of markers whereas the times needed by both methods to estimate the variance components vary relatively little with increasing number of markers. When using k-fold cross-validation, Γa would need to be computed only once whereas Γ would need to be computed anew for each estimation set. An efficient method of doing this is to operate on Γa rather than on Z, that is, one can compute Γ by a suitable reduction of Γa , which would need to be computed only once and can be stored during cross-validations. The time needed to obtain Γ from Γa is shorter than that needed to compute Γa. Therefore, CROP SCIENCE, VOL. 52, MAY– JUNE 2012

for comparing the performance of methods by crossvalidation, the first step in Table 2 can be ignored and focus directed instead on the two subsequent steps (call of MIXED and subsequent BLUP computations). It emerges that method 6 is also the better choice for cross-validation. We also implemented method 6 in R (R Development Core Team, 2011), extracting useful bits of code from the “rrblup” package (Endelman, 2011), which implements the method described in Kang et al. (2008). Our implementation is available as R package “rrBlupMethod6” on the Comprehensive R Archive Network (Schulz-Streeck et al., 2011). Computing times for the scenarios in Table 2 were similar to those for SAS (results not shown).

Further Important Issues We have reviewed here eight different methods for performing RR-BLUP using general-purpose mixed model packages. Evaluation of the methods based on theoretical considerations and simulations shows that method 6 is superior to other methods in terms of computing time across a range of realistic scenarios. Method 6 is an extension of the method proposed by Kang et al. (2008) that allows using an estimate of the variance-covariance matrix R of adjusted genotype means from two-stage analysis of MET, which is routinely used in plant breeding. In implementing any of these methods, provision of good starting values of variance parameters to be estimated and good scaling of the design matrix Z are crucial to optimize the computational efficiency of RR-BLUP. The benchmarks in Table 2 were obtained on a standard computing platform, such as would be easily available to all likely users. However, with some extra effort and experience, platforms can be set up that facilitate tremendous gains in computing time. For example, on a Linux desktop, with the standard linear algebra libraries replaced with multithreaded, processor optimized “GotoBLAS2” libraries (available from “http://www.tacc.utexas.edu/ tacc-projects/gotoblas2/”), the computation time of Γ in the R implementation of method 6 may be reduced by a factor of about 26 (for large p and n and using four threads, results not shown). With the increasing availability of massive whole genome sequence data, the computation of Γ using standard methods will become increasingly more challenging, so that using nonstandard platforms, such as the one described above, should be given serious consideration for large datasets. We only considered the simplest model for RR-BLUP. When extending the model, for example to include heterogeneous error variances and polygenic effects (SchulzStreeck and Piepho, 2010), computing times become longer, and computational gains are not so easy to achieve. In particular, method 6 cannot be extended to model cross-specific effects and variances and achieve some gain in efficiency.

WWW.CROPS.ORG

1101

In defining the methods for RR-BLUP, we have implicitly assumed for simplicity that all genetic variance is captured by the markers. If this assumption does not seem reasonable, for example when the number of markers is small, it is better to add a polygenic effect that captures the unexplained genetic variance (Piepho, 2009). The simplest way to achieve this is to estimate the residual variance σ2 along with the SNP variance so that any unexplained polygenic effects can be captured by the residual variance. If a known error covariance matrix R is used, an additional variance component needs to be fitted for polygenic effects (Piepho, 2009). In our experience, this can notably increase computing time. Using results on the inverse of a partitioned matrix, explicit equations for uˆ and gˆ can be derived from the corresponding MME for methods 1 to 4 (VanRaden, 2008). In our case, the only fi xed effect is the intercept μ, so the computational savings from using such explicit equations is marginal. Also, if a mixed model package is used, the full MME will be solved by the package. Some packages require provision of a positive-definite Γ. These packages will give an error message when Γ is not positive-definite. Some methods have been proposed that could be used to convert Γ into a positive-definite matrix (the so-called “bending methods”; Maenhout et al., 2008). These modifications of Γ can be applied with methods 2, 3a, 3b, 4a, and 4b, and they yield approximately valid estimates of V that can be used with method 5. Four commonly used bending options are given below. 1. One may add a tiny constant on to the diagonal of Γ; that is, the matrix is replaced with Γ + εIn, in which ε is a tiny positive number, for example, ε = 10 –8. This modification leaves the resulting variance-covariance ˆ virtually unaltered and therefore yields matrix V virtually identical BLUPs. A refinement of the method is to “reduce” R by the same amount, that is, to replace R with R − εσˆ u2 I n , which must be positive-definite, in which σˆ u2 is the estimate of σu2 , when R is used. After the replacement, the estimate will change marginally, so one may consider iterating this process, although this will not usually lead to relevant changes in model fit. 2. Maenhout et al. (2008) proposed a numerical method called bending that they used to make a singular numerator relationship matrix positive-definite. The same kind of method could be applied to Γ. 3. Higham (1988) proposed an algorithm that makes a symmetric matrix positive-definite. This is implemented in the function “make.positive. definite(),” implemented in R package “corpcor” (R Development Core Team, 2011). 4. VanRaden (2008) proposed replacing a scaled form of Γ with a weighted combination of Γ and 1102

A, the numerator relationship matrix computed from the pedigree, in which it is assumed that A is nonsingular. From a theoretical point of view, bending methods such as those mentioned here are not entirely satisfactory because they only yield approximations to the BLUP computations, although the approximation error is usually low. Because BLUP computations can be obtained by exact methods (methods 1 to 6), there seems to be no compelling reason to use these bending methods. It is noted that RR-BLUP is very closely related to spatial methods and methods that employ genetic similarities to model genetic correlation (Piepho, 2009; Ober et al., 2011). For example, Bauer et al. (2006) and Albrecht et al. (2011) use the simple matching coefficient to model genetic correlation among pedigreed inbred lines. In this case, the matrix of pairwise similarities is given by Γs = (1/2)Jn + [1/(2p)]ZZT,

[13]

in which J n = 1n1Tn (Piepho, 2009). This is just a shiftscale transformed version of the model employed for RR-BLUP, and it is, in fact, equivalent to RR-BLUP, as shown in Appendix C.

Supplemental Information Available Supplemental material is available at http://www.crops. org/publications/cs. Supplemental File S1: SAS code to exemplify use of fi xed R matrix. Supplemental File S2: SAS codes for simulating phenotypic and marker data and applying the eight methods to the data. Acknowledgments The German Federal Ministry of Education and Research (BMBF) funded this research within the AgroClustEr “Synbreed – Synergistic plant and animal breeding” (Grant ID: 0315526).

APPENDICES Appendix A Let P = In – X(XTV–1X) –1XTV–1. It can be shown that PTV–1P = (SVS)+, in which ( )+ denotes the pseudoinverse of a matrix, and PS = SP = S. Therefore, (SVS) (SVS)+ = (SVS)(PTV–1P) = SVPTV–1P = SPP = SP = S. Moreover, (SVS)(SVS)+ = [UR D UTR ][UR D–1 UTR ] = UR UTR , which establishes that S = UR UTR (see Patterson and Thompson, 1971; Kang et al., 2008).

Appendix B

We have η = UTR y, for which we find η ~ N(0, UTR VUR). But now UTR SVSUR = D = UTR VUR , in which the last equality follows if we plug in S = UR UTR on the righthand side of the equation. Therefore, η = UTR y ~ N(0,D).

WWW.CROPS.ORG

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

Appendix C Replacing Γ = ZZT with Γ = (1/2)Jn + [1/(2p)](ZZT) means that we replace G = Γ σu2 with Gs = aG + bJn , in which a = 1/2 and b = σu2 /(2p). Denote the genotypic effect under this shift-scale transformed model as gs. First consider the replacement Gs = aG. It is obvious, that this implies that gˆ s = a−1gˆ , meaning that the replacement yields an equivalent model. Therefore, it remains to consider the replacement Gs = G + bJn and therefore the replacement of V with Vs = V + bJn and of P with Ps = I n − d s−1J n Vs−1 , in which ds = 1Tn Vs−11n . Then gˆ s = G s Vs−1Ps y = GVs−1Ps y , because J n Vs−1Ps = 0. Now Vs−1 = (V + bJn)–1 = V–1 – C, in which C = b( f + 1)–1V–1JnV–1 and f = tr(bV–1Jn) = bd where d = tr(V-1Jn) (Miller, 1981; Henderson & Searle, 1981). With this result, we find ds = d(bd + 1)–1 and Jn Vs−1 = (bd + 1)JnV–1 so that Ps =

I n − d s−1J n Vs−1 = I n − d −1J n V−1 = P. Therefore, CP = 0 and therefore Vs−1Ps = V−1P , showing that gˆ s = gˆ , regardless of the choice of b. It follows that we may add bJn to G without altering the best linear unbiased prediction (BLUP) of g. It also emerges that μˆ remains unaltered. Moreover, the restricted log-likelihood operates on contrasts that remove the fi xed effect 1n μ from y and therefore also remove bJn from Vs, so restricted maximum likelihood (REML) estimates remain unchanged. In summary, the replacement Gs = aG + bJn yields a model equivalent with gˆ s = a−1gˆ . A proof of the same result but using a different approach based on a Kriging system of equations was given recently by Ober et al. (2011). Note that the result presented here is not restricted to Γ = ZZT but is valid for any form of Γ, including spatial models (Piepho, 2009).

References

Iwata, H., and J.L. Jannink. 2011. Accuracy of genomic selection prediction in barley breeding programs: A simulation study based on the real single nucleotide polymorphism data of barley breeding lines. Crop Sci. 51:1915–1927. doi:10.2135/cropsci2010.12.0732 Kang, H.M., N.A. Zaitlin, C.M. Wade, A. Kirby, D. Heckerman, M.J. Daly, and E. Eskin. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709–1725. doi:10.1534/genetics.107.08010 Maenhout, S., B. DeBaets, and G. Haensert. 2008. Markerbased estimation of the coefficient of coancestry in hybrid breeding programmes. Theor. Appl. Genet. 118:1181–1192. doi:10.1007/s00122-009-0972-y Meuwissen, T.H.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829. Miller, K.S. 1981. On the inverse of the sum of matrices. Math. Mag. 54:67–72. doi:10.2307/2690437 Möhring, J., and H.P. Piepho. 2009. Comparison of weighting in two-stage analyses of series of experiments. Crop Sci. 49:1977–1988. doi:10.2135/cropsci2009.02.0083 Ober, U., M. Erbe, N. Long, E. Porcu, M. Schlather, and H. Simianer. 2011. Predicting genetic values: Kernel-based best linear unbiased prediction with genomic data. Genetics 188:695– 708. doi:10.1534/genetics.111.128694 Patterson, H.D., and R. Thompson. 1971. Recovery of interblock information when block sizes are unequal. Biometrika 58:545–554. doi:10.1093/biomet/58.3.545 Piepho, H.P. 2009. Ridge regression and extensions for genomewide selection in maize. Crop Sci. 49:1165–1176. doi:10.2135/ cropsci2008.10.0595 Piepho, H.P., T. Schulz-Streeck, and J.O. Ogutu. 2011. A stagewise approach for analysis of multi-environment trials. Biuletyn Oceny Odmian 33:7–20 R Development Core Team. 2011. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/ (accessed 11 Aug. 2011). Rao, C.R., H. Toutenburg, Shalabh, and C. Heumann. 2008. Linear models and generalizations. Least squares and alternatives (3rd extended ed.). Springer, Berlin.

Albrecht, T., V. Wimmer, H.J. Auinger, M. Erbe, C. Knaak, M. Ouzunova, H. Simianer, and C.C. Schön. 2011. Genomebased prediction of testcross values in maize. Theor. Appl. Genet. 123:339–350. doi:10.1007/s00122-011-1587-7 Bauer, A.M., T.C. Reetz, and J. Léon. 2006. Estimation of breeding values of inbred lines using best linear unbiased prediction (BLUP) and genetic similarities. Crop Sci. 46:2685–2691. doi:10.2135/cropsci2006.01.0019 Bernardo, R., and J. Yu. 2007. Prospects for genomewide selection for quantitative traits in maize. Crop Sci. 47:1082–1090. doi:10.2135/cropsci2006.11.0690 Crossa J., G. de los Campos, P. Perez, D. Gianola, J. Burgueno, J.L. Araus, D. Makumbi, R.P. Singh, S. Dreisigacker, J. Yan, V. Arief, M. Bänzinger, and H.J. Braun. 2010. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186:713–724 doi:10.1534/genetics.110.118521 Endelman, J. 2011. rrBLUP: Genomic selection and association analysis. R package version 1.1. H. Wickham and B. Mæland. http://crantastic.org/packages/rrBLUP/versions/12431 (accessed 11 Aug. 2011). Habier, D., R.L. Fernando, and J.C.M. Dekkers. 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics 177:2389–2397. Harville, D.A. 1976. Extension of the Gauss-Markov theorem to include estimation of random effects. Ann. Stat. 4:384–395. doi:10.1214/aos/1176343414 Henderson, C.R. 1977. Best linear unbiased prediction of breeding values not in the model for records. J. Dairy Sci. 60:783– 787. doi:10.3168/jds.S0022-0302(77)83935-0 Henderson, C.R. 1984. Application of linear models in animal breeding. University of Guelph, Guelph, Canada. Henderson, H.V., and S.R. Searle. 1981. On deriving the inverse of a sum of matrices. SIAM Rev. 23:53–60. doi:10.1137/1023004 Heslot, N., H.P. Yang, M.E. Sorrels, and J.L. Jannink. 2012. Genomic selection in plant breeding: A comparison of models. Crop Sci. 52:146–160. doi:10.2135/cropsci2011.06.0297 Higham, N.J. 1988. Computing a nearest symmetric positive semidefi nite matrix. Linear Algebra Appl. 103:103–118. doi:10.1016/0024-3795(88)90223-6

CROP SCIENCE, VOL. 52, MAY– JUNE 2012

WWW.CROPS.ORG

1103

Ruppert, D., M.P. Wand, and R.J. Carroll. 2003. Semiparametric regression. Cambridge Univ. Press, Cambridge. SAS Institute. 2004. SAS 9.1.3 help and documentation. SAS Institute Inc., Cary, NC. Schulz-Streeck, T., B. Estaghvirou, and F. Technow. 2011. rrBlupMethod6: Re-parametrization of RR-BLUP to allow for a fixed residual variance. R package version 1.0. R Foundation for Statistical Computing, Vienna, Austria. http://CRAN.Rproject.org/package=rrBlupMethod6 (accessed 18 Nov. 2011). Schulz-Streeck, T., and H.P. Piepho. 2010. Genome-wide selection by mixed model ridge regression and extensions based on geostatistical models. BMC Proceedings 4(Suppl. 1):S8. doi:10.1186/1753-6561-4-S1-S8

1104

Searle, S.R., G. Casella, and C.E. McCulloch. 1992. Variance components. John Wiley & Sons, New York. VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91:4414–4423. doi:10.3168/ jds.2007-0980 Zhao, Y., M. Gowda, W. Liu, T. Würschum, H.P. Maurer, F.H. Longin, M. Ranc, and J.C. Reif. 2012. Accuracy of genomic selection in European maize elite breeding populations. Theor. Appl. Genet. 124(4):769–776. doi:10.1007/s00122011-1745-y

WWW.CROPS.ORG

CROP SCIENCE, VOL. 52, MAY– JUNE 2012