Generalized estimating equations and generalized linear mixed ...

Journal of Applied Ecology 2009, 46, 590–599

doi: 10.1111/j.1365-2664.2009.01642.x

Generalized estimating equations and generalized linear mixed-effects models for modelling resource selection

Blackwell Publishing Ltd

Nicola Koper1* and Micheline Manseau1,2 1

Natural Resources Institute, University of Manitoba, 70 Dysart Road, Winnipeg, MB, Canada R3T 2N2; and 2Parks Canada, Western and Northern Service Centre, 145 McDermot Avenue, Winnipeg, MB, Canada R3B 0R9

Summary 1. Accurate resource selection functions (RSFs) are important for managing animal populations. Developing RSFs using data from GPS telemetry can be problematic due to serial autocorrelation, but modern analytical techniques can help to compensate for this correlation. 2. We used telemetry locations from 18 woodland caribou Rangifer tarandus caribou in Saskatchewan, Canada, to compare marginal (population-specific) generalized estimating equations (GEEs), and conditional (subject-specific) generalized linear mixed-effects models (GLMMs), for developing resource selection functions at two spatial scales. We evaluated the use of empirical standard errors, which are robust to misspecification of the correlation structure. We compared these approaches with destructive sampling. 3. Statistical significance was strongly influenced by the use of empirical vs. model-based standard errors, and marginal (GEE) and conditional (GLMM) results differed. Destructive sampling reduced apparent habitat selection. k-fold cross-validation results differed for GEE and GLMM, as it must be applied differently for each model. 4. Synthesis and applications. Due to their different interpretations, marginal models (e.g. generalized estimating equations, GEEs) may be better for landscape and population management, while conditional models (e.g. generalized linear mixed-effects models, GLMMs) may be better for management of endangered species and individuals. Destructive sampling may lead to inaccurate resource selection functions (RSFs), but GEEs and GLMMs can be used for developing RSFs when used with empirical standard errors. Key-words: conditional, correlated data, GEE, GLMM, k-fold cross-validation, marginal, resource selection function, telemetry, woodland caribou

Introduction Accurate modelling of habitat selection by animals is critical to developing effective management plans. Resource selection functions (RSFs) are used to compare used with available habitat (Manly et al. 2002). Recent progress in GPS technology development has resulted in enormous amounts of data being made available. However, sequentially surveyed locations may be correlated at intervals as long as 1 month apart (Cushman, Chase & Griffin 2005), and are obviously correlated at intervals measured in minutes or hours (e.g. Fortin et al. 2005). Such data violate assumptions of independence of observations, which may increase frequency of type I errors (Clifford, Richardson & Hémon 1989). One approach to dealing with this autocorrelation has been to adopt an analysis that assumes absence of correlation, then *Correspondence author. E-mail: [email protected]

manipulating data to meet this assumption. For example, telemetry locations may be recorded every few hours or days (e.g. Johnson, Seip & Boyce 2004) on the assumption that this time-lag results in independence. However, the increased timelag may not be sufficient to produce independent observations, and the reduced amount of data may increase bias and reduce accuracy (Gustine et al. 2006). Destructive sampling, accomplished by dropping data until independence is reached (Way, Ortega & Strauss 2004), is similarly problematic, and may require dropping as many as 95% of data collected (e.g. Saher 2005). Some approaches that have been proposed for controlling for temporal autocorrelation are problematic. For example, information-theoretic approaches do not sufficiently correct for autocorrelation (cf. Boyce 2006; Aarts et al. 2008) because calculation of standard errors (sensitive to independence) is a critical component of the model selection paradigm, and because the likelihood used to calculate information criteria

© 2009 The Authors. Journal compilation © 2009 British Ecological Society

GEEs and GLMMs for RSFs assumes independence (Burnham & Anderson 1998). Conditional logistic regression (e.g. Johnson & Gillingham 2005) assumes independence among groups of points (observed paired with random), which is not met when telemetry points are recorded frequently. To address this problem, Fortin et al. (2005) incorporated robust standard errors and destructive sampling to obtain long time-lags between clusters of points. Although useful at the ‘step-scale’, their approach does not allow for evaluation of habitat selection at the home-range scale. Gillies et al. (2006) recommended models that include fixed and random (clustering) effects, such as generalized linear mixed-effects models (GLMMs) to control for the correlation that arises from recording multiple locations from each animal. Mixed models have been applied to correlated ecological data (e.g. Bolker et al. 2009), but Gillies et al. (2006) are among the first to apply it to RSFs (see also Aarts et al. 2008). However, there are at least two potential problems with applying GLMM to RSFs. First, the models are analytically complex (Fitzmaurice, Laird & Ware 2004: 326), which may inhibit convergence, and secondly, hypothesis tests in GLMMs are highly sensitive to model and correlation structure misspecification (Overall & Tonidandel 2004) when model-based standard errors are used. Because telemetry locations have been sampled sequentially, they are autocorrelated. However, random points selected from an animal’s home range (e.g. Gillies et al. 2006) do not show autocorrelation, as they are not sampled sequentially over time. Because the correlation structure among telemetry and random points differ, it is impossible to correctly specify the within-cluster correlation structure. The data Gillies et al. provide suggest that grizzly bear Ursus arctos L. locations were determined approximately every 4 h. At this sampling frequency, it is unlikely that these locations are independent. Gillies et al. (2006: 890) misspecified the correlation structure, as they assumed that all data within a cluster (animal) were equally correlated. Therefore, their approach does not meet the assumptions of GLMM. Nonetheless, we believe their approach is promising, and can be developed further. One possible modification is to use empirical (Huber–White sandwich) variance estimates within the GLMM to make the analysis robust to misspecification of the correlation structure (SAS Institute Inc. 2006), as Nielsen et al. (2002) did with a logistic regression model. Gillies et al. (2006) found that GLMMs were more effective for the development of RSFs than were logistic regression models with empirical standard errors, but did not evaluate GLMM combined with empirical standard errors. We suggest that GLMM with empirical standard errors may be robust to both among- and within-animal correlations, in contrast to GLMM without empirical standard errors. Generalized linear models (GLMs) with generalized estimating equations (GEEs) may provide a useful alternative. GEEs include an additional variance component to accommodate correlated data, and to allow for differences among clusters. GEEs have several favourable properties for ecological analyses; for example, parameter estimates and empirical standard errors are robust to misspecification of the correlation structure (Overall & Tonidandel 2004), and they are usually

591

less analytically complex than GLMMs (Agresti 2002: 365), hence, model convergence is more likely. GEEs have been used extensively in a variety of disciplines, such as epidemiology (Wu et al. 1999) and political science (Zorn 2001). In ecology, they have been used to control for lack of independence among nests clustered within sites (Driscoll et al. 2005) and among related species (Duncan 2004). Generalized estimating equations have been used only occasionally in habitat-selection studies. Storch (2002) and Dorman et al. (2007) demonstrate its use for controlling for spatial autocorrelation. In a conditional logistic regression context, Fortin et al. (2005) developed RSFs using estimating equations with an independenceworking correlation structure and robust standard errors, which they implemented using Cox proportional hazards regression. Although GEEs with other correlation structures have not previously been used for building RSFs, robust standard errors have been applied to control for correlation among telemetry locations (Nielsen et al. 2002). However, pooling data across animals (e.g. Nielsen et al. 2002) biases results towards data-rich individuals (Gillies et al. 2006; Aarts et al. 2008), if data are not missing at random. Applying robust standard errors while using a working correlation structure other than ‘independence’ in the estimation procedure should help overcome this problem. Nonetheless, like GLMMs, there are tradeoffs to the benefits of GEEs. Whereas GLMMs are sensitive to the choice of correlation structure, GEEs are sensitive to the link function (Pendergast et al. 1996: 101), which can affect model fit (Lele & Keim 2006). It is, therefore, important to compare these approaches according to both their performance and analytical paradigm, to evaluate the appropriateness of their tradeoffs under different management scenarios. Another fundamental issue is the interpretation of parameter estimates. Conditional (subject-specific) coefficient interpretation means that coefficients model how individual responses change with respect to independent variables. Marginal (population) parameter estimates describe the effects of independent variables on a population. This has a strong effect on parameter estimates, standard error estimates, and significance testing (Fitzmaurice et al. 2004: 365). Whereas GLMMs generate conditional parameter estimates, from which marginal estimates can be derived (Agresti 2002: 499), GEEs only produce marginal ones. However, marginal parameter estimates derived from GLMMs are biased, in that their absolute value is too small, and this bias increases as the variance of the random effect increases (Agresti 2002: 499). Although RSFs do not produce estimates of actual probabilities of use, they produce estimates that are proportional to probability of use (Manly et al. 2002), and thus, this bias could be problematic. Further, the relationships among covariates, and the parameter estimates themselves, are not easily interpreted for marginal estimates derived from conditional models, and models are more likely to be misspecified (Agresti 2002: 499; Fitzmaurice et al. 2004: 364). It is therefore preferable to use a marginal model, such as GEE, when marginal population estimates are of interest (Agresti 2002: 501).

© 2009 The Authors. Journal compilation © 2009 British Ecological Society, Journal of Applied Ecology, 46, 590–599

592

N. Koper & M. Manseau

Accurate resource selection functions make an important contribution to the conservation of rare or threatened species (Johnson, Seip & Boyce 2004). The boreal population of woodland caribou Rangifer tarandus caribou L. is threatened in Canada (COSEWIC 2002). It is sensitive to habitat composition and anthropogenic activities (Brown et al. 2007), and therefore, accidental misspecification of RSFs would have important conservation consequences. We compared RSFs developed using GLMMs and GEEs, at two spatial scales, using data on woodland caribou. We compared effects of empirical and model-based standard errors on statistical significance. Finally, we compared our results with an analysis done on a destructively sampled subset of the data. Because GEEs have rarely been applied to RSFs, we provide an overview of this approach (see also Dorman et al. 2007).

where V(μit) is the variance of the marginal mean μit, and D is a diagonal matrix. The correlation in the data is modelled using the working correlation matrix, R(α )( ni × ni ), defined by the parameter vector α. This vector may contain a single value (i.e., α = α) as in the compound-symmetric correlation structure, or it may contain several values. R is a square matrix of dimensions ni × ni, where ni is the number of samples (or measurements) within each cluster. An iterative process is used to estimate model parameters. First, estimates of V i and β are obtained using initial estimates of α and φ (i.e., β is initially estimated from a generalized linear model, assuming φ = 1 and independence of observations). Then α and φ are estimated using the estimates calculated for Vi and β in the first step (Fitzmaurice et al. 2004: 302). This iterative process continues until model convergence is achieved, that is, that there is little change in the parameter from one iteration to the next. At convergence, the model-based variance estimate is (Fitzmaurice et al. 2004: 305),

Materials and methods

Cov(b) = B–1,

BACKGROUND ON GENERALIZED ESTIMATING

where B =

eqn 4

N

∑D′V

−1 i

i

Di ,

eqn 5

i =1

EQUATIONS

For a review of the application of random effects for RSFs, we recommend Gillies et al. 2006; readers should also review the statistical literature on the use of random effects, such as Agresti 2002 and Bolker et al. 2009. Because we introduce GEEs for the development of RSFs using telemetry data, we present a brief conceptual overview of GEEs; for further details, we recommend Hardin & Hilbe (2003). We use the term cluster to mean a unit of analysis within which there are multiple measurements. In our example, each cluster is a caribou. Three components are important in the GEE (Fitzmaurice et al. 2004: 294–295). Generalized estimating equations require a model for the mean response (as a function of covariates), the variance (often specified as a function of the mean), and a working correlation assumption. They are semi-parametric because estimates rely on parametric assumptions regarding the mean and variance/covariance, but they are not fully parametric (i.e. they require no other distributional assumptions). First, consistent with a GLM, the conditional expectation, E(Yit | Xit) = μit, depends on the independent variables through a link function (a nonlinear equation used to link the predicted values with the independent variables): g(μit) = Xitβ

eqn 1

Secondly, the conditional variance of each Yit, given the independent variables, varies as follows: Var(Yit) = φ v(μit),

eqn 2

where φ is a known or estimated scale parameter (depending on which response distribution is used), and v(μit) is a known variance function of the mean μit. Thirdly, the correlation among data points within clusters is assumed to be a function of one or more correlation parameters, α. Essentially, the GEE is defined by substituting the variance term in the GLM with the following variance–covariance matrix (Hardin & Hilbe 2003: 58), V(μi ) = [ D(V(μit ))1/2 R(α )( ni × ni ) D(V(μit ))1/2 ] ni × ni,

eqn 3

where Di is the derivative matrix (the matrix of the derivative of μi relative to the components of β), and Vi is the working covariance matrix. However, the model-based variance is often replaced by the empirical or ‘sandwich’ variance estimator, which is robust even when the working correlation structure does not correctly describe the correlation in the data (Fitzmaurice et al. 2004: 304). Although it does require a sufficiently large number of clusters to be unbiased (Fitzmaurice et al. 2004: 305), it has a potentially broad application for ecological analyses. The empirical variance matrix is (Fitzmaurice et al. 2004: 302), Cov(b) = B–1MB–1,

eqn 6

N

where M =

∑D′V i

−1 i

Cov(Yi )Vi− Di . 1

eqn 7

i =1

Because Cov(Yi) is unknown, M is a theoretical variance, rather than a variance estimate. Cov(Yi) is estimated using, Cov(Yi) = (Yi – mi)(Yi – mi)′

eqn 8

HABITAT SELECTION BY WOODLAND CARIBOU

Eighteen adult female woodland caribou from the Smoothstone– Wapaweka caribou management area in central Saskatchewan, Canada, were collared in 2005 and 2006 using Lotek GPS collars (Lotek Wireless Inc., 115 Pony Drive, Newmarket, Ontario). Locations were recorded every 4 h and consisted of late winter locations (1 January–15 March 2006 and 2007), when resources are most scarce and habitat selection is strong (Brown et al. 2007). Data from 1 year per caribou were used. The number of locations per animal ranged from 188–610 GPS data points. We modelled the influence of key habitat types on habitat selection by woodland caribou (Brown et al. 2007), by evaluating whether habitat types differed between telemetry locations and random locations. We followed Mayor et al. (2007) in applying a biologically relevant but simplified habitat selection model to illustrate our analyses; other landscape features may also influence habitat selection by woodland caribou. We compared the presence or absence of mature coniferous stands [treed muskegs (TM), mature


GEEs and GLMMs for RSFs spruce stands (MS) and mature jack pine dominated stands (MJPD)] within 50 m of telemetry and random points. We modelled the influence of distance to roads (DRD) and distance to hardwood/ mixed-wood stands (DHMW)) as spatial structure of habitat patches may influence caribou population declines and habitat selection (Johnson & Gillingham 2005). Distance to HMW stands was correlated with presence/absence of HMW, and thus, presence/absence of HMW was not added to the model. Distance to cutblocks was correlated with DRD; therefore, the latter variable was used to capture effects of anthropogenic activities.

SPATIAL SCALES AND DATA SETS

We evaluated habitat selection at two spatial scales: herd home range (e.g. Linke et al. 2005), and the home range of individual animals (e.g. Gillies et al. 2006). In the first data set, we selected random points from the herd home range (100% minimum convex polygon; HHR data set). We selected five times the number of random points per animal as collected from the GPS collars (e.g. Johnson & Gillingham 2005). In the second data set, we selected random points from the home ranges of individual animals (IHR data set). For each data set, this resulted in a total of 8985 telemetry locations and 44 925 random locations. Because the correlation structure of the telemetry and random points differ, correlation structure cannot be correctly specified. However, both GEE and GLMM may be used with an empirical rather than model-based variance estimator, which is robust to deviations from this assumption (Hardin & Hilbe 2003; Fitzmaurice et al. 2004). Although correct specification of the correlation structure is desirable because it allows for the calculation of more efficient (usually smaller) standard errors (Fitzmaurice et al. 2004), empirical standard errors may be used to determine statistical significance when the correlation structure cannot be correctly specified or when it is unknown. For each data set, we also used destructive sampling to remove some of the temporal autocorrelation among relocations (e.g. Way, Ortega & Strauss 2004; Saher 2005). Because as many as 95% of data points may have to be dropped before this is achieved (e.g., Saher 2005), we dropped 95% of our data to create two data sets that were 5% of the size of the intact data sets (HHR 5%, and IHR 5%), to create a relatively extreme example. Intervals between retained telemetry points were 3·33 days apart (see also Fortin et al. 2005). GEE and GLMM were used to control for clustering of data within animals.

ANALYSES

We used Procs GENMOD and GLIMMIX in SAS 9·1 to develop GEEs and GLMMs, respectively (SAS Institute Inc. 2003). All statistical models included the same independent variables. Within GLMM, a random intercept variable was added to account for clustering of points within individuals (Gillies et al. 2006). We used two working correlation structures to analyse each data set using GEE. The independent structure assumes within-cluster observations are independent, but is also useful for data sets with relatively few clusters (Hardin & Hilbe 2003: 142). In the compoundsymmetric working correlation structure, all observations within clusters are assumed to be equally correlated, while observations from different clusters are assumed to be independent (see also Gillies et al. 2006). The compound-symmetric correlation structure is heuristically equivalent to including a random intercept in a mixed model. Empirical standard errors were used to evaluate statistical significance. We compared these results with model-based standard errors, to determine the effect of erroneously using model-based rather than empirical standard errors.

593

We did not directly compare relative fit of GEEs vs. GLMMs for two reasons. First, GEEs use a quasi-likelihood, while GLMMs typically use a maximum-likelihood framework for model estimation. Comparative measures such as Akaike’s Information Criterion (Burnham & Anderson 1998) could be used for evaluating relative fit of models for GLMM (Bolker et al. 2009), whereas the quasilikelihood-under-the-independence-model information criterion, or QIC (Pan 2001) could be used for evaluating relative fit of models for GEE, but there is no criterion that can be used for both. Further, our research indicates QIC rarely chooses the correct correlation structure (A. Barnett, N. Koper, A. Dobson & M. Manseau, unpublished data, 2008). Secondly, because parameter estimates from GLMM were conditional, while parameter estimates from GEE were marginal, parameter estimates and significance are expected to differ, and their comparison is not appropriate. It is desirable to compare the fit of different correlation structures within a GEE analysis. Because our research demonstrates that QIC is strongly biased, this measure is not trustworthy. An informal comparison is to compare the relative size of the empirical (SEE) and model-based (SEM) standard errors. If SEE/SEM is close to 1, this suggests the correlation structure is correctly modelled (Bishop, Die & Wang 2000). We used the SEE/SEM ratio to evaluate whether the compound-symmetric correlation structure fit the data better than the independent correlation structure. There are no guidelines regarding the size of the ratio, but higher ratios reflect poorer model fit. This comparison is qualitative, but it is the best approach available at this time. Future developments and improvements to QIC are planned (J. Hilbe, 2008, personal communication). Model validation is important for RSF analyses. To demonstrate model validation in the context of GEE and GLMM analyses, we applied k-fold cross-validation (Boyce et al. 2002) to the individual home range data set (with the compound-symmetric correlation structure for GEE), but emphasize that this method should not be used to compare the fit of GEE and GLMM unless marginal estimates are derived from both methods. Because GEE predicts habitat selection of a population, for each iteration in the k-fold analysis, we withheld three animals from the data set, used the remaining 15 animals (83% of the 18 animals) to develop each RSF, and tested its fit using the withheld animals. Because the GLMM is describing habitat selection of specific animals, we withheld 17% of the data from each animal, used the remaining 83% of the data to develop each RSF, and tested the model using the withheld data (Boyce et al. 2002). Spearman’s rank correlation analysis was performed on the area-adjusted frequencies across RSF bins. Ten RSF bins with equal number of observations were created for the analyses.

Results MODEL CONVERGENCE

All GEE models converged. Initially, the GLMM analysis on the IHR data set did not converge, but after changing the optimization procedure to the Newton–Raphson method with ridging, the analysis converged. EFFECT OF SPATIAL SCALES ON COMPARISON BETWEEN GLMM AND GEE

Scale had a strong effect on parameter estimates and statistical significance (Table 1). Avoidance of roads was only significant


0·047