Ecological Applications, 24(4), 2014, pp. 862–876 Ó 2014 by the Ecological Society of America

Predictive modeling of marine benthic macrofauna and its use to inform spatial monitoring design MICHAEL DOWD,1,3 JON GRANT,2

AND

LIN LU,2

1

Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, PO Box 15000, Halifax, Nova Scotia B3H 4R2 Canada 2 Department of Oceanography, Dalhousie University, Halifax, Nova Scotia B3H 4R2 Canada

Abstract. This study undertakes ecological analysis focused on predictive modelling and design for spatial sampling. The approaches are applied to a set of coastal marine benthic macrofaunal observations, and associated environmental data, measured at 48 sites in St Anns Bay, Nova Scotia, Canada. A multivariate generalized least-squares regression was used to establish a predictive relationship between benthic fauna and the environment. Five ecological indices derived from faunal composition (abundance, richness, species number, diversity, AMBI) were treated as a multivariate response, and 10 environmental variables as candidate predictors. The multivariate regression also incorporated the effects of spatial autocorrelation. Predictive relationships were highly signiﬁcant, and variable selection identiﬁed three key environmental predictors (median sediment grain size, porosity, and sulﬁde). Using these baseline data, we developed a procedure to identify a reduced sampling design for long-term monitoring of benthic faunal health. The procedure is based on a sequential (backward elimination) algorithm to identify the set of sites that contributed most to the overall information. This study provides a general and comprehensive statistical framework for treating environmental monitoring and sampling design. It can be extended beyond the statistical framework used, and applied to a range of ecological applications. Key words: benthic ecology; ecosystem health; environmental monitoring; marine benthic macrofauna; multivariate regression; spatial autocorrelation; spatial design; St Anns Bay, Nova Scotia, Canada.

INTRODUCTION Assessment of the ecological health of coastal marine ecosystems has become an important issue with increased coastal development and population pressures. The monitoring of the benthic environment and its fauna are key elements of such programs (Borja 2005). A variety of ecosystem indices for benthic ecosystem health have been developed based on macrofaunal assemblages. These include traditional measures of richness and diversity, as well as more targeted measures such as AMBI (the AZTI Marine Biotic Index; see Borja and Muxika 2005). AMBI differs in that it assigns each species an ecological grouping according to its sensitivity along a pollution stress gradient. Benthic environmental variables are also routinely collected as part of sampling programs, including sediment properties (e.g., size distribution, porosity, organics) and biogeochemical measures (e.g., redox). Some of these are cheap and straightforward to measure, and have been suggested as an alternative to the relatively costly macrofaunal monitoring (Wildish et al. 2001). The overall challenge is to ﬁnd effective ways Manuscript received 28 November 2012; revised 4 September 2013; accepted 11 September 2013. Corresponding Editor: A. O. Finley. 3 E-mail: [email protected] 862

to use this information to assess, understand, and monitor ecosystem health and benthic–environment coupling. Recognizing the diversity of data types available, Borja and Dauer (2008) outlined three index-based approaches for quantifying health: univariate, multi-metric, and multivariate approaches. Here, we adopt a general multivariate approach using a suite of ecological indices and environmental variables to develop a predictive model for benthic–environment coupling. Statistical approaches are needed to guide efﬁcient and effective monitoring programs for coastal ecosystem health. This includes deciding which variables to measure, and where and how often to sample them, while taking account monitoring objectives and cost considerations (Caughlan and Oakley 2001). In their review of spatial designs for monitoring aquatic systems, Dobbie et al. (2008) identify three general approaches: (1) geometric sampling; (2) probability-based design; and (3) model-based design. Geometric sampling focuses on space-ﬁlling or grid-based designs that are useful for initial surveys to establish scales of variability, and are usually informed by expert judgment. Probability-based design assumes knowledge of the distributional properties of the population of interest, e.g., stratiﬁed random sampling (e.g., Dauer and Llanso´ 2003, Cabral and Murta 2004). In contrast, a model-based design implies the use of a statistical model that explicitly quantiﬁes the

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

863

FIG. 1. (a) Location map for the study area in eastern Canada, and (b) a detailed map of St Anns Bay, Nova Scotia. The numbered locations of the 48 monitoring stations are indicated in (b).

spatial scales of variability, such as the covariance functions used in geostatistics (e.g., Wackernagel 2003). Monitoring procedures can then be developed for model-based design that maximize information content (Mu¨ller 2007, Mateu and Mu¨ller 2012). In this study, we use a model-based design for multiple variables within a spatial framework to inform a sampling design that targets future monitoring needs. The central aims of this study are to use coastal marine benthic monitoring data to quantify the extent of benthic–environment coupling, and to design a longterm monitoring strategy. This problem is anchored in a concrete application based on a spatially distributed benthic data set from St Anns Bay, Cape Breton, Nova Scotia, Canada. However, the approach taken is sufﬁciently general that it can be applied to other sites and different types of monitoring data. The predictive model that relates macrofaunal indices to the environment is based on multivariate generalized least-squares regression. The suite of faunal indices is therefore simultaneously considered (because these indices are a highly related set of response variables), and the general covariance structure can account for spatial autocorrelation (a feature that, if neglected, would invalidate statistical inference). Variable selection is used to identify important environmental predictor variables for monitoring (the faunal data may provide more direct information on ecosystem health and function, but often are more difﬁcult to obtain). An objective procedure is then proposed to choose a subset of the monitoring stations from the initial comprehensive baseline survey. The sites are chosen so as to retain the maximum amount of information and optimally maintain the predictive relationship between the environmental var-

iables and the multivariate faunal indices. The idea is that the extensive baseline monitoring array can and should be used to adaptively design a less extensive, but still informative, set of longer term monitoring sites for ongoing sampling. MATERIALS

AND

METHODS

Study site and sampling The study site is St Anns Bay, NS, Canada (Fig. 1a), a nearly enclosed meso-tidal embayment with freshwater input from two small rivers in the upper part of the bay (see Plate 1). The bay is ;10 km long by 4 km wide and has an average depth of 10 m. A narrow sand spit spans the mouth of the bay. Throughout the bay, there is longline aquaculture of the blue mussel Mytilus edulis. An environmental monitoring program has been undertaken since 2000 to detect potential long-term impacts of this shellﬁsh aquaculture on the bay ecosystem. In most years, a few monitoring stations are sampled. However, a detailed spatial study was undertaken in June 2009. Benthic macrofaunal and environmental data from 48 sampling stations comprise the baseline data set (Fig. 1). The stations cover the bay on an approximately regular grid (i.e., geometric sampling), but the spacing between stations is variable due to logistical and other sampling considerations. There are two near-coincident stations (station 8 and 10) located north of station 13, and we have retained both of these. At each sampling station, an Ekman grab (15.2 3 15.2 cm) was taken and the entire contents were sieved through a 500-lm mesh. Samples were preserved in formalin and identiﬁed to species. The following faunal indices for each station (or grab) were derived:

864

MICHAEL DOWD ET AL.

abundance (total number of individuals); species number; richness (species numbers weighted by individuals per species); diversity (Shannon-Wiener diversity); and AMBI (see Borja et al. 2000). These macrofaunal indices measure related, but distinct, aspects of faunal composition (Jorgensen et al. 2010). The following environmental variables were also recorded at each station: porosity (%), organic matter (%), chlorophyll a (milligrams per gram sediment), water depth (meters), redox (millivolts), sulﬁde (micromoles per gram sediment), and sediment grain size (in micrometers; characterized by its median, standard deviation, skewness, and kurtosis; see, e.g., Grant et al. 2002, Hargrave et al. 2008, Lu et al. 2008). These data have been divided into two groups: the ﬁve faunal indices are the response variables, and the 10 environmental variables are the explanatory variables. Exploratory data analysis was carried out to examine relationships among the variables and identify spatial patterns (see Results: Data). Statistical methodology The goals of this study are twofold: to quantify benthic–environment coupling, and to design a longterm monitoring strategy. The ﬁrst goal is addressed by regression modeling of the spatially distributed faunal data in terms of the benthic environmental variables. Variable selection is undertaken because it may be more cost effective to measure certain easily obtained and highly informative environmental variables. To meet the second goal, the regression methodology is used to design a monitoring protocol consisting of a subset of the original sampling stations that are the most informative in terms of maintaining the predictive relation between the fauna and the environment. We next describe the details of the methodology used in our study. Regression model framework.—The statistical analysis framework used here is based on multivariate generalized least-squares (MV GLS) regression. Key aspects of the analysis methodology are given, and methodological details are found in the Appendix A. Analysis was carried out using R statistical software (R Development Core Team 2011). The regression model is denoted by Y ¼ Xb þ e

ð1Þ

where Y ¼ [y(1), . . . , y(m)] is an n 3 m multivariate response matrix comprising the m ¼ 5 faunal indices recorded at the n ¼ 48 monitoring sites. The n 3 p matrix X has columns that comprise the p ¼ 10 environmental variables recorded at all the stations. Note that all environmental variables have been standardized (i.e., by subtracting the mean and dividing by the standard deviation) so that there is no need for an intercept term in the model. The p 3 m matrix b ¼ [b(1), . . . , b(m)] contains the regression coefﬁcients. The error term is given by the n 3 m matrix e ¼ [e(1), . . . , e(m)]. Standard output of such a multivariate regression includes: the

Ecological Applications Vol. 24, No. 4

ˆ the predicted reestimated regression coefﬁcients b, ˆ ˆ sponse Y, the residuals eˆ ¼ Y – Y, as well as estimates of their variances and covariances (Johnson and Wichern 2001:354–425). For the application at hand, this framework allows for simultaneous consideration of all the faunal indices, as there is often no a priori reason for choosing one over the other (Borja et al. 2009). It also accounts for the inherent intercorrelation of the indices, which arises due to their derivation from the same faunal data. The explanatory environmental variables are also themselves correlated as they reﬂect related features of the local benthic environment. Hence, not all will be required in the resulting model. Finally, given that the monitoring sites are from a relatively dense sampling array, we anticipate a degree of spatial autocorrelation that must be accounted for by a general error covariance structure. Further aspects of the regression model will be remarked on. Speciﬁcation of regression error term.—The statistical assumptions are that errors, e(i ), are multivariate normal with zero mean, and covariance structure is described as cov(e(i ), e(k)) ¼ Re(i,k), for i, k ¼ 1, . . . , m, which speciﬁes the error covariances within and between variables. This more general covariance structure renders the problem one of multivariate generalized least squares (MV GLS). The usual approach of multivariate ordinary leastsquares (MV OLS) regression assumes that Re(i,k) ¼ re(i,k)I, where I is the identity matrix, and implies that the multivariate observations have independent errors (where subscript e refers to the error term e in Eq. 1). Solutions for MV OLS are readily available (Johnson and Wichern 2001:354–425), and MV GLS solutions can be obtained through straightforward transformation of variables. The transformation requires an estimate of Re(i,k), which can be obtained via a multistage regression algorithm (see Appendix A). Variable selection.—Variable selection procedures were used to identify the most important environmental variables for predicting the faunal indices. This is important due to the strong intercorrelation among the environmental variables, which implies that there is substantial redundant information, or multicollinearity. A monitoring program would, for logistical reasons, want to identify these key variables, and also in practice would want to assess the performance of selected variables to optimize cost considerations. A forward selection procedure is used for variable selection, i.e., the regression model is built by adding environmental variables one at a time, starting from an empty model. The criteria used for variable selection are based on Wilks’ Lambda (Johnson and Wichern 2001:354–425), which is the multivariate analog of the ratio of residual to total sum of squares. At each selection step, the variable is added that yields the smallest Wilks’ Lambda. The variable addition procedure is stopped when the variable added is no longer signiﬁcant at level 0.05

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

865

PLATE 1. A view of St. Anns Bay, Cape Breton Island, Nova Scotia. Photo credit: J. Grant.

(using an approximate F test based on Wilks’ test statistic). Monitoring array design The spatial sampling design is based on the regression model. The motivating scenario is one in which an extensive baseline data set has already been collected at a large number of stations, including both environmental variables and the faunal indices. The goal is to determine a subset of stations for use in long-term monitoring. The monitoring design procedure starts with the full set of stations, and carries out sequential removal of the sampling sites that contribute the least to the overall information. It is developed as follows. The MV GLS regression yields the covariance matrix of the estimated regression coefﬁcients (see Appendix A), which can be expressed as an mp 3 mp matrix: 3 2 bˆ ð1Þ 7 6 ð2Þ cov4 ... 5 ¼ Rˆ b : bˆ ðmÞ We use a basic deﬁnition of information, F, as the inverse of the total variance of the regression coefﬁcients (where subscript b is the variance–covariance for b): F ¼ 1=traceðRˆ b Þ:

ð3Þ

Note that other choices for information metrics are possible. Optimizing the monitoring design is based on backward elimination. That is, at each iteration, the

sampling site (or multivariate observation) is removed that contributes the least to the information F (or, equivalently, inﬂates most the variance). With the regression model established, and starting with all n sites included, the algorithm proceeds as follows: Iteration 1.—Identify and delete the ﬁrst site that contributes least to the information. The ﬁrst elimination requires computing F ( j )(1), which is deﬁned as the total information with the jth observation deleted at iteration 1. This step entails deleting the jth row of X and Y and carrying out the MV GLS regression. This is done for each of the observations, j ¼ 1, . . . , n. The observation, j, with the largest F ( j )(1) is eliminated from the spatial design, yielding the smallest loss of information. A new set of reduced data matrices X(1) and Y(1) is then computed, with the jth observation eliminated. Iteration 2.—Of the remaining n 1 sites, the site that contributes least to the information is identiﬁed and deleted. That is, the X(1), Y(1) from iteration 1 are used to compute F ( j )(2) (this is the information F at iteration 2 with the jth observation deleted for the j ¼ 1, . . . , n 1). The multivariate observation j associated with the largest F ( j )(2) is then eliminated, and X(2) and Y(2) are recomputed (for use in the next iteration). Iterations 3þ.—The procedure of sequentially eliminating observations, or monitoring sites, is continued until a stopping criterion is met. The stopping criterion is likely to vary by application (for our study, we develop an objective criterion in Results: Monitoring array design that is based on identifying the iteration

866

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 2. Plots of the faunal indices (response variables). The diagonal contains histograms and kernel smoothed density estimates. Pairwise scatter plots are found below the main diagonal and include the position of the centroid and the correlation ellipse. Correlation coefﬁcients (r) between the indices are found above the main diagonal. (Note that abundance and species number have been log-transformed here, and for the analysis). Abundance is the total number of individuals; species number is the number of species; richness is species number weighted by individuals per species; diversity is Shannon-Wiener diversity; AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja et al. 2000, Borja and Muxika 2005).

corresponding to change point in the rate of information decline). This procedure provides an objective and general means of sequentially eliminating the least important monitoring sites from a comprehensive baseline data set. The sites retained are the ones that contribute most to the ability of the explanatory environmental variables to predict the multivariate faunal response. RESULTS Data Fig. 2 shows plots of the benthic faunal indices, or the response variables. Pairwise scatter plots indicate that the faunal indices are all positively correlated, as expected. There is one outlier (Station 41), wherein only

a single animal was found in the core sample. The highest correlations are between species number, diversity, and richness (q 0.88). The weakest correlations are between AMBI and the other indices (q , 0.3). The data distribution of each faunal index is slightly rightskewed. Note that the AMBI values are generally in the ‘‘unbalanced range,’’ i.e., classiﬁed as healthier than the transitional to pollution class (Borja et al. 2000). Fig. 3 shows the spatial distribution of the magnitude of the ﬁve faunal indices. There is a strong spatial coherency between all indices; near the mouth of the bay, large values for all indices are found, and they generally decrease in the landward direction (i.e., farthest from the mouth of the bay). This occurs due to landward ﬁning of the sediment, with related

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

867

FIG. 3. Spatial maps of the faunal indices. The magnitude of all ﬁve indices is shown for each of 48 sampling stations, based on circle diameter. Abundance and species number are given in the original units, but plotted on a log scale.

variables such as porosity and organic content showing corresponding changes. However, in the most landward region, the indices become more variable in their magnitudes, and have less relationship to one another. The indices themselves are clearly spatially autocorrelated. A selected set of benthic environmental explanatory variables is plotted in Figs. 4 and 5 (the selection anticipates the key variables selected in Results: Regression results). The pairwise scatter plots of Fig. 4 indicate generally weak intercorrelations (jqj , 0.5) among the environmental variables, implying that they provide information on different aspects of the benthic environment. The exceptions are the following: porosity is strongly negatively correlated with median sediment size (q ¼ 0.7) and positively correlated with sediment chlorophyll a (q ¼ 0.6); median sediment size and chlorophyll a are negatively correlated (q ¼ 0.52). The majority of the data are clustered over a relatively small range of values. However, there are smaller numbers of sites that take on more extreme values and might be expected to be important in establishing predictive relations between the faunal indices and the environment.

Fig. 5 provides a spatially explicit representation of the environmental data. Chlorophyll a and, to a lesser extent, porosity show a general landward increase, whereas median sediment size shows the opposite pattern. For sulﬁde and redox, a large range and some extreme values are evident. These bay-wide patterns are probably a consequence of the hydrodynamic ﬂushing regime, interacting with the seston dynamics and the presence of bivalve aquaculture (Dowd 2003). For the remaining explanatory variables (not shown), there were some high correlations (e.g., porosity and organic matter had q ¼ 0.96), and spatial coherency (e.g., for the sediment grain size distribution parameters). The relationships between the faunal indices and all of the benthic environmental variables are given in Table 1 in terms of their correlation. The similarity in the information content of the faunal indices is evident. The strongest relationships between the faunal indices and the environmental variables are with median sediment size, porosity, organic content, sediment skewness, and sediment kurtosis. However, there is multicollinearity in the explanatory variables and many of the correlations are driven by very large or very small values.

868

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 4. Plots of selected benthic environmental (explanatory) variables, following the format of Fig. 2. Porosity, sulﬁde, grain size, and chlorophyll a were measured in sediment (sulﬁde and chl a concentrations are per gram of sediment); redox is a biogeochemical measure.

Specifying the regression errors Using the complete set of monitoring data outlined in Materials and methods: Study site and sampling, the multistage regression procedure of the Appendix A was carried out. The ﬁrst stage is a MV OLS regression relating the multivariate faunal indices to all 10 of the environmental predictors. This assumes an error covariance structure of the form Re(i,k) ¼ re(i,k)I for i, k ¼ 1, . . . , m. The set of re(i,k) are estimated from the residuals as (n p)1eˆ 0 ðiÞ eˆ (k). The regression was highly signiﬁcant. The studentized residuals were examined for each of the faunal responses and found to satisfy the assumptions of normality, as well as constant variance (the residual variance was independent of both the magnitude of the explanatory variables and the predicted response). There was also one notable outlier associated with the AMBI at station 41 (only one animal was found in the sample),

but an inﬂuence analysis indicated that this point had a minimal effect on the regression. However, the residuals from the MV OLS procedure, when examined from a spatial perspective, were not independent. This means that we have violated a key assumption about the error covariance. Fig. 6 shows the spatial distribution of the residuals associated with each of the predicted faunal indices. The presence of spatial autocorrelation is evident, with large and small values for the residuals tending to cluster together in the same region. The implication is that the MV OLS yields P values that are actually lower than they should be, because the assumption of spatial independence was violated, and the effective degrees of freedom are actually much smaller. In order to model the spatial autocovariance for use in the second stage of the MV GLS regression procedure,

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

869

FIG. 5. Spatial maps for selected benthic environmental variables. The magnitude of these ﬁve explanatory variables is indicated for each of the 48 sampling stations, based on circle diameter.

we computed sample variograms for each of the residuals, shown in Fig. 7. There is evidence for spatial correlation out to a scale of 1–2 km for all residuals. A nugget effect is also evident, and represents the part of the sampling variability due to the replication error in closely spaced stations. To model the spatial autocorrelation, an exponential variogram model was ﬁt to each of the residuals associated with the m ¼ 5 faunal responses. This used

a nonlinear weighted least-squares procedure, where the weights were set to be proportional to the number of points used in calculating the sample variogram values for each distance bin. The variogram model took the form: 1 c ðdÞ ¼ ai þ ð1 ai Þð1 ed=r Þ; si i

i ¼ 1; . . . m ð4Þ

where the i index refers to the faunal responses. Here, ci

TABLE 1. Pearson’s correlation coefﬁcients (r) between the faunal indices vs. the environmental predictor variables for macrobenthos in St Anns Bay, Nova Scotia.

Faunal index

Porosity

Organic content

Abundance Species number Richness Diversity AMBI

0.66 0.66 0.68 0.44 0.44

0.61 0.61 0.65 0.44 0.44

Sediment grain size Chl a

Depth

Redox

Sulﬁde

Median

SD

Skewness

Kurtosis

0.26 0.31 0.38 0.24 0.32

0.093 0.182 0.248 0.180 0.123

0.42 0.42 0.41 0.45 0.15

0.206 0.127 0.092 0.050 0.038

0.66 0.68 0.78 0.63 0.47

0.0425 0.0023 0.0010 0.0838 0.0353

0.61 0.61 0.68 0.59 0.39

0.62 0.61 0.71 0.62 0.41

Notes: Abundance is the total number of individuals; species number is the number of species; richness is species number weighted by individuals per species; diversity is Shannon-Wiener diversity. AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja et al. 2000, Borja and Muxika 2005). Porosity, organic content, chlorophyll a, and sulﬁde were also measured in sediment; redox is a biogeochemical measure.

870

Ecological Applications Vol. 24, No. 4

MICHAEL DOWD ET AL.

FIG. 6. Spatial plots of the studentized residuals from the MV OLS regression. The magnitude of the residuals is shown (by circle size) for each of the faunal indices at all 48 sampling stations.

is the semi-variance, d represents the spatial separation, and r is the range parameter. The sill, si, corresponds to the variance at large spatial separations. The parameter ai ¼ ni/si is the ratio of the nugget, ni, to the sill si. Expressed in this way, the left-hand side of Eq. 4 is 1 – q(d ), where q is the spatial autocorrelation as a function of separation d. It was found that r ’ 0.6 km for all cases (this corresponds to a range of 1.8 km, or where ci reaches 95% of the sill). The sill and nugget were also found to be related because ai ’ 1/3, for residuals of all faunal indices, making the autocorrelation independent of the response variable considered. In contrast, the sill, or total variance, was index dependent for the residuals of each of the faunal response variables. The ﬁtted variograms are also shown in Fig. 7. There was no evidence of anisotropy or any behavior that warranted a more complex form than the exponential variogram. The following covariance structure for stage 2 of the regression procedure was speciﬁed based on the variogram results: Reði;kÞ ¼ reði;kÞ V;

i; k ¼ 1; . . . ; m:

ð5Þ

The correlation matrix V is readily constructed from the ﬁtted parametric form of the variogram, using its relationship to the spatial autocorrelation, q(d ), as previously stated. Elements of the n 3 n matrix V were

determined by computing the pairwise distances of all monitoring sites and using Eq. 4 expressed in terms of correlation. This correlation matrix was used for stage 2 of the regression procedure. The transformed residuals from the MV GLS satisﬁed the assumptions of normality, constant variance, and spatial independence. Further iterative reﬁnement of V (i.e., beyond our twostage procedure) was deemed unnecessary. Regression results Results from the MV GLS regression using all 10 environmental variables are given in Table 2. The overall regression was highly signiﬁcant, and porosity, redox, and sediment median grain size were identiﬁed as important variables. (The corresponding MV OLS regression, which does not account for the spatial autocorrelation, had an overall signiﬁcance level four orders of magnitude lower, and also identiﬁed depth as an important predictor). The regression coefﬁcients (Table 2) obtained from the MV GLS procedure showed consistency between the response variables, reﬂecting their function of being similar but distinct measures of ecosystem health. The largest discrepancies between the regression coefﬁcients are between AMBI and the other indices, which was expected due to the

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

871

FIG. 7. Sample (open circles) and ﬁtted (lines) variograms for the residuals of each of the response variables to assess spatial autocorrelation.

distinctiveness of AMBI as a faunal metric (Borja and Muxika 2005). Note that if univariate GLS regressions (i.e., treating each faunal index as a separate response) were carried

out (not shown), results were signiﬁcant for abundance, richness, and species number (P , 0.01), marginally signiﬁcant for diversity (P ; 0.025), and not signiﬁcant for AMBI (P . 0.1). The associated R 2 for univariate

TABLE 2. Regression coefﬁcients obtained from multivariate generalized least-squares regression between faunal indices (response variables) and environmental predictor variables. Faunal indices

Environmental predictors

Abundance

Species number

Richness

Diversity

AMBI

Porosity**** Chl a Organic content Depth Redox**** Sulﬁde Grain size median**** Grain size SD Grain size skewness Grain size kurtosis

0.667 0.244 0.143 0.0197 0.0808 0.114 0.636 0.217 0.0841 0.369

0.772 0.116 0.365 0.0584 0.00147 0.120 1.37 0.410 0.219 0.580

0.643 0.166 0.227 0.0804 0.0552 0.0154 1.19 0.322 0.162 0.525

0.159 0.231 0.0733 0.00602 0.194 0.0533 0.688 0.169 0.00205 0.173

0.175 0.0661 0.0727 0.0529 0.0905 0.106 1.03 0.238 0.338 0.445

Notes: The regression coefﬁcients are estimated for b (Eq. 1). AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja and Muxika 2005). Grain size refers to a sediment measure. **** P , 0.00001 for an individual regression coefﬁcient (Wilks’ test). The overall regression signiﬁcance was P ¼ 6.7 3 108.

872

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 8. Change in the values of the MV GLS (multivariate generalized least squares) regression coefﬁcients as monitoring sites are successively eliminated from the analysis using the selection algorithm. Each panel shows the regression coefﬁcients for one of the multivariate response variables, as indicated. The numbers 1, 2, and 3 in each panel refer to the regression coefﬁcients for explanatory variables: porosity, median grain size, and sulﬁde, respectively.

regressions with each response variable were: abundance (R 2 ¼ 0.506), species number (R 2 ¼ 0.817), richness (R 2 ¼ 0.594), diversity (R 2 ¼ 0.387), and AMBI (R 2 ¼ 0.153). The important environmental variables also differed for each univariate regression. The important environmental variables identiﬁed by the variable selection procedure were: sediment median grain size, porosity, and sulﬁde. The overall regression for this reduced model is highly signiﬁcant. (Interestingly, forward selection using MV OLS chooses the same variables, but entering in a different order: median grain size, sulﬁde, porosity.) Note that the structure of the spatial autocorrelation of the residuals remained nearly the same as for the full model. Monitoring array design The monitoring array design procedure uses backward elimination of the least informative sites. The reduced regression model was used. The criterion used to determine the number of sites to retain in the ﬁnal design was based on the stability of the regression coefﬁcients as site elimination proceeded. Fig. 8 shows the regression coefﬁcients associated with the reduced MV GLS model as sites are eliminated

FIG. 9. Changes in the total information, F, as monitoring sites are successively eliminated from the analysis using the site selection algorithm. Results from both the multivariate ordinary least squares (MV OLS, no spatial correlation), and multivariate generalized least squares (MV GLS, with spatial correlation) are shown.

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

873

FIG. 10. Monitoring sites selected by the design procedure. The gray ﬁlled circles represent the 14 sites retained in the ﬁnal design. The number-ﬁlled circles are the 34 eliminated stations. The enclosed numbers indicate the order of their removal (1 ¼ ﬁrst, 34 ¼ last).

according to algorithm outlined in the Materials and methods: Monitoring array design. A maximum of 42 sites can be eliminated, after which the regression becomes ill-conditioned (the number of observations is close to the number of unknowns). Each panel shows the regression coefﬁcients for porosity, median grain size, and sulﬁde for one of the ﬁve faunal indices. The magnitude of all three regression coefﬁcients was consistent and relatively stable until about 32 observations were removed, indicating that the predictive relations remain robust until this degree of site elimination. After this, the magnitude of the leading regression coefﬁcient for porosity drops for all faunal indices (excepting richness) and the variance increases. The regression coefﬁcients for median grain size and sulﬁde showed a slight trend after 35 sites were eliminated. Fig. 9 shows the total information (the inverse of the trace of Rˆ b) as a function of the number of sites eliminated from the analysis. Two cases are shown: the ﬁrst where spatial autocorrelation is incorporated using MV GLS, and another where it has been erroneously ignored (MV OLS). The pattern for MV GLS indicated that the information was fairly stable until 35 of the 48 sites were eliminated. In contrast, using MV OLS suggested that information decreases abruptly after 25 sites are eliminated. Note that the increase in information for the MV GLS at iteration 9 is due to the removal of the outlier of Station 41. Based on the information change point from the MV GLS results, we chose to

retain 14 sites in the ﬁnal monitoring array (and thus eliminate 34 of 48 monitoring sites). The overall MV GLS regression remained highly signiﬁcant even when we eliminated this number of sites. The results of the monitoring design procedure are shown in Fig. 10. The 14 sites retained in the ﬁnal design are identiﬁed, and the order of elimination for the sampling stations is indicated. The ﬁnal design was one in which the sites were distributed spatially over most the bay, but not uniformly or completely. There were regions where no sites were chosen (southwest part of the bay). Only one of the two spatially coincident sites was included, and the outlier (station 41) was eliminated. An inﬂuence analysis was also carried out to examine the effect of this outlier on the design. With station 41 removed from the data set, it was found that the ﬁrst 25 stations were eliminated in the same order; after that, two substitutions were made: Station 27 in and station 23 out (these are both inner stations) and station 16 in and station 42 out (these are both mid-bay stations). To gain further insight into the sampling stations that were chosen, Fig. 11 shows pairwise scatterplots of the retained and the omitted multivariate observations obtained from the spatial design procedure. For all faunal response variables, the sites chosen effectively subsample the data to simultaneously maintain the range and distributional properties of the original set of indices, while omitting redundancy. A similar subsampling was also evident for the environmental predictor

874

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 11. Scatter plots of faunal and environmental observations selected by the design procedure. The 14 observations retained are shown in black circles, while the 34 observations omitted by the site selection procedure are shown by light gray circles.

variables, as well as for the relationships between these predictors and the multivariate faunal response. DISCUSSION

AND

CONCLUSIONS

This study has the dual objectives of assessing coastal environmental health and informing future monitoring design. The approach ﬁrst established a predictive relationship between macrofaunal indices and the benthic environment using multivariate generalized least-squares regression (MV GLS). The predictive relations were then used to identify a set of sites contributing the most information about the faunal assemblages and their environment.

The MV GLS regression framework is a wellestablished and straightforward approach. Begueria and Pueyo (2009) suggest that GLS is a superior alternative to simultaneous autoregressive models for spatial regression problems. GLS allows response variables of various types to be transformed to meet model assumptions, and complex error processes to be incorporated. Dormann et al. (2007) review methods to account for spatial autocorrelation in the analysis of species distributional data and conclude that GLS compares well to autoregressive models, spatial generalized linear mixed models, and generalized estimating equations. They suggest that a nugget term be incorporated into GLS for increased stability, and their wish list

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

includes extensions for spatial models to include multivariate responses and variable selection; all of these are features of our MV GLS framework. Moreover, the central concepts (accounting for correlation of the response variables, collinearity in the explanatory variables, and spatial correlation) can also be incorporated into generalized linear models and mixed models. The MV GLS procedure proved efﬁcient and robust in terms of variable and site selection, readily treating both outliers and redundant stations. It also allowed for the necessary diagnostics, modiﬁcations, and transformations required for the iterative model building and sampling design exercises. For our study in St Anns Bay, using a MV GLS regression approach established a highly signiﬁcant predictive relationship between the ﬁve faunal indices and the benthic environment. The AMBI index was most dissimilar to other indices; this is not surprising, because AMBI contains qualitative weighting for species based on indicator value. Multiple ecological indicators are also often recorded simultaneously over different scales (Messer et al. 1991). In our approach, we use a multivariate response within a regression-based predictive model to treat the indices simultaneously and ensure that their intercorrelation is properly accounted for. Indeed, different results were obtained (in terms of key environmental variables, and monitoring design) when considering the indices separately, that is, as a univariate response. The MV GLS regression framework allows for a general error structure and, in our study, a straightforward incorporation of spatial autocorrelation. Indeed, the variogram analysis allowed us to use an error model based on a separable error covariance structure. Spatial autocorrelation is still an outstanding issue in ecology (Legendre 1993, Beale et al. 2010, Valcu and Kempenaers 2010, Cressie and Wikle 2011). Benthic ecologists have recognized the importance of scales of variation and patchiness (Dauer and Llanso 2003, Quintino et al. 2006), and its effect on abundance estimation (Cabral and Murta 2004). Ignoring spatial autocorrelation in statistical analysis tends to inﬂates Type I error and deﬂate P values (Lennon 2000), a result conﬁrmed by our study. Hurlbert (1984) notes that ignoring it is a form of pseudoreplication: the real degrees of freedom are much smaller than the apparent degrees of freedom due to the dependence structure in the data. In addition to affecting the standard errors of a regression, there is also debate in the ecological literature about the effect on the values of the regression coefﬁcients themselves (Beale et al. 2007, Hawkins et al. 2007). The consequence is clear: statistical inference, variable selection, model selection, and power analysis are not reliable unless spatial relationships are properly taken into account. There is increasing emphasis being given to spatial sampling design in ecology and the environmental sciences (see, e.g., Mu¨ller 2007, Mateu and Mu¨ller 2012). Model-based sampling requires explicit models

875

for the spatial structure (Dobbie et al. 2008), and sophisticated statistical approaches have been developed for adaptive sampling based on spatiotemporal models (Wikle and Royle 2005, Cressie and Wikle 2011). Here, a regression-based statistical algorithm was developed to identify key sites at which the benthic environmental variables provide the most information about variation in the faunal indices. Our data set allowed us to assume a constant decorrelation scale and a simpliﬁed crosscovariance structure. However, many situations will require a much more sophisticated characterization of the spatial dependence structure. Computational Bayesian approaches are emerging (Finley et al. 2008), and ﬂexible multivariate nonstationary processes for representing spatial dependence have been developed (Gelfand et al. 2004). Another approach is copulas, which are multivariate distributions that can readily model complex dependence structure (Briggs et al. 2013). We used a sequential procedure (backward elimination) for sampling design. We chose, as a measure of information, the inverse of the total variance of the regression coefﬁcients, which performed well in selecting sites for this application. However, other information metrics (e.g., the determinant, or generalized variance) should also be considered. Although the site selection algorithm is computationally straightforward, it is likely that the procedure could be made more efﬁcient using approaches from the case-deletion diagnostic literature. The ﬁnal design removes redundancies in the data and emphasizes where subsampling of grid of stations could occur with minimal loss of information. The design seems fairly robust to one notable outlier, but an assessment of robustness would be recommended for any application. In summary, the approach that we have developed for benthic environmental monitoring considered a number of important issues for environmental health assessment. These include spatial autocorrelation, simultaneous treatment of multiple faunal indices, environmental variable selection, and improving sampling for future monitoring efforts. Such continued reﬁnement of models for spatial ecology is important for properly understanding and characterizing ecosystems (Beale et al. 2010, Massol et al. 2011). This work should help to better inform the theory and practice of coastal environmental health assessment and monitoring. ACKNOWLEDGMENTS This study was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Strategic Grant. M. Dowd and J. Grant were also both supported by NSERC Discovery Grants. The authors also gratefully acknowledge the two reviewers for their insightful comments. LITERATURE CITED Beale, C. M., J. J. Lennon, D. A. Elston, M. J. Brewer, and J. M. Yearsley. 2007. Red herrings remain in geographical ecology: a reply to Hawkins et al. Ecography 30:845–847. Beale, C. M., J. J. Lennon, J. M. Yearsley, M. J. Brewer, and D. A. Elston. 2010. Regression analysis of spatial data. Ecology Letters 13:246–264.

876

MICHAEL DOWD ET AL.

Beguerı´ a, S., and Y. Pueyo. 2009. A comparison of simultaneous autoregressive and generalized least squares models for dealing with spatial autocorrelation. Global Ecology and Biogeography 18:273–279. Borja, A. 2005. The European water framework directive: A challenge for nearshore, coastal and continental shelf research. Continental Shelf Research 25(14):1768–1783. Borja, A., and D. M. Dauer. 2008. Assessing the environmental quality status in estuarine and coastal systems: Comparing methodologies and indices. Ecological Indicators 8:331–337. Borja, A., J. Franco, and V. Perez. 2000. A marine biotic index to establish the ecological quality of soft-bottom benthos within European estuarine and coastal environments. Marine Pollution Bulletin 40:1100–1114. Borja, A., and I. Muxika. 2005. Guidelines for the use of AMBI (AZTIs Marine Biotic Index) in the assessment of the benthic ecological quality. Marine Pollution Bulletin 50(7):787–789. Borja, A., et al. 2009. Assessing the suitability of a range of benthic indices in the evaluation of environmental impact of ﬁn and shellﬁsh aquaculture located in sites across Europe. Aquaculture 293(3–4):231–240. Briggs, J., M. Dowd, and R. Meyer. 2013. Data assimilation for large scale spatio-temporal systems using a location particle smoother. Environmetrics 24(2):81–97. Cabral, H. N., and A. G. Murta. 2004. Effect of sampling design on abundance estimates of benthic invertebrates in environmental monitoring studies. Marine Ecology Progress Series 276:19–24. Caughlan, L., and K. L. Oakley. 2001. Cost considerations for long-term ecological monitoring. Ecological Indicators 1:123–134. Cressie, N., and C. K. Wikle. 2011. Statistics for spatiotemporal data. Wiley, New York, New York, USA. Dauer, D. M., and R. J. Llanso´. 2003. Spatial scales and probability based sampling in determining levels of benthic community degradation in the Chesapeake Bay. Environmental Monitoring and Assessment 81:175–186. Dobbie, M. J., B. L. Henderson, and D. L. Stevens. 2008. Sparse sampling: Spatial design for monitoring stream networks. Statistics Surveys 2:113–153. Dormann, C. F., et al. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30:609–628. Dowd, M. 2003. Seston dynamics in a tidal embayment with shellﬁsh aquaculture: a model study using tracer equations. Estuarine, Coastal and Shelf Science 57(3):523–537. Finley, A. O., S. Banerjee, A. R. Ek, and R. E. McRoberts. 2008. Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics 13(1):60–83. Gelfand, A. E., A. Schmidt, S. Banerjee, and C. F. Sirmans. 2004. Nonstationary multivariate process modelling through spatially varying coregionalization. Test 13:263–312. Grant, J., P. MacPherson, and B. T. Hargrave. 2002. Sediment properties and benthic–pelagic coupling in the North Water Polynya. Deep-Sea Research 49:5259–5275.

Ecological Applications Vol. 24, No. 4

Hargrave, B. T., M. Holmer, and C. P. Newcombe. 2008. Towards a classiﬁcation of organic enrichment in marine sediments based on biogeochemical indicators. Marine Pollution Bulletin 56(5):810–824. Hawkins, B. A., J. A. F. Diniz-Filho, L. M. Bini, P. De Marco, and T. M. Blackburn. 2007. Red herrings revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography 30:375–384. Hurlbert, S. H. 1984. Pseudoreplication and the design of ecological ﬁeld experiments. Ecological Monographs 54:187– 211. Johnson, R. A., and D. W. Wichern. 2001. Applied multivariate statistical analysis. Prentice-Hall, Englewood Cliffs, New Jersey, USA. Jorgensen, S. E., F.-L. Xu, and R. Costanza, editors. 2010. Ecological indicators for assessment of ecosystem health. CRC Press, Boca Raton, Florida, USA. Legendre, P. 1993. Spatial autocorrelation: trouble or new paradigm. Ecology 74:1659–1673. Lennon, J. J. 2000. Red-shifts and red herrings in geographical ecology. Ecography 23:101–113. Lu, L., J. Grant, and J. Barrell. 2008. Macrofaunal spatial patterns in relationship to environmental variables in the Richibucto Estuary, New Brunswick, Canada. Estuaries and Coasts 31(5):994–1005. Massol, F., D. Gravel, N. Mouquet, M. W. Cadotte, T. Fukami, and M. A. Leibold. 2011. Linking community and ecosystem dynamics through spatial ecology. Ecology Letters 14:313–323. Mateu, J., and Mu¨ller. 2012. Spatio-temporal design: Advances in efﬁcient data acquisition. Wiley, New York, New York, USA. Messer, J. J., R. A. Linthurst, and W. S. Overton. 1991. An EPA program for monitoring ecological status and trends. Environmental Management 17:67–78. Mu¨ller, W. G. 2007. Collecting spatial data. Springer, New York, New York, USA. Quintino, V., M. Elliott, and A. M. Rodrigues. 2006. The derivation, performance and role of univariate and multivariate indicators of benthic change: Case studies at differing spatial scales. Journal of Experimental Marine Biology and Ecology 330:368–382. Valcu, M., and B. Kempenaers. 2010. Spatial autocorrelation: an overlooked concept in behavioral ecology. Behavioral Ecology 21:902–905. Wackernagel, H. 2003. Multivariate geostatistics: an introduction with applications. Springer, New York, New York, USA. Wikle, C. K., and J. A. Royle. 2005. Dynamic design of ecological monitoring networks for non-Gaussian spatiotemporal data. Environmetrics 16(5):507–522. Wildish, D. J., B. T. Hargrave, and G. Pohle. 2001. Costeffective monitoring of organic enrichment resulting from salmon mariculture. ICES Journal of Marine Science 58:469– 476.

SUPPLEMENTAL MATERIAL Appendix Details of multivariate regression procedures (Ecological Archives A024-050-A1).

Predictive modeling of marine benthic macrofauna and its use to inform spatial monitoring design MICHAEL DOWD,1,3 JON GRANT,2

AND

LIN LU,2

1

Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, PO Box 15000, Halifax, Nova Scotia B3H 4R2 Canada 2 Department of Oceanography, Dalhousie University, Halifax, Nova Scotia B3H 4R2 Canada

Abstract. This study undertakes ecological analysis focused on predictive modelling and design for spatial sampling. The approaches are applied to a set of coastal marine benthic macrofaunal observations, and associated environmental data, measured at 48 sites in St Anns Bay, Nova Scotia, Canada. A multivariate generalized least-squares regression was used to establish a predictive relationship between benthic fauna and the environment. Five ecological indices derived from faunal composition (abundance, richness, species number, diversity, AMBI) were treated as a multivariate response, and 10 environmental variables as candidate predictors. The multivariate regression also incorporated the effects of spatial autocorrelation. Predictive relationships were highly signiﬁcant, and variable selection identiﬁed three key environmental predictors (median sediment grain size, porosity, and sulﬁde). Using these baseline data, we developed a procedure to identify a reduced sampling design for long-term monitoring of benthic faunal health. The procedure is based on a sequential (backward elimination) algorithm to identify the set of sites that contributed most to the overall information. This study provides a general and comprehensive statistical framework for treating environmental monitoring and sampling design. It can be extended beyond the statistical framework used, and applied to a range of ecological applications. Key words: benthic ecology; ecosystem health; environmental monitoring; marine benthic macrofauna; multivariate regression; spatial autocorrelation; spatial design; St Anns Bay, Nova Scotia, Canada.

INTRODUCTION Assessment of the ecological health of coastal marine ecosystems has become an important issue with increased coastal development and population pressures. The monitoring of the benthic environment and its fauna are key elements of such programs (Borja 2005). A variety of ecosystem indices for benthic ecosystem health have been developed based on macrofaunal assemblages. These include traditional measures of richness and diversity, as well as more targeted measures such as AMBI (the AZTI Marine Biotic Index; see Borja and Muxika 2005). AMBI differs in that it assigns each species an ecological grouping according to its sensitivity along a pollution stress gradient. Benthic environmental variables are also routinely collected as part of sampling programs, including sediment properties (e.g., size distribution, porosity, organics) and biogeochemical measures (e.g., redox). Some of these are cheap and straightforward to measure, and have been suggested as an alternative to the relatively costly macrofaunal monitoring (Wildish et al. 2001). The overall challenge is to ﬁnd effective ways Manuscript received 28 November 2012; revised 4 September 2013; accepted 11 September 2013. Corresponding Editor: A. O. Finley. 3 E-mail: [email protected] 862

to use this information to assess, understand, and monitor ecosystem health and benthic–environment coupling. Recognizing the diversity of data types available, Borja and Dauer (2008) outlined three index-based approaches for quantifying health: univariate, multi-metric, and multivariate approaches. Here, we adopt a general multivariate approach using a suite of ecological indices and environmental variables to develop a predictive model for benthic–environment coupling. Statistical approaches are needed to guide efﬁcient and effective monitoring programs for coastal ecosystem health. This includes deciding which variables to measure, and where and how often to sample them, while taking account monitoring objectives and cost considerations (Caughlan and Oakley 2001). In their review of spatial designs for monitoring aquatic systems, Dobbie et al. (2008) identify three general approaches: (1) geometric sampling; (2) probability-based design; and (3) model-based design. Geometric sampling focuses on space-ﬁlling or grid-based designs that are useful for initial surveys to establish scales of variability, and are usually informed by expert judgment. Probability-based design assumes knowledge of the distributional properties of the population of interest, e.g., stratiﬁed random sampling (e.g., Dauer and Llanso´ 2003, Cabral and Murta 2004). In contrast, a model-based design implies the use of a statistical model that explicitly quantiﬁes the

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

863

FIG. 1. (a) Location map for the study area in eastern Canada, and (b) a detailed map of St Anns Bay, Nova Scotia. The numbered locations of the 48 monitoring stations are indicated in (b).

spatial scales of variability, such as the covariance functions used in geostatistics (e.g., Wackernagel 2003). Monitoring procedures can then be developed for model-based design that maximize information content (Mu¨ller 2007, Mateu and Mu¨ller 2012). In this study, we use a model-based design for multiple variables within a spatial framework to inform a sampling design that targets future monitoring needs. The central aims of this study are to use coastal marine benthic monitoring data to quantify the extent of benthic–environment coupling, and to design a longterm monitoring strategy. This problem is anchored in a concrete application based on a spatially distributed benthic data set from St Anns Bay, Cape Breton, Nova Scotia, Canada. However, the approach taken is sufﬁciently general that it can be applied to other sites and different types of monitoring data. The predictive model that relates macrofaunal indices to the environment is based on multivariate generalized least-squares regression. The suite of faunal indices is therefore simultaneously considered (because these indices are a highly related set of response variables), and the general covariance structure can account for spatial autocorrelation (a feature that, if neglected, would invalidate statistical inference). Variable selection is used to identify important environmental predictor variables for monitoring (the faunal data may provide more direct information on ecosystem health and function, but often are more difﬁcult to obtain). An objective procedure is then proposed to choose a subset of the monitoring stations from the initial comprehensive baseline survey. The sites are chosen so as to retain the maximum amount of information and optimally maintain the predictive relationship between the environmental var-

iables and the multivariate faunal indices. The idea is that the extensive baseline monitoring array can and should be used to adaptively design a less extensive, but still informative, set of longer term monitoring sites for ongoing sampling. MATERIALS

AND

METHODS

Study site and sampling The study site is St Anns Bay, NS, Canada (Fig. 1a), a nearly enclosed meso-tidal embayment with freshwater input from two small rivers in the upper part of the bay (see Plate 1). The bay is ;10 km long by 4 km wide and has an average depth of 10 m. A narrow sand spit spans the mouth of the bay. Throughout the bay, there is longline aquaculture of the blue mussel Mytilus edulis. An environmental monitoring program has been undertaken since 2000 to detect potential long-term impacts of this shellﬁsh aquaculture on the bay ecosystem. In most years, a few monitoring stations are sampled. However, a detailed spatial study was undertaken in June 2009. Benthic macrofaunal and environmental data from 48 sampling stations comprise the baseline data set (Fig. 1). The stations cover the bay on an approximately regular grid (i.e., geometric sampling), but the spacing between stations is variable due to logistical and other sampling considerations. There are two near-coincident stations (station 8 and 10) located north of station 13, and we have retained both of these. At each sampling station, an Ekman grab (15.2 3 15.2 cm) was taken and the entire contents were sieved through a 500-lm mesh. Samples were preserved in formalin and identiﬁed to species. The following faunal indices for each station (or grab) were derived:

864

MICHAEL DOWD ET AL.

abundance (total number of individuals); species number; richness (species numbers weighted by individuals per species); diversity (Shannon-Wiener diversity); and AMBI (see Borja et al. 2000). These macrofaunal indices measure related, but distinct, aspects of faunal composition (Jorgensen et al. 2010). The following environmental variables were also recorded at each station: porosity (%), organic matter (%), chlorophyll a (milligrams per gram sediment), water depth (meters), redox (millivolts), sulﬁde (micromoles per gram sediment), and sediment grain size (in micrometers; characterized by its median, standard deviation, skewness, and kurtosis; see, e.g., Grant et al. 2002, Hargrave et al. 2008, Lu et al. 2008). These data have been divided into two groups: the ﬁve faunal indices are the response variables, and the 10 environmental variables are the explanatory variables. Exploratory data analysis was carried out to examine relationships among the variables and identify spatial patterns (see Results: Data). Statistical methodology The goals of this study are twofold: to quantify benthic–environment coupling, and to design a longterm monitoring strategy. The ﬁrst goal is addressed by regression modeling of the spatially distributed faunal data in terms of the benthic environmental variables. Variable selection is undertaken because it may be more cost effective to measure certain easily obtained and highly informative environmental variables. To meet the second goal, the regression methodology is used to design a monitoring protocol consisting of a subset of the original sampling stations that are the most informative in terms of maintaining the predictive relation between the fauna and the environment. We next describe the details of the methodology used in our study. Regression model framework.—The statistical analysis framework used here is based on multivariate generalized least-squares (MV GLS) regression. Key aspects of the analysis methodology are given, and methodological details are found in the Appendix A. Analysis was carried out using R statistical software (R Development Core Team 2011). The regression model is denoted by Y ¼ Xb þ e

ð1Þ

where Y ¼ [y(1), . . . , y(m)] is an n 3 m multivariate response matrix comprising the m ¼ 5 faunal indices recorded at the n ¼ 48 monitoring sites. The n 3 p matrix X has columns that comprise the p ¼ 10 environmental variables recorded at all the stations. Note that all environmental variables have been standardized (i.e., by subtracting the mean and dividing by the standard deviation) so that there is no need for an intercept term in the model. The p 3 m matrix b ¼ [b(1), . . . , b(m)] contains the regression coefﬁcients. The error term is given by the n 3 m matrix e ¼ [e(1), . . . , e(m)]. Standard output of such a multivariate regression includes: the

Ecological Applications Vol. 24, No. 4

ˆ the predicted reestimated regression coefﬁcients b, ˆ ˆ sponse Y, the residuals eˆ ¼ Y – Y, as well as estimates of their variances and covariances (Johnson and Wichern 2001:354–425). For the application at hand, this framework allows for simultaneous consideration of all the faunal indices, as there is often no a priori reason for choosing one over the other (Borja et al. 2009). It also accounts for the inherent intercorrelation of the indices, which arises due to their derivation from the same faunal data. The explanatory environmental variables are also themselves correlated as they reﬂect related features of the local benthic environment. Hence, not all will be required in the resulting model. Finally, given that the monitoring sites are from a relatively dense sampling array, we anticipate a degree of spatial autocorrelation that must be accounted for by a general error covariance structure. Further aspects of the regression model will be remarked on. Speciﬁcation of regression error term.—The statistical assumptions are that errors, e(i ), are multivariate normal with zero mean, and covariance structure is described as cov(e(i ), e(k)) ¼ Re(i,k), for i, k ¼ 1, . . . , m, which speciﬁes the error covariances within and between variables. This more general covariance structure renders the problem one of multivariate generalized least squares (MV GLS). The usual approach of multivariate ordinary leastsquares (MV OLS) regression assumes that Re(i,k) ¼ re(i,k)I, where I is the identity matrix, and implies that the multivariate observations have independent errors (where subscript e refers to the error term e in Eq. 1). Solutions for MV OLS are readily available (Johnson and Wichern 2001:354–425), and MV GLS solutions can be obtained through straightforward transformation of variables. The transformation requires an estimate of Re(i,k), which can be obtained via a multistage regression algorithm (see Appendix A). Variable selection.—Variable selection procedures were used to identify the most important environmental variables for predicting the faunal indices. This is important due to the strong intercorrelation among the environmental variables, which implies that there is substantial redundant information, or multicollinearity. A monitoring program would, for logistical reasons, want to identify these key variables, and also in practice would want to assess the performance of selected variables to optimize cost considerations. A forward selection procedure is used for variable selection, i.e., the regression model is built by adding environmental variables one at a time, starting from an empty model. The criteria used for variable selection are based on Wilks’ Lambda (Johnson and Wichern 2001:354–425), which is the multivariate analog of the ratio of residual to total sum of squares. At each selection step, the variable is added that yields the smallest Wilks’ Lambda. The variable addition procedure is stopped when the variable added is no longer signiﬁcant at level 0.05

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

865

PLATE 1. A view of St. Anns Bay, Cape Breton Island, Nova Scotia. Photo credit: J. Grant.

(using an approximate F test based on Wilks’ test statistic). Monitoring array design The spatial sampling design is based on the regression model. The motivating scenario is one in which an extensive baseline data set has already been collected at a large number of stations, including both environmental variables and the faunal indices. The goal is to determine a subset of stations for use in long-term monitoring. The monitoring design procedure starts with the full set of stations, and carries out sequential removal of the sampling sites that contribute the least to the overall information. It is developed as follows. The MV GLS regression yields the covariance matrix of the estimated regression coefﬁcients (see Appendix A), which can be expressed as an mp 3 mp matrix: 3 2 bˆ ð1Þ 7 6 ð2Þ cov4 ... 5 ¼ Rˆ b : bˆ ðmÞ We use a basic deﬁnition of information, F, as the inverse of the total variance of the regression coefﬁcients (where subscript b is the variance–covariance for b): F ¼ 1=traceðRˆ b Þ:

ð3Þ

Note that other choices for information metrics are possible. Optimizing the monitoring design is based on backward elimination. That is, at each iteration, the

sampling site (or multivariate observation) is removed that contributes the least to the information F (or, equivalently, inﬂates most the variance). With the regression model established, and starting with all n sites included, the algorithm proceeds as follows: Iteration 1.—Identify and delete the ﬁrst site that contributes least to the information. The ﬁrst elimination requires computing F ( j )(1), which is deﬁned as the total information with the jth observation deleted at iteration 1. This step entails deleting the jth row of X and Y and carrying out the MV GLS regression. This is done for each of the observations, j ¼ 1, . . . , n. The observation, j, with the largest F ( j )(1) is eliminated from the spatial design, yielding the smallest loss of information. A new set of reduced data matrices X(1) and Y(1) is then computed, with the jth observation eliminated. Iteration 2.—Of the remaining n 1 sites, the site that contributes least to the information is identiﬁed and deleted. That is, the X(1), Y(1) from iteration 1 are used to compute F ( j )(2) (this is the information F at iteration 2 with the jth observation deleted for the j ¼ 1, . . . , n 1). The multivariate observation j associated with the largest F ( j )(2) is then eliminated, and X(2) and Y(2) are recomputed (for use in the next iteration). Iterations 3þ.—The procedure of sequentially eliminating observations, or monitoring sites, is continued until a stopping criterion is met. The stopping criterion is likely to vary by application (for our study, we develop an objective criterion in Results: Monitoring array design that is based on identifying the iteration

866

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 2. Plots of the faunal indices (response variables). The diagonal contains histograms and kernel smoothed density estimates. Pairwise scatter plots are found below the main diagonal and include the position of the centroid and the correlation ellipse. Correlation coefﬁcients (r) between the indices are found above the main diagonal. (Note that abundance and species number have been log-transformed here, and for the analysis). Abundance is the total number of individuals; species number is the number of species; richness is species number weighted by individuals per species; diversity is Shannon-Wiener diversity; AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja et al. 2000, Borja and Muxika 2005).

corresponding to change point in the rate of information decline). This procedure provides an objective and general means of sequentially eliminating the least important monitoring sites from a comprehensive baseline data set. The sites retained are the ones that contribute most to the ability of the explanatory environmental variables to predict the multivariate faunal response. RESULTS Data Fig. 2 shows plots of the benthic faunal indices, or the response variables. Pairwise scatter plots indicate that the faunal indices are all positively correlated, as expected. There is one outlier (Station 41), wherein only

a single animal was found in the core sample. The highest correlations are between species number, diversity, and richness (q 0.88). The weakest correlations are between AMBI and the other indices (q , 0.3). The data distribution of each faunal index is slightly rightskewed. Note that the AMBI values are generally in the ‘‘unbalanced range,’’ i.e., classiﬁed as healthier than the transitional to pollution class (Borja et al. 2000). Fig. 3 shows the spatial distribution of the magnitude of the ﬁve faunal indices. There is a strong spatial coherency between all indices; near the mouth of the bay, large values for all indices are found, and they generally decrease in the landward direction (i.e., farthest from the mouth of the bay). This occurs due to landward ﬁning of the sediment, with related

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

867

FIG. 3. Spatial maps of the faunal indices. The magnitude of all ﬁve indices is shown for each of 48 sampling stations, based on circle diameter. Abundance and species number are given in the original units, but plotted on a log scale.

variables such as porosity and organic content showing corresponding changes. However, in the most landward region, the indices become more variable in their magnitudes, and have less relationship to one another. The indices themselves are clearly spatially autocorrelated. A selected set of benthic environmental explanatory variables is plotted in Figs. 4 and 5 (the selection anticipates the key variables selected in Results: Regression results). The pairwise scatter plots of Fig. 4 indicate generally weak intercorrelations (jqj , 0.5) among the environmental variables, implying that they provide information on different aspects of the benthic environment. The exceptions are the following: porosity is strongly negatively correlated with median sediment size (q ¼ 0.7) and positively correlated with sediment chlorophyll a (q ¼ 0.6); median sediment size and chlorophyll a are negatively correlated (q ¼ 0.52). The majority of the data are clustered over a relatively small range of values. However, there are smaller numbers of sites that take on more extreme values and might be expected to be important in establishing predictive relations between the faunal indices and the environment.

Fig. 5 provides a spatially explicit representation of the environmental data. Chlorophyll a and, to a lesser extent, porosity show a general landward increase, whereas median sediment size shows the opposite pattern. For sulﬁde and redox, a large range and some extreme values are evident. These bay-wide patterns are probably a consequence of the hydrodynamic ﬂushing regime, interacting with the seston dynamics and the presence of bivalve aquaculture (Dowd 2003). For the remaining explanatory variables (not shown), there were some high correlations (e.g., porosity and organic matter had q ¼ 0.96), and spatial coherency (e.g., for the sediment grain size distribution parameters). The relationships between the faunal indices and all of the benthic environmental variables are given in Table 1 in terms of their correlation. The similarity in the information content of the faunal indices is evident. The strongest relationships between the faunal indices and the environmental variables are with median sediment size, porosity, organic content, sediment skewness, and sediment kurtosis. However, there is multicollinearity in the explanatory variables and many of the correlations are driven by very large or very small values.

868

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 4. Plots of selected benthic environmental (explanatory) variables, following the format of Fig. 2. Porosity, sulﬁde, grain size, and chlorophyll a were measured in sediment (sulﬁde and chl a concentrations are per gram of sediment); redox is a biogeochemical measure.

Specifying the regression errors Using the complete set of monitoring data outlined in Materials and methods: Study site and sampling, the multistage regression procedure of the Appendix A was carried out. The ﬁrst stage is a MV OLS regression relating the multivariate faunal indices to all 10 of the environmental predictors. This assumes an error covariance structure of the form Re(i,k) ¼ re(i,k)I for i, k ¼ 1, . . . , m. The set of re(i,k) are estimated from the residuals as (n p)1eˆ 0 ðiÞ eˆ (k). The regression was highly signiﬁcant. The studentized residuals were examined for each of the faunal responses and found to satisfy the assumptions of normality, as well as constant variance (the residual variance was independent of both the magnitude of the explanatory variables and the predicted response). There was also one notable outlier associated with the AMBI at station 41 (only one animal was found in the sample),

but an inﬂuence analysis indicated that this point had a minimal effect on the regression. However, the residuals from the MV OLS procedure, when examined from a spatial perspective, were not independent. This means that we have violated a key assumption about the error covariance. Fig. 6 shows the spatial distribution of the residuals associated with each of the predicted faunal indices. The presence of spatial autocorrelation is evident, with large and small values for the residuals tending to cluster together in the same region. The implication is that the MV OLS yields P values that are actually lower than they should be, because the assumption of spatial independence was violated, and the effective degrees of freedom are actually much smaller. In order to model the spatial autocovariance for use in the second stage of the MV GLS regression procedure,

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

869

FIG. 5. Spatial maps for selected benthic environmental variables. The magnitude of these ﬁve explanatory variables is indicated for each of the 48 sampling stations, based on circle diameter.

we computed sample variograms for each of the residuals, shown in Fig. 7. There is evidence for spatial correlation out to a scale of 1–2 km for all residuals. A nugget effect is also evident, and represents the part of the sampling variability due to the replication error in closely spaced stations. To model the spatial autocorrelation, an exponential variogram model was ﬁt to each of the residuals associated with the m ¼ 5 faunal responses. This used

a nonlinear weighted least-squares procedure, where the weights were set to be proportional to the number of points used in calculating the sample variogram values for each distance bin. The variogram model took the form: 1 c ðdÞ ¼ ai þ ð1 ai Þð1 ed=r Þ; si i

i ¼ 1; . . . m ð4Þ

where the i index refers to the faunal responses. Here, ci

TABLE 1. Pearson’s correlation coefﬁcients (r) between the faunal indices vs. the environmental predictor variables for macrobenthos in St Anns Bay, Nova Scotia.

Faunal index

Porosity

Organic content

Abundance Species number Richness Diversity AMBI

0.66 0.66 0.68 0.44 0.44

0.61 0.61 0.65 0.44 0.44

Sediment grain size Chl a

Depth

Redox

Sulﬁde

Median

SD

Skewness

Kurtosis

0.26 0.31 0.38 0.24 0.32

0.093 0.182 0.248 0.180 0.123

0.42 0.42 0.41 0.45 0.15

0.206 0.127 0.092 0.050 0.038

0.66 0.68 0.78 0.63 0.47

0.0425 0.0023 0.0010 0.0838 0.0353

0.61 0.61 0.68 0.59 0.39

0.62 0.61 0.71 0.62 0.41

Notes: Abundance is the total number of individuals; species number is the number of species; richness is species number weighted by individuals per species; diversity is Shannon-Wiener diversity. AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja et al. 2000, Borja and Muxika 2005). Porosity, organic content, chlorophyll a, and sulﬁde were also measured in sediment; redox is a biogeochemical measure.

870

Ecological Applications Vol. 24, No. 4

MICHAEL DOWD ET AL.

FIG. 6. Spatial plots of the studentized residuals from the MV OLS regression. The magnitude of the residuals is shown (by circle size) for each of the faunal indices at all 48 sampling stations.

is the semi-variance, d represents the spatial separation, and r is the range parameter. The sill, si, corresponds to the variance at large spatial separations. The parameter ai ¼ ni/si is the ratio of the nugget, ni, to the sill si. Expressed in this way, the left-hand side of Eq. 4 is 1 – q(d ), where q is the spatial autocorrelation as a function of separation d. It was found that r ’ 0.6 km for all cases (this corresponds to a range of 1.8 km, or where ci reaches 95% of the sill). The sill and nugget were also found to be related because ai ’ 1/3, for residuals of all faunal indices, making the autocorrelation independent of the response variable considered. In contrast, the sill, or total variance, was index dependent for the residuals of each of the faunal response variables. The ﬁtted variograms are also shown in Fig. 7. There was no evidence of anisotropy or any behavior that warranted a more complex form than the exponential variogram. The following covariance structure for stage 2 of the regression procedure was speciﬁed based on the variogram results: Reði;kÞ ¼ reði;kÞ V;

i; k ¼ 1; . . . ; m:

ð5Þ

The correlation matrix V is readily constructed from the ﬁtted parametric form of the variogram, using its relationship to the spatial autocorrelation, q(d ), as previously stated. Elements of the n 3 n matrix V were

determined by computing the pairwise distances of all monitoring sites and using Eq. 4 expressed in terms of correlation. This correlation matrix was used for stage 2 of the regression procedure. The transformed residuals from the MV GLS satisﬁed the assumptions of normality, constant variance, and spatial independence. Further iterative reﬁnement of V (i.e., beyond our twostage procedure) was deemed unnecessary. Regression results Results from the MV GLS regression using all 10 environmental variables are given in Table 2. The overall regression was highly signiﬁcant, and porosity, redox, and sediment median grain size were identiﬁed as important variables. (The corresponding MV OLS regression, which does not account for the spatial autocorrelation, had an overall signiﬁcance level four orders of magnitude lower, and also identiﬁed depth as an important predictor). The regression coefﬁcients (Table 2) obtained from the MV GLS procedure showed consistency between the response variables, reﬂecting their function of being similar but distinct measures of ecosystem health. The largest discrepancies between the regression coefﬁcients are between AMBI and the other indices, which was expected due to the

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

871

FIG. 7. Sample (open circles) and ﬁtted (lines) variograms for the residuals of each of the response variables to assess spatial autocorrelation.

distinctiveness of AMBI as a faunal metric (Borja and Muxika 2005). Note that if univariate GLS regressions (i.e., treating each faunal index as a separate response) were carried

out (not shown), results were signiﬁcant for abundance, richness, and species number (P , 0.01), marginally signiﬁcant for diversity (P ; 0.025), and not signiﬁcant for AMBI (P . 0.1). The associated R 2 for univariate

TABLE 2. Regression coefﬁcients obtained from multivariate generalized least-squares regression between faunal indices (response variables) and environmental predictor variables. Faunal indices

Environmental predictors

Abundance

Species number

Richness

Diversity

AMBI

Porosity**** Chl a Organic content Depth Redox**** Sulﬁde Grain size median**** Grain size SD Grain size skewness Grain size kurtosis

0.667 0.244 0.143 0.0197 0.0808 0.114 0.636 0.217 0.0841 0.369

0.772 0.116 0.365 0.0584 0.00147 0.120 1.37 0.410 0.219 0.580

0.643 0.166 0.227 0.0804 0.0552 0.0154 1.19 0.322 0.162 0.525

0.159 0.231 0.0733 0.00602 0.194 0.0533 0.688 0.169 0.00205 0.173

0.175 0.0661 0.0727 0.0529 0.0905 0.106 1.03 0.238 0.338 0.445

Notes: The regression coefﬁcients are estimated for b (Eq. 1). AMBI is the AZTI Marine Biotic Index developed in Spain (see Borja and Muxika 2005). Grain size refers to a sediment measure. **** P , 0.00001 for an individual regression coefﬁcient (Wilks’ test). The overall regression signiﬁcance was P ¼ 6.7 3 108.

872

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 8. Change in the values of the MV GLS (multivariate generalized least squares) regression coefﬁcients as monitoring sites are successively eliminated from the analysis using the selection algorithm. Each panel shows the regression coefﬁcients for one of the multivariate response variables, as indicated. The numbers 1, 2, and 3 in each panel refer to the regression coefﬁcients for explanatory variables: porosity, median grain size, and sulﬁde, respectively.

regressions with each response variable were: abundance (R 2 ¼ 0.506), species number (R 2 ¼ 0.817), richness (R 2 ¼ 0.594), diversity (R 2 ¼ 0.387), and AMBI (R 2 ¼ 0.153). The important environmental variables also differed for each univariate regression. The important environmental variables identiﬁed by the variable selection procedure were: sediment median grain size, porosity, and sulﬁde. The overall regression for this reduced model is highly signiﬁcant. (Interestingly, forward selection using MV OLS chooses the same variables, but entering in a different order: median grain size, sulﬁde, porosity.) Note that the structure of the spatial autocorrelation of the residuals remained nearly the same as for the full model. Monitoring array design The monitoring array design procedure uses backward elimination of the least informative sites. The reduced regression model was used. The criterion used to determine the number of sites to retain in the ﬁnal design was based on the stability of the regression coefﬁcients as site elimination proceeded. Fig. 8 shows the regression coefﬁcients associated with the reduced MV GLS model as sites are eliminated

FIG. 9. Changes in the total information, F, as monitoring sites are successively eliminated from the analysis using the site selection algorithm. Results from both the multivariate ordinary least squares (MV OLS, no spatial correlation), and multivariate generalized least squares (MV GLS, with spatial correlation) are shown.

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

873

FIG. 10. Monitoring sites selected by the design procedure. The gray ﬁlled circles represent the 14 sites retained in the ﬁnal design. The number-ﬁlled circles are the 34 eliminated stations. The enclosed numbers indicate the order of their removal (1 ¼ ﬁrst, 34 ¼ last).

according to algorithm outlined in the Materials and methods: Monitoring array design. A maximum of 42 sites can be eliminated, after which the regression becomes ill-conditioned (the number of observations is close to the number of unknowns). Each panel shows the regression coefﬁcients for porosity, median grain size, and sulﬁde for one of the ﬁve faunal indices. The magnitude of all three regression coefﬁcients was consistent and relatively stable until about 32 observations were removed, indicating that the predictive relations remain robust until this degree of site elimination. After this, the magnitude of the leading regression coefﬁcient for porosity drops for all faunal indices (excepting richness) and the variance increases. The regression coefﬁcients for median grain size and sulﬁde showed a slight trend after 35 sites were eliminated. Fig. 9 shows the total information (the inverse of the trace of Rˆ b) as a function of the number of sites eliminated from the analysis. Two cases are shown: the ﬁrst where spatial autocorrelation is incorporated using MV GLS, and another where it has been erroneously ignored (MV OLS). The pattern for MV GLS indicated that the information was fairly stable until 35 of the 48 sites were eliminated. In contrast, using MV OLS suggested that information decreases abruptly after 25 sites are eliminated. Note that the increase in information for the MV GLS at iteration 9 is due to the removal of the outlier of Station 41. Based on the information change point from the MV GLS results, we chose to

retain 14 sites in the ﬁnal monitoring array (and thus eliminate 34 of 48 monitoring sites). The overall MV GLS regression remained highly signiﬁcant even when we eliminated this number of sites. The results of the monitoring design procedure are shown in Fig. 10. The 14 sites retained in the ﬁnal design are identiﬁed, and the order of elimination for the sampling stations is indicated. The ﬁnal design was one in which the sites were distributed spatially over most the bay, but not uniformly or completely. There were regions where no sites were chosen (southwest part of the bay). Only one of the two spatially coincident sites was included, and the outlier (station 41) was eliminated. An inﬂuence analysis was also carried out to examine the effect of this outlier on the design. With station 41 removed from the data set, it was found that the ﬁrst 25 stations were eliminated in the same order; after that, two substitutions were made: Station 27 in and station 23 out (these are both inner stations) and station 16 in and station 42 out (these are both mid-bay stations). To gain further insight into the sampling stations that were chosen, Fig. 11 shows pairwise scatterplots of the retained and the omitted multivariate observations obtained from the spatial design procedure. For all faunal response variables, the sites chosen effectively subsample the data to simultaneously maintain the range and distributional properties of the original set of indices, while omitting redundancy. A similar subsampling was also evident for the environmental predictor

874

MICHAEL DOWD ET AL.

Ecological Applications Vol. 24, No. 4

FIG. 11. Scatter plots of faunal and environmental observations selected by the design procedure. The 14 observations retained are shown in black circles, while the 34 observations omitted by the site selection procedure are shown by light gray circles.

variables, as well as for the relationships between these predictors and the multivariate faunal response. DISCUSSION

AND

CONCLUSIONS

This study has the dual objectives of assessing coastal environmental health and informing future monitoring design. The approach ﬁrst established a predictive relationship between macrofaunal indices and the benthic environment using multivariate generalized least-squares regression (MV GLS). The predictive relations were then used to identify a set of sites contributing the most information about the faunal assemblages and their environment.

The MV GLS regression framework is a wellestablished and straightforward approach. Begueria and Pueyo (2009) suggest that GLS is a superior alternative to simultaneous autoregressive models for spatial regression problems. GLS allows response variables of various types to be transformed to meet model assumptions, and complex error processes to be incorporated. Dormann et al. (2007) review methods to account for spatial autocorrelation in the analysis of species distributional data and conclude that GLS compares well to autoregressive models, spatial generalized linear mixed models, and generalized estimating equations. They suggest that a nugget term be incorporated into GLS for increased stability, and their wish list

June 2014

PREDICTIVE MODELING FOR SPATIAL DESIGN

includes extensions for spatial models to include multivariate responses and variable selection; all of these are features of our MV GLS framework. Moreover, the central concepts (accounting for correlation of the response variables, collinearity in the explanatory variables, and spatial correlation) can also be incorporated into generalized linear models and mixed models. The MV GLS procedure proved efﬁcient and robust in terms of variable and site selection, readily treating both outliers and redundant stations. It also allowed for the necessary diagnostics, modiﬁcations, and transformations required for the iterative model building and sampling design exercises. For our study in St Anns Bay, using a MV GLS regression approach established a highly signiﬁcant predictive relationship between the ﬁve faunal indices and the benthic environment. The AMBI index was most dissimilar to other indices; this is not surprising, because AMBI contains qualitative weighting for species based on indicator value. Multiple ecological indicators are also often recorded simultaneously over different scales (Messer et al. 1991). In our approach, we use a multivariate response within a regression-based predictive model to treat the indices simultaneously and ensure that their intercorrelation is properly accounted for. Indeed, different results were obtained (in terms of key environmental variables, and monitoring design) when considering the indices separately, that is, as a univariate response. The MV GLS regression framework allows for a general error structure and, in our study, a straightforward incorporation of spatial autocorrelation. Indeed, the variogram analysis allowed us to use an error model based on a separable error covariance structure. Spatial autocorrelation is still an outstanding issue in ecology (Legendre 1993, Beale et al. 2010, Valcu and Kempenaers 2010, Cressie and Wikle 2011). Benthic ecologists have recognized the importance of scales of variation and patchiness (Dauer and Llanso 2003, Quintino et al. 2006), and its effect on abundance estimation (Cabral and Murta 2004). Ignoring spatial autocorrelation in statistical analysis tends to inﬂates Type I error and deﬂate P values (Lennon 2000), a result conﬁrmed by our study. Hurlbert (1984) notes that ignoring it is a form of pseudoreplication: the real degrees of freedom are much smaller than the apparent degrees of freedom due to the dependence structure in the data. In addition to affecting the standard errors of a regression, there is also debate in the ecological literature about the effect on the values of the regression coefﬁcients themselves (Beale et al. 2007, Hawkins et al. 2007). The consequence is clear: statistical inference, variable selection, model selection, and power analysis are not reliable unless spatial relationships are properly taken into account. There is increasing emphasis being given to spatial sampling design in ecology and the environmental sciences (see, e.g., Mu¨ller 2007, Mateu and Mu¨ller 2012). Model-based sampling requires explicit models

875

for the spatial structure (Dobbie et al. 2008), and sophisticated statistical approaches have been developed for adaptive sampling based on spatiotemporal models (Wikle and Royle 2005, Cressie and Wikle 2011). Here, a regression-based statistical algorithm was developed to identify key sites at which the benthic environmental variables provide the most information about variation in the faunal indices. Our data set allowed us to assume a constant decorrelation scale and a simpliﬁed crosscovariance structure. However, many situations will require a much more sophisticated characterization of the spatial dependence structure. Computational Bayesian approaches are emerging (Finley et al. 2008), and ﬂexible multivariate nonstationary processes for representing spatial dependence have been developed (Gelfand et al. 2004). Another approach is copulas, which are multivariate distributions that can readily model complex dependence structure (Briggs et al. 2013). We used a sequential procedure (backward elimination) for sampling design. We chose, as a measure of information, the inverse of the total variance of the regression coefﬁcients, which performed well in selecting sites for this application. However, other information metrics (e.g., the determinant, or generalized variance) should also be considered. Although the site selection algorithm is computationally straightforward, it is likely that the procedure could be made more efﬁcient using approaches from the case-deletion diagnostic literature. The ﬁnal design removes redundancies in the data and emphasizes where subsampling of grid of stations could occur with minimal loss of information. The design seems fairly robust to one notable outlier, but an assessment of robustness would be recommended for any application. In summary, the approach that we have developed for benthic environmental monitoring considered a number of important issues for environmental health assessment. These include spatial autocorrelation, simultaneous treatment of multiple faunal indices, environmental variable selection, and improving sampling for future monitoring efforts. Such continued reﬁnement of models for spatial ecology is important for properly understanding and characterizing ecosystems (Beale et al. 2010, Massol et al. 2011). This work should help to better inform the theory and practice of coastal environmental health assessment and monitoring. ACKNOWLEDGMENTS This study was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Strategic Grant. M. Dowd and J. Grant were also both supported by NSERC Discovery Grants. The authors also gratefully acknowledge the two reviewers for their insightful comments. LITERATURE CITED Beale, C. M., J. J. Lennon, D. A. Elston, M. J. Brewer, and J. M. Yearsley. 2007. Red herrings remain in geographical ecology: a reply to Hawkins et al. Ecography 30:845–847. Beale, C. M., J. J. Lennon, J. M. Yearsley, M. J. Brewer, and D. A. Elston. 2010. Regression analysis of spatial data. Ecology Letters 13:246–264.

876

MICHAEL DOWD ET AL.

Beguerı´ a, S., and Y. Pueyo. 2009. A comparison of simultaneous autoregressive and generalized least squares models for dealing with spatial autocorrelation. Global Ecology and Biogeography 18:273–279. Borja, A. 2005. The European water framework directive: A challenge for nearshore, coastal and continental shelf research. Continental Shelf Research 25(14):1768–1783. Borja, A., and D. M. Dauer. 2008. Assessing the environmental quality status in estuarine and coastal systems: Comparing methodologies and indices. Ecological Indicators 8:331–337. Borja, A., J. Franco, and V. Perez. 2000. A marine biotic index to establish the ecological quality of soft-bottom benthos within European estuarine and coastal environments. Marine Pollution Bulletin 40:1100–1114. Borja, A., and I. Muxika. 2005. Guidelines for the use of AMBI (AZTIs Marine Biotic Index) in the assessment of the benthic ecological quality. Marine Pollution Bulletin 50(7):787–789. Borja, A., et al. 2009. Assessing the suitability of a range of benthic indices in the evaluation of environmental impact of ﬁn and shellﬁsh aquaculture located in sites across Europe. Aquaculture 293(3–4):231–240. Briggs, J., M. Dowd, and R. Meyer. 2013. Data assimilation for large scale spatio-temporal systems using a location particle smoother. Environmetrics 24(2):81–97. Cabral, H. N., and A. G. Murta. 2004. Effect of sampling design on abundance estimates of benthic invertebrates in environmental monitoring studies. Marine Ecology Progress Series 276:19–24. Caughlan, L., and K. L. Oakley. 2001. Cost considerations for long-term ecological monitoring. Ecological Indicators 1:123–134. Cressie, N., and C. K. Wikle. 2011. Statistics for spatiotemporal data. Wiley, New York, New York, USA. Dauer, D. M., and R. J. Llanso´. 2003. Spatial scales and probability based sampling in determining levels of benthic community degradation in the Chesapeake Bay. Environmental Monitoring and Assessment 81:175–186. Dobbie, M. J., B. L. Henderson, and D. L. Stevens. 2008. Sparse sampling: Spatial design for monitoring stream networks. Statistics Surveys 2:113–153. Dormann, C. F., et al. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30:609–628. Dowd, M. 2003. Seston dynamics in a tidal embayment with shellﬁsh aquaculture: a model study using tracer equations. Estuarine, Coastal and Shelf Science 57(3):523–537. Finley, A. O., S. Banerjee, A. R. Ek, and R. E. McRoberts. 2008. Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics 13(1):60–83. Gelfand, A. E., A. Schmidt, S. Banerjee, and C. F. Sirmans. 2004. Nonstationary multivariate process modelling through spatially varying coregionalization. Test 13:263–312. Grant, J., P. MacPherson, and B. T. Hargrave. 2002. Sediment properties and benthic–pelagic coupling in the North Water Polynya. Deep-Sea Research 49:5259–5275.

Ecological Applications Vol. 24, No. 4

Hargrave, B. T., M. Holmer, and C. P. Newcombe. 2008. Towards a classiﬁcation of organic enrichment in marine sediments based on biogeochemical indicators. Marine Pollution Bulletin 56(5):810–824. Hawkins, B. A., J. A. F. Diniz-Filho, L. M. Bini, P. De Marco, and T. M. Blackburn. 2007. Red herrings revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography 30:375–384. Hurlbert, S. H. 1984. Pseudoreplication and the design of ecological ﬁeld experiments. Ecological Monographs 54:187– 211. Johnson, R. A., and D. W. Wichern. 2001. Applied multivariate statistical analysis. Prentice-Hall, Englewood Cliffs, New Jersey, USA. Jorgensen, S. E., F.-L. Xu, and R. Costanza, editors. 2010. Ecological indicators for assessment of ecosystem health. CRC Press, Boca Raton, Florida, USA. Legendre, P. 1993. Spatial autocorrelation: trouble or new paradigm. Ecology 74:1659–1673. Lennon, J. J. 2000. Red-shifts and red herrings in geographical ecology. Ecography 23:101–113. Lu, L., J. Grant, and J. Barrell. 2008. Macrofaunal spatial patterns in relationship to environmental variables in the Richibucto Estuary, New Brunswick, Canada. Estuaries and Coasts 31(5):994–1005. Massol, F., D. Gravel, N. Mouquet, M. W. Cadotte, T. Fukami, and M. A. Leibold. 2011. Linking community and ecosystem dynamics through spatial ecology. Ecology Letters 14:313–323. Mateu, J., and Mu¨ller. 2012. Spatio-temporal design: Advances in efﬁcient data acquisition. Wiley, New York, New York, USA. Messer, J. J., R. A. Linthurst, and W. S. Overton. 1991. An EPA program for monitoring ecological status and trends. Environmental Management 17:67–78. Mu¨ller, W. G. 2007. Collecting spatial data. Springer, New York, New York, USA. Quintino, V., M. Elliott, and A. M. Rodrigues. 2006. The derivation, performance and role of univariate and multivariate indicators of benthic change: Case studies at differing spatial scales. Journal of Experimental Marine Biology and Ecology 330:368–382. Valcu, M., and B. Kempenaers. 2010. Spatial autocorrelation: an overlooked concept in behavioral ecology. Behavioral Ecology 21:902–905. Wackernagel, H. 2003. Multivariate geostatistics: an introduction with applications. Springer, New York, New York, USA. Wikle, C. K., and J. A. Royle. 2005. Dynamic design of ecological monitoring networks for non-Gaussian spatiotemporal data. Environmetrics 16(5):507–522. Wildish, D. J., B. T. Hargrave, and G. Pohle. 2001. Costeffective monitoring of organic enrichment resulting from salmon mariculture. ICES Journal of Marine Science 58:469– 476.

SUPPLEMENTAL MATERIAL Appendix Details of multivariate regression procedures (Ecological Archives A024-050-A1).