modelling biological data with segmented ... - Semantic Scholar

1 downloads 0 Views 2MB Size Report
MODELLING BIOLOGICAL DATA WITH SEGMENTED LANDSCAPE. OBJECTS AND IMAGE GREY VALUES. Eva Ivits a. , Barbara Koch a. , Lars Waser b.
MODELLING BIOLOGICAL DATA WITH SEGMENTED LANDSCAPE OBJECTS AND IMAGE GREY VALUES Eva Ivitsa, Barbara Kocha, Lars Waser b, Dan Chamberlainc a

University of Freiburg, Germany, email: [email protected], [email protected] b WSL, Switzerland, email: [email protected] c The British Trust for Ornithology, Great Britain, email: [email protected]

ABSTRACT Landscape structure was investigated on six test sites in Switzerland, selected along a land-use intensity gradient. The test areas were captured with remote sensing images on three spatial resolutions: 1) fused Landsat ETM-IRS and 2) Quickbird satellite images as well as 3) CIR aerial photos. Segmentation and fuzzy classification were implemented on the images to extract landscape patch indices in 96 sampling plots. In addition, or iginal and enhanced grey values were derived in the plots. Abundance of seven breeding bird species, sampled in the plots, was analysed with CCA. The variance of the species data explained by patch indices and grey values was compared across the three spatial resolutions. CCA revealed little explanatory power of remote sensing variables. The explained variance was comparable in case of grey values and patch indices. Increasing spatial resolution of patch indices and grey values did not evidence increasing association to abundance of birds. Presence and absence of E. rubecula was modelled by means of logistic regression. The logistic models were compared based on goodness of fit statistics and discrimination capacities. Logistic regression revealed very high discrimination capacity of both patch indices and grey values in predicting presence and absence of the species. Increasing spatial resolution did not effect the discrimination capacity of the remote sensing variables but resulted in increasing importance of image textural features. Keywords: grey values, segmentation, spatial resolution, birds, CCA, logistic regression

1 INTRODUCTION Landscape ecology is largely founded on the notation, that environmental patterns strongly influence ecological processes [1]. Using landscape indices for quantifying landscape pattern is a widespread method in ecology for over a decade now [2], [3]. Remotely sensed images offer optimal basis for landscape indices calculation, since satellite sensors are able to cover large continuous areas. A variety of studies examined the scale effect of changing pixel sizes on landscape indices [4], [5], [6], [7], [8]. Fewer studies explored the potential of landscape indices in species diversity based on real-world field studies [9], [10]. Another method for quantifying landscapes based on remote sensing is the application of original and enhanced image grey values. Image enhancement techniques are used in general for pointing out important features in raw remotely sensed data. Texture analysis performs filter operations including first order statistics as well as second order statistics derived from the grey level co-occurrence matrix. Focal analysis performs filter operations inclu ding density, diversity, majority etc. related to computations in a mowing window. The advantage of textural and filter enhancements is the decreasing contrasts in high frequency scenes, in order to emphasize homogeneous information. Furthermore, the usage of textures avoids noise-effects. Birds express well-documented response on landscape composition and pattern [11], [12]. Up to date however, only few studies analysed image grey values as indicators of bird assemblages. The present study was conducted within the BioAssess project in the fifth framework program of the European Commission. The BioAssess project developed biodiversity assessment tools for inland terrestrial ecosystems. These tools comprised a set of biodiversity indicators for plants, birds, butterflies, lichens, carabids, soil macrofauna, and colle mbola, while remote sensing was used to derive indices of landscape pattern. The purpose of the proposed set of indicators was to assess the impact of policies on changes in biodiversity in Europe. In order to achieve this goal, it was assumed that changes in biodiversity could be measured along a land-use intensity gradient. The purpose of the

present study is to demonstrate the power of remote sensing in predicting bird species diversity in the Swiss test site. The main questions to answer are 1) whether increasing spatial resolution plays a decisive role in modelling species data and 2) whether image grey values and derivatives are comparable to classified pixels in their potential to model species data. These questions become especially important with la rge area studies, since the price of remote sensing images exponentially increases with increasing spatial resolution. Furthermore, classification is time consuming and introduces uncertainties in the land use/land cover maps and with increasing spatial resolution it becomes difficult due to the high variation between the pixels.

2 MATERIAL AND METHODS 2.1. STUDY AREA The test area is located in the northern pre-Alps in the cantons of Luzern and Bern (figure 1). The region is characterised by a complex topography with impenetrable gorges, rocky slopes, karst areas and fluviatile deposits [13]. Six Land Use Units (LUUs) were selected representing a land use intensity gradient. Old-growth forest (LUU1) represented the lowest level of land use intensity, forest/woodland dominated landscapes represented mixed land use (LUU3 and 4) while intensively used grassland was chosen to depict intensive land use (LUU5 and 6, table 1 and figure 3). Each LUU was a 1 by 1 km square. Table 1. Description of the six land use units (LUUs) LUU LUU1

Old-growth forest

LUU2

Managed forest

LUU3 LUU4 LUU5 LUU6

Criteria (% of land use cover) Old-growth forest >50%, Other forests-woodland-shrub land >10%, Other land-uses > 20% Managed forest >50%, Other forests-woodland-shrub land >10%, Other land-uses > 20%

Mixed-use dominated by forest or Forest-woodland-shrub land >50%, Grassland >10%, Crops >10% woodland Mixed-use not dominated by a Forest-woodland-shrub land >25%, Grassland >25%, Crops >25% single land-use Mixed-use dominated by pasture Grassland >50%, Crops >10%, Forest -woodland-shrub land >10% Mixed-use dominated by arable crops

Crops >50%, Grassland >10% Forest -woodland-shrub land >10%

2.2. REMOTE SENSING DATA three different remote sensing datasets were used: 1) Landsat ETM image (1999) fused with an IRS1D (1999) scene, 2) multispectral Quickbird data (2002) and 3) Color-infrared (CIR) Orthoimages of the years 1999 and 2001. Landsat ETM was fused with the IRS-1D scene due to the good spectral resolution of the former and the good spatial resolution of the latter. This resulted in a multispectral image with a spatial resolution of 5m. The Adaptive Image Fusion (AIF) [14] was applied for the fusion to preserve original radiometric values of the Landsat image. The multispectral Quickbird satellite data provides a spatial resolution of 2.8m, whereas the spatial resolution of the Orthoimages equals 0.6m. 2.3. BIRDS SAMPLING 16 sampling plots were selected inside each LUUs. These 16 plots were located 200m apart from each other and comprised a 100m-radius circle. Considering that 16 plots were assigned for each of the six LUU, 96 sampling plots were investigated. The sampling plots served to produce remote sensingderived indicators of land-use intensity and to sample breeding bird data. Breeding bird species were sampled in the 96 plots. The aim of the bird sampling was to obtain counts of birds during the breeding season that can be used to estimate both avian species richness and the relative abundance of individual species within the whole 1-km squares. Four point counts spread throughout the breeding season, each of 5 minutes duration, was undertaken at each of the 16 sample points. Point counts are one of the most widely used methods for estimating numbers of birds and relating them to features of habitat in extensive surveys [15], [16]. The four visits to each sample point were spread throughout the breeding season and it was ensured that the visits are spread more or less evenly. This is important because it ensured that both early nesting resident species (many of which

sing mainly in the early weeks of spring) and long-distance migrant species are adequately detected and counted.

LUU6

LUU4 LUU2

LUU1 LUU3

LUU5

Figure 1: Overview of the study area located in the northern Pre-Alps of Switzerland

2.4. EXPLANATORY and RESPONSE VARIABLES Six variables were derived from all the three images and used as one type of explanatory variables in the statistical analysis. These are referred as grey values and are: • Focal diversity filter of the near infrared channel (focdiv) • Near infrared channel (nir) • NDVI (Normalised Difference Vegetation Index) (ndvi) • First principal component (PCA) of all the channels (pca) • Skewness filter of the near infrared channel (skew) • Variance filter of the near infrared channel (var) The near infrared domain of the electromagnetic spectrum differentiates between vegetation types and vegetation vigour. The NDVI image is useful for differentiating between surfaces with and without vegetation cover as well as between different types of vegetation (i.e forest and grassland or agriculture). The first PCA axis of all the image channels compresses the main information available in the images. Thus this one images will contain pixel values describing different vegetation types, and areas without vegetation like soil and artificial surfaces. It was assumed that different vegetation types but mostly homogeneous and heterogeneous vegetation surfaces would denote different textural values. The variance filter makes use of the average pixel values within a so-called mowing window where the middle pixel value is replaced by the average value calc ulated from all surrounding pixels. The skewness filter measures how much the data within the mowing window are skewed towards the highest or the lowest values. The focal diversity filter computes the number of different pixel values within the mowing window. All the filters were calculated within a 3*3 pixel size kernel. The calculated pixel values were averaged within the sampling plots and this was done on all the three spatial resolutions. All the variables were handled as continuous scale of measurement. As an example, subsets of the six variables calculated from the orthophoto with the sampling plots overlaid are shown in Figure 2. The three images were also segmented and classified with the eCognition software [17]. This segmentation technique uses spectral and textural properties of image pixels, together with their size and behaviour on different stages of scale, to produce segmented objects. Segmentation can be done on hierarchical scales, where a semantic net is built between the different levels and their objects. This allows the development of a hierarchical classification scheme where the delineated objects can be further categorised into finer classes using fuzzy logic theory. This method is very appropriate for landscape ecological analysis because the user can define the scale, which influences the detail of segmentation: For further details see [18]. The following six measures have been computed inside the sampling plots, which throughout the text are referred to as patch indices: •

Area of artificial surface (aarsu), of forest (afrst), of grassland (agrslnd), of open spaces (aopsp), and of water (awat)



Number of patches inside the sampling plots (pano)

focdiv

nir

ndvi

skew

var

pca

Figure 2. Example of the six derived grey value variables based on the aerial photo; LUU six is shown with a subset of the sampling plots.

Dependent variables were seven breeding bird species. By selecting the species out of the 36 sampled birds in the Swiss test site both statistical and ecological criteria were examined. Fourteen species met the requirement of occurrence between 20 and 80 percent in the sampling plots. As the percentage of forest was the main criterion for the definition of the different LUUs, further selection focused on mainly woodland species and some habitat generalists with primary (ancestral) habitat as woodland. Moreover, the fact that forest is the most dominant vegetation cover that one can disseminate in remote sensing images also supported this criterion. Relatively widespread species were selected with a bias towards a landscape with broad-leaved woodland. Seven species met the above-mentioned requirements. • Blackbird (Turdus merula ) • Blackcap (Sylvia atricapilla) • Blue Tit (Parus caeruleus) • Chiffchaff (Phylloscopus collybita) • Robin (Erithacus rubecula ) • Song Thrush (Turdus philomelos) • Wren (Troglodytes troglodytes) Analysis of presence-absence data of only one species, Robin (Erithacus rubecula ), is presented in this paper. E. rubecula was selected because it denoted a strong positive correlation to the extent of forest habitats in the landscape. 2.5. STATISTICAL ANALYSIS: CCA and LOGISTIC REGRESSION To relate patch indices and grey values to the sampled biological data, Canonical Correspondence Analysis (CCA) was carried out. Species data was log transformed and centred before the analysis. Scaling of CCA focused on inter-species distances with biplot rule as according to [19] this scaling type is more appropriate when the length of the gradient is not very long. The statistical significance of the relationship between the species and the whole set of environmental variables was evaluated using the Monte Carlo permutation test with 999 permutations. This was calculated for all canonical axes. If CCA denoted variables with an inflation factor above 10, a new analysis was run without the variable in concern. In order to analyse the performance of grey values and patch indices on the three spatial resolutions, the cumulative variance of species data explained by the firs four CCA axes was compared.

Logistic regression was performed for modelling presence-absence of E. rubecula with grey values and patch indices. Environmental variables were tested for multicollinearity before the analysis. In case variables denoted a Pearson’s correlation above 0.9 only one of them was kept in the analysis. Abundance values were converted into presence-absence, i.e. into zeros where no species were recorded, and ones if E. rubecula birds were found in the sampling plots. In order to seek the most parsimonious model best subset logistic regression was applied to the datasets. Best subset regression identifies a user specified number of best models containing one, two, three variables up to the single model containing all p variables. For the selection of the best subset models the “Score Test Association to Mallow’s C” (STAM) [20] statistic was applied. The rescaled Nagelkerke’s R2 values and the Hosmer-Lemeshow tests were computed to assess the goodness of fit of the models. In order to compare discrimination capacities the area under the ROC curve (c-statistic) and the percentage of correct classifications were calculated. For validating how well the models fit the data Monte-Carlo cross validations were applied. Finally, to close out spatial influence of observations on the models, spatial semivariograms of the Pearson residuals were computed. Using the observed and predicted values from the logistic models, spatial kriging models were computed in GIS. A prediction surface was created for the observed and the estimated values. This gave an insight of how well the spatial models reflect distribution of the sampled species along the land-use intensity gradient.

3 RESULTS CCA of the seven breeding bird species with Landsat-IRS and Quickbird grey values indicate the variable nir having an inflation factor above 20 while out of the orthophoto grey values the variable pca denotes too high value. Out of the patch indices agrslnd denoted the highest inflation factor on all spatial resolutions. These variables did not enter the final CCA. All the analyses are significant on the p < 0.05 level (table 2). For all analyses the first CCA is the most important (highest eigenvalue) while the consecutive axes are of less importance (table 2). Grey values and patch indices are not strongly effected by their spatial resolution regarding the explained variance in the species data. Grey values and patch indices denote comparable performance by explaining variance in the breeding birds dataset. The first four CCA axes together expla in in all cases more then 12% of the variance, in case of Quickbird patch indices it amounts to 16% (table 2). However, the variance in the species data, which could be explained by the remote sensing variables is very low concerning that abundance data of only seven bird species were included in the analysis. Table 2. CCA results with grey values and patch indices on the three spatial resolutions Model Parameters

CCA1

CCA2

CCA3

CCA4

Landsat-IRS grey values Eigenvalues Cumulative % variance of species data Monte-Carlo permutation Eigenvalues Cumulative % variance of species data Monte-Carlo permutation Eigenvalues Cumulative % variance of species data Monte-Carlo permutation

CCA1

CCA2

CCA3

CCA4

Landsat-IRS patch indices

0.193

0.046

0.011

0.004

0.210

0.030

0.003

0.001

10.2

12.7

13.2

13.5

11.2

12.8

12.9

13.0

p < 0.002

p < 0.021

Quickbird grey values

Quickbird patch indices

0.194

0.024

0.009

0.005

0.210

0.055

0.030

0.007

10.3

11.6

12.1

12.3

11.1

14.1

15.7

16.0

p < 0.018

p < 0.007

Orthophoto grey values

Orthophoto patch indices

0.152

0.046

0.022

0.014

0.196

0.077

0.019

0.004

8.1

10.5

11.6

12.4

10.4

14.5

15.5

15.7

p < 0.005

p < 0.014

Both grey values and patch indices prove to be successful predictors of presence-absence of E. rubecula . All the semivariograms expresses a flat pattern (not shown) thus no spatial autocorrelation of the residuals occurs. Area of grassland denote high and negative Pearson’s correlation to area of forest (> 0.9), therefore it had been left out from the further analysis. Only the logistic model with patch indices of the orthophoto denotes a significant Hosmer-Lemeshow test, which indicates that this model does not fit the species data. All models are able to classify presence-absence data above 80%

accuracy, and the discrimination capacity of the models were above 90%. Model performance was tested with the Monte-Carlo cross-validation technique. This detects difference between classified and cross-validated values in case of Quickbird grey values. Cross-validation of the other models indicates high percentage of correctly classified cases. Concerning discriminant capacities, grey values and patch indices are able to predict presence-absence of E. rubecula similarly well. Discriminant capacity of the models also indicates, that remote sensing variables are not significantly effected by the increasing spatial resolutions. Table 3. Best subset logistic model parameters for segmented landscape objects on the three spatial resolutions

R

2

HosmerLemeshow test

% of correct classificatio n

c statisti c

MonteCarlo crossvalidation

R

2

HosmerLemeshow test

Landsat-IRS grey values 0.817

0.926

0.818

0.971

89.6%

90.7%

0.733

0.798

88.5%

0.777

0.156

Quickbird grey values

0.640

82.3%

MonteCarlo crossvalidation

88.5%

0.940

86.5%

Landsat-IRS patch indices

0.972

Orthophoto grey values 0.703

c staistic

Landsat-IRS patch indices

0.973

90.6%

% of correct classificatio n

90.6%

0.960

89.1%

Orthophoto patch indices

0.943

81.8%

0.773

0.028

90.6%

0.955

90.6%

Table 4 presents patch indices, grey values and their significance, which are selected by the logistic models predicting presence-absence of E. rubecula. Out of the five input patch indices, only area of forest was selected as important predictor (p < 0.005). This is not surprising as E. rubecula is woodland species. Regarding grey value predictors, different variables are selected by the three logistic models. For the fused Landsat-IRS image, the ndvi and nir variables show high significance (p < 0.005). Out of the Quickbird input grey values pca denotes very high significance while var was moderately significant (p