Using Principle Component Regression, Artificial Neural Network, and ...

Environ Model Assess (2015) 20:355–365 DOI 10.1007/s10666-014-9433-3

Using Principle Component Regression, Artificial Neural Network, and Hybrid Models for Predicting Phytoplankton Abundance in Macau Storage Reservoir Iek In Ieong & Inchio Lou & Wai Kin Ung & Kai Meng Mok

Received: 21 August 2013 / Accepted: 17 October 2014 / Published online: 28 October 2014 # Springer International Publishing Switzerland 2014

Abstract Principle component regression (PCR), artificial neural network (ANN), and their combination used as datadriven models were selected to apply in this study to predict (based on the current-month variables) and forecast (based on the last 3-month-ahead variables) the phytoplankton dynamics in Macau Main Storage Reservoir (MSR) that is experiencing algal bloom in recent years. The models used the comprehensive 8 years’ monthly water quality data for training and the most recent 3 years’ monthly data for testing. Twenty-four water quality variables including physical, chemical, and biological parameters were involved, and comparisons were made to select the best models that can be applied to MSR. Simulation results revealed that ANN has better accuracy and generalization performance in comparison with PCR both for the prediction and the forecasted model. Using principal component analysis (PCA) for the data, inputs did not show better performance for the ANN, implying that eliminating the uncorrelated variables do not increase the prediction capability for the adopted model. Globally, in contrast with previous studies showing that the hybrid model can handle both linear and nonlinear components of the problems well, the PCRANN in this study obtain no better improvement.

Keywords Algal bloom . Phytoplankton abundance . Artificial neural network . Principle component analysis . Prediction model . Forecast model I. In Ieong : I. Lou (*) : K. M. Mok Department of Civil and Environmental Engineering, Faculty of Science and Technology, University of Macau, Av. Padre Tomás Pereira Taipa, Macau, SAR, China e-mail: [email protected] W. K. Ung Laboratory & Research Center, Macao Water Co., Ltd., 718, Avenida do ConselheiroBorja, Macau, SAR, China

1 Introduction Eutrophication of surface water has become a significant environmental problem and emerged as a worldwide concern due to the increase in occurrence and severity. It was reported that approximately 28–45 % of freshwater reservoirs all over the world have been found to be eutrophic [1]. Eutrophication is caused by the excessive nutrients discharged into water bodies, resulting in the rapid growth of algae (also called phytoplankton) under favorable conditions. Algal blooms reduce the quality of water in water bodies, and the algaereleasing cyanotoxins are carcinogens, which affect the health of public. It is thus important to understand and predict the dynamics of phytoplankton to provide early warning of algal bloom in an aquatic system. The difficulty in predicting phytoplankton abundance in freshwater system is that physical, chemical, and biological processes as well as the interactions among them are involved. The multiple linear regression (MLR) method is a commonly used technique to obtain a linear input–output model for a given dataset. However, it can face serious difficulties when the independent variables are correlated with each other. One method of removing such multicollinearity from the independent variables is to use principal component analysis (PCA). PCA is used as data filter by using an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables, thus reducing the complexity of a multidimensional system by maximization of component loading variance and elimination of invalid components. Previous studies [2, 3] attempted to use the principle component regression (PCR), i.e., PCA followed by MLR, to predict chlorophyll a levels, the fundamental index of phytoplankton. However, the disadvantage of PCR is that the variable dataset used as the inputs responsible for the phytoplankton dynamics may have high complex nonlinearity, expecting that PCR alone is inadequate for

356

prediction. Though there are some existing multivariate statistical methods that are able to approximate the nonlinear relationship, the assumption of functional dependency would be a drawback of such complicated procedures [4]. On the other hand, artificial neural network (ANN) was adopted as an alternative approach for extracting information, without any assumptions of the nature of the nonlinearity. ANN required no a priori assumptions about the model in terms of mathematical relationships or distribution of data [5], and it is a well-suited method with self-adaptability, selforganization, and error tolerance [6, 7]. ANN was first used by French and Recknagel [8] to predict algal blooms based on water quality dataset, in which a feed-forward ANN was trained for predication of species abundances. There has been a growing interest afterward in using ANN for modeling ecosystems. A variety of models [9–11] based on ANN for predicting the algal blooms have been developed. It was reported that ANN provided better results than the MLR model [12]. However, ANN alone is not able to handle both linear and nonlinear characteristics equally well, although it is possible to simulate both linear and nonlinear structures, and there are also some contradictory reports in the literature on ANN performance in forecasting time series data [13]. Both PCR and ANN models can achieve success in their own and in linear and nonlinear domains, respectively. However, none of them is a universal model that can be applied to all situations. The simulation of the PCR model to complex nonlinear problems may be inadequate, while using ANN to model linear problems yield mixed results [13, 14] that showed that the performance of ANN for linear regression problems depended on the sample size and the noise level; thus, it may not be unwise to apply ANN directly to any type of data. As it is difficult to understand completely the characteristics and the pattern of the data in a typical problem, a hybrid methodology that has both linear and nonlinear modeling capabilities could be an appropriate method for practical use. The idea of combining forecasts is to use each method’s features to capture different patterns in the dataset, for improving the forecast accuracy over the individual forecast. Considering the above-mentioned drawbacks of singlemethod forecast, Bates and Granger [15] first proposed combining forecasts. It was observed that the combination of PCR and ANN can improve the prediction accuracy of ozone concentration level in the lower atmosphere with higher R2 and reduce the root mean square error (RMSE). However, this hybrid method is not widely used, and no literature was found to adopt the combined technique in predicting the phytoplankton abundance in a freshwater system. In addition, to model the phytoplankton population using the current environmental variables, the effect of time series is becoming more important for forecasting, in which past variables are collected and analyzed to develop a model describing the underlying relationship. It is of great importance to

I. In Ieong et al.

extrapolate the time series data into the future. This approach is particularly useful when little knowledge is available on the underlying data-generating process or when there is no satisfactory explanatory model that can relate the phytoplankton population to the explanatory variables. Previous ANN models have determined that consideration of time was important to the prediction of algal growth [16–18]. In spite of the widespread application of ANN used for algal bloom prediction, most of the current ANN models available in the works of literature are adopted to model chlorophyll a or single specific species, rather than to model total phytoplankton abundance directly, resulting in an inaccurate estimation of algae population. Besides, the causal factors driving phytoplankton growth are unclear, and the measurements of water quality data are limited, which also constrain the ANN model application for algal blooms. For example, in the paper presented by Wilson and Rechnagel [19], only the main water parameters (temperature, nitrogen, phosphorus, and underwater light) were included as inputs, assuming a not high dependence of other factors to the growth of algae, even it is not true in practice. To better understand the algal blooms in Macau Main Storage Reservoir (MSR), and find a feasible approach for predicting and forecasting the phytoplankton in MSR, various models including PCR, ANN, PCA-ANN, and PCR-ANN based on 24 monthly water parameters were adopted in this study. The models were compared for selecting the best one able to estimate the phytoplankton population in MSR. We hypothesized that the complex ecosystem can be divided into two components, linear and nonlinear components, which are believed to be able to be modeled by PCR and ANN, respectively. Considering PCA may be also helpful to reduce the uncorrelated variables to phytoplankton, all of these combined techniques would be considered in this study to improve the model performance for algal bloom predication and forecast.

2 Materials and Methods 2.1 Study Area and Data Collection Macau Reservoir (MSR), located in the east part of the Macau peninsula with the longitude of 113° 33′ 12″ E and latitude of 22° 12′12″ N, is the biggest reservoir in Macau with the capacity of about 1.9 million m3 and the water surface area of 0.35 km2. It is a pumped storage reservoir that receives raw water from the West River of the Pearl River network 20 km away and can provide water supply to the whole areas of Macau for about 1 week. MSR is particularly important as the temporary water source during the salty tide period when high-salinity concentration is caused by intrusion of seawater to the water intake location. In recent years, there were reports (Macao Water Co., Ltd., unpublished data) showing some

Using PCR, ANN, and Hybrid Models for Predicting Phytoplankton Abundance in MSR

problems of algal bloom in the reservoirs especially in the summer, with high phytoplankton abundance in which Microcystis was detected as the dominant genus. Location in the inlet of the reservoir was selected for sampling. Samples were collected in duplicate monthly from May 2001 to February 2011 at 0.5 m from the water surface. A total of 23 monthly water quality parameters (Table 1), including hydrological, physical, chemical, and biological parameters, were used in the model development. 2.2 Data Partitioning and Selected Water Parameters The collected data were divided into two sets. Data from May 2001 to December 2008 were used to build up the model, i.e., the training set for ANN. Data from January 2009 to February 2011 were used for testing. In practice, for ANN, it will be more desirable to split the data into training, validation, and testing. However, due to limited data in the present study, the data will only be segregated into two parts [20]. The most important constraint affecting a generic input design is the selected input variables, i.e., the availability of measurement across a range of water quality data that are related to domain knowledge regarding the growth of

357

phytoplankton. From the mechanisms’ point of view, it may be difficult to select variables from the existing theories that drive the growth of phytoplankton, as these causal factors may be different in different reservoirs and periods. In this study, correlation analysis was conducted to identify the water parameters which were significantly correlated with phytoplankton abundance. Only the parameters with the correlation coefficients greater than 0.3 are selected as inputs in the PCR, ANN, and hybrid models. It was also noted that the parameters selected in forecast models are different from those in the prediction models, as the water parameters in previous data (past records) were also used in the correlation analysis. To investigate the effects of time representation to the model, two types of ANN structure were devised based on the selected water variables identified. The structure of the current-month model (i.e., prediction model) allows currentmonth prediction of phytoplankton abundance by means of the selected driving variables, while the structure of the last 3month-ahead model (i.e., forecast model) performed the 1month-ahead predictions of phytoplankton abundance based on last 3 months’ selected variables including the biotic parameter of phytoplankton abundance. In the forecast models of PCR, ANN, PCA-ANN, and PCR-ANN, phytoplankton

Table 1 Water quality characteristics of MSR from 2001 to 2011 Parameter

Minimum

Maximum

Mean

Standard deviation

Turbidity (NTU) Temperature (°C) pH Conductivity (μs/cm, 25 °C) Cl− (mg/L) SO42− (mg/L) SiO2 (mg/L) Alkalinity (mg/L CaCO3) HCO3− (mg/L) DO (%) NO3− (mg/L) NO2− (mg/L) NH4+ (mg/L) TN (mg/L N) UV254 (1/m) Fe (μg/L)

1.60 13.00 7.70 171.00 11.00 9.60 0.08 44.00 31.00 71.30 0.08 0.00 0.01 0.14 1.94 19.00

37.70 33.00 9.52 1297.00 291.00 64.80 9.50 107.00 131.00 187.60 5.52 0.25 0.60 0.90 4.28 1171.00

9.15 24.86 8.50 346.84 44.32 22.73 4.22 80.10 90.09 103.85 1.20 0.02 0.06 0.34 3.06 205.14

7.01 5.49 0.34 155.38 38.41 9.65 2.26 13.55 18.72 17.42 1.04 0.03 0.07 0.14 0.50 206.35

PO43− (ug/L) TP (ug/L) Suspended solid (mg/L) TOC (mg/L) HRT (day) Water level (m) Precipitation (mm/month) Log10 (phytoplankton abundance (cells/L))

7.00 9.00 0.40 1.10 26.13 2.29 0.00 6.15

47.00 135.00 34.40 5.20 233217.65 5.68 1204.00 8.28

9.65 40.46 7.80 2.07 3270.04 4.50 168.51 7.13

4.78 23.09 5.72 0.52 21737.92 0.81 198.65 0.62

358

abundance at time t is a function of water parameter at time t− 1, water parameter at time t−2, and water parameters at time t −3, where t−1, t−2, and t−3 represent the 1, 2, and 3 months prior to time t.

I. In Ieong et al.

structure with less hidden nodes is more preferable, which usually gives better generalization capabilities and fewer overfitting problems. 3.3 Hybrid Model

3 Modeling Methods 3.1 Principle Component Regression PCR is divided into two parts, principle component analysis (PCA) and MLRs. PCR is an important data pre-processing procedure for regression analysis to remove multicollinearity between water variables. PCA and MLR were carried out using PASW 19 software package (SPSS, Inc.) The PCA was performed on the water parameters to rank their relative significance and to describe their interrelation patterns as well as onto the phytoplankton population levels. Since phytoplankton abundance did not show normal distribution, logarithmic transformation was applied to phytoplankton data to be used in PCA. Kaiser–Meyer–Olkin (KMO) was used to measure the sample adequacy, and Bartlett’s test of sphericity was applied to verify the applicability of PCA [21]. The stepwise option was used to choose the principle components, and the principle component scores of the selected parameters were used as independent variables in the MLR to check if the occurrences of phytoplankton could be explained by environmental variables as well as to predict the phytoplankton abundance. In the stepwise method, nonsignificant score values were excluded from the model. The modeling of phytoplankton abundance by MLR can be presented by Phytoplankton_abundance = c + b1s1 + ⋅ ⋅ ⋅ + bnsn + e, where c is a constant term, bn is the regression coefficient of the score value of the nth PC, sn is the score values of the nth PC, and e is the error term of the model. 3.2 Artificial Neural Network Feed-forward network is the most commonly used structure in artificial neural network modeling. Determining the size of a hidden layer is a significant task in ANN; some general rules for selecting the number of hidden nodes NH in the ANN model need to be followed. One is that NH should be within NI and 2NI +1 [22], where NI is the number of input nodes. Moreover, in order to prevent overfitting of the training data, Rogers and Dowla [23] also suggest that the conditions NH ≤ NTR/ (NI +1) need to be satisfied, where NTR is the number of training samples. In this study, a trial and error approach was carried out to find the optimum number of hidden nodes in the models. In general, a network

The hybrid model was based on the previous findings that MLR achieved success in describing linear relationships in the data while it has such difficulties as multiple variables, multicollinearities, and outliers. However, ANN is good at modeling nonlinear data. Thus, a hybrid method can increase the chance to capture different patterns in the data and improve prediction power. The detail procedure of a combination method was proposed by Zhang et al. [3]. The relationship between the linear and nonlinear component is shown as yt =Lt + Nt, where Lt is the linear component and Nt is the nonlinear component. The PCR is used to estimate the linear component of the data, and the residuals from the linear model containing the nonlinear component are then found as et = yt − Lt∧, where et is Table 2 Correlation analysis of prediction and forecast model Parameters

Prediction model

Forecast model Time-lagged (month) t−1

t−2

t−3

Turbidity Temperature pH Conductivity Cl− SO42− SiO2 Alkalinity HCO3−

−0.03 0.19 0.49 −0.08 0.01 −0.03 0.33 −0.34 −0.46

0.00 0.21 0.42 0.01 0.10 0.03 0.31 −0.30 −0.40

−0.01 0.19 0.38 0.14 0.22 0.14 0.16 −0.21 −0.32

−0.06 0.14 0.33 0.21 0.28 0.22 0.04 −0.12 −0.24

DO (Dissolved oxygen) NO3− NO2− NH4+ TN (Total nitrogen) UV254 Fe PO43− TP (Total phosphorus) Suspended solid TOC (Total organic carbon) HRT (Hydraulic retention time) Water level Precipitation Phytoplankton abundance

0.39 −0.29 −0.10 0.11 0.68 0.56 −0.14 0.02 0.08 0.31 0.38 −0.12 0.13 −0.09 –

0.35 −0.22 −0.08 0.10 0.60 0.55 −0.06 0.06 0.05 0.35 0.33 −0.11 0.05 0.05 0.82

0.34 −0.22 −0.02 0.08 0.53 0.48 −0.04 0.06 0.02 0.31 0.29 −0.13 0.01 0.11 0.71

0.31 −0.15 0.03 0.25 0.46 0.47 −0.08 0.03 0.00 0.23 0.35 −0.16 −0.02 0.06 0.62


359

Table 3 Eigenvalue and percentage variance of the nine principle components for the prediction model PC

1

2

3

4

5

6

7

8

9

Eigenvalue % variance

2.90 32.25

2.05 22.81

1.56 17.35

0.86 9.50

0.60 6.62

0.49 5.43

0.33 3.63

0.19 2.16

0.02 0.24

the residual at the time t from the PCR model, yt is the observed value at the time t, and Lt∧ is the forecasted value at the time t from the PCR model. ANNs are then used to model the residuals from the PCR model using et = f(et−1, et− 2, …et−n) + tεt, where f is a nonlinear function determined by the neural network and ɛt is the random error. The prediction or forecast from the ANN model is denoted by Jt∧. Then, the combined forecast (y∧) will be expressed as y∧ =Lt∧ +Jt∧. In summary, this proposed methodology of a hybrid system involves two steps. In the first step, PCR is used to model the linear component of the problem. Then, in the second step, an artificial neural network is used to model the residuals from the PCR model. Since the PCR model can only capture the linear structure of the data, the residuals of the linear model will contain nonlinearity information; thus, the results from the artificial neural network can be used as the prediction of the error terms of the PCR model. 3.4 Training Algorithm and Stopping Criteria Batch gradient descent backpropagation algorithm was adopted in the present study; batch implies that all inputs are applied to the network before the weights are upgraded. Backpropagation (BP) training algorithm for feed-forward ANN was first introduced in 1986 [24] and was then widely implemented in ANN modeling [25, 26]. It updates the network weights and biases in the direction in which the performance function decreases most rapidly, following the slope of the error surface downward toward its minimum. Stopping criteria determine when the training should stop. If the training stops too early, it will be undertrained and is not able to learn the pattern. On the other hand, if the training stops too late, it will be overtrained. Usually, an overtrained network will have great training performance but poor prediction ability. In our present study, several stopping criteria have been adopted, involving (i) maximum number of iteration, (ii) maximum training time, (iii) targeted total sum squared error, and (iv) minimum gradient. The training stops when it hits one of the above stopping criteria. The value of stopping criteria is problem dependent [26], which is controlled by computational time or train to an acceptable performance. Scaling is often an important data pre-processing procedure in artificial neural network; as the input data always have different units, the range of different parameters may have folds of difference. In ANN, the contribution of an input to the model depends heavily on its variability relative to other

inputs, that is, the more important inputs in prediction will have the largest weights, while those that are less important will have lower weights, so it is important to rescale the inputs so that their variability reflects their importance. For lack of prior information, it is common to standardize the inputs to the same standard deviation; the standardized approach is to substrate the mean and to divide by standard deviation. It normalizes the input and the target parameters so that they will have zero mean and unity standard deviation.

3.5 Performance Indicators The performance of models was evaluated using the following indicators: square of correlation coefficient (R2) that provides the variability measure for the data reproduced in the model, mean absolute error (MAE) and RMSE that measure residual errors, providing a global idea of the difference between the observation and modeling. The indicators were defined as 2 below by R2 ¼ 1− FFo ; R2 ¼ 1− FFo ; F ¼ ∑ðY i − Y i ∧ Þ ; 2 F o ¼ ∑ Y i −Y i ; MAE ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n

1 n

n 1 n

∑ jY i ∧ −Y i j and RMSE ¼ i¼1

2

∑ ðY i ∧ −Y i Þ , where n is the number of data, Yi i¼1

and Y i are observation data and the mean of observation data, respectively, and Yi∧ is the modeling result. Table 4 Composition of the principle components for the prediction model Variables

Alkalinity Bicarbonate UV254 Suspended solid Total nitrogen Silicon Dissolved oxygen pH Total organic carbon

Component PC1

PC2

PC3

.967 .950 .079 .125 −.197 −.396 −.253 −.414 .206

.005 −.013 .853 .818 .602 .593 −.125 .082 .110

−.024 −.238 .316 −.161 .453 −.338 .806 .746 .576

360

I. In Ieong et al.

Table 5 Eigenvalue and percentage variance of the 23 principle components for the forecast model PC

1

2

3

4

5

6

7

8

9

10

11

12

Eigenvalue % variance PC Eigenvalue % variance

7.78 33.84 13 0.33 1.42

3.79 16.46 14 0.27 1.17

1.96 8.51 15 0.25 1.08

1.81 7.87 16 0.22 0.98

1.41 6.14 17 0.21 0.89

1.07 4.66 18 0.19 0.83

0.77 3.35 19 0.15 0.67

0.65 2.82 20 0.13 0.58

0.54 2.37 21 0.12 0.51

0.48 2.07 22 0.09 0.37

0.40 1.74 23 0.02 0.08

0.37 1.61

4 Results and Discussion

4.2 Principle Component Analysis

4.1 Correlation Analysis

The values of KMO for both prediction and forecast models were above the criteria value of 0.6, indicating that the PCA was applicable [21]. The PCA for the prediction model was performed using the nine selected parameters from the result of correlation analysis. Table 3 shows that the first three principle components can explain 72.42 % variation of the data variation. The scree test suggested only three components with the eigenvalues greater than 1 to be retained, in which all the nine parameters were included. The composition of the three principle components is shown in Table 4.

Correlation of log10 phytoplankton and water parameters for the forecast model and the prediction model is shown in Table 2. Parameters with correlation coefficients greater than 0.3 (highlighted in bold) will be retained in the models. Thus, there were only 9 parameters used in the prediction models and 23 time-lagged parameters selected for the forecast models. It is interesting to note that the two types of models selected the same water parameters (pH, SiO2, alkalinity, HCO3−, DO, TN, UV254, suspended solids, and TOC) as inputs. Table 6 Composition of the principle components for the forecast model Variables

Component PC1

PC2

PC3

PC4

PC5

PC6

UV254(t−1) UV254 (t−2) UV254 (t−3) Suspended solid (t−2) Suspended solid (t−1) Algae population (t−3) Algae population (t−2) Total nitrogen (t−2) Total nitrogen (t−3) Total nitrogen (t−1) Algae population (t−1) Alkalinity (t−1) Bicarbonate (t−2) Silicon (t−1)

.899 .854 .756 .712 .652 .636 .590 .566 .566 .539 .512 .293 .121 .400

.002 −.129 −.209 −.043 −.040 .291 .391 .204 .030 .299 .416 −.810 −.734 .728

.117 .068 .071 .245 −.299 .067 .141 .036 .070 .368 .362 −.340 −.095 −.317

.108 .162 .093 −.375 −.026 .193 .363 .419 .115 .167 .257 −.091 −.513 −.121

.057 .104 .170 −.067 .053 .344 .226 .164 .402 .150 .200 −.130 −.143

.040 .079 .128 −.134 −.188 .400 .348 .328 .363 .326 .291 .140 −.009

Bicarbonate (t−1)

.170

−.727

−.554

−.144

−.056 −.154

−.094 .091

Dissolved oxygen (t−1) pH (t−1) Dissolved oxygen (t−2) pH (t−2) Dissolved oxygen (t−3) pH (t−3) Total organic carbon (t−1) Total organic carbon (t−3)

.092 .202 .074 .217 .106 .268 .083 .025

.057 .151 .086 .103 .111 .111 −.041 −.205

.921 .808 .154 .095 .141 .010 .200 .011

.028 .204 .915 .848 .124 .080 −.025 .103

.084 .041 .025 .174 .898 .855 −.129 .319

.106 .155 .022 .057 .007 .101 .824 .648


361

Table 7 Performance indices of the four prediction models Performance index R RMSE MAE Performance index

Accuracy performance (training set) PCR ANN 0.637 0.758 (0.752) 0.372 0.304 (0.307) 0.302 0.233 (0.238) Generalization performance (testing set)

PCA-ANN 0.750 (0.728) 0.308 (0.321) 0.237 (0.239)

PCR-ANN 0.720 (0.708) 0.326 (0.333) 0.244 (0.252)

R2 RMSE MAE

PCR 0.747 0.318 0.263

PCA-ANN 0.736 (0.717) 0.325 (0.332) 0.228 (0.251)

PCR-ANN 0.711 (0.703) 0.340 (0.342) 0.293 (0.261)

2

ANN 0.792 (0.749) 0.288 (0.316) 0.226 (0.243)

Entries in parentheses represent the average 50-run performance for ANN, PCA-ANN, and PCR-ANN models

For the forecast model, the PCA was performed using the 23 selected parameters from the result of correlation analysis. Table 5 shows that the first six principle components can explain 77.47 % variation of the data variation. The scree test suggested only six components with the eigenvalues greater than 1 to be retained, in which all the 23 parameters were included in the models. The composition of the six principle components is shown in Table 6. 4.3 Multiple Linear Regression The MLR results for the prediction and forecast model of phytoplankton abundance can be written as Log10(phytoplankton abundance) = 7.134–0.072(PC1)+ 0.175(PC2) + 0.135(PC3) and Log 1 0 (phytoplankton abundance)=7.142+0.042(PC1)+0.059(PC2) +0.069(PC3)+ 0.064(PC4) +0.11(PC6), respectively. 4.4 Input Variables of ANN Models As shown in the part of correlation analysis, nine parameters were selected for the ANN prediction model, while 23 time-lagged parameters were selected for the ANN forecast model. After further PCA pre-processing, the numbers of inputs of the PCA-ANN and PCR-ANN prediction models were reduced from nine to three and three, while those of the PCA-ANN and PCR-ANN forecast models were reduced from 23 to 6 and 6. Thus, PCA reduced the dimension of the input variables, which simplifies the complexity of the modeling system. 4.5 Modeling Result Comparison Visual assessment of predictions or forecasts determines whether the models are able to predict or forecast the magnitude and timing of algal blooms. Correlation coefficients can provide a normalized numerical representation of model fit

that can compare the prediction power and performance of different models. Testing of the models invoked two parts, the accuracy performance and the generalization performance. Accuracy performance is to test the capability of the model to predict the output for the given input set that originally used to train the model, while generalization performance is to test the capability of the model to predict the output for the given input sets that were not in the training set. In order to prevent the model that is memorizing the inputs instead of generalized learning, both performance checks need to be considered. Due to the random initializing conditions for the ANN models, different runs may simulate slightly different results. In the present study, the 50-run-averaged performance indices for ANN’s models were used. Visual assessment showed that the models were able to predict and forecast seasonal patterns in phytoplankton abundance on selected variables to some degree. The performance of prediction models is shown in Table 7. The ANN model has the best performance, with the highest R2 (0.758 and 0.792) and the lowest RMSE (0.304 and 0.288) and MAE (0.233 and 0.226) for both accuracy and generalization performance. The PCA-ANN

Table 8 Performance indices of the four forecast models Performance index Accuracy performance (training set) PCR ANN PCA-ANN PCR-ANN R2 0.516 0.784 (0.758) 0.763 (0.730) 0.745 (0.732) RMSE 0.423 0.283 (0.299) 0.296 (0.315) 0.307 (0.313) MAE 0.349 0.215 (0.229) 0.215 (0.241) 0.236 (0.238) Performance index Generalization performance (testing set) R2 RMSE MAE

PCR 0.754 0.313 0.250

ANN 0.826 (0.760) 0.264 (0.306) 0.213 (0.247)

PCA-ANN 0.782 (0.734) 0.295 (0.324) 0.220 (0.254)

PCR-ANN 0.739 (0.732) 0.323 (0.313) 0.244 (0.238)

Entries in parentheses represent the average 50-run performance for ANN, PCA-ANN, and PCR-ANN models

362

I. In Ieong et al.

algae population (log10)

9 Observed algae population ANN PCA-ANN PCR PCR-ANN

8.5 8 7.5 7 6.5 6 2001

2002

2003

2004

2005

2006

2007

2008

2009

Year

Fig. 1 Observed and predicted phytoplankton for the training dataset of the prediction models

and the PCR-ANN hybrid models with less input variables did not show improvement than the ANN model, and the PCR has the worst accuracy performance confirming that PCR cannot handle well the nonlinear relationship between water parameters and phytoplankton abundance. The 50-run-averaged performance indices showed the same trend as the best performance indices. The performance indices for the forecast models are shown in Table 8. The ANN model was successful in both training and testing prediction, with the highest R2 (0.784 and 0.826) and the lowest RMSE (0.283 and 0.264) and MAE (0.215 and 0.213) in both accuracy and generalization performance. When comparing the PCA-ANN model with the ANN model, no improvement has been shown; this suggest that preprocessing the data by PCA to the ANN model is ineffective in this study. The PCR model shows a comparable generalization performance with the ANN model in the testing set, but the accuracy performance for the training set (R2 =0.516, RMSE=0.423, MAE=0.349) was the worst amount than all other models; this may imply that the MLR model cannot handle the nonlinear complex relationship between the input parameters and the phytoplankton abundance. In the PCRANN hybrid model, no improvement was observed when compared with the ANN model. A previous study [27] suggested that PCA can reduce the complexity and collinearity of the input variables, which can

improve convergence rate and performances of the models. However, in the present study, the ANN models without PCA pre-processing (with more input variables) outcompete the PCA-ANN and PCR-ANN model. This is because, although the PCA-ANN and PCR-ANN are models combining advantages of PCA with the ANN, some information on nonlinear dependence may be lost through this integration. The most significant components, explanatory variables, are retained as the outcome of PCA, while the highly nonlinear dependence between those variables has been removed by PCA and the consequent variables. This could be a reason for why the obtained results were different from others. Therefore, the issue could be problem sensitive, depending on the dependency structure of the data and the processes involved and to what extent they can be represented linearly. Properly designed experiments will be further used to evaluate this hypothesis in future research. In the simulation results of the training dataset (Fig. 1) and the testing dataset (Fig. 2) for the prediction models, the timing of algae bloom in different models matched with the observed historical data, except that only one event of blooming in year 2001 was missed, which showed that all of the four models have a certain extent in predicting the occurrence of algae bloom. When checking the magnitude of blooming in the simulation, it was found that the PCR model is more fluctuational and likely to make an underestimation in the

algae population (log10)

8.5 8 7.5 Observed algae population ANN PCA-ANN PCR PCR-ANN

7 6.5 6

2009

2010

2011 Year

Fig. 2 Observed and predicted phytoplankton for the testing dataset of the prediction models


363

Algae population (log10)

8.5

Observed algae population ANN PCA-ANN PCR PCR-ANN

8 7.5 7 6.5 6 2001

2002

2003

2004

2005

2006

2007

2008

Year Fig. 3 Observed and predicted phytoplankton for the training dataset of the 1-month-ahead forecast models

peak events. It also showed that the PCR model could not learn as good as the other three models, which can be reflected from its poor accuracy performance results (Table 7). Moreover, in the testing dataset simulation, all four models exhibited overestimation in the low phytoplankton level. Comparison plots of the forecast model simulation are shown in Figs. 3 and 4. In the training dataset simulation, it can be observed that the timing of algal bloom in PCR model was delayed, while the other three models still matched with the observed historical data. When checking the magnitude of blooming, the same as the prediction model in Fig. 4, PCR shows that it is likely to make an underestimation in the peak events. As expected from the poor performance index result (Table 8), the PCR model in forecasting was not very successful when compared with the other three models. A comparison of the prediction model and forecast model in terms of correlation coefficients (Tables 7 and 8) showed the scenario-specific response to time representation. The forecast model had a slight improvement than the prediction model, with the R2 increasing from 0.758 to 0.784 for the training set and from 0.792 to 0.826 for the testing set. These results contradicted those obtained by Wilson and Rechnagel

[19], indicating that the prediction model performed substantially better than the forecast model for eutrophic reservoirs. It was hypothesized that water bodies characterized by low nutrient concentrations and thus reduced dynamics in the trophic state were more compatible with the ANN forecast model as phytoplankton community is generally supposed to have a slow physiological and metabolic rate. However, the phytoplankton community in MSR may not be the case, and further study needs to be conducted for confirming our hypothesis. In general, ANN was successful to predict the algae population with a reasonable degree of accuracy for the prediction and the forecast model. Considering the importance of the forecast model in application of a raw water monitoring program, three possible solutions could be considered for improving the model performance. The first one is to model the change of monthly algae population, rather than the total phytoplankton abundance; thus, the growth differences between two consecutive months can be simulated, with respect to water parameters. The second way is to shorten the sampling time from monthly data to biweekly data or even weekly data, which will provide more accurate model calibration and implementation.

Algae population (log10)

9 8.5 8 7.5 Observed algae population

7

ANN PCA-ANN PCR

6.5 6

PCR-ANN

2009

2010

2011

Year Fig. 4 Observed and predicted phytoplankton for the testing dataset of the 1-month-ahead forecast models

364

However, this would increase the workloads and operational costs needed for more sampling and measurements. Another solution available is to include data inputs in the same time of previous years, e.g., the same month in the last year, assuming that phytoplankton community and structure would appear as periodic patterns.

5 Conclusions Forecast and prediction models with PCR, ANN, PCA-ANN, and the hybrid model (PCR-ANN) were compared for simulating the phytoplankton abundance of MSR. PCR showed the worst prediction performance than other three methods, indicating that the complex nonlinear relationship among the variables in the aquatic systems cannot not be simulated using the linear model alone. In addition, ANN has the best prediction power with the R2 of 0.792–0.826, implying that ANN is good to deal with the collinearity problem in the data. However, contradictory to the previous pieces of research saying that the hybrid model obtained more prediction results, the PCR-ANN in this study did not show better performance than ANN alone. The hybrid ANN architecture incorporating other models, such as PCR, has been developed and evaluated in the literature. However, provided with the different characteristics of data, it is difficult to draw any conclusion as to which model architecture can be used in any particular circumstance so as to improve the forecasting accuracy. This can be the focus of the future research. The prediction model and forecast model proved to be successful for MSR in our study. However, more cases studies comparing eutrophic and mesotrophic reservoirs should be needed to be performed to confirm whether the ecological processes characterized in eutrophic reservoirs performed faster than those in mesotrophic reservoirs, resulting in such dynamics of the phytoplankton community that may not be well compatible with the forecast model proposed in previous study [19]. Acknowledgments We thank Macao Water Co., Ltd., for providing 10 years’ historical data of water quality parameters and phytoplankton abundances. The financial support from the Fundo para o Desenvolvimento das Ciências e da Tecnologia (FDCT) (grant no FDCT/069/2014/A2) and the Research Committee at the University of Macau is gratefully acknowledged.

References 1. Selman, Z., Greenhalgh, S., & Diaz, R. (2008). Eutrophication and hypoxia in coastal areas: a global assessment of the state of knowledge. Washington, DC: World Resources Institute.

I. In Ieong et al. 2. Camdevyren, H., Demyr, N., Kanik, A., & Keskyn, S. (2005). Use of principal component scores in multiple linear regression models for prediction of chlorophyll-a in reservoirs. Ecological Modelling, 181, 581–589. 3. Zhang, W., Lou, I., Ung, W. K., Kong, Y., & Mok, K. M. (2013). Eutrophication analysis and principle component regression for two subtropical storage reservoirs in Macau. Desalination and Water Treatment, 51(37–39), 7331–7340. 4. Utojo, U. & Bakshi, B.R. (1995). Connection between neural networks and multivariate statistical methods: an overview. In: D.R. Baughman & Y.A. Liu (Eds) Neural Networks in Bioprocessing and Chemical Engineering. New York: Academic Press 5. Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., & Aulagnier, S. (1996). Application of neural networks to modelling nonlinear relationships in ecology. Ecological Modelling, 90, 39–52. 6. Wei, B., Sugiura, N., & Maekawa, T. (2001). Use of artificial neural network in the prediction of algal blooms. Water Research, 35(8), 2022–2028. 7. Kuo, J. T., Hsieh, M. H., Lung, W. S., & She, N. (2007). Using artificial neural network for reservoir eutrophication prediction. Ecological Modelling, 200, 171–177. 8. French, M., & Recknagel, F. (1994). Modelling of algal blooms in freshwaters using artificial neural networks. In P. Zanetti (Ed.), Computer techniques in environmental studies V, environmental systems, vol. II (pp. 87–94). Boston: Computational Mechanics Publications. 9. Oh, H. M., Ahn, C. Y., Lee, J. W., Chon, T. S., Choi, K. H., & Park, Y. S. (2007). Community patterning and identification of predominant factors in algal bloom in Daechung Reservoir (Korea) using artificial neural network. Ecological Modelling, 203, 109–118. 10. Singh, K. P., Basant, A., Malik, A., & Jain, G. (2009). Artificial neural network modeling of the river water quality—a case study. Ecological Modelling, 220, 888–895. 11. Jung, N. C., Popescu, L., Kelderman, P., Solomatine, D. P., & Price, R. K. (2010). Application of model trees and other machine learning techniques for algal growth prediction in Yongdam reservoir, Republic of Korea. Journal of Hydroinformatics, 12(3), 262–274. 12. Maier, H. R., Jain, A., Dandy, G. C., & Sudheer, K. P. (2010). Methods used for the development of neural networks for the prediction of water resource variables in river systems: current status and future directions. Environmental Modelling and Software, 25, 801– 909. 13. Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159–175. 14. Markham, I. S., & Rakes, T. R. (1998). The effect of sample size and variability of data on the comparative performance of artificial neural networks and regression. Computers and Operations Research, 25, 251–263. 15. Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Operations Research Quarterly, 20, 448–451. 16. Maier, H., Dandy, G., & Burch, M. (1998). Use of artificial neural networks for modelling cyanobacteria Anabaena spp. in the River Murray, South Australia. Ecological Modelling, 105, 257–272. 17. Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., & Wilson, H. (1998). Modelling and prediction of phyto- and zooplankton dynamics in Lake Kasumigaura by artificial neural networks. Lakes & Reservoirs: Research and Management, 3, 123–133. 18. Ieong, I. I., Lou, I., Ung, W. K. & Mok, K. M. (2012). Freshwater Phytoplankton Prediction Model by Artificial Neural Network, In Proceedings of IWA HIC (2012)—10th International Conference on Hydroinformatics, Hamburg, Germany. 19. Wilson, H., & Rechnagel, F. (2001). Towards a generic artificial neural network model for dynamic predictions of algal abundance in freshwater lakes. Ecological Modelling, 146, 69–84. 20. Tyagi, P., Chandramouli, V., Lingireddy, S., & Buddhi, D. (2008). Relative performance of artificial neural networks and regression

Using PCR, ANN, and Hybrid Models for Predicting Phytoplankton Abundance in MSR models in predicting missing water quality data. Environmental Engineering Science, 25(5), 657–668. 21. Pallant, J., Chorus, I., & Bartram, J. (2007). Toxic cyanobacteria in water, SPSS Survival Manual. New York: McGraw Hill 22. Hecht-Nielsen, R. (1987). Komogorov’s mapping neural network existence theorem, In Proceedings of the IEEE International Conference on Neural Networks, pp. 11–13, IEEE Press, New Jersey. 23. Rogers, L. L., & Dowla, F. U. (1994). Optimization of groundwater remediation using artificial neural networks with parallel solute transport modeling. Water Resources Research, 30(2), 457–481.

365

24. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation, In: Parallel Distributed Processing, MIT Press, Cambridge 25. Melesse, A. M., Krishnaswamy, J., & Zhang, K. (2008). Modeling coastal eutrophication at Florida Bay using neural networks. Journal of Coastal Research, 24(2), 190–196. 26. Palani, S., Liong, S. Y., & Tkalich, P. (2008). An ANN application for water quality forecasting. Marine Pollution Bulletin, 56, 1586–1597. 27. Sousa, S. I. V., Martins, F. G., Alyim-Ferraz, M. C. M., & Pereira, M. C. (2007). Multiple linear regression and artificial neural networks based on principal components to predict ozone concentrations. Environmental Modelling and Software, 22, 97–103.