Determination of Input for Artificial Neural Networks ...

Determination of Input for Artificial Neural Networks for Flood Forecasting Using the Copula Entropy Method

Downloaded from ascelibrary.org by TEXAS A&M UNIVERSITY on 01/15/15. Copyright ASCE. For personal use only; all rights reserved.

Lu Chen, Ph.D. 1; Lei Ye 2; Vijay Singh, F.ASCE 3; Jianzhong Zhou 4; and Shenglian Guo 5

Abstract: Artificial neural networks (ANNs) have proved to be an efficient alternative to traditional methods for hydrological modeling. One of the most important steps in the ANN development is the determination of significant input variables. This study proposes a new method based on the copula-entropy (CE) theory to identify the inputs of an ANN model. The CE theory permits to calculate mutual information (MI) and partial mutual information (PMI), which characterizes the dependence between potential model input and output variables directly instead of calculating the marginal and joint probability distributions. Two tests were carried out for verifying the accuracy and performance of the CE method. The CE theory-based input determination methodology was applied to identify suitable inputs for a flood forecasting model for a real-world case study involving the three gorges reservoir (TGR) in China. Test results of application of the flood forecasting model to the upper Yangtze River indicates that the proposed method appropriately identifies inputs for the ANN with the smallest root-mean-square error (RMSE) for training, testing, and validation data. DOI: 10.1061/(ASCE)HE.1943-5584.0000932. © 2014 American Society of Civil Engineers. Author keywords: Flood forecasting; Artificial neural networks; Input variables selection; Copula entropy, Partial mutual information (PMI).

Introduction Artificial neurons (AN), first introduced in 1943 (McCulloch and Pitts 1943), which mimic the functioning of a human brain by acquiring knowledge through a learning process that involves finding an optimal set of weights for the connections and threshold values for the nodes. Artificial neural networks (ANNs) have become extremely popular for prediction and forecasting in a number of areas, including finance, power generation, medicine, water resources, and environmental science (Maier and Dandy 2000). At present, ANNs have proved to be an efficient alternative to traditional methods for hydrological modeling, such as streamflow forecasting (Hu et al. 2005; Corzo and Solomatine 2007; Solomatine et al. 2007; Shrestha and Solomatine 2008; Elshorbagy et al. 2010a, b) and groundwater modelling (Lallahem and Mania 2003; Chang et al. 2007). One of the most important steps in the ANN development is the determination of significant input variables (Bowden et al. 2005a; 1 College of Hydropower and Information Engineering, Huazhong Univ. of Science and Technology, Wuhan 430074, China. E-mail: chl8505@ 126.com 2 Ph.D. Candidate, College of Hydropower and Information Engineering, Huazhong Univ. of Science and Technology, Wuhan 430074, China. 3 Professor, Distinguished Professor and Caroline and William N. Lehrer Distinguished Chair in Water Engineering, Dept. of Biological and Agricultural Engineering and Dept. of Civil and Environmental Engineering, Texas A&M Univ., TAMU, College Station, TX 77843-2117. 4 Professor, College of Hydropower and Information Engineering, Huazhong Univ. of Science and Technology, Wuhan 430074, China. 5 Professor, State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan Univ., Wuhan 430072, China (corresponding author). E-mail: [email protected] Note. This manuscript was submitted on February 19, 2013; approved on October 24, 2013; published online on October 26, 2013. Discussion period open until December 8, 2014; separate discussions must be submitted for individual papers. This paper is part of the Journal of Hydrologic Engineering, © ASCE, ISSN 1084-0699/04014021(14)/$25.00.

© ASCE

Fernando et al. 2009). In most water resources applications of ANNs, little attention has been given to the task of selecting appropriate model inputs (Maier and Dandy 2000). In general, not all of the potential input variables will be equally informative, since some may be correlated, noisy, or may have no significant relationship with the output variable being modeled (Bowden et al. 2005a). Including a large number of inputs in ANN models and relying on the network to determine the critical model inputs usually increase the network size (Maier and Dandy 2000). This also brings a number of disadvantages, such as decreasing processing speed and increasing the amount of data required to efficiently estimate the connection weights (Lachtermacher and Fuller 1994). Fernando et al. (2009) indicated that the task of an input selection algorithm is to determine the strength of the relationship between potential model inputs and outputs. However, the real systems are generally complex and mostly associated with nonlinear processes. Therefore, the dependencies between output and input variables are difficult to measure. There are several traditional methods to describe the structure of dependence between variables, one of which is a linear relation that mainly exists in regression models measured by the covariance and correlation coefficient, and it is based on the multivariate normal distribution (Xu 2005; Zhao and Lin 2011). The drawbacks of the linear correlation method are that (1) it only applies to a linear correlation; and (2) it tends to focus on the degree of dependence, and ignores the structure of dependence (Zhao and Lin 2011). Two important measures of dependence (concordance), known as Kendall’s tau and Spearman’s rho, provide perhaps the best alternatives to the linear correlation coefficient as a measure of dependence for non-Gaussian distributions, for which the linear correlation coefficient is inappropriate and often misleading. The disadvantage of rank-based correlation coefficient is that there is a loss of information when the data are converted to ranks; if the data are normally distributed, it is less powerful than the Pearson correlation coefficient (Gauthier 2001). An alternative method is based on entropy theory. In the entropy theory, mutual information (MI) has been successfully employed as

04014021-1

J. Hydrol. Eng. 2014.19.

J. Hydrol. Eng.


a nonlinear measure of inference among variables by many researchers (e.g., Khan et al. 2006; Molini et al. 2006; Ng et al. 2007; Hejazi et al. 2008; Alfonso et al. 2010, 2012). Mutual information, defined as the difference between marginal and conditional entropy, is a measure of the amount of information that one random variable contains or explains about another random variable. It can be used to indicate the dependence or independence between variables. If the two variables are independent, the mutual information between them is zero. If the two are strongly dependent, for example, one is a function of another; the mutual information between them is large (Li 1990). The use of mutual information has become popular in several fields of science to measure the dependence between variables (Alfonso et al. 2010). For example, using the MI method, Harmancioglu and Yevjevich (1987) analyzed three types of information transfer among river points. The mutual information has also been used for the network design (Alfonso et al. 2010). Some of advantages of this method have been reported widely (e.g., Li 1990; Singh 2000; Steuer et al. 2002). The advantages of MI are that (1) it is a nonlinear measure of statistical dependence based on information theory (Steuer 2006); and (2) it is a nonparametric method and makes no assumptions about the functional form (Gaussian or non-Gaussian) of the statistical distribution that produced the data. In this paper, the entropy-based method was used for identification of inputs. However, there is a disadvantage when using MI to select inputs of ANN. Although a candidate model input might have a strong relationship with the model output, this information might be redundant if the same information is already provided by another input (Fernando et al. 2009). Bowden et al. (2005a) and Fernando et al. (2009) pointed out that input selection algorithms should cater to the redundancy in candidate model inputs. In order to achieve this, Sharma (2000) proposed a partial mutual information (PMI) criterion as the basis for identifying more than one predictor in a stepwise manner. In a review of approaches used to select inputs for ANN models, Bowden et al. (2005a) concluded that the partial mutual information (PMI) algorithm of Sharma (2000) was superior to methods commonly used to determine inputs for ANN models, as it is model-free and uses a nonlinear measure of dependence (mutual information). In the PMI algorithm of Sharma (2000), the nonparametric kernel methods were used to characterize the joint probability distribution of the variables involved. Fernando et al. (2009) modified the PMI input selection algorithm in order to increase its computational efficiency, while maintaining its accuracy. They introduced the average shifted histograms (ASHs) as an alternative to kernel-based methods for the estimation of mutual information (MI). However, there are several disadvantages of the PMI algorithm. First, hydrological events, such as rainfall and runoff, are continuous but the PMI methods use the discrete version to calculate PMI. Therefore, a method for continuous variable is needed. Second, these methods need estimates of both marginal and joint probability distributions. For a d-dimensional multivariate distribution, it is difficult to obtain the joint probability distributions. In order to overcome these problems, this paper proposes a new method, which calculates PMI values using the copula-entropy theory instead of calculating the marginal and joint probability distributions, respectively. The objective of this study, therefore, was to develop a method for input selection for an ANN model based on the copula-entropy theory. The paper is organized as follows. Section “Background” presents the theoretical background, including a discussion and presentation of concepts and formulas of ANN, entropy, and copulas. Section “Copula-Entropy Theory” proposes the copula-entropy © ASCE

theory and its property. Section “Determination of Inputs of ANN Using the CE Theory” proposes a new method based on the copulaentropy theory to determine the inputs of an ANN model. In section “Evaluation of the Proposed Method”, two tests are carried out to assess the accuracy and performance of the proposed method. Section “Case Study” discusses a case study of the upper Yangtze River along with an analysis and discussion of results. Section “Conclusions” summarizes the conclusions of this study.

Background Artificial Neural Networks In statistical modeling, nonlinear dynamic processes are approximated by a regression model of the general form (May et al. 2008a) yðtÞ ¼ F½yðt − 1Þ; : : : ; yðt − pÞ; Qðt − 1Þ; : : : ; Qðt − qÞ

ð1Þ

where F is a function; y is the model output and predicted at time t; p, and q are the parameters denoting the model order or number of lags. The model inputs comprise past observations (or lags) of y and Q. Function F is unknown, and the ANN is used to determine the form of F based on a set of representative data. The ANN architecture adopted in this case study was the general regression neural network (GRNN), which is a class of ANN that was first introduced by Specht (1991) as a neural network paradigm for kernel regression. Bowden et al. (2005a) summarized the advantages of GRNN as: capability of nonlinear modeling between inputs and outputs, fixed network architecture, and quicker training than other ANNs. The GRNN paradigm is briefly outlined below, and details can be found in Specht (1991). Assume that fðx; yÞ represents the known joint continuous probability density function of a vector random variable, X, and a scalar random variable, Y. Let x be a particular measured value of the vector random variable X. The conditional mean of y given x, also called the regression of y on x, is given by Specht (1991) R∞ yfðx; yÞdy E½yjx ¼ R−∞ ð2Þ ∞ −∞ fðx; yÞdy where fðx; yÞ is not known. A sample of observations of x and y is ˆ used to obtain an estimate fðx; yÞ. The GRNN provides an estimate of E½yjx, which is the conditional expectation of y given x. Entropy Theory The Shannon entropy (Shannon 1948) quantitatively measures the mean uncertainty associated with a probability distribution of a random variable and in turn with the random variable itself in concert with several consistency requirements (Kapur and Kesavan 1992). The entropy of a random variable (r.v.) X can be expressed as Z ∞ HðXÞ ¼ − fðxÞ log fðxÞdx ð3Þ 0

where fðxÞ is the probability density function of variable X. In this study, flood flow is focused on, so the range of the variable is from 0 to infinite. Actually, the domain can be extended to any real number. Eq. (3) defines the univariate continuous entropy or marginal entropy of X. The units of entropy are actually given by the base of the logarithm, being nats for base and bits for base 2. The natural logarithm would be used hereafter. For two random variables (r.v.’s) X 1 and X 2 , the joint entropy can be expressed as

04014021-2


J. Hydrol. Eng.

Z HðX 1 ; X 2 Þ ¼ −

∞

Z

0

∞ 0

fðx1 ; x2 Þ log fðx1 ; x2 Þdx1 dx2

ð4Þ

Let X 1 ; X 2 ; : : : ; X d denote the r.v.’s. The multidimensional joint entropy can be expressed as Z ∞ Z ∞ HðX 1 ; X 2 ; : : : ; X d Þ ¼ − ::: fðx1 ; x2 ; : : : ; xd Þ 0

0

× log½fðx1 ; x2 ; : : : ; xd Þdx1 dx2 ; : : : ; dxd

ð5Þ

The mutual information can be expressed as Downloaded from ascelibrary.org by TEXAS A&M UNIVERSITY on 01/15/15. Copyright ASCE. For personal use only; all rights reserved.

TðX 1 ; X 2 Þ ¼ HðX 1 Þ þ HðX 2 Þ − HðX 1 ; X 2 Þ

ð6Þ

Using Eqs. (3) and (4), Eq. (6) can be written as Z ∞ Z ∞ TðX 1 ; X 2 Þ ¼ − fðx1 Þ log fðx1 Þdx − fðx2 Þ log fðx2 Þdx 0 0 Z ∞Z ∞ fðx1 ; x2 Þ log fðx1 ; x2 Þdx1 dx2 þ 0 0 Z ∞Z ∞ ¼− fðx1 ; x2 Þ log fðx1 Þdx1 dx2 0 0 Z ∞Z ∞ − fðx1 ; x2 Þ log fðx2 Þdx1 dx2 0 0 Z ∞Z ∞ þ fðx1 ; x2 Þ log fðx1 ; x2 Þdx1 dx2 0 0 Z ∞Z ∞ ¼ fðx1 ; x2 Þ½− log fðx1 Þ − log fðx2 Þ 0

Cðu1 ; u2 Þ ¼ ϕ½−1 ½ϕðu1 Þ þ ϕðu2 Þu1 ; u2 ∈ I

0

þ log fðx1 ; x2 Þdx1 dx2 Z ∞Z ∞ fðx1 ; x2 Þ dx dx ¼ fðx1 ; x2 Þ log fðx2 Þfðx1 Þ 1 2 0 0

analysis (De Michele and Salvadori 2003; Grimaldi and Serinaldi 2006; Kao and Govindaraju 2007; Zhang and Singh 2007; Kuhn et al. 2007), flood frequency analysis (Favre et al. 2004; Shiau et al. 2006; Zhang and Singh 2006; Renard and Lang 2007; Chen et al. 2010), drought frequency analysis (Shiau 2006; Kao and Govindaraju 2010; Song and Singh 2010), rainfall and flood events analysis (Singh and Zhang 2007; Xiao et al. 2009; Wang et al. 2010; Chen et al. 2012), sea storm analysis (De Michele et al. 2007), and some other theoretical analyses of multivariate extreme problems (Salvadori et al. 2007; Salvadori and De Michele 2010; Chebana and Ouarda 2011). Detailed theoretical background and description for the use of copulas can be found in Nelsen (2006) and Salvadori et al. (2007). Different families of copulas have been proposed and described by Nelsen (2006) and Salvadori et al. (2007). Of all the copula families, the Archimedean family is more desirable for hydrological analyses, because it can be more easily constructed and can be applied whether the correlation among the hydrological variables is positive or negative (Zhang and Singh 2006). The bivariate Archimedean copula has the simple algebraic form

ð7Þ

ð9Þ

where ϕ is a specific function known as a generator of C. A large variety of copulas belong to this family. Three oneparameter Archimedean copulas, including the Gumbel, Frank, and Clayton copulas, have been widely applied in frequency analysis (Favre et al. 2004; Zhang and Singh 2006). Therefore, these copulas were used in this study, the forms of which are listed in Table 1.

Copula-Entropy Theory Copula Function In application of univariate, bivariate, and multivariate entropy formulas, it is necessary to use unvariate, bivariate, and multivariate distributions. The problem of specifying a probability model for dependent multivariate observations can be simplified by expressing the corresponding d dimensional joint cumulative distribution using a copula function (Salvadori and De Michele 2010). Following Sklar (1959) and Nelsen (2006), if F1;2; : : : ; d ðx1 ; x2 ; : : : ; xd Þ is a multivariate distribution function of d correlated random variables of X 1 ; X 2 ; : : : ; X d with respective marginal distributions (or margins) F1 ðx1 Þ; F2 ðx2 Þ; : : : ; Fd ðxd Þ, then it is possible to write a d-dimensional cumulative distribution function (CDF) with univariate margins, F1 ðx1 Þ; F2 ðx2 Þ; : : : ; Fd ðxd Þ, as follows:

Based on the concepts of entropy and copula functions, the copulaentropy (CE) theory is introduced in this section. Actually, the CE is the entropy of the copula function. Definition of Copula Entropy Let x ∈ Rd be random variables with marginal functions Fi ðxÞ; U i ¼ Fi ðxÞ; i ¼ 1; 2; : : : ; d. Then, U i are uniformly distributed random variables; and ui will denote a specific value of U i . The entropy of the copula function is defined as CE, which can be expressed as Z 1 Z 1 HC ðU 1 ; U 2 ; : : : ; U d Þ ¼ − ··· cðu1 ; u2 ; : : : ; ud Þ 0

Fðx1 ; x2 ; : : : ; xd Þ ¼ C½F1 ðx1 Þ; F2 ðx2 Þ; : : : ; Fd ðxd Þ ¼ Cðu1 ; : : : ; ud Þ

ð8Þ

where Fk ðxk Þ ¼ uk for k ¼ 1; : : : ; d, with U k ∼ Uð0; 1Þ, and C is a function called copula. The copula function is capable of exhibiting the structure of dependence between two or more random variables, and has recently emerged as a practical and efficient method for modeling the general dependence in multivariate data (e.g., Joe 1997; Nelsen 2006). The advantages of using copulas to model joint distributions are manyfold: (1) flexibility in choosing arbitrary marginal and structure of dependence; (2) extension to more than two variables; and (3) separate analysis of marginal distributions and dependence structure (Salvadori et al. 2007; Serinaldi et al. 2009). Hydrological applications of copulas have surged in recent years (e.g., Wang et al. 2010). For example, they have been used for rainfall frequency © ASCE

0

× log½cðu1 ; u2 ; : : : ; ud Þdu1 ; : : : ; dud

ð10Þ

where cðu1 ; u2 ; : : : ; ud Þ is the probability density function of copulas, and expressed as ∂Cðu1 ; u2 ; : : : ; ud Þ ∂u1 ∂u2 ; : : : ; ∂ud Table 1. Archimedean Copulas Family Gumbel Clayton Frank

04014021-3


Pn

Equations θ 1=θ

Domain

expf−½ i¼1 ð− ln ui Þ g; i ¼ 1; 2; : : : ; d P −1=θ ; i ¼ 1; 2; : : : ; d ð ni¼1 u−θ i − 1Þ Q n 1 ð1 − e−θui Þ − log 1 − i¼1 ; i ¼ 1; 2; : : : ; d θ 1 − e−θ

θ ∈ ½1; ∞Þ θ ∈ ð0; ∞Þ θ∈R

J. Hydrol. Eng.

Relationship between CE and MI The purpose of this section is to find a relationship between CE and MI. The joint probability density function of vector random variable X can be defined as (Grimaldi and Serinaldi 2006) fðx1 ; x2 ; : : : ; xd Þ ¼ cðu1 ; : : : ; ud Þ

d Y

fðxi Þ

ð11Þ

i¼1

Based on Eq. (3), the joint entropy can be expressed as Z ∞ Z ∞ HðX 1 ; X 2 ; : : : ; X d Þ ¼ − ··· fðx1 ; x2 ; : : : ; xd Þ log½fðx1 ; x2 ; : : : ; xd Þdx1 dx2 ; : : : ; dxd Downloaded from ascelibrary.org by TEXAS A&M UNIVERSITY on 01/15/15. Copyright ASCE. For personal use only; all rights reserved.

Z ¼−

0

∞ 0

Z ¼−

∞ 0

Z ¼−

∞ 0

Z −

∞

0

Z ···

0

∞ 0

Z ···

∞

Z

fðxi Þ log½cðu1 ; : : : ; un Þ

cðu1 ; : : : ; ud Þ

d Y

d Y

0

cðu1 ; : : : ; ud Þ

d Y

fðxi Þdx1 dx2 ; : : : ; dxd

i¼1

fðxi Þflog½cðu1 ; : : : ; ud Þ þ

i¼1 ∞

Z ···

d Y i¼1

0

···

cðu1 ; : : : ; un Þ

d X

log½fðxi Þgdx1 dx2 ; : : : ; dxd

i¼1

fðxi Þ · log½cðu1 ; : : : ; ud Þdx1 dx2 ; : : : ; dxd

i¼1 ∞

0

cðu1 ; : : : ; ud Þ

d Y i¼1

fðxi Þ ·

d X

log½fðxi Þdx1 dx2 ; : : : ; dxd ¼ A þ B

ð12Þ

i¼1

where Z A¼−

0

Z ¼−

∞

∞

Z ··· Z

∞ 0

cðu1 ; : : : ; ud Þ

d Y

fðxi Þ ·

i¼1

d X

Z log½fðxi Þdx1 dx2 ; : : : ; dxd ¼ −

i¼1

∞ 0

Z ···

∞ 0

fðx1 ; x2 ; : : : ; xd Þ ·

n X

log½fðxi Þdx1 dx2 ; : : : ; dxd

i¼1

∞

fðx1 ; x2 ; : : : ; xd Þ · flog½fðx1 Þþ · · · þ log½fðxd Þgdx1 dx2 ; : : : ; dxd Z ∞ Z ∞ ∞ ∞ ¼− ··· fðx1 ; x2 ; : : : ; xd Þ · log½fðx1 Þdx1 dx2 ; : : : ; dxd · · · − ··· fðx1 ; x2 ; : : : ; xd Þ · log½fðxd Þdx1 dx2 ; : : : ; dxd 0 0 0 0 Z Z ∞ Z ∞ ∞ ¼− log½fðx1 Þ ··· fðx1 ; x2 ; : : : ; xd Þ · dx2 ; : : : ; dxd dx1 0

···

Z

0

¼ ::: −

Z

0

0

Z

∞ 0

Z log½fðxd Þ½

0

∞ 0

···

Z

∞

0

fðx1 ; x2 ; : : : ; xd Þ · dx1 ; : : : ; dxd−1 dxd ¼ −

d Z X i¼1

∞ 0

fðxi Þ log½fðxi Þdxi ¼

d X

HðX i Þ

ð13Þ

i¼1

Noting du ¼ dx · fðxi Þ Z B¼− Z ¼−

∞ 0 ∞ 0

Z ··· Z ···

∞ 0

n Y

cðu1 ; : : : ; ud Þ · log½cðu1 ; : : : ; ud Þdu1 du2 ; : : : ; dud ¼ HC ðuÞ

Therefore, the joint entropy can be expressed as the sum of the d univariate marginal entropies and the CE as follows: HðX 1 ; X 2 ; : : : ; X d Þ ¼

d X

fðxi Þ · log½cðu1 ; : : : ; ud Þdx1 dx2 ; : : : ; dxd

i¼1 ∞

0

cðu1 ; : : : ; ud Þ

HðX i Þ þ HC ðU 1 ; U 2 ; : : : ; U d Þ ð15Þ

ð14Þ

From Eq. (17), it can be seen that the mutual information is actually the negative CE. It is well known that the mutual information can measure the linear and nonlinear dependencies. Therefore, the CE can be used to estimate the linear and nonlinear dependencies.

i¼1

Eq. (15) indicates that the joint entropy HðX 1 ; X 2 ; : : : ; X d Þ is divided into two parts: the sum of the d marginal entropies HðX i Þ and the CE HC ðU 1 ; U 2 ; : : : ; U d Þ. For d ¼ 2, HðX 1 ; X 2 Þ ¼ HðX 1 Þ þ HðX 2 Þ þ HC ðU 1 ; U 2 Þ

ð16Þ

From Eq. (6) TðX 1 ; X 2 Þ ¼ HðX 1 Þ þ HðX 2 Þ − HðX 1 ; X 2 Þ ¼ −HC ðX 1 ;X 2 Þ ð17Þ © ASCE

Calculation of Copula Entropy Two methods are proposed to calculate the CE. One is multiple integration method, and the other is Monte Carlo method. The FORTRAN programing language was used to do the calculations. Multiple Integration Method From Eq. (10), the CE can be derived using the multiple integration method. First, parameters of the copula function need to be estimated, and then the copula probability density function

04014021-4


J. Hydrol. Eng.


can be determined. The multiple integration method, proposed by Berntsen et al. (1991), was applied to calculate the multiple integrations. In order to test this multiple integration method, the copula probability density function was used as an integrand. The result of integration should be 1. Monte Carlo Method For more variables, it may be difficult to calculate multiple integrations. The Monte Carlo method can be used to calculate the CE. For a multivariate vector with support in [0,1], the CE can be obtained by Z HC ðu1 ; u2 ; : : : ; ud Þ ¼ − cðUÞ ln cðUÞdU ¼ −E½ln cðUÞ ½0;1d

ð18Þ The CE equals the expected value of ln½cðUÞ, which can be derived by the Monte Carlo method. Similar to the multiple integration method, first the dependence structure and parameters of the copula function need to be determined. M pairs of u were generated from the determined copula function, and then average values of the ln½cðuÞ were calculated. An example for calculating the CE is given as follows. The FORTRAN code was used to do the calculation. For example, the CE of two variables X and Y is calculated. The Gumbel copula was used to establish the joint distribution of variables X and Y. The parameter of the Gumbel copula is 1.5. The Gumbel copula can be described as Hðx; yÞ ¼ C½u; v ¼ expf−½ð− ln uÞ1.5 þ ð− ln vÞ1.5 1=1.5 g ð19Þ First, the multiple integration method was used. The integrand function is Eq. (19). The multiple integration method, proposed by Berntsen et al. (1991), was applied to calculate the multiple integrations. The value of CE of variables X and Y is –0.166. Second, the Monte Carlo method was used. According to Eq. (18), the CE equals the expected value of ln½cðUÞ. First, the copula function was established as shown in Eq. (19). 10,000 pairs of u and v were generated from the determined copula function, and then average values of the ln½cðu; vÞ were calculated. The value of CE is –0.153.

Determination of Inputs of ANN Using the CE Theory In this section, the new method for input identification is introduced. First, the relation between PMI and CE is discussed, and the theory of CE is applied for input identification. Second, according to the calculated value of CE, a reliable and efficient criterion to decide when to stop the addition of new inputs to the list of selected inputs is developed. Third, a detailed procedure of the proposed method is given. Application of CE to Input Identification In the following, first the definition of PMI (partial mutual information) is introduced. Second the relation between CE and PMI is discussed. MI, which is equal to the negative CE, can be used to identify the nonlinear dependence between candidate input and output variables (Fernando et al. 2009). However, the MI method is not directly able to deal with the issue of redundant inputs (Bowden et al. 2005a). To overcome this problem, Sharma (2000) introduced the concept of partial mutual information (PMI), which represents © ASCE

the information between two observations that is not contained in a third one and provides a measure of partial or additional dependence than the new input can add to the existing prediction model (Bowden et al. 2005a). The PMI between the output (dependent variable) y and the input (independent variable) x, for a set of preexisting inputs z, can be given by (Bowden et al. 2005a) Z Z f 0 0 ðx 0 ; y 0 Þ PMI ¼ fX 0 ;Y 0 ðx 0 ; y 0 Þ ln X ;Y 0 dx 0 dy 0 ð20Þ fX 0 ðx ÞfY 0 ðy 0 Þ where x 0 ¼ x − E½xjz; and y 0 ¼ y − E½yjz, where E denotes the expectation operation. Variables x 0 and y 0 only contain the residual information in variables x and y after considering the effect of already selected input z (Fernando et al. 2009), and can be calculated based on the GRNN model. MATLAB was used to realize for GRNN modelling. From Eqs. (15) and (17), PMI also can be described using the CE Z Z f 0 0 ðx 0 ; y 0 Þ PMI ¼ fX 0 ;Y 0 ðx 0 ; y 0 Þ ln X ;Y 0 dx 0 dy 0 fX 0 ðx ÞfY 0 ðy 0 Þ ¼ −HC ðX 0 ; Y 0 Þ

ð21Þ

Eq. (21) shows that PMI is equal to the negative CE of variables of X 0 and Y 0 . Therefore, the CE can be used to determine the inputs instead of the MI and PMI method. Termination Criterion The CE algorithm requires a reliable and efficient criterion to decide when to stop the addition of new inputs to the list of selected inputs (Fernando et al. 2009). There are several methods that can be used to achieve this. Sharma (2000) and Bowden et al. (2005a, b) used the 95th percentile confidence limit for the sample PMI to decide whether the PMI of a candidate input is significantly different from zero and should therefore be added to the set of already selected inputs. May et al. (2008b) and Fernando et al. (2009) indicated that this method, which used the bootstrap with 100 bootstraps to estimate the 95th percentile confidence limit, places a significant computational burden on the algorithm. And a bootstrap so small might not provide reliable estimation of the confidence bound, which could lead to unreliable and/or suboptimal input variable selection (May et al. 2008b; Fernando et al. 2009). Fernando et al. (2009) introduced the Hampel identifier that is proposed by Davies and Gather (1993) as a termination criterion. Using this method, May et al. (2008b) and Fernando et al. (2009) recommended a termination algorithm. The details of the Hampel test criterion are described in the following. The Hampel identifier is an outlier detection method for determining whether a given value x is significantly different from others within a set of values X. Assume that a set of candidates will initially contain some proportion of redundant variables, and the significant variable will be detected. The Hampel distance begins by calculating the absolute deviation from the median negative CE for all candidates and defined as (Fernando et al. 2009; May et al. 2008b) dj ¼ jNH − NH ð50Þ j

ð22Þ

where dj is the absolute deviation; NH represents the negative CE values; and NHð50Þ denotes the median NH for the candidate set. Then the Hampel distance is calculated by (Fernando et al. 2009; May et al. 2008b)

04014021-5


J. Hydrol. Eng.

Zj ¼

dj

the exact value of MI and the MI estimates obtained using the copula-entropy method are therefore given by

ð23Þ

ð50Þ 1.4826dj

E ¼ T Gauss ðX; YÞ–TðX; YÞ

ð25aÞ

R ¼ ½I Gauss ðX; YÞ − IðX; YÞ=I Gauss ðX; YÞ

ð25bÞ

ð50Þ

denotes the where dj denotes the Hampel distance; and dj median absolute deviation dj . If the Hampel distance, Zj , is greater than 3, namely Zj > 3, then the candidates are added to the selected input set.

where E represents the absolute error, and R the relative error. Assuming that the Pearson correlation coefficient ρ ranges from –0.9 to 0.9 with a step size of 0.1, the exact and estimated MI values were calculated by Eq. (24) and the proposed methods, respectively. Multiple integration and Monte Carlo methods were, respectively, used for calculating the CE. For the first method, the multiple integration method proposed by Berntsen et al. (1991) was applied. For the second method, 10,000 pairs of u were generated, and average values of ln½cðuÞ were calculated. The absolute and relative errors were calculated and are listed in Table 2, and results of calculation are shown in Fig. 1. It is indicated that the MI values calculated by the three methods are very close, and the values calculated by the multiple integration method are more accurate than that by the Monte Carlo method. Therefore, the proposed method is satisfactory, and the multiple integration method was used for calculations hereafter.


Procedures of Input Variable Selection A stepwise input selection algorithm is now formulated for determining the inputs of an ANN using the CE method described above. First, determine the set of variables that can be taken as potentially inputs of the ANN. This variable set was defined as the vector I in. Denote the vector that will store the final identified inputs as I. The algorithm is as follows: 1. Based on Eq. (21), use the copula-entropy method to calculate the PMI between the output and each of the potential new inputs in I in , conditional on the preexisting input set I. The conditional expectations are computed using the GRNN method (Bowden et al. 2005a). 2. Calculate the Hampel distance Zj corresponding to the PMI obtained in step (1). 3. If the Hampel distance for the highest PMI is greater than (3), then move the candidate to the selected input set I. 4. Repeat steps (1)–(3) until all significant inputs have been selected.

Function Text Before applying the proposed method to a real-world case study, it was necessary to carry out a statistical test based on the generated synthetic data. Bowden et al. (2005a), May et al. (2008b), and Fernando et al. (2009) used several models for testing, four of which were applied in this study. These included three time series models and one a nonlinear system. The four models are given as follows: 1. AR1

Evaluation of the Proposed Method In order to assess the accuracy and performance of the proposed method, two tests were carried out. One test was based on the Gaussian variables whose MI values were known beforehand. The other was based on a range of synthetically generated datasets, whose dependence attributes were known beforehand.

xt ¼ 0.9xt−1 þ 0.866et

where et is a Gaussian random noise with a zero mean and unit standard deviation for both models. xt is the time series, and 1 denotes the number of lags. 2. AR9

Accuracy Test The copula-entropy method was used to calculate the PMI values. Two calculation methods, namely the multiple integration method and Monte Carlo method, were employed to obtain the PMI values. In order to test the accuracy of those two methods, the estimated MI values were compared with the theoretical ones. The theoretical PMI values for the normal (Gaussian) copula are given as follows (Calsaverini and Vicente 2009): 1 T Gauss ðX; YÞ ¼ − logð1 − ρ2 Þ 2

ð26Þ

xt ¼ 0.3xt−1 − 0.6xt−4 − 0.5xt−9 þ et

ð27Þ

where 1, 4, and 9 represent the number of lags. 3. TAR2-threshold autoregressive order 2 −0.5xt−6 þ 0.5xt−10 þ 0.1et if xt−6 ≤ 0 xt ¼ 0.8xt−10 þ 0.1et if xt−6 > 0

ð24Þ

where ρ = Pearson linear correlation coefficient between Gaussian variables X and Y. Assuming Person correlation coefficient ρ ranges from –0.9 to 0.9 with step 0.1, the MI were calculated according to the Eq. (24) and the two proposed methods, respectively. The errors between

ð28Þ

4. ADD15 fðx1 ; : : : ; x15 Þ ¼ 10 sin

Y

x1 x2 þ 20ðx3 − 0.5Þ2

þ 10x4 þ 5x5 þ ε

ð29Þ

Table 2. Absolute and Relative Errors between the Estimates and Theoretical MI Values −0.7

−0.6

−0.5

−0.4

EI 0.000 0.008 0.009 RI −0.01 1.59 2.70 EM −0.009 −0.004 −0.001 RM −1.10 −0.88 −0.18

ρ

−0.9

−0.8

0.000 0.00 0.002 0.67

0.000 0.00 0.013 9.18

0.000 0.000 0.00 0.00 0.002 0.000 1.83 −0.42

−0.3

−0.2

−0.1

0

0.1

0.000 0.000 0.000 0.000 0.00 0.00 — 0.00 0.002 −0.001 0.000 0.000 10.29 −10.00 — −6.00

0.2

0.3

0.000 0.00 0.002 7.35

0.000 0.00 0.002 5.08

0.4

0.5

0.6

0.000 0.000 0.000 0.00 0.00 0.00 0.002 −0.003 −0.002 2.18 −2.43 −0.94

0.7 0.009 2.70 0.009 2.70

0.8

0.9

0.008 0.000 1.59 −0.01 0.008 0.000 1.59 −0.01

Note: EI means the values obtained using the integration method; and EM means the values obtained using the Monte Carlo method. © ASCE

04014021-6


J. Hydrol. Eng.

Case Study


Data

Fig. 1. Comparisons of estimated and theoretical MI values, which is calculated by copula entropy method and Eq. (24), respectively

where ε is Gaussian noise with zero mean and unit variance; and x1 , x2 , x3 , x4 , and x5 can be generated from uniform distribution. 1,020 data points from each of the above synthetic models were generated with the first 20 points being discarded to reduce the effect of an arbitrary initialization (Bowden et al. 2005a). For these models, the first 15 lags were chosen as potential model inputs. The GRNN model with two hidden layers were used to calculate E½xjz and E½yjz in Eq. (20). A trial-and-error method was employed to determine a suitable number of hidden layer nodes for each time. The final input subset was obtained based on the improved PMI method. The tested results for each of these models are shown in Tables 3–6. Take the AR9 model as an example. For the first iteration, the MI value was calculated. The highest MI value occurs in lag 4, which is 0.239 with a Z-value of 4.43. Because the Z-value is greater than 3, the lag 4 was selected. The highest PMI value for the second, third, and fourth iterations show up in lags 9, 1, and 8 with the Z-values of 7.00, 5.86, and 1.99, respectively. Therefore, lags 9 and 1 were selected, and lag 8 was discarded. The final input sets for these test functions are given in Table 7. It is seen from Tables 3–7 that the proposed method is rational and can be applied for both time series and nonlinear models. The proposed method is capable of choosing inputs in their correct order of significance. Table 3. Test Results Based on Generated Data for AR1 Model First iteration Lags 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 © ASCE

Second iteration

MI

Z-value

Lags

PMI

Z-value

0.935 0.594 0.430 0.329 0.277 0.248 0.213 0.188 0.155 0.128 0.110 0.075 0.064 0.048 0.031

4.47 2.43 1.45 0.84 0.53 0.36 0.15 0.00 0.20 0.36 0.47 0.67 0.74 0.83 0.94

2 3 4 5 6 7 8 9 10 11 12 13 14 15 —

0.001 0.003 0.004 0.006 0.004 0.002 0.003 0.000 0.000 0.000 0.002 0.003 0.002 0.006 —

0.64 0.49 0.71 1.76 1.00 0.42 0.44 1.00 1.18 1.21 0.22 0.20 0.20 1.74 —

To further test the utility of the proposed method, it was applied to the case study of flood forecasting in the Yangtze River, China, which is the longest river in Asia, the third longest river in the world. It is about 6,300 km long and flows from its source in Qinghai Province eastward to the East China Sea at Shanghai. This study mainly focused on the upper Yangtze River, which is 4,529 km long, up to 3=4 of the whole length of the Yangtze River, with a drainage area of 1,006,000 km2 , accounting for 55.6% of the watershed area. Floods of the Yangtze River (Chang Jiang) in central and eastern China have occurred periodically and often have caused considerable destruction of property and loss of life. Among the most major flood events are those of 1870, 1931, 1954, 1998, and 2010. For example, in 1998, the entire Yangtze River basin suffered from tremendous flooding—the largest flood since 1954, which led to the economic loss of 166 billion Chinese Yuan (Yin and Li 2001). Hence, flood forecasting of the upper Yangtze River is very important for flood prevention and disaster relief. For the Yangtze River, floods are caused by unusually high precipitation between June and August. Summer is the main flooding season due to the heavy monsoon rainfall. The mean annual precipitation of Jinsha, Min and Tuo, Jialing, and Wu rivers are 736, 1,083, 965, and 1,163 mm, respectively. Temporal and spatial distributions of rainfall are closely related to monsoon activities and seasonal motion of subtropical highs. Floods in the middle and lower reaches of the Yangtze River mainly stem from the upper region of the Yichang Station, which is also the control station for the Three Georges Reservoir. The Yangtze River from Zhicheng to Chenglingji gauging station is named Jingjiang River reach, which is regarded as the most key area for flood prevention. Usually the flood volume of upper Yichang Station is about 90% of the Jingjiang River reach, and about 50% of the total flow volume of the Yangtze River. The upper Yangtze River comprises a complex of tributaries, principally Yalong, Min, Tuo, and Jialing Rivers on the left bank, and Wu River on the right bank. A schematic of the regional main tributary rivers and gauging stations is shown in Fig. 2. Yalong River joins Jinsha River, which is also recognized as part of the Yangtze River. Therefore, Jinsha River, instead of Yalong River, was considered in this study. A total of six gauging stations were taken into account. From upstream to downstream, they are Pingshan, Gaochang, Lijiawan, Beibei, Wulong, Yichang, each with a concurrent mean daily flow data from the year 1998–2007. The flow of each gauging station was taken as a variable. The past values of Pingshan, Gaochang, Lijiawan, Beibei, Wulong, and Yichang Station were taken as potential input candidate variables, and the runoff of Yichang station at time t as output variable. The data used at the Yichang gauging station is naturalized, and the storage effects of three gorges reservoir (TGR) were removed. This data is used to represent the input flow of the three gorges reservoir, which cannot measure directly. Therefore, this flood forecasting model that is proposed aims to predict the input flow of the three gorges reservoir. The CE algorithm with the Hampel distance outlier detection approach as the termination criterion was used to identify the significant inputs. Selection of Input Variables for ANN Model Bowden et al. (2005b) proposed a two stage procedure for input selection. The same method was used in this study. The first step

04014021-7


J. Hydrol. Eng.

Table 4. Test Results Based on Generated Data for AR9 Model First iteration


Lags 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Second iteration

Third iteration

Fourth iteration

MI

Z-value

Lags

PMI

Z-value

Lags

PMI

Z-value

Lags

PMI

Z-value

0.081 0.002 0.037 0.239 0.011 0.002 0.007 0.030 0.090 0.068 0.018 0.001 0.200 0.126 0.033

1.03 0.67 0.09 4.43 0.47 0.67 0.55 0.06 1.22 0.75 0.32 0.69 3.60 2.00 0.00

1 2 3 5 6 7 8 9 10 11 12 13 14 15 —

0.051 0.006 0.000 0.013 0.006 0.003 0.058 0.188 0.093 0.014 0.010 0.117 0.078 0.021 —

1.37 0.50 0.73 0.20 0.50 0.62 1.67 7.00 3.10 0.15 0.32 4.07 2.49 0.15 —

1 2 3 5 6 7 8 10 11 12 13 14 15 — —

0.142 0.019 0.003 0.070 0.002 0.004 0.016 0.039 0.010 0.001 0.002 0.059 0.031 — —

5.86 0.15 0.60 2.49 0.67 0.55 0.00 1.05 0.27 0.72 0.68 1.97 0.67 — —

2 3 5 6 7 8 10 11 12 13 14 15 — — —

0.000 0.000 0.001 0.001 0.002 0.004 0.000 0.001 0.004 0.002 0.002 0.000 — — —

0.72 0.66 0.15 0.02 0.82 1.99 0.69 0.02 1.74 0.46 0.45 0.74 — — —

Table 5. Test Results Based on Generated Data for TAR2 Model First iteration

Second iteration

Details of the method are described as follows. If the number of candidate variables is d (i.e., x1 ; x2 ; xi ; : : : ; xd ) and the output variable is yt , then their own past values (xi;t−1 ; xi;t−2 ; : : : ; xi;t−k ) and (yt−1 ; yt−2 ; : : : ; yt−k ) are potential inputs, where k refers to the maximum lag that has been included as a potential input. Bowden et al. (2005b) indicated that if prior knowledge about the relationship between the input and output time series is available, then k can be chosen such that the lags of the input variable that exceed k are not likely to have any significant effect on the output time series. Noting that 3-day and 7-day flood volumes have usually been employed in flood analysis, a flood event lasts less than two weeks. The period of two weeks, which is double the time of 7 days, was taken into account in this study. Except for time t, the first 13 lags of each variable were used as candidate inputs. First, the MI values between each of (xi;t−1 ; xi;t−2 ; : : : ; xi;t−k ) and yt and between (yt−1 ; yt−2 ; : : : ; yt−k ) and yt were calculated. The significant lags, which have the highest MI values, were selected. Then the PMI value and its Z-value were calculated in each iteration, the maximum of which with its Z-value greater than 3 was selected. The final selected results of all the stations in step one are summarized in Table 8. During this stage, the original 78 inputs were reduced to 12 inputs. The past runoff (yt−1 ; yt−2 ; : : : ; yt−k ) at Yichang Station has great impact on yt . Therefore, several values of past runoff at Yichang Station was selected. Only one input was

Third iteration

Lags

PMI

Zj

Lags

PMI

Zj

Lags

PMI

Z-value

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.006 0.014 0.006 0.021 0.004 0.041 0.004 0.012 0.009 0.359 0.009 0.018 0.011 0.012 0.005

0.72 0.70 0.63 0.37 0.67 15.94 0.79 0.31 0.11 0.11 0.69 0.68 0.92 0.34 0.72

1 2 3 4 5 6 7 8 9 11 12 13 14 15 —

0.000 0.003 0.003 0.001 0.000 0.029 0.000 0.001 0.001 0.002 0.003 0.003 0.003 0.001 —

0.72 0.70 0.63 0.37 0.67 15.94 0.79 0.31 0.11 0.11 0.69 0.68 0.92 0.34 —

1 2 3 4 5 7 8 9 11 12 13 14 15 — —

0.000 0.000 0.003 0.003 0.001 0.000 0.003 0.001 0.001 0.000 0.003 0.004 0.001 — —

0.53 0.69 0.87 1.34 0.00 0.67 1.17 0.16 0.00 0.60 1.03 1.90 0.23 — —

is called bivariate stage, which aims to determine the significant lag of each variable. The second step is called multivariate stage, in which the significant lags selected in the previous step are combined to form a subset of candidates. Then, the final set of significant input can be obtained using the same PMI method as in step 1.

Table 6. Test Results Based on Generated Data for ADD15 Model First iteration

Second iteration

Third iteration

Fourth iteration

Fifth iteration

Sixth iteration

Lags

PMI

Zj

Lags

PMI

Zj

Lags

PMI

Zj

Lags

PMI

Zj

Lags

PMI

Zj

Lags

PMI

Zj

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.001 0.001 0.190 0.132 0.051 0.001 0.001 0.000 0.000 0.000 0.002 0.000 0.000 0.005 0.001

0.00 0.15 172.88 119.90 45.66 0.05 0.22 0.86 0.67 0.83 0.54 0.42 0.71 3.37 0.29

1 2 4 5 6 7 8 9 10 11 12 13 14 15 —

0.001 0.001 0.516 0.144 0.001 0.001 0.001 0.002 0.000 0.002 0.000 0.001 0.006 0.000 —

0.01 0.26 493.78 137.10 0.28 0.20 0.01 0.95 0.94 0.72 0.81 0.40 4.48 0.63 —

1 2 5 6 7 8 9 10 11 12 13 14 15 — —

0.002 0.007 0.255 0.004 0.002 0.002 0.000 0.000 0.000 0.003 0.000 0.001 0.001 — —

0.23 2.67 135.51 1.28 0.00 0.23 0.82 0.79 0.67 0.51 0.86 0.43 0.48 — —

1 2 6 7 8 9 10 11 12 13 14 15 — — —

0.007 0.000 0.001 0.000 0.000 0.000 0.000 0.002 0.001 0.002 0.000 0.001 — — —

8.81 0.53 0.23 0.91 0.54 0.79 0.23 2.25 0.76 1.51 0.58 0.59 — — —

2 6 7 8 9 10 11 12 13 14 15 — — — —

0.002 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.001 — — — —

3.56 0.00 0.67 0.21 0.56 1.11 2.50 0.57 0.44 1.76 0.74 — — — —

6 7 8 9 10 11 12 13 14 15 — — — — —

0.000 0.000 0.000 0.000 0.001 0.001 0.000 0.000 0.002 0.001 — — — — —

0.03 0.69 1.01 0.03 0.80 0.09 0.61 0.66 2.37 1.58 — — — — —

© ASCE

04014021-8


J. Hydrol. Eng.


Table 7. Final Selected Input Sets for the Four Models Model

Final selected input sets

AR1 AR9 TAR2 TR

xt−1 xt−4 ; xt−9; xt−1 x10 ; x6 x3 ; x4 ; x5 ; x1 ; x2

selected for Gaochang, Beibei, Lijiawan, and Wulong, and the selected lag times of these stations matched the flood travel times. The travel times from Pingshan, Gaochang, Lijiawan, Beibei, and Wulong to Yichang Station were about 3, 3, 3, 2, and 2 days. From this point of view, this method is adequate. Second, the significant lags selected in step one were combined to form a subset of candidates. The PMI values were calculated based on the CE method. During this stage, 12 inputs were reduced to 7. The final input set for ANN consisted of X yc ;t−1; X yc ;t−2; X ps ;t−1 ; X gc ;t−3; X bb ;t−2; X wl ;t−2; X ljw;t−3 , in which subscript yc; ps; gc; bb; wl; ljw mean Yichang, Pingshan, Gaochang, Beibei, Wulong, Lijiawan gauging station, respectively. Flood Forecasting Based on the Selected Inputs The selected variables based on the copula-entropy method were used as inputs of the ANN model. As mentioned above, 10 years (1998–2007) of data was used, 8 years of which were used for training the ANN model, and two of which were used for model validation. A cross validation was conducted to evaluate the performance of the proposed model, which avoids problems of arbitrarily dividing data into calibration and validation sets. The GRNN method with two hidden layers was used to establish the ANN model. Since the cross validation were used in the case study, it is not possible to use the same hidden nodes for each data set. A trial-and-error method was employed to determine a suitable number of hidden layer nodes for each time.

The performance of the hydrological forecasting models was assessed in accordance with the criteria specified by the Ministry of Water Resources of China (MWR 2006). These are the coefficient of efficiency (i.e., Nash–Sutcliffe efficiency, R2 ), which is a measure of the goodness-of-fit between the recorded and predicted discharge time series data, and the qualified rate (α) of predicted individual flood event peak discharge and volume (Li et al. 2010). A forecast peak discharge or flood volume is termed qualified, when the difference between the predicted and the recorded values is within 20% of the recorded value. The root-mean-square error (RMSE) between the observed and predicted flood value was also used as a performance criterion in this study. The formulas of these criteria can be found in Li et al. (2010). Results of flood forecasting are shown in Table 9. Time series plots of observed and predicted flood values obtained with seven inputs selected using the copula-entropy method are shown in Fig. 3. It can be seen that from Table 9 and Fig. 3 that the model performs quite well. Comparison with Other Methods Comparison with Inputs Obtained by the Linear Correlation Coefficients Bowden et al. (2005a) pointed out that the linear correlation method is the most popular analytical technique for selecting appropriate inputs. Therefore, this study compared the selected inputs by the proposed method with those by the linear correlation coefficient. Two assumptions need to be satisfied for the Pearson correlation coefficient. One is that the variable must follow the multivariate normal distribution, and the other is that the pairwise dependency is linear. The normal distribution for the marginal variables is presumed. The marginal probability density functions estimated by the normal assumption and the principle of maximum entropy (POME) method for the five rivers are shown in Fig. 4. It is seen that the distribution estimated by POME fitted the empirical distribution better than the normal distribution, especially for the data of Min,

Fig. 2. Locations of mainstream and tributaries (Jinsha, Min, Tuo, Jialing, and Wu rivers) and their corresponding gauging stations in the upper Yangtze river © ASCE

04014021-9


J. Hydrol. Eng.

Table 8. Final Selected Inputs in Step One Stations

Selected inputs Lag t − 1; t − 4; t − 2 Lag t − 3 Lag t − 2 Lag t − 3 Lag t − 2 Lag t − 1; t − 2; t − 3; t − 4; t − 5


Pingshan Gaochang Beibei Lijiawan Wulong Yichang

Table 9. The Crass-Validation Results Based on the Inputs Selected by the CE Method Validation period 1998–1999 2000–2001 2002–2003 2004–2005 2006–2007

Type

Qualified rate

RMSE

R2

Training Validation Training Validation Training Validation Training Validation Training Validation

0.961 0.954 0.962 0.960 0.977 0.945 0.977 0.933 0.973 0.937

2,026 3,064 2,242 2,182 2,159 2,390 1,999 2,802 2,017 2,423

0.954 0.948 0.963 0.932 0.964 0.938 0.970 0.906 0.966 0.960

Tuo, and Wu Rivers that showed high kurtosis and skewness. The assumption of normality was found to be inappropriate in this case. To test the validity of the assumption that the pairwise dependence is linear, the time series of flow data was divided into two segments. Pearson’s correlations were calculated for each segment. The calculated results of the Gaochang gauging and Wulong gauging stations are listed in Table 10, which indicate that Pearson’s correlation was changing over time, and therefore linear correlation coefficients are not valid for these stations. In the following, the authors discuss if the linear correlation coefficient was used, then which inputs were finally selected. Pearson’s correlation coefficients were computed, and the results are given in Table 11. The variable at lag t with the largest correlation coefficients was definitely selected as an input of ANN. The partial correlation coefficients (PCC) were used to remove the effect of the selected input variable and measure the true correlation between potential inputs and output. The theory of PMI was

employed to calculate the PCC between each potential input and output given the selected input lag t. The calculated results are shown in Fig. 5, where the value for the selected variable is the linear correlation coefficient, and for the other is the PCC value given the selected input lag t. It can be seen that compared with the value of the correlation coefficients, which are the largest ones in Fig. 5, PCC is not large. Take the Beibei gauging station for example. The largest linear correlation coefficient occurred at lag 2, which is the largest one in the figure and equals 0.634. Therefore, the variable at lag 2 was definitely taken as the input of the ANN model. The PCC, which removed the effect of the selected variable at lag t − 2, was calculated, shown in Fig. 5. The second highest value for Beibei was 0.26, and nearly one third of the largest one. Therefore, only one variable was selected for each gauging station. The selected inputs, based on the linear correlation method, are listed in Table 12, and the selected input set of the proposed method is also shown. Results show that the input set selected by the proposed method and correlation coefficient is somewhat different. For example, the proposed method selected the Gaochang lag 3 as input, and the correlation method selected the Gaochang lag 4 as input. Actually, the travel time between Gaochang and Yichang is 3 days. Both of these two input sets, which were selected by the proposed method and Pearson linear correlation coefficients, were employed to forecast the flood at the Yichang Station. The same data sets were used to predict the input flow of the three gorges reservoir. The performance criteria were calculated, and the results, given in Table 13, indicate that the network trained with the inputs selected by the PMI method had a higher R2 and α and smaller RMSE values. In addition, the ANN model results based on the inputs selected by the LCC method were also shown in Fig. 3, which indicates that the predicted results based on the inputs selected by the CE method is superior to those based on the inputs selected by the LCC method. Therefore, the flood forecasting model with the selected inputs set based on PMI is better. Comparison with the Current Flood Forecasting Model of TGR The current flood forecasting method was used to predict the flow of 2006 and 2007 of TGR (Liang et al. 1992). The RMSE, R2 , and α values, calculated using the current flood forecasting model, are 2,425, 0.9340, and 0.95; and those of the proposed model are 2,423, 0.937, and 0.96, respectively. The current regression method performed pretty well, and the results of the proposed method based on ANN model are comparable with the current method.

Fig. 3. Comparison of observed daily flow series with flood forecasting results based on the proposed input identification method and the traditional linear correlation method © ASCE

04014021-10


J. Hydrol. Eng.


Fig. 4. Fitting frequency histograms of flood magnitude by the POME method and normal distribution

Table 10. The Correlation Coefficients of Gaochang and Wulong Gauging Station Stations Gaochang Wulong

Period

Yichang t

Lag t−1

Lag t−2

Lag t−3

Lag t−4

Lag t−5

Lag t−6

Lag t−7

Lag t−8

Lag t−9

Lag t − 10

Lag t − 11

Lag t − 12

Lag t − 13

1 2 1 2

1.000 1.000 1.000 1.000

0.289 0.445 0.363 0.514

0.325 0.505 0.421 0.521

0.385 0.565 0.456 0.489

0.463 0.586 0.444 0.442

0.488 0.564 0.417 0.400

0.457 0.531 0.407 0.358

0.403 0.507 0.414 0.319

0.365 0.502 0.428 0.288

0.341 0.496 0.432 0.267

0.324 0.472 0.418 0.246

0.313 0.444 0.395 0.218

0.308 0.413 0.368 0.177

0.305 0.381 0.335 0.132

Note: Consider that the time series of flow data was divided into two segments. 1 represents the correlation coefficients of the first segmentation; and 2 represents the correlation coefficients of the second segmentation. © ASCE

04014021-11


J. Hydrol. Eng.


Table 11. The Correlation Coefficients between Potential Inputs and Output of ANN Model Stations

Lag t−1

Lag t−2

Lag t−3

Lag t−4

Lag t−5

Lag t−6

Lag t−7

Lag t−8

Lag t−9

Lag t − 10

Lag t − 11

Lag t − 12

Lag t − 13

Yichang Pingshan Lijiawan Gaochang Beibei Wulong

0.968 0.698 0.417 0.435 0.603 0.498

0.901 0.695 0.487 0.492 0.636 0.504

0.829 0.684 0.523 0.538 0.604 0.478

0.764 0.661 0.503 0.540 0.526 0.446

0.714 0.629 0.459 0.507 0.441 0.424

0.673 0.593 0.414 0.470 0.381 0.413

0.640 0.557 0.385 0.443 0.350 0.407

0.612 0.523 0.372 0.424 0.333 0.401

0.585 0.492 0.367 0.405 0.320 0.385

0.558 0.462 0.362 0.386 0.309 0.361

0.529 0.434 0.358 0.369 0.302 0.331

0.499 0.407 0.355 0.352 0.298 0.295

0.468 0.380 0.345 0.339 0.292 0.264

Fig. 5. Correlation coefficients between inputs and output of the ANN model (for the selected input, the value is the linear correlation coefficient, and for the other potential inputs, it is the PCC given the selected variable)

Table 12. Comparisons between the Selected Input Sets Based on the Pearson Linear Correlation Coefficients and Proposed PMI Method Rivers Jinsha Min Tuo Jialing Wu Yangzte © ASCE

Stations Pingshan Gaochang Lijiawan Beibei Wulong Yichang

Pearson correlation coefficient Lag Lag Lag Lag Lag Lag

t−1 t−4 t−3 t−2 t−2 t−1

Table 13. Comparison of Results Obtained with Different Input Variables

PMI Lag t − 1 Lag t − 3 Lag t − 3 Lag t − 2 Lag t − 2 Lag t − 1; t − 2

Nash–Sutcliffe efficiency R2 Methods


Qualified rate

Training Validation Training Validation Training Validation

Linear 0.9231 correlation CE 0.9402

04014021-12

RMSE

0.9036

1,476

2,932

0.9857

0.8566

0.9341

1,281

2,423

0.9898

0.9590

J. Hydrol. Eng.


Conclusions This study develops a new method to identify the input selection algorithm for the ANN model, using CE. The accuracy and performance of the proposed method are investigated and analyzed using statistical experimentation and real world data. The main conclusions are: 1. The CE can measure the linear and nonlinear dependencies based on the information theory and the copula function. It is a nonparametric method and makes no assumptions about the statistical distribution. Furthermore, the proposed method only needs to calculate the CE instead of the marginal or joint probability distributions, and estimates MI or PMI more directly and avoids the accumulation of systematic bias. Third, the proposed method is based on the theory for continuous variables instead of the discrete version to calculate PMI. 2. Multiple integration and Monte Carlo methods are used to obtain the value of CE. The accuracy test shows that the two methods, namely multiple integration and Monte Carlo methods, lead to similar results and the result calculated by multiple integration method is more accurate than that by the Monte Carlo method. 3. The function test with known attributes show that the proposed method can be applied to both time series and nonlinear models and is capable of choosing inputs in their correct order of significance. 4. Application of the proposed method to the case study of forecasting flood in the upper Yangtze River indicates that the proposed method can identify the appropriate inputs for ANN. Compared with the obtained inputs set based on the Pearson linear correlation method, the network trained with the inputs selected by the PMI method has a smaller RMSE for training, testing, and validation data. Therefore, this study has a practical significance for flood forecasting in the upper Yangtze River basin.

Acknowledgments The project was financially supported by the National Natural Science Foundation of China (NSFC Grant 51309104, 51239004, 51190094), Fundamental Research Funds for the Central Universities (2013QN113) and Natural Science Foundation of Hubei Province (No. 2013CFB184).

References Alfonso, L., He, L., Lobbrecht, A., Price, R. (2012). “Information theory applied to evaluate the discharge monitoring network of the Magdalena river.” J. Hydroinf., in press. Alfonso, L., Lobbrecht, A., and Price, R. (2010). “Optimization of water level monitoring network in polder systems using information theory.” Water Resour. Res., 46(12), W12553. Berntsen, J., Espelid, T. O., and Genz, A. (1991). “An adaptive algorithm for the approximate calculation of multiple integrals.” ACM Trans. Math. Softw., 17(4), 437–451. Bowden, G. J., Dandy, G. C., and Maierb, H. R. (2005a). “Input determination for neural network models in water resources applications. Part 1. Background and methodology.” J. Hydrol., 301(1–4), 75–92. Bowden, G. J., Maierb, H. R., and Dandy, G. C. (2005b). “Input determination for neural network models in water resources applications. Part 2. Case study: Forecasting salinity in a river.” J. Hydrol., 301(1–4), 93–107. Calsaverini, R. S., and Vicente, R. (2009). “An information-theoretic approach to statistical dependence: Copula information.” Eur. Phys. Lett., 88(6), 3–12. © ASCE

Chang, F. J., Chiang, Y. M., and Chang, L. C. (2007). “Multi-step-ahead neural networks for flood forecasting.” Hydrol. Sci. J., 52(1), 114–130. Chebana, F., and Ouarda, T. (2011). “Multivariate extreme value identification using depth functions.” Environmetrics, 22(3), 441–455. Chen, L., Guo, S. L., Yan, B. W., Liu, P., and Fang, B. (2010). “A new seasonal design flood method based on bivariate joint distribution of flood magnitude and date of occurrence.” Hydrol. Sci. J., 55(8), 1264–1280. Chen, L., Singh, V., Shenglian, G., Hao, Z., and Li, T. (2012). “Flood coincidence risk analysis using multivariate copula functions.” J. Hydrol. Eng., 10.1061/(ASCE)HE.1943-5584.0000504, 742–755. Corzo, G., and Solomatine, D. P. (2007). “Knowledge-based modularization and global optimization of artificial neural network models in hydrological forecasting.” Neural Networks, 20(4), 528–536. Davies, L., and Gather, U. (1993). “The identification of multiple outliers.” J. Am. Stat. Assoc., 88(423), 782–792. De Michele, C., and Salvadori, G. (2003). “A generalized Pareto intensity duration model of storm rainfall exploiting 2-copulas.” J. Geophys. Res., 108(D2), 1–11. De Michele, C., Salvadori, G., Passoni, G., and Vezzoli, R. (2007). “A multivariate model of sea storms using copulas.” Coastal Eng., 54(10), 734–751. Elshorbagy, A., Corzo, G., Srinivasulu, S., and Solomatine, D. P. (2010a). “Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology—Part 1: Concepts and methodology.” Hydrol. Earth Syst. Sci., 14(10), 1931–1941. Elshorbagy, A., Corzo, G., Srinivasulu, S., and Solomatine, D. P. (2010b). “Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology—Part 2: Application.” Hydrol. Earth Syst. Sci., 14(10), 1943–1961. Favre, A.-C., Adlouni, S., Perreault, L., Thiémonge, N., and Bobée, B. (2004). “Multivariate hydrological frequency analysis using copulas.” Water Resour. Res., 40(1), W01101. Fernando, T. M. K. G., Maier, H. R., and Dandy, G. C. (2009). “Selection of input variables for data driven models: An average shifted histogram partial mutual information estimator approach.” J. Hydrol., 367(3–4), 165–176. Gauthier, T. D. (2001). “Detecting trends using Spearman’s rank correlation coefficient.” Environ. Forensics, 2(4), 359–362. Grimaldi, S., and Serinaldi, F. (2006). “Design hyetographs analysis with 3-copula function.” Hydrol. Sci. J., 51(2), 223–238. Harmancioglu, N. B., and Yevjevich, V. (1987). “Transfer of hydrologic information among river points.” J. Hydrol., 91(1–2), 103–111. Hejazi, M. I., Cai, X., and Ruddel, B. (2008). “The role of hydrologic information to reservoir operations—Learning from past releases.” Adv. Water Resour., 31(12), 1636–1650. Hu, T. S., Lam, K. C., and Ng, S. T. (2005). “A modified neural network for improving river flow prediction.” Hydrol. Sci. J., 50(2), 299–318. Joe, H. (1997). Multivariate models and dependence concepts, Chapman and Hall, London. Kao, S. C., and Govindaraju, R. S. (2007). “A bivariate frequency analysis of extreme rainfall with implications for design.” J. Geophys. Res., 112(D13). Kao, S. C., and Govindaraju, R. S. (2010). “A copula-based joint deficit index for droughts.” J. Hydrol., 380(1–2), 121–134. Kapur, J. N., and Kesavan, H. K. (1992). Entropy optimization principles and their application, Academic Press, San Diego. Khan, S., et al. (2006). “Nonlinear statistics reveals stronger ties between ENSO and the tropical hydrological cycle.” Geophys. Res. Lett., 33(24), L24402. Kuhn, G., Khan, S., Ganguly, A. R., and Branstetter, M. L. (2007). “Geospatial–temporal dependence among weekly precipitation extremes with applications to observations and climate model simulations in South America.” Adv. Water Resour., 30(12), 2401–2423. Lachtermacher, G., and Fuller, J. D. (1994). “Backpropagation in hydrological time series forecasting.” Stochastic and statistical methods in hydrology and environmental engineering, K. W. Hipel, A. I. McLeod, U. S. Panu, and V. P. Singh, eds., Kluwer Academic, Dordrecht.

04014021-13


J. Hydrol. Eng.


Lallahem, S., and Mania, J. (2003). “Evaluation and forecasting of daily groundwater outflow in a small chalky watershed.” Hydrol. Process., 17(8), 1561–1577. Li, W. (1990). “Mutual information functions versus correlation functions.” J. Stat. Phys., 60(5–6), 823–837. Li, X., Guo, S. L., Liu, P., and Chen, G. Y. (2010). “Dynamic control of flood limited water level for reservoir operation by considering inflow uncertainty.” J. Hydrol., 391(1–2), 124–132. Liang, G. C., Kachroo, R. K., Kang, W., and Yu, X. Z. (1992). “Applications of linear modeling techniques for flow routing on large basins.” J. Hydrol., 133(1–2), 99–140. Maier, H. R., and Dandy, G. C. (2000). “Neural networks for the prediction and forecasting of water resources variables: A review of modeling issues and applications.” Environ. Modell. Software, 15(1), 101–124. May, R. J., Dandy, G. C., Maier, H. R., and Nixon, J. B. (2008a). “Application of partial mutual information variable selection to ANN forecasting of water quality in water distribution systems.” Environ. Modell. Software, 23(10–11), 1289–1299. May, R. J., Maier, H. R., Dandy, G. C., and Gayani Fernando, T. M. K. (2008b). “Nonlinear variable selection for artificial neural networks using partial mutual information.” Environ. Modell. Software, 23(10–11), 1312–1326. McCulloch, W. S., and Pitts, W. (1943). “A logical calculus of the ideas imminent in nervous activity.” Bull. Math. Biol., 5, 115–133. Ministry of Water Resources (MWR). (2006). Regulation for calculating design flood of water resources and hydropower projects, Chinese Shuili Shuidian Press, Beijing (in Chinese). Molini, A., La Barbera, P., and Lanza, L. G. (2006). “Correlation patterns and information flows in rainfall fields.” J. Hydrol., 322(1–4), 89–104. Nelsen, R. B. (2006). An introduction to Copulas, 2nd Ed., Springer, New York. Ng, W. W., Panu, U. S., and Lennox, W. C. (2007). “Chaos based analytical techniques for daily extreme hydrological observations.” J. Hydrol., 342(1–2), 17–41. Renard, B., and Lang, M. (2007). “Use of a Gaussian copula for multivariate extreme value analysis: Some case studies in hydrology.” Adv. Water Resour., 30(4), 897–912. Salvadori, G., and De Michele, C. (2010). “Multivariate multiparameter extreme value models and return periods: A copula approach.” Water Resour. Res., 46(10), W10501. Salvadori, G., De Michele, C., Kottegoda, N. T., and Rosso, R. (2007). Extremes in nature: An approach using copulas, Springer, New York. Serinaldi, F., Bonaccorso, B., Cancelliere, A., and Grimaldi, S. (2009). “Probabilistic characterization of drought properties through copulas.” Phys. Chem. Earth , 34(10–12), 596–605. Shannon, C. E. (1948). “Mathematical theory of communication.” Bell Syst. Tech. J., 27(3), 379–423. Sharma, A. (2000). “Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1 a strategy for system predictor identification.” J. Hydrol., 239(1–4), 232–239.

© ASCE

Shiau, J. T. (2006). “Fitting drought duration and severity with twodimensional copulas.” Water Resour. Manage., 20(5), 795–815. Shiau, J. T., Wang, H. Y., and Chang, T. T. (2006). “Bivariate frequency analysis of floods using copulas.” J. Am. Water Resour. Assoc., 42(6), 1549–1564. Shrestha, D. L., and Solomatine, D. P. (2008). “Data-driven approaches for estimating uncertainty in rainfall-runoff modelling.” Int. J. River Basin Manage., 6(2), 109–122. Singh, V. P. (2000). “The entropy theory as a tool for modeling and decision making in environmental and water resources.” J. Water Soc. Am., 26(1), 1–11. Singh, V. P., and Zhang, L. (2007). “IDF curves using the Frank Archimedean copula.” J. Hydrol. Eng., 10.1061/(ASCE)1084-0699(2007)12: 6(651), 651–662. Sklar, A. (1959). “Fonctions de répartition à n dimensions et leursmarges.” Publ. Inst. Stat. Univ. Paris, 8, 229–231. Solomatine, D. P., Maskey, M., and Shrestha, D. L. (2007). “Instance-based learning compared to other data-driven methods in hydrologic forecasting.” Hydrol. Process., 21(2), 275–287. Song, S., and Singh, V. P. (2010). “Meta-elliptical copulas for drought frequency analysis of periodic hydrologic data.” Stochastic Environ. Res. Risk A, 24(3), 425–444. Specht, D. F. (1991). “A general regression neural network.” IEEE Trans. Neural Networks, 2(6), 568–576. Steuer, R. (2006). “On the analysis and interpretation of correlations in metabolomic data.” Brief Bioinf., 7(2), 151–158. Steuer, R., Kurths, J., Daub, C. O., Weise, J., and Selbig, J. (2002). “The mutual information: Detecting and evaluating dependencies between variables.” Bioinformatics, 18(2), 231–240. Wang, X., Gebremichael, M., and Yan, J. (2010). “Weighted likelihood copula modeling of extreme rainfall events in Connecticut.” J. Hydrol., 390(1–2), 108–115. Xiao, Y., Guo, S. L., Liu, P., Yan, B. W., and Chen, L. (2009). “Design flood hydrograph based on multicharacteristic synthesis index method.” J. Hydrol. Eng., 10.1061/(ASCE)1084-0699(2009)14: 12(1359), 1359–1364. Xu, Y. (2005). “Applications of copula-based models in portfolio optimization.” Ph.D. dissertation, Univ. of Miami, Coral Gables, FL. Yin, H. F., and Li, C. A. (2001). “Human impact on floods and flood disasters on the Yangtze River.” Geomorphology, 41(2–3), 105–109. Zhang, L., and Singh, V. P. (2006). “Bivariate flood frequency analysis using the copula method.” J. Hydrol. Eng., 10.1061/(ASCE)10840699(2006)11:2(150), 150–164. Zhang, L., and Singh, V. P. (2007). “Gumbel–Hougaard copula for trivariate rainfall frequency analysis.” J. Hydrol. Eng., 10.1061/ (ASCE)1084-0699(2007)12:4(409), 409–419. Zhao, N., and Lin, W. T. (2011). “A copula entropy approach to correlation measurement at the country level.” Appl. Math. Comput., 218(2), 628–642.

04014021-14


J. Hydrol. Eng.