Selection of a kernel bandwidth for measuring ... - hp.igf.edu.pl

Stochastic Environmental Research and Risk Assessment 15 (2001) 310±324 Ó Springer-Verlag 2001

Selection of a kernel bandwidth for measuring dependence in hydrologic time series using the mutual information criterion T. I. Harrold, A. Sharma, S. Sheather 310 Abstract. Mutual information is a generalised measure of dependence between any two variables. It can be used to quantify non-linear as well as linear dependence between any two variables. This makes mutual information an attractive alternative to the use of the correlation coef®cient, which can only quantify the linear dependence pattern. Mutual information is especially suited for application to hydrological problems, because the dependence between any two hydrologic variables is seldom linear in nature. Calculation of the mutual information score involves estimation of the marginal and joint probability density functions of the two variables. This paper uses nonparametric kernel density estimation methods to estimate the probability density functions. Accurate estimation of the mutual information score using kernel methods requires selection of appropriate smoothing parameters (bandwidths) for use with the kernels. The aim of this paper is to obtain a practical method for bandwidth selection for calculation of the mutual information score. In this paper, the lag-one dependence structures of several autocorrelated time series are analysed using mutual information (note that this produces the lag-one auto-MI score, the analog of the lag-one autocorrelation). Empirical trials are used to select appropriate bandwidths for a range of underlying autoregressive and autoregressive-moving average models with normal or near-normal parent distributions. Expressions for reasonable bandwidth choices under these conditions are proposed.

1 Introduction Identi®cation, analysis, and modelling of dependence are essential parts of the discipline of hydrology. Statistical dependence is a mathematical description of the strength of the relationship between a dependent variable and one or more T. I. Harrold, A. Sharma (&) School of Civil and Environmental Engineering The University of New South Wales Sydney NSW 2052, Australia S. Sheather Australian Graduate School of Management The University of New South Wales Sydney NSW 2052, Australia The authors gratefully acknowledge New South Wales Department of Land and Water Conservation and the Australian Research Council for funding this research.

explanatory (predictor) variables. Examples where the analysis and modelling of dependence is required include ®lling gaps in rainfall, evaporation and stream¯ow data, forecasting future rainfall and stream¯ows, and generation of long sequences of synthetic data. Hirsch et al. (1993) provides a good introduction to the analysis of relationships between hydrologic variables. The dependence structure present in hydrologic time series has traditionally been modelled using autoregressive or autoregressive-moving average models. Such models characterise the time series by an assumed probability distribution based on a small number of sample statistics such as mean, variance, correlation, and skewness, calculated from the historical record. The ®tted models assume a linear dependence relationship, and ®tting of the models is often based on the calculation of the correlation or partial correlation between various lags of the variable being modelled. Measures such as correlation offer a limited representation of the nature of dependence that may be present, as they can only represent the quality of a linear relationship between the variables. Nonlinear relationships between variables (which are common in hydrology) cannot be adequately detected and quanti®ed by the correlation coef®cient. As a result, the corresponding ®tted models may not be fully representative of the system that they are attempting to model. There is a need to develop a strategy to measure dependence and ®t hydrologic models in a more general manner. In particular, a more general measure of dependence that can detect and quantify nonlinear relationships is required. While such a measure has obvious uses in any application where the dependence between any two variables needs to be quanti®ed, it has a unique importance in hydrologic time series modelling applications such as those described in Sharma et al. (1997). These applications are designed to reproduce a broad class of underlying probability density functions, and they therefore require the use of a generalised measure of dependence for selecting the predictor variables that will be used in the modelling. The use of a linear measure of dependence (such as the correlation coef®cient) for this task can result in the selection of predictors that cannot adequately reproduce the nonlinear dependence and non-Gaussian probability density functions. Several approaches have been used for ®nding the order of dependence for a time series model. In a linear context, measures such as the Akaike information criterion (AIC) (Brockwell and Davis 1996, p 171) are used to choose the model order for parametric models. However, traditional measures such as AIC are not applicable to the problem of model order selection for nonparametric time series models, such as the NP(p) model proposed by Sharma et al. (1997). Estimation of the model order using a general linear/nonlinear time series model such as the NP(p) model can be accomplished by use of the partial mutual information criterion. The partial mutual information criterion (PMI) (Sharma, 2000) is a useful alternative for identifying a combination of predictors for a model (the number of predictors being the model order) without making major assumptions about the underlying model structure. Accurate estimation of the mutual information criterion is an important part of the calculation of PMI. The mutual information (MI) criterion (Sharma, 2000; Fraser and Swinney, 1986) is a measure of dependence that can detect and quantify both linear and nonlinear relationships. Mutual information is related to entropy, and has also been referred to as transinformation (see Chapman 1986, Singh 1997). Sharma (2000) shows that MI performs better than correlation in detecting and quantifying a range of nonlinear dependence structures, and that it also performs well in quantifying linear dependence. We believe that the mutual information criterion

311

312

can quantify a broader range of underlying dependence structures than any other available method. We use a nonparametric implementation of the mutual information in this study. The nonparametric method used is kernel density estimation (Silverman, 1986). Nonparametric methods avoid the issue of assuming a probability distribution and use the entire historical sample to estimate the probability densities needed for simulation. Nonparametric models are constructed with minimal assumptions regarding the underlying dependence structure and the form of the probability density function, and are therefore more generally applicable than traditional parametric models. The implementation of the mutual information criterion studied here has been used in Moon (1995) and Sharma (2000), and is sensitive to the choice of a set of smoothing parameters known as the kernel bandwidths. For example, in a typical result from this study, a 13% decrease in bandwidth from the best obtained resulted in a 150% increase in the mean square error of the estimated mutual information. This current work presents the results of empirical trials and sensitivity studies. These results suggest appropriate choices for smoothing parameter (bandwidth) selection for calculation of the MI score using kernel density estimation methods. These choices are based on the use of the Gaussian reference bandwidth (Silverman, 1986 p 86; Scott, 1992 p 152), multiplied by a scaling factor. Mutual information is described in the next section of this paper, followed by a discussion of kernel density estimation of the MI and a discussion on the selection of the kernel bandwidths. The methodology for the empirical trials to ®nd practical bandwidth choices for use in calculating the MI score is discussed next, followed by a presentation of the results. A practical rule for selecting appropriate bandwidths, based on the results, is presented in the conclusion to this paper.

2 Background 2.1 The mutual information criterion as a measure of dependence For two variables X and Y, the MI criterion is de®ned in bits1 as: ZZ fX;Y x; y dx dy MI fX;Y x; y log2 fX xfY y

1

where fX x and fY y are the marginal probability density functions (PDF's) of X and Y respectively, and, fX;Y x; y is the joint (bivariate) PDF of X and Y. If the two variables X and Y are not related then, by the de®nition of independence, the joint PDF is equal to the product of the marginal PDF's. The ratio in Eq. (1) would equal one and the log of this would equal zero. Thus the MI criterion for independent data is expected to be zero. The possible values that the MI criterion can take range from zero (if no dependence exists between the variables) to a number approaching positive in®nity (if perfect dependence exists between the variables). Chapman (1986) showed that, for a bivariate normal distribution, MI is directly related to the correlation coef®cient (q) as shown in Eq. (2).

1 The units of MI are de®ned by the base of the logarithm in Eq. (1). If base 2 is used the units are called bits. This work uses base 2 (after Fraser and Swinney, 1986).

MI

1 log 1 2 2

q2

2

It can be seen that MI does not distinguish between a positive or negatively sloped dependence relationship. However, in situations where the underlying dependence structure is not linear (and therefore Eq. (2) is invalid), MI is a more reliable indicator of the presence of dependence.

2.2 Kernel density estimation of the mutual information criterion For any given bivariate sample, the MI score in (1) can be estimated as: " # n X f^X;Y xi ; yi 1 ^ MI log2 n i1 f^X xi f^Y yi

313

3

where xi ; yi is the i'th bivariate sample data pair in a sample of size n, and, f^X xi , f^Y yi , and f^X;Y xi ; yi are the respective marginal and joint probability densities estimated at the sample data points. This paper adopts nonparametric methods for producing estimates of the joint and marginal densities in Eq. (3). A nonparametric method is de®ned as one which can reproduce a broad class of underlying density functions (Scott 1992, p 44). These methods seek to approximate the underlying density locally using data from a small neighbourhood near the point of estimate (Lall, 1995), and avoid making assumptions about the form of the underlying PDF. Some versions of the MI function (Fraser and Swinney, 1986; Osaka et al. 1997, 1998) use histograms to estimate the joint and marginal probability densities in Eq. (3). An alternate to this was suggested by Darbellay (1999) and Darbellay and Vajda (1999) who used an adaptive histogram to estimate the probability density functions. However, histograms (which count the number of data points falling into evenly spaced bins) are a crude measure of probability density because the histogram is not smooth at the bin edges, and because the location of the bin edges can dramatically affect the estimates. Kernel density estimation (Silverman 1986; Scott 1992; Sharma et al., 1997) is a nonparametric method that eliminates the bin edge problems that are associated with histograms. The probability density is estimated by the summation of kernels (smooth functions) centred at each observed data point. This produces a weighted moving average of the empirical frequency distribution of the data. The implementation of the MI criterion using kernel density estimation techniques was ®rst proposed by Moon et al. (1995). The kernel density estimate of the underlying univariate PDF at coordinate location x can be written as follows. A normal kernel2 (Silverman, 1986) is used. n 1X 1 p exp f^X x n i1 2pk1 r^

x xi 2 2k1 r^2

! 4

2 The form of the kernel (be it normal, Epanechnikov, or the like) has little impact on the ®nal kernel density estimate, compared to the impact of the choice of smoothing parameterisation. We choose a normal kernel because the properties of this kernel are well understood, especially for bivariate data.

^ is the bandwidth (a where xi is the i'th data point in X, for a sample of size n, k1 r smoothing parameter) of the univariate kernel used in estimating the PDF, r^ is the estimated standard deviation of X, and, k1 is the univariate bandwidth factor which represents the bandwidth for a sample having a unit variance. The kernel density estimate of the underlying bivariate PDF at x; y for the bivariate case is given in Eq. (5). Again, a normal kernel is used.

2 314

n 6 1X 1 6 q exp f^X;Y x; y 6 4 n i1 2p det k2 S 2

x y

xi yi

T

2 k2 S 2

1

x y

3 xi yi 7 7 7 5

5

where xi ; yi is the i'th data pair in a sample of size n, k22 S is the bivariate bandwidth, and S is the sample covariance matrix of the variable set X; Y.

r^xx S r^xy

r^xy r^yy

This method of representing the bivariate bandwidth, referred to as sphering (Fukunaga, 1972), is equivalent to using a bandwidth k2 on a sample transformed such that the resulting covariance matrix is the identity (I). The choice of the bandwidth (represented by k1 r^ in Eq. (4) and by k22 S in Eq. (5)) is the key to an accurate estimate of the probability density. A large bandwidth results in an oversmoothed probability density, with subdued modes and over-enhanced tails. A small bandwidth, on the other hand, can lead to density estimates overly in¯uenced by individual data points, with noticeable bumps in the tails of the probability density. The way that the bandwidth is represented is important. After investigating the performance of several smoothing parameterisations in a kernel estimator of the bivariate probability density, Wand and Jones (1993) state that smoothing strategies based on the covariance matrix are inappropriate in general. However we feel that this conclusion does not apply to hydrologic data since the distributions that Wand and Jones base this recommendation on are seldom seen in hydrology. For near-normal, skewed, and strongly autocorrelated data, which are common in hydrology, sphering (i.e. representing the bandwidth by k22 S) is a simple, practical and ef®cient way of specifying appropriate smoothing parameters. It is done as an intermediate step to improve the density estimation before the mutual information is calculated. Following Moon et al. (1995) and Sharma (2000) we adopt the sphering choice for use in calculating mutual information. A relatively simple bandwidth choice, known as the Gaussian reference bandwidth (Silverman, 1986 p 86; Scott, 1992 p 152), is estimated as:

kref

1=d4 4 n d2

1=d4

6

where n and d refer to the sample size and dimension of the multivariate variable set respectively. For a univariate PDF d 1 this reduces to

kref1 1:06n

1=5

7

For a bivariate PDF d 2 this reduces to

kref2 1:0n

1=6

8

Use of the Gaussian reference bandwidth is simple and computationally ef®cient when compared to other bandwidth selection rules for estimating a probability density. The Gaussian reference bandwidth for kernel density estimation of a PDF has been derived by minimising the mean square error assuming that the underlying PDF is normal, and thus it is the optimal bandwidth for estimating a PDF if the kernel estimator is normal and if the underlying (population) PDF is normal. It is a reasonably robust choice even if the underlying PDF is weakly normal, for example if it is slightly skewed, or slightly multimodal. For example, Sharma et al. (1998) showed that the Gaussian reference bandwidth produced integrated squared errors (ISE's) that were comparable to the ISE's for the Maximum Likelihood Cross Validation or Least Square Cross Validation methods, when these methods were applied to a weakly bimodal bivariate probability density. Equation (6) is therefore appropriate for choosing the bandwidth factor for estimating the PDF of a broad range of datasets. However, calculation of the MI score involves a ratio of probability densities. When the sample estimate of the MI score is calculated using Eq. (3), two bandwidths are used in the calculations: one for calculation of the marginal (univariate) PDF's (Eq. (4))3 , and one for the calculation of the joint (bivariate) PDF (Eq. (5)). The Gaussian reference bandwidth cannot be expected to be optimal or near-optimal when calculating the MI score because of the function of PDF's involved. Previous papers that used kref1 and kref2 when calculating the MI score (Moon et al. 1995; Sharma, 2000) did not consider this issue. A bandwidth selection rule for calculation of the MI score is required. The approach that is taken in this paper is as follows. For simplicity, it is assumed that the bandwidth choice will be a multiple of the Gaussian reference bandwidth. This reduces our bandwidth selection problem to selection of a scaling factor for the bandwidth, which we call a. The bandwidth factors used to calculate the MI score (i.e. k1 in Eq. (4) and k2 in Eq. (5)) are calculated as follows:

k1 akref1

9

k2 akref2

10

where a was selected in a range varying from 0.5 to 2.5, and kref1 and kref2 are as speci®ed in Eqs. (7) and (8). When a is well-chosen, then the calculation of the ratio in Eq. (3) will not be overly affected by oversmoothing or undersmoothing in any of the estimated PDF's, and the estimated value of the MI score obtained from our data sample will be close to the ``true'' (population) value of the MI score. However, if a is poorly chosen, then our estimated value of the MI score may be very different to the true value of the MI score. A poor choice of a may result in a poor estimate of the MI score, even if the sample size being used for the estimate is large. 3

We adopt the same bandwidth choice for both fX x and fY y. This simple approach is warranted if the distributions of x and y are similar. For calculation of the lag-one auto-MI score (as discussed in this paper) this assumption is valid because Y is simply a lagged version of X.

315

This paper aims to empirically derive appropriate values for a for a range of underlying dependence structures for near-normal data. A known lag-one dependence structure of an autocorrelated time series is analysed using mutual information (note that this calculates the lag-one auto-MI score, which is the equivalent of the lag-one autocorrelation). Empirical trials are used to select appropriate values for a for calculating the lag-one auto-MI score. This is done by calculating a mean square error based on 500 trials, where the differences between the small-sample estimate of the MI score and the ``true'' (population) MI score are recorded. For a given sample size, the a-value with the smallest mean square error is selected. 316

3 Empirical trials to find a practical choice of a As stated in the previous section, a good choice of the scaling factor for the bandwidth (a) will result in a sample estimate of the lag-one auto-MI score that is close to the ``true'' (population) value of the MI score. We aim to empirically derive practical choices for a for a range of underlying dependence structures and sample sizes. The underlying lag-one dependence structures that were considered in this study were provided by autoregressive (AR(1)) models and autoregressive moving average (ARMA(1,1)) models. These models were chosen because Eq. (2) can then be used to provide an exact expression for the ``true'' lag-one auto-MI score (provided the random deviate is normal). Additionally, an AR(1) model reproduces mainly short-term dependence while an ARMA(1,1) model can reproduce some longer-term dependence. The ARMA(1,1) models were included in the study to see if longer-term dependence has any in¯uence on the choice of a for calculation of the MI score. The AR(1) models used were of the form q Xt /Xt 1 1 /2 et 11 where / is the autoregressive parameter (the lag-one autocorrelation), and et is the random deviate (either normal, with mean = 0, variance = 1; or skewed, from a gamma distribution with mean = 0, variance = 1, skew = 0.5). The ARMA(1,1) models were of the form

s 1 /2 Xt /Xt 1 et 1 h2 2/h

het 1

12

where / is the autoregressive parameter, h the moving average parameter, and et is the random deviate (either normal, with mean = 0, variance = 1; or skewed, from a gamma distribution with mean = 0, variance = 1, skew = 0.5). The combination of random deviate, model type, and model parameters led to 24 models being tested in this study, as shown in Table 1 and 2. Table 2 also shows the lag-one autocorrelations (q) for the ARMA(1,1) models, which helps to compare these models with the AR(1) models. The models tested all produced a zero mean, unit standard deviation time series. For each of the 24 underlying models, the ``true'' (population) MI score was calculated. For the models which used a normally distributed random deviate (i.e. models 1±6 and 13±19), the ``true'' (population) lag-one auto-MI score was calculated exactly from the lag-one autocorrelation using Eq. (2). However it was necessary to estimate the ``true'' lag-one auto-MI scores for the models which

Table 1. The Autoregressive (AR(1)) models Model no.

/

Random deviate

1 2 3 4 5 6 7 8 9 10 11 12

0.0 0.17 0.3 0.5 0.7 0.9 0.0 0.17 0.3 0.5 0.7 0.9

Normal Normal Normal Normal Normal Normal Skewed Skewed Skewed Skewed Skewed Skewed

317

Table 2. The autoregressive-moving average (ARMA(1,1)) models Model no.

/

h

Random deviate

q

13 14 15 16 17 18 19 20 21 22 23 24

0.3 0.5 0.5 0.7 0.7 0.9 0.9 0.3 0.5 0.5 0.7 0.7

0.1 0.2 0.3 0.3 0.5 0.3 0.6 0.1 0.2 0.3 0.3 0.5

Normal Normal Normal Normal Normal Normal Normal Skewed Skewed Skewed Skewed Skewed

0.20 0.32 0.22 0.47 0.24 0.80 0.49 0.20 0.32 0.22 0.47 0.24

used a gamma distributed random deviate. The estimates were calculated using an averaged shifted histogram (ASH) probability density estimator (Scott, 1992), for a sample size of 107 . This work was done using the statistical software package SPlus (MathSoft, 1997), and the ASH library add-in (Scott, 1993). ASH requires a speci®cation of the grid (number of bins) to be used in the calculation of the histograms (the number of bins used directly affects the bin size, which is analogous to bandwidth). The shift parameter used in the algorithm was set to the default value. A 400 400 grid was used. An example of the lag-one dependence structure for an ARMA(1,1) model with / 0:7 and h 0:3, a skewed random deviate (coef®cient of skewness = 0.5) and a sample size of 200 is shown graphically in Fig. 1. For this example (which is model no. 23; see Table 2) the ``true'' (population) lag-one autocorrelation is 0.4716 and the estimated ``true'' lag-one auto-MI score is 0.194, while the sample estimates of these parameters were calculated to be 0.474 and 0.209 respectively. Fig. 1 also shows contours of the estimated bivariate PDF for this sample, while Fig. 2 shows the estimated univariate PDF for this sample, as calculated using the kernel density estimator in Eq. (4). A time series plot for this example is shown in Fig. 3. For each combination of underlying model and sample size, the procedure adopted for ®nding the best value of a was:

318 Fig. 1. Example lag one dependence structure. Contours of the estimated bivariate PDF are superimposed on the plot

Fig. 2. Estimated univariate PDF for the example in Fig. 1

Fig. 3. Time series plot for the example in Fig. 1

1. Generate 500 samples from the underlying model. 2. Calculate the sample lag-one auto-MI scores for a range of possible a for each of the 500 samples. 3. Calculate the mean square error (MSE) for each possible a. 4. The lowest MSE identi®es the best choice of a for that combination of underlying model and sample size. The method for calculating the MSE is shown in Eq. 13.

MSE

n 1X ^i MI n i1

MItrue 2

13

^ i is the small-sample estimate of the MI where n is the number of trials (=500), MI score for the i'th trial, and MItrue is the ``true'' MI score. a was selected in a range varying from 0.5 to 2.5. The sample sizes used were 30, 50, 200, 800, and 1500.

4 Results 4.1 Results of small-sample trials Lag-one auto-MI scores from the 500 trials for each combination of model (1±24), sample size, and a were calculated. These results were used to calculate MSE's based on the deviations between the trial values of the MI score and the ``true'' MI score. As expected, the calculated MSE values reduced as sample size increased. However, this aspect of the results is not of interest here. What we want to do is to identify the value of a that gives the smallest MSE value for a given sample size. This is the best choice of a for this combination of model and sample size. Values that are of interest are the smallest MSE obtained for a given combination, the corresponding a, and the percentage changes in MSE for other a considered for that combination of underlying model and sample size. 4.2 MSE values and selection of the best a choices The smallest MSE's obtained for given combinations of underlying model and sample size were calculated, but are not shown here for space reasons. The corresponding a values, which are the best choice of a for each combination of model and sample size, are shown in Tables 3 and 4. It can be seen from the tables that the selected a values are quite stable over the range of sample sizes and underlying models that were tested, especially for lag-one autocorrelations of 0.7 or less (i.e. all models except 6, 12 and 18). For these cases, if the sample size is 200 or greater then a 1:5 is generally selected. For sample sizes of 30 or 50, a is more variable, but the value obtained for any given case is approximately (1:8 q) where q is the underlying lag-one autocorrelation for that model. This is shown in Fig. 4 (excluding results from models 6, 12 and 18). We therefore propose a practical method for choosing a when the lag-one autocorrelation is 0.7 or less: if the sample size is 200 or greater choose a 1:5, otherwise choose a 1:8 r, where r is the sample estimate of the lag-one autocorrelation.

319

Table 3. Best a values obtained ± AR(1) models Model no.

320

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Sample size 30

50

200

800

1500

1.9 1.7 1.6 1.4 1.1 0.6 1.9 1.7 1.6 1.4 1.1 0.6 1.7 1.6

1.7 1.6 1.6 1.5 1.1 0.6 1.7 1.6 1.5 1.4 1.1 0.6 1.7 1.6

1.6 1.5 1.5 1.5 1.3 0.8 1.5 1.5 1.5 1.4 1.3 0.8 1.5 1.5

1.5 1.5 1.5 1.5 1.4 1.1 1.5 1.5 1.5 1.4 1.3 1.2 1.5 1.5

1.5 1.5 1.5 1.5 1.4 1.1 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.5

Table 4. Best a values obtained ± ARMA(1,1) models Model no.

15 16 17 18 19 20 21 22 23 24

Sample size 30

50

200

800

1500

1.7 1.4 1.7 0.7 1.1 1.7 1.6 1.7 1.3 1.7

1.7 1.4 1.6 0.8 1.2 1.6 1.6 1.6 1.4 1.6

1.6 1.4 1.5 1.0 1.3 1.5 1.5 1.5 1.4 1.5

1.5 1.5 1.5 1.2 1.4 1.4 1.4 1.5 1.4 1.5

1.5 1.5 1.5 1.3 1.5 1.4 1.4 1.4 1.4 1.4

Fig. 4. Best a choice vs. q for small sample sizes (n 30, n 50)

321

Fig. 5. Selection of a for model 2

The longer-term dependence present in the ARMA(1,1) models had little in¯uence on the choice of a. The a selected for the ARMA(1,1) models were similar to those selected for the AR(1) models with similar lag-one autocorrelations. The only exception to this was model 19, which had the highest moving-average component (h 0:6) and thus had strong long memory. Even so, this effect was only for small sample sizes (30 and 50) and a 1:8 q is still a reasonable selection rule for these cases. Figure 5 plots the percentage changes in MSE against a for one of the models that was considered in this study. The percentage changes in MSE are plotted relative to the MSE for the best a choice. This ®gure contains more information on the choice of a than is shown in Tables 3 and 4. For example, the ®gure shows that for a sample size of 200, a 13% decrease in a from the best obtained resulted in a 150% increase in the MSE of the estimated MI. Tables 5 and 6 show the ef®ciency loss (in terms of percentage increase in MSE, compared to the smallest MSE obtained for that combination of model and sample size) for adopting a 1:5 (for n 200) or a 1:8 q (for n < 200). It can be seen that our method for choosing a works well, with little ef®ciency loss, for all the cases considered. (The highest ef®ciency loss is 21%. In most cases the ef®ciency loss is below 5%.) Note that the method does not give a value for a if q > 0:7, so no results are recorded for models 6, 12, and 18. The mean square error consists of a bias component and a variance component. The bias and variance were analysed for the results from model 2. There was a small negative bias in the sample estimates of the lag-one auto-MI scores. For all of the results from model 2, the variance represented more than 98% of the MSE. It is interesting to compare our results to those presented in Darbellay and Vajda (1999). Using a adaptive histogram to estimate the probability density functions, Darbellay and Vajda (1999) calculate MI scores for samples from a normal distribution, for sample size 250 and larger. The results reported in Darbellay and Vajda (1999) show a signi®cantly higher bias which decreases with increasing sample size. This is to be expected given that a kernel density estimate is a superior measure of the probability density, as compared to histogram based methods.

Table 5. Ef®ciency loss (%) from selecting a by the suggested method ± AR(1) models Model no.

322

1 2 3 4 5 6 7 8 9 10 11 12

Sample size 30

50

200

800

1500

3 10 7 2 0 ± 2 10 3 1 0 ±

9 0 2 7 0 ± 15 0 0 2 0 ±

21 0 0 0 4 ± 0 0 0 4 7 ±

0 0 0 0 2 ± 0 0 0 4 6 ±

0 0 0 0 1 ± 16 14 7 4 0 ±

Table 6. Ef®ciency loss (%) from selecting a by the suggested method ± ARMA(1,1) models Model no.

13 14 15 16 17 18 19 20 21 22 23 24

Sample size 30

50

200

800

1500

10 5 5 4 4 ± 14 4 2 0 0 4

1 3 0 3 0 ± 3 0 0 0 4 0

0 0 1 1 0 ± 5 0 0 0 4 0

0 0 0 0 0 ± 2 10 5 0 2 0

0 0 0 0 0 ± 0 2 1 12 3 5

5 Conclusions As a practical method for smoothing parameter selection for use in calculating the MI score, this study found that a = 1.5 (applied to Eqs. (9) and (10)) is a good scaling factor for the bandwidth for sample sizes of 200 or more. For sample sizes less than 200, a value of a 1:8 r should be used, where r is the sample estimate of the lag-one autocorrelation. This result applies for a near-normal parent PDF and a range of underlying autoregressive dependence structures. This scaling factor selection rule is very stable, with ef®ciency loss of less than 5% for most of the cases considered in this study. However, the rule does have limitations. The selection rule was obtained for a normal or near-normal parent distribution, and may not be appropriate if this assumption is badly violated. The rule that was obtained is also not appropriate for lag one autocorrelations greater than about 0.7.

The longer-term dependence present in the ARMA(1,1) models had little in¯uence on the choice of a. The a selected for the ARMA(1,1) models were similar to those selected for the AR(1) models with similar lag-one autocorrelations. The only exception to this was model 19, which had the highest moving-average component (h 0:6) and thus had strong long memory. Even so, this effect was only for small sample sizes (30 and 50) and a 1:8 r is still a reasonable selection rule for these small sample sizes. When using the approach proposed here, data that are strongly skewed should be transformed to near-normal before the dependence structure is analysed. An alternative (but less established) approach would be to use adaptive kernel methods, where the bandwidth is larger in the tails than in the peak of the PDF. The assumption that the selected bandwidth would be a multiple of the Gaussian reference bandwidth provided a reasonably consistent performance across a wide variety of dependence structures that are commonly encountered in hydrologic practice. The results presented here are speci®c to time series analysis, where the data is collected at regular time intervals. Results may vary when the two variables are time-independent. The bandwidth selection strategy presented in this paper is a major improvement on the strategies used previously. To our knowledge, this is the ®rst study of appropriate kernel bandwidth choice for use in calculating the MI criterion. We are con®dent that the bandwidth selection rules presented here provide a sound basis for using MI to quantify a broad range of underlying dependence structures, especially the type of dependence that is typically found in hydrologic data. While the empirical results presented here were limited to linear dependence structures for the sake of simplicity, it must be emphasised that the strong point of MI is that it quanti®es both linear and nonlinear dependence into a single and easy to use number. We expect these rules to provide conservative estimates of the MI for both linear and nonlinear data. It must be pointed out that if suf®cient time and resources are available at the hands of the modeller, data based bandwidth choices would have a greater ef®ciency than the simplistic rule we have proposed here. We must, however, point out that no such rules for selecting data based bandwidths for use in estimating the MI, have yet been developed. The use of the bandwidth selection rule presented here would have implications on any future applications of the MI or PMI criteria for identi®cation of predictors of hydrological and other variables. It is likely that the use of the proposed bandwidth choices would lead to more stable and useful predictors, which would lead to better predictions or forecasts, depending on the use the model is being put to. Work on the application of the proposed rule to predictor identi®cation of daily rainfall for use in a stochastic rainfall simulation model is underway, and results would be presented at a later stage.

References

Brockwell PJ, Davis RA (1996) Introduction to Time Series and Forecasting. SpringerVerlag, New York Darbellay GA (1999) An estimator of the mutual information based on a criterion for independence. Comput. Stat. Data Anal., 32: 1±17 Darbellay GA, Vajda I (1999) Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Infor. Theory, 45(4): 1315±1321 Fraser AM, Swinney HL (1986) Independent coordinates for strange attractors from mutual information. Phys. Rev. A, 33(2): 1134±1140 Fukunaga K (1972) Introduction to Statistical Pattern Recognition. Academic Press, New York

323

324

Hirsch RM, Helsel DR, Cohn TA, Gilroy EJ (1993) Statistical analysis of hydrologic data. In: Maidment DR (ed.) Handbook of Hydrology, McGraw-Hill, New York Lall U (1995) Recent advances in nonparametric function estimation: hydraulic applications. US National Rep. Int. Union Geophys. 1991±1994, Rev Geophys 33: 1093 pp. MathSoft (1997) S-Plus 4.5 Statistical Software Package. Data Analysis Products Division, Math Soft Incorporated, Seattle, Washington Moon YI, Rajagopalan B, Lall U (1995) Estimation of mutual information using kernel density estimators. Phys. Rev. E, 52(3): 2318±2321 Osaka M, Saitoh H, Yokoshima T, Kishida H, Hayakawa H, Cohen RJ (1997) Nonlinear pattern analysis of ventricular premature beats by mutual information. Meth. Infor. Med., 36: 257±260 Osaka M, Yambe T, Saitoh H, Yoshizawa M, Itoh T, Nitta S, Kishida H, Hayakawa H (1998). Mutual information discloses relationship between hemodynamic variables in arti®cial heart-implanted dogs. MAm. J. Physiol., 275: H1419±H1433 Scott DW (1992) Multivariate Density Estimation ± Theory, Practice and Visualisation. John Wiley, New York Scott DW (1993) ASH (Averaged Shifted Histogram) library add-in for the S-Plus Statistical Software Package. http://lib.stat.cmu.edu/s/ash Sharma A (2000) Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: 1. A strategy for system predictor identi®cation. J. Hydr., 239: 232±239 Sharma A, Lall U, Tarboton DG (1998) Kernel bandwidth selection for a ®rst order nonparametric stream¯ow simulation model. Stoc. Hydr. Hydra., 12: 33±52 Sharma A, Tarboton DG, Lall U (1997) Stream¯ow simulation: a nonparametric approach, Water Reso. Res., 33(2): 291±308 Silverman BW (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York Singh VP (1997) The use of entropy in hydrology and water resources. Hydro. Proc., 11: 587±626 Wand MP, Jones MC (1993) Comparison of smoothing parameterisations in bivariate kernel density estimation. J. Ame. Stat. Ass., 88(422): 520±528