Nonparametric density estimation with adaptive ...

Nonparametric density estimation with adaptive varying window size Vladimir Katkovnik and Ilya Shmulevich Signal Processing Laboratory Tampere University of Technology

ABSTRACT We propose a new method of kernel density estimation with a varying adaptive window width. This method is different from traditional ones in two aspects. First, we use symmetric as well as nonsymmetric left and right kernels with discontinuities and show that the fusion of these estimates results in accuracy improvement. Second, we develop estimates with adaptive varying window widths based on the so-called intersection of confidence intervals (ICI) rule. Several examples of the proposed method are given for different types of densities and the quality of the adaptive density estimate is assessed by means of numerical simulations.

1

INTRODUCTION

In signal processing applications, such as, for instance, filtering, optimal algorithms often require the knowledge of underlying densities of signals and/or noise. As these densities are usually unknown, unrealistic assumptions are frequently made, thus compromising the performance of the algorithms in question. A common approach to this problem is to estimate the density from the data. If a particular form of the density is assumed or known, then parametric estimation is used. If nothing is assumed about the density shape, nonparametric estimation is employed. Besides being widely used in the field of pattern recognition and classification,1 nonparametric probability density estimation has been applied in image processing,2 ,3 communications,4 and many other fields. One of the most well-known and popular techniques of nonparametric density estimation is the kernel or Parzen density estimate,5 ,1 .6 Given N samples X1 , · · · , XN drawn from a population with density function f (x), the Parzen density estimate at x is given by µ ¶ N x − Xi 1 X1 ˆ κ fh (x) = , (1) N i=1 h h where κ (·) is a window or kernel function R and h is the window width, smoothing parameter, or simply the kernel size. Traditionally, it is assumed that κ (u) du = 1 and κ (·) is symmetric, that is, κ (u) = κ (−u). One popular choice is the Gaussian kernel µ 2¶ 1 u κ (u) = √ exp − . (2) 2 2π The kernel size h is the most important characteristic of the Parzen density estimate,8 .9 One can compute the ideal or optimal value of h by minimizing the mean-square error ½h n o i2 ¾ ˆ ˆ MSE fh (x) = E fh (x) − f (x) (3)

between the true and estimated densities, with respect to h. The MSE is a function of x and so the optimal kernel size h is also a function of x. In order to minimize the MSE, a best compromise between variance and bias must be selected. Using Taylor series approximations of the moments of fˆh (x) and noting that

where

and

i2 n o h n o n o MSE fˆh (x) = E fˆh (x) − f (x) + Var fˆh (x) , n o Z E fˆh (x) = κ (u) f (x + hu) du

¶ ·µ Z n n o o¸ 1 1 2 2 ˆ ˆ κ (u) f (x + hu) − E fh (x) , Var fh (x) = N h

(4)

(5)

(6)

the optimal value of the kernel size can be shown to be equal to5 h0,sym (x) =

Ã

!1/5 R f (x) κ2 (u) du , £ ¤2 R N f 00 (x) u2 κ (u) du

(7)

provided that h → 0, N → ∞, and Nh → ∞. The notation h0,sym indicates that the kernel is symmetric. As can be seen from equation (7), the optimal kernel size depends on the value of the density and on its second derivative. It is also possible to obtainnan optimal constant kernel size independent of x, by minimizing n o o either R R the integral mean-square error M SE fˆh (x) dx or the expected mean-square error MSE fˆh (x) f (x) dx.

Clearly, in practice, one does not have access to the true density function f (x) which is proposed to be estimated. Thus, a number of heuristic approaches can be taken for finding the window width. For instance, the optimal constant h can be computed for, say, the Normal distribution (as a function of N ) and then used for making the estimate fˆh (x).1 Since the density estimates are often used for classification purposes, another approach is to determine h on the basis of the expected probability of misclassification.8 Although h is usually taken to be a constant, several important approaches have been proposed to vary it. One is the well-known k-th nearest neighbor estimate.10 Another method is the adaptive kernel estimate proposed in.11 A number of other papers considering the problem of kernel size selection exist (e.g.,12 ,13 ,14 ,15 ,1617 ). In this paper, we propose a novel data-driven method that incorporates the following two ideas. First, we introduce and use nonsymmetric “left” and “right” kernels, while the estimate itself is produced by fusing the corresponding left and right estimates. For example, the right and left kernels can be given by ³ 2´ ½ 2 √ exp − u2 , u ≥ 0 2π κr (u) = 0, u < 0 κl (u) = κr (−u) , (8) respectively. For the fused estimate, the actual kernel is a weighted sum of the nonsymmetric kernels. In this way, we obtain a nonsymmetric kernel with independently controlled sizes of the of its left and right parts. Second, we propose and develop a special rule (statistic) for choosing a data-driven kernel size, which is selected in a point-wise manner for every argument value of the density. Our main motivation for this complex estimator is to make it adaptive to unknown and varying smoothness of the density to be estimated. In particular, we wish to enable it to estimate nonsymmetric density functions with varying curvature and possible discontinuities. This method is described in Section 2. In Section 3, we assess the quality of the density estimate by comparing it to a known density which we are estimating, under the mean-square error criterion. It will be shown that the estimates based on the variable-sized kernels are superior to the estimates based on optimal constant-sized kernels. Moreover, we will show that the fusion of non-symmetric kernels results in accuracy improvement.

2 2.1

ADAPTIVE KERNEL SIZE SELECTION

Description of the method

The method of adaptive kernel size selection is based on the ICI rule, proposed in18 and later developed for signal filtering in.19 One of several attractive properties of the ICI rule is that it is spatially adaptive over a wide range of signal classes in the sense that its quality is close to that which one could achieve if smoothness of the original signal was known in advance.18 We briefly reintroduce this method here in the context of density estimation. o n o n Consider the bias E fˆh (x) − f (x) and variance Var fˆh (x) of the density estimate given in (1). The Taylor series approximation of the bias results in1 R 2 n o ½ h2 00 2 f (x)R u κ (u) du, if κ (u) = κ (−u) . Bias fˆh (x) ' if κ (u) 6= κ (−u) hf 0 (x) uκ (u) du,

(9)

In the case of a symmetric kernel, the first-order approximation term is zero. Thus, in order to include the effect of the bias in equation (4) for computing the MSE, the second-order approximation must be used. For the nonsymmetric kernel, the first-order approximation is sufficient. However, the same variance term, for both symmetric and nonsymmetric kernels, is included in equation (4) even if the first-order approximation Z n o 1 Var fˆh (x) ' f (x) κ2 (u) du (10) Nh

is used and the second-order approximation results in unnecessary complexity.1 o n First, suppose the kernel is symmetric and consider the ratio of the standard deviation Std fˆh (x) of the ¯ o¯ n ¯ ¯ estimate to the absolute value of the bias ¯Bias fˆh (x) ¯ of the estimate, evaluated at the optimum value of h0,sym (x), given in equation (7). Substituting (7) into the first line of (9) and into (10), we get q n o R 1 2 Std fˆh0,sym (x) Nh0,sym f (x) κ (u) du ¯ ¯ n R o¯ = ¯ 1 2 00 2 ¯ ¯ h ¯ ¯ 2 0,sym f (x) u κ (u) du ¯Bias fˆh0,sym (x) ¯ q R 1 2 N f (x) κ (u) du = Ã !1/2 R ¯1 ¯ R f (x) κ2 (u)du ¯ f 00 (x) u2 κ (u) du¯ £ R ¤2 2 N f 00 (x)

u2 κ(u)du

= 2.

Now, consider the nonsymmetric kernel. Then, the asymptotically optimal bandwidth is different from (7) and is of the form Ã !1/3 R f (x) κ2 (u) du . h0,nsym (x) = £ R ¤2 2N f 0 (x) uκ (u) du In that case, the ratio of the standard deviation to the absolute value of the bias turns out to be n o Std fˆh0,nsym (x) √ ¯ n o¯ = 2. ¯ ¯ ¯Bias fˆh0,nsym (x) ¯

It is useful to note that as the standard deviation and bias are monotonically increasing and decreasing, respectively, as h → 0, we have that n n o ¯ o¯ 1 ˆh (x) ≤ ¯¯Bias fˆh (x) ¯¯ , f if κ (u) = κ (−u) , h ≤ h0,sym Std 2 n n o ¯ o¯ . (11) ¯ ¯ √1 Std fˆh (x) ≤ ¯Bias fˆh (x) ¯ , if κ (u) 6= κ (−u) , h ≤ h0,nsym 2 For a given kernel size h, the estimation error can be ¯ ¯ ¯ˆ ¯ ¯fh (x) − f (x)¯ =

represented as ¯ ¯ n o ¯ ¯ ¯Bias fˆh (x) + ξ h (x)¯ ¯ n o¯ ¯ ¯ ≤ ¯Bias fˆh (x) ¯ + |ξ h (x)| ,

n o where ξ h (x) is a random variable with zero mean and standard deviation equal to Std fˆh (x) . Thus, ¯ ¯ ¯ n n o¯ o ¯ ¯ ¯ ¯ˆ ¯fh (x) − f (x)¯ ≤ ¯Bias fˆh (x) ¯ + χp · Std fˆh (x)

(12)

holds with an arbitrary probability p, for a suitably chosen χp . Using the relationships in (11) together with (12), we get that for h ≤ hopt , ¯ ¯ ¯ ¯ˆ ¯fh (x) − f (x)¯ ≤  ¡ n o ¢ 1  + χp · Std fˆh (x) , if κ (u) = κ (−u) , h ≤ h0,sym 2 ³ ´ n o ≤  √1 + χp · Std fˆh (x) , if κ (u) 6= κ (−u) , h ≤ h0,nsym 2 n o = Γ · Std fˆh (x) , (13) ´ ³ ¢ ¡ where Γ = 12 + χp for the symmetric kernel and Γ = √12 + χp for the nonsymmetric kernel. Larger values of Γ correspond to larger values of p. Even though it may seem that larger values of Γ are to be preferred, as we shall see in Section 3, smaller values of Γ, and thus increased risk, can yield better accuracy of estimation. The ICI rule essentially tests the hypothesis h ≤ h0,sym or h ≤ h0,nsym for various values of h and in this way selects an h close to hopt as follows.

Suppose H = {h1 < h2 < · · · < hJ } is a finite collection of kernel sizes, starting with a small h1 . Using inequality (13), we determine a sequence of confidence intervals Dj Lj Uj

= [Lj , Uj ] , j = 1, · · · , J n o = fˆhj (x) − Γ · Std fˆhj (x) n o = fˆhj (x) + Γ · Std fˆhj (x) ,

(14)

each one corresponding to a kernel size in H. The ICI rule, then, can be stated as follows.19 ICI Rule: Consider the intersection of the intervals Dj , 1 ≤ j ≤ i with increasing i and let i+ be the largest of those i for which the intervals Dj , 1 ≤ j ≤ i have a point in common. This i+ defines the adaptive kernel size h+ (x) = hi+ and consequently, the density estimate fˆh+ (x) (x). This procedure can be implemented by Algorithm 1. It is important to note that the kernel size selection procedure based on the ICI rule requires only the knowledge of the density estimate and its variance, for which equation (10) can be used, with fˆh+ (x) (x) in place of f (x). Γ is a design parameter of the algorithm and the selection of its value will be discussed in Section 3.

Algorithm 1 Adaptive Window Width Selection L ⇐ −∞, ¡ U ⇐¢ ∞ while L ≤ U and (i ≤n J) do o L ⇐ fˆhi (x) − Γ · Std fˆhi (x) n o U ⇐ fˆhi (x) + Γ · Std fˆhi (x) £ ¤ L ⇐ max L, L , U ⇐ min [U , U ] i ⇐i+1 end while h+ (x) ⇐ hi−1

2.2

Density estimation examples

In this section, we will illustrate the use of the kernel size selection procedure based on the ICI rule. As a first example, consider estimating the density function shown in Figure 1a. This example is intended to qualitatively demonstrate the behavior of the adaptive kernel size selection procedure, using the symmetric as well as nonsymmetric kernels, given in equations (2) and (8), respectively. The allowable kernel sizes (H) start with h1 = 0.01 and increase until h300 = 3.0 with a step of 0.01. Figures 1b,c,d show the kernel sizes chosen by the ICI rule, corresponding to the three kinds of kernels (κ (·), κr (·), and κl (·)) used. The number of observations N is equal to 10000. Especially worthy of notice is the behavior of non-symmetric kernels in the presence of discontinuities in the density function. For instance, the kernel size of the right kernel κr (·) is rather high at the point of the first discontinuity (x = −1) and becomes smaller as it approaches the second discontinuity (x = 0), after which the situation is similar (Figure 1c). This behavior corresponds to the common sense idea that a large window size for the right kernel should be chosen at x = −1+ while for x = 0− , the data available for the right kernel estimator is very small and hence, the kernel size is accordingly small. Here, the notation x± means limε→0,ε>0 (x ± ε). For the left kernel κl (·), the behavior is the opposite (Figure 1d). Immediately after the first discontinuity, the left kernel still contains very few observations and consequently has small size which increases up until x = 0− . Similarly, just after this point, the kernel size again becomes quite small, since even small sizes encompass two different density regions. Finally, as shown in Figure 1b, the size of the symmetric kernel increases towards the middle between discontinuities, that is, at x = ±0.5. Also, very large kernel sizes, in the form of spikes, can be seen exactly at the points of discontinuities. The reason for this phenomenon is the following. It can be shown, by using equation (5), that at the point of discontinuity of the density function, the expectation of the estimate satisfies Z n o f (x+ ) + f (x− ) . lim E fˆh (x) = lim κ (u) f (x + hu) du = h→0 h→0 2

Therefore, the ICI rule behaves correctly and in accordance with this fact. The reason that large kernel sizes are chosen is because the density function on either side of the discontinuity is constant and larger kernel sizes decrease the variance and consequently, the MSE. Again, it is worth noting that the estimate uses no information about the underlying density f (x). Another way to evaluate the method of adaptive kernel size selection is to compare it to an optimal varying kernel size. Rather than using equation (7), which is an asymptotic result, we shall compare our method to the more stringent criterion, namely, the empirically obtained varying kernel size h∗ (x) which minimizes the MSE between the known density f (x) and the estimated density fˆh∗ (x) (x). In other words, ³ ´2 h∗ (x) = arg min E f (x) − fˆh(x) (x) . h(x)

(15)

In this example, we shall consider estimating the density function shown in Figure 2a. The density is zero outside the interval [0, 1]. The allowable kernel sizes (H) start with h1 = 0.01 and increase until h100 = 1.0 with a step

(a)

(b)

1

0.7 0.6

0.8

0.5 0.4

s

f(x)

h (x)

0.6 0.4

0.3 0.2

0.2 0 -2

0.1 -1

0

1

0 -2

2

-1

(c)

0

1

2

1

2

(d)

0.7

0.8

0.6 0.6

h (x)

0.4

0.4

l

r

h (x)

0.5

0.3 0.2

0.2

0.1 0 -2

-1

0

1

2

0 -2

-1

0

Figure 1: Density f (x) to be estimated (a) and adaptive window widths corresponding to a symmetric kernel κ (·) (b), right kernel κr (·) (c), and left kernel κl (·) (d) (Γ = 4).

of 0.01. The number of observations N is equal to 10000. An optimal kernel size, from the set of allowable kernel sizes, was found for every x, using equation (15). This is shown in Figure 2b as a solid line. The dashed line shows the variable kernel size h+ (x) obtained by using the ICI rule. As can be seen, their behavior is very similar. As expected, the kernel size is larger in the flat region of the density as compared with regions of the peaks where the kernel size becomes smaller.

3

ACCURACY ASSESSMENT

In this section, we consider the performance of the proposed method of adaptive kernel size selection. In addition to this, we shall employ a fusion of two estimates based on non-symmetric kernels. Our goal will be to compare performance to that of a Parzen estimate with a constant kernel size. This comparison is done for the density given in Figure 1a. Figure 3 shows the root mean-square error (RMSE) between the known density f (x) and the density estimated with a variety of constant symmetric kernel sizes, ranging from 0.001 to 3.0. It can be seen from this figure that there exists an optimal constant h, equal to 0.0158, that attains a minimum RMSE of 0.0426. Our next step is to compare the optimal estimate with constant kernel size to the estimate based on fusion of two estimates based on the ICI rule. The non-symmetric kernels given in (8) are used for the estimates fˆl (x) = fˆr (x) =

¶ µ N 1 X 1 x − Xi κ l N i=1 h+ h+ l (x) l (x) ¶ µ N 1 X 1 x − Xi κ , r N i=1 h+ h+ r (x) r (x)

(16)

+ where h+ l (x) and hr (x) are the non-symmetric kernel sizes selected by the ICI rule. Then, the fused estimate

(a) 2.5 2

f(x)

1.5 1 0.5 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

(b) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 2: (a) Density f (x) to be estimated; (b) optimal varying window width hm (x) (solid line) minimizing the MSE between the known density and the estimate and the varying window width based on the ICI rule h+ (x) (dashed line) (Γ = 5).

0.35

0.3

0.25

RMSE

0.2

0.15

0.1

0.05

0 -3 10

10

-2

-1

10 h

10

0

10

1

Figure 3: RMSE between the known density and the estimate with constant kernel sizes, for various h (h is shown on a logarithmic scale).

0.095

0.09

0.085

0.08

RMSE

0.075

0.07

0.065

0.06

0.055

0.05

0.045 -2 10

10

-1

10

0

10

1

Γ

Figure 4: RMSE values for non-symmetric estimates, as a function of Γ, ranging from 0.01 to 3. Solid line corresponds to fˆl (x) and the dashed line to fˆr (x).

can be defined as a weighted combination of the left and right estimates f˜ (x) =

fˆl (x) Var{fˆl (x)} 1 Var{fˆl (x)}

+ +

fˆr (x) Var{fˆr (x)} 1 Var{fˆr (x)}

,

(17)

where the weights are inversely proportional to the variances of the respective estimates. Thus, the smaller the variance of the estimate, the larger will it’s contribution be to the overall estimate. Unfortunately, it is not possible to use equation (10) for the variances, since in practice, we do not know the true density f (x). One approach is to substitute the estimate itself for f (x) in (10). In that case, equation (17) reduces to h+ (x) + h+ r (x) . f˜ (x) = l + hl (x) h+ r (x) + fˆ (x) fˆ (x) l

(18)

r

Prior to showing the results of this estimate, let us focus on the selection of parameter Γ, which we shall refer to as the threshold parameter. In general, larger values of Γ tend to result in kernel sizes that are larger than the optimal ones and hence, oversmooth the density. At the same time, smaller values of Γ give smaller than optimal kernel sizes and undersmooth the density. Our experience indicates that, fortunately, the accuracy of estimation is not overly sensitive to the selection of Γ and there exists a range of values (usually between 0.1 and 1.2) that result in similar RMSE values. Figure 4 shows the RMSE values as a function of Γ for the non-symmetric estimates. It is interesting to note that the same value of Γ, approximately equal to 1, minimizes the RMSE for both left and right estimates. Figure 5 shows the adaptive varying kernel sizes selected by the ICI rule for Γ = 1.06. Finally, by fusing the estimates according to equation (18), the RMSE between f˜ (x) and f (x) is equal to 0.0198 while the RMSE for the optimal estimate with constant symmetric kernel size (see Figure 3) is equal to 0.0426. It should be mentioned that the optimal constant kernel size estimate, to which we were comparing our method, can never be achieved in practice as the true density is unknown. At the same time, the proposed method, making no assumptions about the underlying density, results in significantly better estimation accuracy. As can be seen, by comparing Figures

0.7 0.6 0.5

+

hl (x)

0.4 0.3 0.2 0.1 0 -2

-1.5

-1

-0.5

0 x

0.5

1

1.5

2

-1.5

-1

-0.5

0 x

0.5

1

1.5

2

0.5 0.4

+

hr (x)

0.3 0.2 0.1 0 -2

+ Figure 5: The kernel sizes chosen by the ICI rule (Γ = 1.06) for h+ l (x) (top figure) and hr (x) (bottom figure).

1 and 5, the kernel sizes selected by the ICI rule exhibit greater variability for smaller values of Γ. However, the accuracy of estimation is improved.

4

CONCLUSIONS

We have proposed a new method for varying the bandwidth in kernel density estimation. This method is based on the ICI rule and requires only the knowledge of the variance of the estimate. In our case, as the true density is unknown, the variance of the estimator is approximated by replacing the true density by the estimate. It is also possible to implement an iterative technique in which successive estimates are used to compute the variance by formula (6), using which in the ICI rule, new estimates are formed. Although we have considered this method for one-dimensional densities, there is no conceptual difficulty in extending it to multi-dimensional densities. In that case, as with other techniques, not only the size, but also the shape of the kernel is an important parameter. We have shown, by means of numerical simulations, that the proposed method can perform significantly better than any constant-bandwidth method.

5

REFERENCES

[1] K. Fukunaga, Statistical Pattern Recognition, 2nd edition. Academic Press, 1990. [2] D. Sindoukas, N. Laskaris, and S. Fotopoulos, “Algorithms for color image edge enhancement using potential functions,” IEEE Signal Processing Letters, Vol. 4, No. 9, pp. 269-272, 1997. [3] D. Wright, J. Stander, and K. Nicolaides, “Nonparametric density estimation and discrimination from images of shapes,” Journal of the Royal Statistical Society, Series C: Applied Statistics, Vol. 46, No. 3, pp. 365-380, 1997. [4] S. Zabin and G. Wright, “Nonparametric density estimation and detection in impulsive interference channels,”

IEEE Transactions on Communications. Vol. 42, No. 2-4, pp. 1684-1711. [5] E. Parzen, “On the estimation of a probability density function and the mode,” Ann. Math. Stat., Vol. 33, pp. 1065-1076, 1962. [6] T. Cacoullos, “Estimation of a multivariate density,” Ann. Inst. Stat. Math., Vol. 18, pp. 179-189, 1966. [7] J.-N. Hwang, S.-R. Lay, and A. Lippman, “Nonparametric Multivariate Density Estimation: A Comparative Study,” IEEE Transactions on Signal Processing, Vol. 42, No. 10, pp. 2795-2810, 1994. [8] S. Raudys, “On the effectiveness of Parzen window classifier,” Informatica, Vol. 2, No. 3, pp. 434-454, 1991. [9] B. W. Silverman, “Choosing the window width when estimating a density,” Biometrika, Vol. 65, pp. 1-11, 1978. [10] D. Loftsgaarden and C. Quesenberry, “A nonparametric estimate of a multivariate density function,” Ann. Math. Statist., Vol. 36, pp. 1049-1051, 1965. [11] L. Breiman, W. Meisel, and E. Purcell, “Variable kernel estimates of multivariate densities,” Technometrics, Vol. 19, pp. 135-144, 1977. [12] I. Abramson, “On bandwidth variation in kernel estimates - a square root law,” Ann. Statist., Vol. 10, pp. 1217-1223, 1982. [13] G. Terrell and D. Scott, “Variable kernel density estimation,” Ann. Statist., Vol. 20, No. 3, pp. 1236-1265, 1992. [14] S.-T. Chiu, “An automatic bandwidth selector for kernel density estimation,” Biometrika, Vol. 79, No. 4, pp. 771-782, 1992. [15] S. J. Sheather and M. C. Jones, “A reliable data-based bandwidth selection method for kernel density estimation,” J. R. Statist. Soc. B, Vol. 53, No. 3, pp. 683-690, 1991. [16] P. Hall, S. J. Sheather, M. C. Jones, and J. S. Marron, “On optimal data-based bandwidth selection in kernel density estimation,” Biometrika, Vol. 78, No. 2, pp. 263-269, 1991. [17] C. C. Taylor, “Bootstrap choice of the smoothing parameter in kernel density estimation,” Biometrika, Vol. 76, No. 4, pp. 705-712, 1989. [18] A. Goldenshluger and A. Nemirovsky, “On spatial adaptive estimation of nonparametric regression,” Mathematical Methods of Statistics, Vol. 6, No. 2, pp. 135-170, 1997. [19] V. Katkovnik, “A new method for varying adaptive bandwidth selection,” IEEE Transactions on Signal Processing, Vol. 47, No. 9, pp. 2567-2571, 1999.