A Bayesian Regularized Neural Network Approach to ... - IEEE Xplore

3 downloads 0 Views 426KB Size Report
Abstract—Short term traffic speed prediction is very important in intelligent transportation systems. Neural networks have been widely used for traffic speed ...
A Bayesian Regularized Neural Network Approach to Short-Term Traffic Speed Prediction Chenye Qiu

Chunlu Wang

School of Computer Beijing University of Posts and Telecommunications Beijing, China [email protected]

School of Computer Beijing University of Posts and Telecommunications Beijing, China

Xingquan Zuo

Binxing Fang

School of Computer Beijing University of Posts and Telecommunications Beijing, China

School of Computer Beijing University of Posts and Telecommunications Beijing, China

Abstract—Short term traffic speed prediction is very important in intelligent transportation systems. Neural networks have been widely used for traffic speed prediction. However, the classical neural network usually lacks satisfactory generalization ability, which usually results in an imprecise prediction of traffic speed. Regularization is an essential technique to improve the generalization ability of neural network. Regularization is realized by adding a weight decay function to the energy function of the neural network. One of the key problems of the regularization technique is how to decide the parameter of the weight decay function. In this paper, the Bayesian technique is used to optimize these regularization parameters and a Bayesian regularized neural network (BRNN) used for traffic speed prediction is proposed. The speed prediction model was validated by the real-world traffic speeds of the Hangzhou city collected from the floating car system. The experimental results show that the proposed method is able to improve the generalization ability of neural networks, and can achieve better prediction results than several traditional prediction models. Keywords-intelligent transportation systems; traffic speed prediction; neural network; bayesian regularization

I.

INTRODUCTION

With the rapid development of intelligent transportation systems (ITS), many traffic data are collected by various types of traffic equipments, such as loop detectors, cameras, loop detectors, etc. Extensive traffic data collected from these equipments make it possible to develop traffic condition prediction models. Predicted traffic information in the near future can provide traffic authorities traffic control assistance and give travelers route guidance information. Therefore, the short-term traffic prediction models have become a hotspot in ITS in the past decade. This work was supported by Chinese Universities Scientific Fund (2009RC0208) and National Key Technology R&D Program of China (2009BAG13A01).

978-1-4577-0653-0/11/$26.00 ©2011 IEEE

Short-term prediction is to predict the traffic condition variables, such as flow, speed and travel time in the range of 5 to 30 minutes in the future. A wide variety of techniques have been used to develop traffic condition prediction models. These techniques include time series model [1]–[3], K nearest neighbors nonparametric model [4], [5], Bayesian network theory [6], Kalman filtering algorithm [7], neural network based models [8]–[10]. Although these models can solve the short-term traffic prediction problem to some extent, there are still several aspects needed to be further considered. 1) These researches mainly focused on traffic flow or travel time prediction problems. Literatures on traffic flow or travel time prediction problems are vast. Researches on traffic speed prediction are comparatively few, especially the traffic speed prediction based on floating car data.. 2) Floating car system based on the traces of GPS positions works effectively in gathering real-time traffic information [10]. Previous researches mainly used traffic data collected by loop detectors to predict traffic status. Researches based on data collected from the floating car system are relatively few. 3) In these prediction models, the difference of traffic patterns in different days of one week and in different hours of one day is often neglected. In fact, traffic pattern varies in different days of one week and in different hours in one day. Reference [11] divides one week into four groups. The days in the same group are considered to have similar traffic patterns. But it doesn’t mention the hour-to-hour variation in one day. It is hard to build a model that fits a whole day. Traffic patterns in rush hours are quite different from the other time of a day. Hence in this paper, we consider both the day-to-day variation and the hour-to-hour variation of traffic pattern in our prediction model. Neural network is widely used in various fields because of its special characteristics, such as self-learning, nonlinear

2215

mapping and parallel distributed manipulation. In traffic speed prediction problem, the traffic speed data used to train the neural network are not completely clean, especially the speed data collected from the floating car system. Noise is inevitable when using GPS devices to collect traffic speed. A trained neural network which fits the noisy training data well possibly fails to generalize new data. This is “overfitting”, which often happens when using neural network to predict. The traditional neural network always shows poor generalization ability. Many researchers have focused on how to improve the generalization of neural networks [12], [13]. Regularization is an efficient technique to improve the generalization capability of neural networks [14]–[16], [20]. It is conducted by adding a weight decay function to the energy function of the neural network. With the weight decay function, the weights are smaller and the output of the neural work would be smoother. When using these regularization techniques, the main problem is to decide how much the weight decay term influences the learning algorithm. Reference [12] analyzes different weight decay function and different parameter of the weight decay function and tests their performance with several examples. Reference [16] deals with this problem from a multi-objective optimization point of view. Bayesian technique is an effective and simple way to solve this problem [15], [20]. Reference [19] introduces this technique to deal with the problem of travel time prediction. In this paper, a Bayesian regularized neural network model (BRNN) is employed to solve the traffic speed prediction problem. The traffic speed data of the Hangzhou city collected by the floating car system are used to validate our method. The rest of the paper is organized as follows. In Section II, the source data and the traffic speed prediction problem are described formally. In Section III, we introduce the BRNN model. In Section IV, several models for comparison are presented. We explain our experimental procedure and present the experimental results in Section V, Section VI concludes this paper. II.

current, and predictive. We use historical data and current data to get the predictive data. Suppose the current time is t , and the current traffic speed of this road section is v(t ) . Given the historical traffic speed of this road section v(t − 1), v(t − 2),", v(t − h) at time t − 1 , t − 2 ,",t − h . v(t + 1) is the future traffic speed to be predicted. We can build a prediction model, as (1), by analyzing the historical data set. (1) vˆ(t + 1) = f (v (t ), v(t − 1), " v(t − h)) where f (⋅) is a nonlinear mapping function. III.

BAYESIAN REGULARIZED NEURAL NETWORK

Three-layer neural network model has the ability to approximate any non-linear function. It is used in diverse fields. Nowadays, there are more than 40 types of neural networks. Among all these models, Back Propagation (BP) neural network is the most widely used kind of neural network. BP neural network is a kind of multi-layer feed forward neural network. As shown in Figure 1, BP neural network is a kind of neural network with 3 layers or more, including input layer, middle layer (hidden layer), and output layer. The neurons between adjacent layers are completely connected. But there is no connect between the neurons in the same layer. It is proved that BP neural network can approach any nonlinear function given enough neurons in each layer [15]. In the aforementioned problem, the inputs of the neural network are the current speed and historical speeds of the target road section, v(t ), v(t − 1),", v(t − h) , respectively. The output is the traffic speed in the future, vˆ(t + 1) .

v(t)

v(t-1)

v(t-h)

SOURCE DATA AND PROBLEM DESCRIPTION

In this paper, we focus on the problem of predicting the traffic speed of an urban road section. The data are collected from the floating car systems of the Hangzhou city. Floating car system based on the traces of GPS positions works effectively in gathering real-time traffic information. It is based on cars with in-vehicle GPS devices (floating cars) traveling on a road network [10]. These cars send the traffic data they collect during their travel to the data center of the floating car system during their travel. The data are then sent to different users. The traffic information obtained by floating car system can cover a wide range of road network and reflect the actual state of the road traffic. Accurate prediction of traffic speed can be very useful for both traffic authorities and road users. There are two kinds of raw data in the system, 1min interval and 15-min interval. There are three kinds of traffic speed data: historical,

v(t+1) Fig. 1. The structure of the Bayesian regularized neural network used for the traffic speed prediction

In this paper, we used a simple and effective way to suppress the effects of noise and improve the generalization ability of the neural network. Typically, the object of training a neural network is to obtain a set of network weights and biases, which minimize the error between the real speed and predicted speed. Normal objective function is the mean squared error (mse):

2216

ED =

1 n ∑ (dt − xout ,t ) n t =1

2

where n is the number of training samples,

P(α , β | D, M ) =

(2)

d t represents

the real speed and xout ,t is the network output. However, this classical objective function can often overfit the training data. In the traffic speed prediction problem, the training data are not clean. If the network overfits the training data, it would perform poorly on the testing data. To improve the generalization ability of the neural network and make it perform well with new cases, we add a weight decay term to the objective function. With this weight decay function, the network weights would be smaller, which will improve its generalization ability. The new objective function is: E = α Ew + β ED

(3)

Where Ew is the weight decay function, α , β are the regularization parameters. It represents the relative importance of the weight decay term. If α > β , the training emphasizes weights reduction, which would lead to a smooth network output. There are many types of weight decay term. In [14], Xu et al. presented four kinds of weight decay term and analyzed their effectiveness in different fields. In our research, the most popular quadratic weight decay function is used: 2

Ew =

1 N ∑ [w ] 2 j =1 j

P( D | α , β , M ) P(α , β | M ) P( D | M )

(9)

Maximizing the posterior can be achieved by maximizing P( D | α , β , M ) , which is the normalization factor in (5). P( D | α , β , M ) = =

=

P ( D | w , β , M ) P (w | α , M ) P ( w | D, α , β , M )

(π / β ) − n / 2 exp( − β ED )(π / α ) − N / 2 exp(−α Ew ) 1 exp( − E (w )) Z F (α , β )

(10)

Z F (α , β ) (π / β ) n / 2 (π / α ) N / 2

Estimate Z F (α , β ) by Taylor series expansion. (11) Z F ≈ (2π ) N / 2 (det(( H MP ) −1 ))1/ 2 exp( − E ( w MP )) where H = α∇ 2 Ew + β∇ 2 ED is the Hessian Matrix of the objective function. Substitute this to (10), we can get the optimal values for α and β at the minimum point by taking the derivative with respect to each of the log of (10). During the process, the Hessian matrix needs to be computed. The Hessian matrix can be computed by the Gauss Newton approximation, the same as the Levenberg-Marquardt (LM) algorithm. Then we obtain: α MP =

γ 2 Ew ( w MP )

, β MP =

n −γ 2 ED (w MP )

(12)

where γ = N − 2α tr (H MP )−1 is called the number of effective

(4)

where [⋅] j means the jth component of the vector and N is the number of all the weights of the network. The optimal regularization parameters α , β can be determined by the Bayesian technique. In Bayesian theory, the weights of the network can be considered as random variable. Given the data set, α and β , the posterior probability is: P ( D | w, β , M ) × P (w | α , M ) (5) P ( w | D, α , β , M ) = P( D | α , β , M ) where D represents the data set, M is the particular neural network model used, P( D | w, β , M ) is the likelihood function, P (w | α , M ) is the prior density, and P ( D | α , β , M ) is the normalization factor which ensures the total probability is 1. If the data and the weights are both Gaussian, then the probability densities can be written as: (6) P ( D | w , β , M ) = (π / β ) − n / 2 exp( − β E D ) −N / 2 (7) P (w | α , M ) = (π / α ) exp( −α Ew ) Substitute (6) and (7) into (5), we get (π / β ) − n /2 (π / α )− N / 2 exp(−(α Ew + β ED )) P ( w | D, α , β , M ) = (8) P( D | α , β , M ) = Z F (α , β ) exp(− E (w )) In Bayesian theory, maximizing the posterior probability is equivalent to minimizing the regularized objective function E (w) . Then the regularization parameters can be optimized by using Bayesian rules:

parameter numbers. It is between 0 and N . For more details about the Bayesian regularization, see [15]. Here are the steps of the BRNN learning algorithm: 1) Initialize α , β , and the weights randomly. 2) Update the weights as follows: (13) w ( k + 1) = w ( k ) − [2 β J Tk J k + 2α I ]−1 g k where I is an N × N identity matrix, J is the Jacobian

matrix[18], g = [ ∂E ( w ) ∂E (w ) " ∂E ( w ) ]T . ∂w1

∂w2

∂wN

IV.

COMPARATIVE MODELS

3) Compute γ , α , and β . 4) Iterate the steps 2) and 3) until the training goal is reached.

To evaluate the usefulness of the BRNN model, some common prediction methods are employed for results comparison. Four models are described in the follow. A. Traditional neural network (NN) model In order to show the effectiveness of the Bayesian regularized neural network, a common neural network is used for traffic speed prediction problem. Levenberg-Marquardt learning algorithm is used.

2217

B. K nearest neighbors nonparametric regression (KNNNR) model KNNNR model is widely used in traffic prediction field. Denote a state vector of traffic speed as: (14) x(t ) = [v (t ), v (t − 1)," , v (t − h)] Compute the Euclidean distance between the new state space with the historical state spaces to fine k nearest neighbors. Then the predicted speed can be computed as follows: v (t + 1) k 1 vˆ(t + 1) = ∑ i /∑ i =1 Disti i = Disti k

(15)

where Disti means the Euclidean distance between the new

experiment. 7 days’ data were used to build the model and 3 days’ data were used to test the accuracy. As mention in Section II, there are two types of data in the floating car system, i.e. 1-min interval data and 15-min interval data. 1-min interval data were chosen because we focused on short-term traffic speed prediction. The 1-min interval data were aggregated and averaged into 5-min period as follows: x=

1 m ∑ xi m i =1

(19)

where m is 5 in this paper. The data were normalized to a value between 0 and 1.

state space with the ith nearest neighbor. For more information about KNNNR, see [4], [5]. C. Weighted Moving average (WMA) model A moving average model of order k is computed as: vˆ(t + 1) =

v(t ) + v(t − 1) + " + v(t − h) h +1

(16)

Considering that the speed on the time interval nearer to the time interval t + 1 is more correlated with the traffic speed at time interval t + 1 , a weighted moving average was employed here: α v(t ) + α1v(t − 1) + " + α h v(t − h) (17) vˆ(t + 1) = 0 α 0 + α1 + " + α h where α 0 , α1 ,", α h are parameters to be decided..

Fig. 2. The Geographical Information of the Target Road

B. Error Measurements Mean absolute error(MAE) and mean squared error (MSE) are applied to assess the accuracy of prediction. 1 num (20) mae = ∑ | X Re al (t ) − X F (t ) | num t =1

D. Auto Regression (AR) model We can get the predictive value as follows by AR model: (18) vˆ(t + 1) = α 0 v(t ) + α1v(t − 1) + " + α h v(t − h) + ε

2

where ε is a zero mean random variable, α 0 , α1 ,", α h are the parameters to be estimated by the least square method. V.

mse =

(21)

where X Re al represents the real speed, X F is the predicted output, and num denotes the number of test samples.

EXPERIMENT PROCEDURE AND RESULTS

A. Data Preparation The floating car system of the Hangzhou city covers most of the urban area. We will use the label in the system to denote a specific road section. The northbound speed of Road No.540 (see Figure 2) was chosen. A data set from May 20, 2007 to June 20, 2005 was collected. Reference [11] mentions day-today variation in one week. Besides, hour-to-hour variation in one day was also considered in our research. According to [11], the days during one week can be divided into four groups: G1 (Monday), G2 (Tuesday, Wednesday and Thursday), G3 (Friday), and G4 (Saturday and Sunday). In order to eliminate hour-to-hour variation, data in morning peak (7o’clock to 9o’ clock) was chosen. Hence in our research, data in morning peak on days of the G3 group were chosen to validate our model. Some data in the floating car system were lost, so we finally had 10 days’ data to carry out

1 num ∑ [ X Re al (t ) − X F (t )] num t =1

C. Traffic Speed Prediction Model Based on BRNN In this paper, the h in (1) was chosen as 2. That means, we used the traffic speeds at 3 successive previous time interval to predict the traffic speed at t+1 interval. A three-layer neural network was built, with 3 neurons in the input layer, 12 neurons in the hidden layer, and 1 neuron in the output layer. The number of neurons in the hidden layer is an important feature of the network, which was chosen by many experiments. The transmission function in the hidden layer was hyperbolic tangent S type (tansig) function. The transmission function in the output layer was the logarithm S type (logsig) function. The inputs were v (t ) , v (t − 1) , v (t − 2) of the target road section, and the output was v (t + 1) . The training goal was set as 0.01. Different initial weights and biases of neural network would lead to different results. The

2218

initial weights and biases are random in our study. The BRNN learning algorithm was used to train the neural network for 10 times with different initial weights and biases and the best one among them was chosen.

70

60

Speed (km/h)

50

D. Traffic Speed Prediction Models Based on comparative models As for the NN model, the parameters of it were the same as the parameters of the BRNN model. Hence, the only difference between them was the training algorithm. The LM learning algorithm was also used to train the neural network for 10 times and the best one was chosen. In the KNNNR model, the only parameter need to be decided is the number of neighbors, k. If the neighbors are too few, the output would fluctuate fiercely. With the increase of the number of neighbors, the output gets smoother. However, the accuracy would reduce when the number of neighbors exceeds a certain value and the output would be almost a fixed value. 10 is the most suitable value for this case obtained by many experiments. In the WMA model, we need to design the weights. Because v (t ) is more correlated to v (t + 1) than v (t − 1) and v (t − 2) , and v (t − 1) is more correlated to v(t + 1) than v (t − 2) . So the

40

30

20

10 predicted observed 0

0

10

20

30 Sample ID

40

50

60

Fig. 4. The Results Obtained by NN model 70

60

Speed (km/h)

50

40

30

20

10 predicted observed 0

weights α 0 , α1 , α 2 were assigned 2, 0.75, and 0.25, respectively. This WMA model outperforms the MA model. In the AR model, the parameters were estimated by the least square method. During our experiment, α 0 , α1 , α 2 are 0.6145, 0.1847, and 0.1291, respectively.

0

10

20

30 Sample ID

40

50

60

Fig. 5. The Results Obtained by KNNNR model

70

60

50 Speed (km/h)

E. Experimental results and analysis Among the 10 days’ data, 7days’ data were used to build the model and the left were used for test the accuracy. Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7 show the predicted speed and observed speed obtain by the aforementioned five models.

40

30

20

10 predicted observed 0

70

0

60

Fig. 6. The Results Obtained by WMA model

20

30 Sample ID

40

50

60

70

50

60

40 50

30 Speed (km/h)

Speed (km/h)

10

20

10 predicted observed 0

0

10

20

30 Sample ID

40

50

40

30

20

60 10 predicted observed

Fig. 3. The Results Obtained by BRNN model

0

0

10

20

30 Sample ID

40

50

60

Fig. 7. The Results Obtained by AR model

The mae and mse of all the five models are shown in TABLE I.

2219

REFERENCES

MAE MSE

TABLE I MAE AND MSE OF DIFFERENT MODELS BRNN NN KNNNR WMA 10.1446 13.8975 10.7519 11.3534 145.0001 293.7279 169.7859 200.7054

[1] AR 10.9504 190.5897

[2]

As shown in Figure 3, the predicted speeds obtained by BRNN model are near to the real speeds, and this method leads to the smallest mae and mse according to TABLE I. In Figure 5, it is clear that KNNNR model’s results are smooth, without too large values or too small values. This is related to the number of neighbors we choose. KNNNR model leads to the second smallest mae and mse. From Figure 6 and Figure 7, we can see that the speeds of WMA model and AR model have similar trends with the real speed. But there is an obvious time delay existing. AR model’s results are slightly better than the results obtained by WMA model. In Figure 4, we can see that the output of the NN model fluctuates fiercely and it leads to the worst prediction accuracy. It is obvious that it overfits the training data and it performs badly in the test sample. Compared with it, the BRNN model performs well on the testing data. From this, we can clearly see that the Bayesian regularization improves the generalization ability of neural network significantly, and it is very effective for traffic speed prediction problem. VI.

CONCLUSION

[3]

[4]

[5]

[6] [7]

[8] [9] [10]

Traffic condition prediction has become a hotspot in ITS. Accurate traffic condition prediction is very useful for traffic authorities and common road users. In the existing researches, there are many researches on traffic flow and travel time prediction, but the researches on traffic speed prediction based on the data gathered by the GPS devices are relatively few. In this paper, a BRNN model is proposed to solve the problem of the traffic speed prediction problem. The neural network approach is applied because of its special characteristics. However, neural network often suffers from the “overfitting” phenomenon, especially when the traffic speed data are not completely clean. In order to reduce the effects of noise and improve the generalization ability of the neural network, a weight decay term is added to the energy function of the neural network and the Bayesian technique is used to determine the optimal regularization parameters. The floating car data of the Hangzhou city are used to validate our model. The experimental results show that the BRNN model outperforms the common neural network model and several other common prediction models. From the analysis of the experiment results, we can see that different models have different characteristics. In the future research, we will focus on if a hybrid model would help improve the prediction accuracy.

[11] [12] [13] [14] [15] [16] [17] [18] [19]

[20]

2220

D. Billings, and J. S. Yangt, “Application of the ARIMA Models to Urban Roadway Travel Time Prediction - A Case Study,” in Conf. Rec. 2006 IEEE Int. Conf. Systems, Man, and Cybernetics, pp.2529-2534. A.Guin, “Travel Time Prediction Using a Seasonal Autoregressive Integrated Moving Average Time Series Model,” in Conf. Rec. 2006 IEEE Int. Conf. Intelligent Transportation Systems, pp.493-496. Xinyu Min, Jianming Hu, Qi Chen, Tongshuai Zhang, and Yi Zhang, “Short-Term Traffic Flow Forecasting of Urban Network Based on Dynamic STARIMA Model,” in Conf. Rec. 2009 IEEE Int. Conf. Intelligent Transportation Systems, pp.1-6. B..Smith, B. Williams, and R. K. Oswald, “Comparison of parametric and nonparametric models for traffic flow forecasting,” Transportation Research Part C: Emerging Technologies, vol. 10, no. 4, pp. 303–321, Aug. 2002. J. C. Weng, Z. W. Hu, Q. Yu, and F. T. Ren, “Floating Car Data Based Nonparametric Regression Model for Short-Term Travel Speed Prediction,” Journal of Southwest Jiaotong University (English Edition), vol. 15, no. 3, pp. 223–230, Jul. 2007. S. L. Sun, C. S. Zhang, and G. Q. Yu, “a bayesian network approach to traffic flow forecasting,” IEEE Trans.Intelligent Transportation Systems, vol. 7, no. 1, pp. 124–132, Mar. 2006. C. Antoniou, H.N. Koutsopoulos, and G. Yannis, “An efficient nonlinear Kalman filtering algorithm using simultaneous perturbation and applications in traffic estimation and prediction” in Conf. Rec. 2007 IEEE Int. Conf. Intelligent Transportation Systems, pp.217-222. J. Yu, G. L. Chang, H. W. Ho, and Y. Liu, “Variation Based Online Travel Time Prediction Using Clustered Neural Networks,” in Conf. Rec. 2008 IEEE Int. Conf. Intelligent Transportation Systems, pp.85-90. K. Cao, and M. Zhao, “A Dynamic Traffic Forecast Using Hybrid Wavelet Network with an Adaptive Genetic Local Search,” in Conf. Rec. 2007 IEEE Int. Conf. Intelligent Transportation Systems, pp.7-11. C. D. Fabritiis, R. Ragona, and G. Valenti, “Traffic Estimation And Prediction Based On Real Time Floating Car Data,” in Conf. Rec. 2008 IEEE Int. Conf. Intelligent Transportation Systems, pp.197-203. L. Huang, and M. Barth, “A Novel Loglinear Model for Freeway Travel Time Prediction,” in Conf. Rec. 2008 IEEE Int. Conf. Intelligent Transportation Systems, pp.197-203. D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol.323, pp. 533-536, Oct. 1986. K. Matsuoka, “Noise Injection into Inputs in Back-Propagation Learning,” IEEE Trans.Systems, Man, and Cybernetics, vol. 22, no.3, pp. 436–440, May. 1992. Y. Xu, K. W. Wong, and C. S. Leung, “Generalized RLS Approach to the Training of Neural Networks,” IEEE Trans.Neural Networks, vol. 17, no. 1, pp. 19–34, Jan. 2006. F. D. Foresee, and T. T. Hagan, “Gauss-Newton Approximation to Bayesian Learning,” in Conf. Rec. 1997 IEEE Int. Conf. Neural Networks, pp.1930-1935. Y. C. Jin, T. Okabe, and B. Sendhoff, “Neural Network Regularization and Ensembling Using Multi-objective Evolutionary Algorithms,” in Conf. Rec. 2004 IEEE Int. Conf. Evolutionary Computation, pp.1-8. D. L. Donoho, “De-Noising by Soft-Thresholding,” IEEE Trans.Information Theory., vol. 41, no.3, pp. 613–627, May. 1992. F. M. Ham, and I. Kostanic, Principles of Neuro-computing for Science& Engineering, China Machine Press, 2000, pp. 95-100. C.P.IJ. van Hinsbergen, J.W.C. van Lint, and H.J. van Zuylen, “Bayesian committee of neural networks to predict travel times with confidence intervals,” Transportation Research Part C, vol. 17, pp. 498509, Oct. 2009. David J C Mackay, “Probable networks and plausible predictions – a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, vol. 6, pp. 469–505, 1995.