Feedforward neural networks for nonparametric regression

9 downloads 0 Views 222KB Size Report
ber of hidden neurons de ne exible non-parametric regression models. In ... and extend it to a non-parametric model by allowing unconstrained size of.
This is page 1 Printer: Opaque this

Feedforward neural networks for nonparametric regression

David Rios Insua Peter Muller Abstract

Feed forward neural networks (FFNN) with an unconstrained random number of hidden neurons de ne exible non-parametric regression models. In Muller and Rios Insua (1998) we have argued that variable architecture models with random size hidden layer signi cantly reduce posterior multimodality typical for posterior distributions in neural network models. In this chapter we review the model proposed in Muller and Rios Insua (1998) and extend it to a non-parametric model by allowing unconstrained size of the hidden layer. This is made possible by introducing a Markov chain Monte Carlo posterior simulation scheme using reversible jump (Green 1995) steps to move between di erent size architectures.

1 Introduction Outside the mainstream statistical literature there has been enormous interest for neural networks (NNs). After a revival due to Rumelhart and McClelland's work (1986), physicists, biologists, philosophers and computer scientists have introduced, developed and expanded various NN models applying them to solve traditional statistical problems like regression or classi cation. The basic building blocks of NN models are neurons, which may be viewed as information processors transforming non-linearly the inputs they receive into one or more outputs. Various ways of linking and relating neurons lead to di erent NN models, including, among others, feed-forward neural networks (FFNNs), Boltzmann machines, or Hop eld nets. Lippman (1987) provides an interesting account from a computational point of view. In this chapter, we shall focus on FFNNs. In their simplest version, their architecture is de ned by three types of nodes (or neurons): input, hidden and output nodes. Input nodes are associated with explanatory variables and feed hidden nodes which nonlinearly transform weighted combinations of inputs to feed outputs, associated with response variables, which, in turn, transform weighted combinations of their inputs. Interest in these models arise from results by Cybenko (1989), Hornik et al (1989) and others, which describe them as universal approximators: when nonlinear transformations

2

Rios Insua, Muller

are of sigmoidal type, as the number of hidden nodes goes to in nity, we may approximate any continuous function in a compact range. If we then introduce an error model, we have a exible class of models on random functions with in nite-dimensional parameter vector, establishing a link with non-parametric inference. Realising their potential, there have been several recent reviews of NNs from a statistical point of view, see e.g., Cheng and Titterington (1994), Ripley (1993), Stern (1996), Warner and Misra (1996), Neal (1996) or De Veaux and Ungar (1997). The general message they convey is that, although there has been a certain hype in the eld, NNs may constitute a useful addition to the statistician's toolkit when dealing with complex nonlinear features in regression, classi cation or time series analysis. As an indication of their potential, note that some of the most popular statistical packages and libraries have recently included NN modules. Yet their impact is still to be fully perceived in statistical research practice. Given that, it comes at no surprise that relatively little work on Bayesian analysis of NNs has taken place. Buntine and Weigend (1991) and MacKay in various papers, see e.g., (1992), develop the rst approaches, based mainly on Gaussian approximations to the posteriors. Bishop (1996) reviews many of these earlier attempts. Of special relevance is the work of Neal, culminating in the above mentioned Neal (1996). From a theoretical point of view, he relates in nite hidden node FFNNs with Gaussian processes, when priors are appropriately chosen, hence sidestepping the issue of over tting data. Other prior choices show that in nite FFNNs are actually richer than Gaussian processes, con rming their relevance as modeling tools. From a practical point of view, assuming a large, but xed, number of hidden nodes, Neal introduces a hybrid Monte Carlo method, merging Metropolis algorithm with sampling techniques based on dynamical simulation, to perform inference and prediction with NNs. In previous work, Muller and Rios Insua (1998) introduced the idea of (parametric) variable architecture FFNNs, together with a very ecient Markov chain Monte Carlo method. In this chapter, we build on this work providing analyses of nonparametric FFNN models, based on a novel reversible jump algorithm.

2 Feed Forward Neural Networks as Nonparametric Regression Models A feedforward neural network model with p input nodes, one hidden layer with M hidden nodes, one output node and activation functions is a model relating p explanatory variables x = (x1 ; :::; xp ) and a response vari-

1. Feed Forward Neural Networks

able y of the form

y^(x) =

M X j =1

j (x0 j + j )

3

(2.1)

with j 2 R, j 2 R p . The terms j are designated biases and may be assimilated to the rest of the j vector if we consider an additional input with constant value one, say x0 = 1. The typical setup for FFNNs is: Given data D = f(x1 ; y1 ); :::; (xN ; yN )g and xed M , choose = ( P 1 ; :::::::; M ); = ( 1 ; :::; M ) according to a least squares criterion min ; Ni=1 (yi ? y^(xi ))2 , either via backpropagation (Rumelhart and McLelland, 1986), an implementation of steepest descent, or other optimisation method such as quasi-Newton or simulated annealing. Hence, at least implicitly, we are assuming a normal error model and we are viewing a nonlinear parametric regression problem. It is also sometimes suggested to include a regularisation term in the objective function to avoid data over tting. Cybenko (1989) and others show that when is a sigmoidal function, nite sums of the form (2.1) are dense in C (Ip ), the set of real continuous functions in the p-dimensional unit cube. For their proof, they assume that M ! 1 as the approximation gets better. Hence, we may view FFNNs as nonparametric regression models.

3 Variable Architecture FFNNs An important issue which has received comparatively little attention in the literature is the choice of architecture which, for the model we are considering, consists mainly of choosing the number M of hidden nodes. Typically, a choice is made by trial and error, though there are several heuristics helping in that task. In Muller and Rios Insua (1998), we argue that architecture should be treated as an additional parameter M in the model: apart from simulated examples, there will always be uncertainty about the number of hidden nodes to include; a basic tenet of Bayesian inference suggests that uncertainty about architecture choice should be formalized in a prior probability model on M . We also argue that considering M as unknown parameter, rather than xing it, will address some of the likelihood (and posterior) multimodality which makes inference in NN models notoriously dicult, as recognised by various authors. Besides arguing for a variable architecture FFNN model, characterised as a mixture of M sigmoidal activation functions (), with M unknown and ranging through the whole set of positive integers, we shall include a linear regression term x0 . This corresponds to a standard model building strategy based on blocks, see West and Harrison (1996), where the linear

4

Rios Insua, Muller

term would model linear e ects, and the FFNN term would take care of other e ects. Hence, we shall deal with the following model

yi = x0i  +

M X j =1

j (x0i j ) + i ; i = 1; :::; N;

(3.2)

where i  N (0; 2 ). For the moment, we x () = exp()=(1 + exp()), i.e., we use a logistic activation function. However, everything we shall say is valid for other sigmoidal functions. For contrast, our second example below will include a di erent type of activation functions. The parameters in our model are the linear weights  = (0 ; 1 ; :::; p ) and ( 1 ; 2 ; : : :), the logistic parameters ( 1 ; 2 ; : : :), the number M of terms and the error variance 2 . The prior over network parameters is: j  N ( ;  2 ); j  N ( ;  2 ); j  N ( ; S ); ?2  Gamma(s=2; sS=2): (3.3) The meaning and interpretation of parameters allow us to use informative priors. For example, the 's should re ect the order of magnitude of the data yi . After standardisation, positive and negative values for j would be equally likely, calling for a symmetric prior around 0, with a standard deviation re ecting the range of plausible values for yi . Similarly, a range of reasonable values for the logistic coecients j will be determined by the meaning of the data being modeled, to address smoothness issues. We also need to provide a prior over the number M of hidden nodes. We shall keep it as a generic distribution, i.e., p(M ). Poisson distributions provide a convenient way of modelling various shapes for such distribution. Alternatively, we may choose a probability model which rewards parsimony by putting geometrically decreasing prior probability on larger networks, hence assuming a geometric prior with parameter : M  Geom( ): When there is non-negligible uncertainty about the prior hyperparameters, we may complete the prior model with a hyperprior over them. We shall use the following standard choices in hierarchical models:   N (a ; A );   N (a ; A );  ?2  Gamma(cb =2; cbCb =2); S ?1  Wish(c ; (c C )?1 ); (3.4) Finally, as hyperprior for the geometric probability, we use  Beta(a ; b ). Since the likelihood is invariant to relabelings, and given our prior choice, we recommend including an order constraint to avoid trivial posterior multimodality due to permutations of indices. For example, we could use

1p  2p : : :  Mp .

1. Feed Forward Neural Networks

5

Note that the main di erence with our previous model in Muller and Rios Insua (1998) resides on having removed the upper bound on the number of terms in the mixture of logistics, to allow for a nonparametric model. Since our hyperpriors allow for variances tending to in nity, we encompass the more general models described by Neal (1996).

4 Posterior inference with the FFNN model Posterior inference in FFNNs is plagued by multimodality issues. Besides, trivial multimodality due to relabeling, which we mitigate with a constraint on logistic slopes, there is inherent multimodality due to nonlinearity and to nesting of models by inclusion of (approximately) duplicate or (approximately) irrelevant hidden nodes. As a consequence, there is little hope for normal approximations with these models, and we need to turn to MCMC methods. However, straightforward implementation of commonly used MCMC schemes is hindered by a variety of issues. First, the variable architecture model de nes a model with changing dimensionality parameter vector: posterior inference has to mix over models with hidden layers of di erent size. We introduce below an algorithm which addresses the changing dimensions by using an implementation of reversible jump (Green, 1995) with moves corresponding to \birth", \death", \thinning" and \seeding" of hidden nodes. Second, the high dimensional parameter vector is typically highly correlated a posteriori, necessitating MCMC strategies which use blocking to jointly update as many parameters as possible and marginalizing to partly avoid the random walk nature of Metropolis algorithms. In Muller and Rios Insua (1998) we introduce a scheme which marginalizes over the weights j when updating the input weights j . This is made possible by Lemma 1 which gives the marginal likelihood of , marginalizing over ( ; ). With this likelihood we use a Metropolis step to update values of the logistic regression parameters. The other key observation, which is actually used in the proof of the lemma, is that, once we have xed M and

, we have a standard linear model, hence facilitating, through Gibbs steps, the sampling of new linear weights and hyperparameters. For simplicity, we drop dependence on hyperparameters: Lemma 1: =1;:::;M , X = (x )j=0;:::;p , Let zij = zij ( ) = (x0i j ), Z = (zij )ji=1 ij i=1;:::;N ;:::;N 0 2 R = (X; Z ). De ne A = R R= ,  = R0 y=2 , C = 1= 2 I ,  =  = 2 (1; 1; : : : ; 1)0 , ml;b ( ) = (A+C )?1 (+) and Sl;b ( ) = (A+C )?1 .

6

Rios Insua, Muller

Then,

N ) = ml;b ( )] Y p(Dj ; M ) = p[(p;[(; ) = ml;b ( )jy; ] i=1 p[yi j(; ) = ml;b ( ); ] N Y = p[(; ) = ml;b ( )]jSl;b ( )j1=2 p[yi j(; ) = ml;b ( ); ]: i=1

4.1 MCMC posterior simulation for FFNN models

We describe posterior simulation in the proposed model by stating the algorithm for one sweep, i.e., the transition from imputed parameter values at iteration t to those values at iteration t + 1. The list below outlines the steps in one sweep. Details of each step are discussed afterwards. An item of the form [ajb] indicates that parameter a is being updated using current values for parameters b. The absence of some parameter c in the conditioning set (to the right of the bar) indicates that either c is being marginalized over, or that a and c are conditionally independent given b. Let

j = ( j1 ; : : : ; jp )0 , = ( 1 ; : : : ; M )0 , = ( 1 ; : : : ; M )0 ,  = (0 ; :::; p ) and  = ( ;  ;  ;  ; ). Let ?jk indicate the list of all weights j k without jk . The following steps de ne one sweep of the Markov chain: (i) [ jk j ?jk ; M; ; D]; j = 1; : : : ; M; k = 0; : : : ; p; (ii) [M j ; ; D]; \Birth/Death" (iii) [M j ; ; D]; \Seed/Thin" (iv) [ ; j ; M; ; D]; (v) [ j ; ; ; D]: Step (i). We update the logistic weights jk , marginalizing over (; ). We use a random walk Metropolis proposal to propose a new value for jk , and evaluate the Metropolis-Hastings acceptance probabilities using Lemma 1 to compute the required marginal likelihoods, p(Dj ~; M;  ) and p(Dj ; M;  ). Denote with k = C ;kk the k-th element in the diagonal of C . Let ~ denote the vector with jk replaced by ~jk and, if necessary, the indices rearranged to satisfy the order constraint on the jp . Let x ^ y denote the minimum of x and y. We then do:

~jk  N ( jk ; k ); Dj ~; M;  )p(~ jM;  ) ;  = 1 ^ pp((D j ; M;  )p( jM;  ) 0

and

=



~ with probability 

with probability 1 ? :

0

1. Feed Forward Neural Networks

7

The computation of p( jM;  ) is straightforward. See the note after (4.5) about the order constraint in p( jM;  ). Use Lemma 1 to evaluate p(Dj ; M;  ). Step (ii). A change in M by a reversible jump (Green 1995) type move is proposed. The algorithm follows the setup in Richardson and Green (1997). With probability dM = M=(M + 1); M 6= 1 and d1 = 0, we propose to delete one hidden node (\death"); with probability bM = 1 ? dM we propose to add one hidden node (\birth"). If we decide to add a node, then we rst generate a vector M +1  q( M +1 ) = N ( ; S ). The random vector M +1 plays the role of the continuous random variable u in Richardson and Green's discussion of a birth move in the normal mixtures model. If we decide to delete a node, then we pick with equal probabilities 1=M one of the M hidden nodes to be proposed for deletion. Assume we choose index j . Then remove j from the list of nodes; relabel the remaining indices to close the gap in the indeces; decrement M to M~ := M ? 1; and record M~ +1 = j . To derive the appropriate acceptance probabilities, consider rst the case of proposing a birth. Denote with ~ the vector augmented with the additional M +1 , relabeled if necessary to maintain the order constraint. Let M~ = M + 1. With probability  = 1 ^ A, where

A = prior ratio  likelihood ratio  proposal ratio ~  ) dM~ M1~ ~  )p(M~ ) p(Dj ~; M;

jM; = pp(~ ( jM;  )p(M ) p(Dj ; M;  ) bM q( M +1 ) ~ ~  ) 1=(M~ + 1) j ~; M; = M p (Dp(jD

; M;  ) 1=(M + 1)

(4.5)

accept the proposal and set := ~. Otherwise keep Q unchanged. In verifying the expression for A, note that p( jM;  ) = (M !) M j =1 N ( j ;  ; S ), with the factor M ! appearing because of the order constraint. The factor M1~ in the numerator of the proposal ratio corresponds to uniformly sampling one of the M~ indices to decide which node should die. The in the nal expression derives from p(M~ )=p(M ), if we use the geometric prior M  Geom( ). Note that the Jacobian appearing in the general expression for reversible jump moves equals one. The calculations for a death move can be derived by an analogous argument. Step (iii). As in the previous step, a change in the number of hidden nodes is proposed. Instead of generating an entirely new candidate node as in Step (ii), we propose a new node in relation to the current ones. We shall call this move \seeding a node". The matching move proposes to reduce M by \thinning"

8

Rios Insua, Muller

one current node. With probability tM = 0:5 for M > 1 and t1 = 0 we consider thinning and with probability sM = 1 ? tM we consider seeding. Assume we decide to seed. With equal probabilities we choose an index j 2 f1; : : : ; M g and generate a new node ~j+1 by rst generating u  N (0; S ) and then setting ~j+1 = j + u for some factor . In the current implementation we use  = 1:0. Denote with ~ the augmented vector after relabeling the nodes j+1 ; : : : ; M as ~j+2 ,. . . ,~ M~ , where M~ = M + 1. If ~ violates the order constraint imposed on the last coordinates jp ; j = 1; : : : ; M , then we immediately reject the proposal (if the order constraint ~  ) in the acceptance probabilities given is violated then the factor p(~ jM; below will evalute to zero). For the matching thinning step, we choose with equal probabilities, p = 1=(M ? 1), a pair of indices (j; j +1) 2 f(j; j +1); i = 1; : : : ; M ? 1g; record u = ?1 ( j+1 ? j ); reduce M by one, M~ = M ? 1; remove j+1 from the list of nodes; and re-label the remaining indices to close the gap. Denote the proposal ~. Again, we only derive the acceptance probability for a seed move. The expression for the thinning move is easily derived by a symmetric argument. Assume we decided to seed with respect to term j . With probability  = 1 ^ A, where A = prior ratio  likelihood ratio  proposal ratio ~  ) p(M~ ) p(Dj ~; M; ~  ) tM~ 1=(M~ ? 1)

jM; = pp(~  ( jM;  ) p(M ) p(Dj ; M;  )  q(u) jJ j sM =M ~ p(Dj ~; M; ~  ) tM~ : = NN( (ju;;0 ; ;SS )p)M

(Dj ; M;  )  ?p sM The factor 1=M in the denominator, and the factor 1=(M~ ? 1) in the numerator of the proposal ratio correspond to the step of sampling an index j from which we seed and sampling two indices (j; j +1) from which we thin, respectively. The factor p is the Jacobian appearing in the general formula for the reversible jump acceptance probability. The additional factor M~ in the numerator stems from the order constraint on the jp ; j = 1; : : : ; M (see the comment after the acceptance probability in step ii). Step (iv). Conditional on ( ; ; M ) the model becomes a normal linear regression problem with coecients ( ; ). Resampling ( ; ) from p( ; j ; ; M; D) is a straightforward multivariate normal generation. See, for example, Bernardo and Smith (1994) for an explicit statement of the relevant moments. Step (v). We appeal to the same normal linear regression argument as before. It is then straightforward to resample the hyperparameters from their respective complete conditional posterior distributions.

1. Feed Forward Neural Networks

9

Convergence of the algorithm to the desired stationary distribution comes from arguments in Tierney (1994) and Green (1995).

5 Examples

1.5

-6 -5 -4 -3 -2 -1 0

-6 -5 -4 -3 -2 -1 0

Example 1. Stochastic optimisation. An application of interest is stochastic optimisation. Suppose we are interested in nding decisions of maximum expected utility, yet we do not have an explicit expression of such an objective function. Suppose however that we may approximate it quite accurately at several points, say via Monte Carlo integration. We could then t a nonlinear surface and optimise its predictive expected value. For an example, consider a reservoir management problem adapted from Rios Insua et al. (1997). We have to decide how much water should we release through turbines (u1 ) and spillgates R(u2 ) from a reservoir. We evaluate them through their expected utility g(u1 ; u2; i)h(i)di, where g is the utility function, i is the in ow and h is the predictive density for in ows. At N design points, we approximate by standard Monte Carlo the expected utility. We then t a neural network model to the data f(ui1; ui2 ; (u11 ; u12 )); i = 1; :::; N g where  designates the Monte Carlo approximation of the expected utility. Figure 1 shows the data and tted FFNN model. Note that other smoothing methods would have problems in this example, due to the sharp edge on the front, which would typically be oversmoothed. Figure 2 shows the marginal posterior on M , suggesting bigger support for architectures with three or four hidden nodes. In ad-

1

1.5 0.5

0

X2

-0

.5

-1

-1

.5

-1.5

0 -0.5 X1 -1

(a) Data

0.5

1

1.5

1

0.5

0 X2 0.5

-1

-1

.5

-1.5

-1

-0.5

(b) Fitted NN surface.

0

0.5

1

X1

FIGURE 1. Example 1 (stochastic optimisation). Data and tted surface using the NN model (xyz). The solid triangles indicate the data points.

Rios Insua, Muller

0

0.0

2

0.1

4

0.2

M

6

p(M|D) 0.3

0.4

8

0.5

10

10

0

2

4

6

8

10

12

5000

M

(a)

10000 IT

15000

20000

(b)

FIGURE 2. Example 1 (stochastic optimisation). Posterior p(M D) on the size of the hidden layer. Panel (a) plots the estimated posterior distribution p(M D). Panel (b) shows the trajectory using the Markov chain. j

j

dition to the normal prior (3.3), we constrained the jk by j jk j < 10:0 to avoid numerical problems. Otherwise proposals for the vector could lead to degenerate design matrices in the regression problems required for the evaluation of p( j ) (Lemma 1). Example 2. Robot arm. In this example the response is bivariate and the activation functions are tanh instead of logistic. This is a standard example in the neural network literature called the robot arm problem, see e.g. MacKay (1992) and Neal (1996). We try to predict the robot arm position (y1 ; y2 ) from two joint angles (x1 ; x2 ), when they are actually related by the model yi1 = 2:0 cos(xi1 ) + 1:3 cos(xi1 + xi2 ) + i1 ; i = 1; : : : ; N; yi2 = 2:0 sin(xi1 ) + 1:3 sin(xi1 + xi2 ) + i2 ;  N (0; 0:05): ik iid The bivariate FFNN with tanh() activation functions we use is as follows:

X yik = x0i k + dj jk (x0i j ) + ik ; i = 1; :::; N; k = 1; 2; j =1  N (0; 2 ); () = tanh(): (5.6) ik iid M

Note that we could change ( j ; j ) to (? j ; ? j ) without changing the likelihood. Hence, with a prior symmetric around 0, the posterior distribution would remain invariant under such transformations. To avoid this identi ability problem we add a constraint j1 > 0.

1. Feed Forward Neural Networks

11

3.0 2.5 X2 2.0 1.5 1.0 0.5

0.5

1.0

1.5

X2 2.0

2.5

3.0

In NN applications, it is frequent that the data set is broken in two subsets, one for estimation and the other for validation. We use MacKay's (1992) data set, splitting it into a training data set ( rst 200 observations), and a test data set (last 200 observations). Figure 3 shows the estimated surface, which is practically indistinguishable from the data. After around 300 iterations the mean squared error is 0.0064 and already close to the asymptotic value 0.0062 (note the theoretical minimum 22 = 0:0050), indicating that short run-lengths of several hundred iterations are sucient for predictive purposes. However, to monitor convergence diagnostics on some selected parameters, we needed 20,000 iterations to achieve practical convergence. The estimated marginal posterior probabilities p(M jD) for the number of hidden nodes are 0.30, 0.53, 0.15 and 0.02 for M = 6; 7; 8 and 9, respectively.

-2

-1

0 X1

1

E (yi1 jxi1 ; xi2 ; D); i = n + 1

2

-2

-1

0 X1

1

E (yi2 jxi1 ; xi2 ; D); i = n + 1

FIGURE 3. Example 2 (robot arm). Estimated surface (dotted contours) and true contours (thin solid contours). The estimated and the true surface are almost undistinguishable. The average predictive mean squared error (MSE), i.e., the average of predictive mean squares, averaged over all 200 data points in the test data, is 0:0062. The theoretical optimum is 2 = 0:005.

6 Discussion Several recent reviews suggest that neural networks constitute a useful enhancement to the statisticians' toolkit. Neal (1996), in particular, shows that FFNNs with in nite hidden nodes embrace a class of exible models richer than Gaussian processes. We have shown here how a reversible jump algorithm permits routine analysis of FFNNs viewed as nonparametric regression models. By combining them, as we did, with other conventional

2

12

Rios Insua, Muller

models, such as linear regression, we have at hand a powerful modelling strategy for complex problems. Important applications include stochastic optimisation (Example 1), approximation (Example 2) or regression metamodels for simulation experiments, see Rios Insua et al (1997). Many issues remain to be explored. For example, we should try other standard NN applications like classi cation, density estimation or time series analysis, e.g., via nonlinear autoregression models. Other reversible moves than birth, death, seed and thin, could be devised and should be compared. Also we have con ned to FFNNs but many other NN models, analysed from a Bayesian point of view, might prove useful.

Acknowledgements

Research supported by grants from CICYT, the Government of Madrid, the Iberdrola Foundation and the National Science Foundation (NSF/DMS9704934).

References Bernardo, J.M. and Smith A.F.M. (1994). Bayesian Theory, Wiley, New York. Bishop, C.M. (1996). Neural Networks for Pattern Recognition, Oxford University Press, Oxford. Buntine, W.L. and Weigend, A.S. (1991). Bayesian back-propagation. Complex Systems, 5, 603-643. Cheng, B. and Titterington, D.M. (1994). Neural networks: a review from a statistical perspective (with discussion). Statistical Science, 9, 2-54. Cybenko, G. (1989). Approximation by superposition of sigmoidal functions, Mathematics of Control Systems and Signals, 2, 303-314. De Veaux, R., Ungar, L. (1997). A brief introduction to neural networks, Technical Report, Williams College, Williamstown, MA. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711-732. Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward neural networks are universal approximators, Neural Networks, 2, 359-366. Lippman, R. P. (1987). An introduction to computing with neural nets, IEEE ASSP Magazine, 4-22. MacKay, D.J.C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448-472.

1. Feed Forward Neural Networks

13

Muller, P. and Rios Insua, D. (1998). Issues in Bayesian Analysis of Neural Network Models, Neural Computation , 10, 571{592. Neal, R. M. (1996). Bayesian Learning for Neural Networks, SpringerVerlag, New York. Richardson, S., and Green, P. (1997). Modelling and Computation for Mixture Problems (with discussion) Journal of the Royal Statistical Society, B, 59, 731-792. Rios Insua, D., Rios Insua, S. and Martin, J. (1997). Simulacion, RA-MA, Madrid. Rios Insua, D., Salewicz, K.A., Muller, P. and Bielza, C. (1997). Bayesian methods in reservoir operations, in French, Smith (eds) The Practice of Bayesian Analysis, Arnold. Ripley, B.D. (1993). Statistical aspects of neural networks. In Networks and Chaos, Barndorf-Nielsen, Jensen, Kendall, eds. Chapman and Hall, London. Rumelhart, D.E. and Mclelland, J.L. (eds) (1986). Parallel Distributed Processing, MIT Press, Cambridge. Stern, H.S. (1996). Neural networks in applied statistics, Technometrics, 38, 205-220. Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics, 22, 1701-1762. West, M., Harrison, J. (1996). Bayesian Forecasting and Dynamic Linear Models, 2nd ed, Springer. Warner, B., Misra, M. (1996). Understanding neural networks as statistical tools, The American Statistician, 50, 284-293.