Experimental Design for the Heteroscedastic Model - MUCM

2 downloads 0 Views 744KB Size Report
Jul 9, 2009 - We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr ,ts}. We train a GP on the ...
Experimental Design for the Heteroscedastic Model

Alexis Boukouvalas, Dan Cornford Neural Computing Research Group, Aston University

July 9rd , 2009

A. Boukouvalas, D. Cornford

MUCM

1/23

Overview

Refresher of Heteroscedastic Gaussian Process Emulator. Experimental results on Yuhba and Rabies model.

Experimental Design using the Fisher Information Matrix. Motivation. Derivation. Experimental results: Monotonicity. Submodularity. Optimization space using Exhaustive Search. Design Criterion Test.

Open Questions and Conclusions.

A. Boukouvalas, D. Cornford

MUCM

2/23

Random Output Simulators

Stochastic simulator A mapping that produces random output given a fixed set of inputs. Observational model yi (xi ) = ti (xi ) + ε(xi )

A. Boukouvalas, D. Cornford

(1)

MUCM

3/23

Algorithm

Main idea Use a coupled system of GPs to evaluate the mean and variance.

We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.

Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.

A. Boukouvalas, D. Cornford

MUCM

4/23

Algorithm

Main idea Use a coupled system of GPs to evaluate the mean and variance.

We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.

Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.

A. Boukouvalas, D. Cornford

MUCM

4/23

Algorithm

Main idea Use a coupled system of GPs to evaluate the mean and variance.

We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.

Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.

A. Boukouvalas, D. Cornford

MUCM

4/23

Algorithm

Main idea Use a coupled system of GPs to evaluate the mean and variance.

We estimate a standard homoscedastic GP: GH by maximum likelihood on the two sets of observations t = {tr , ts }. We train a GP on the log(variance) GS . For set r we use the corrected sample variance. For set s we sample from GH to estimate the variance at that point.

Estimate the heteroscedastic GP GM to jointly predict the mean and variance (next slide). If s non empty, set GH =GM and repeat from step 2 until convergence.

A. Boukouvalas, D. Cornford

MUCM

4/23

GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :

µ∗ = K ∗ (K + RN −1 )−1 t T

Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.

We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1

We omit the mean function although inclusion is straightforward.

A. Boukouvalas, D. Cornford

MUCM

5/23

GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :

µ∗ = K ∗ (K + RN −1 )−1 t T

Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.

We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1

We omit the mean function although inclusion is straightforward.

A. Boukouvalas, D. Cornford

MUCM

5/23

GM estimation For set r the target values are the sample means, not the random individual samples of the underlying process. Since mean is distributed as N (m, σ2 /n) we have to divide by number of realizations when predicting the mean of the training points. The predictive distribution equations are1 :

µ∗ = K ∗ (K + RN −1 )−1 t T

Σ∗ = K ∗∗ + R ∗ − K ∗ (K + RN −1 )−1 K ∗ where K = c (., .) the training point covariance. R = diag [r (x1 ) . . . r (xN )] the variance estimate from GS . Diagonal since we assume independent noise. N = diag (n1 . . . nN ) the number of samples at each training point. K ∗ , K ∗∗ and R ∗ the corresponding test point matrices.

We use the most likely value of the variance from GS . Another option would be Monte Carlo. 1

We omit the mean function although inclusion is straightforward.

A. Boukouvalas, D. Cornford

MUCM

5/23

Heteroscedastic Model: Joint Process Derivation Our observation equation at a given design point xi is: ti (xi ) = yi (xi ) + ε(xi ) Likelihood: p(¯ ti |¯ yi ) = p(¯ ti |yi ) = N (¯ ti |yi ,

σ2 (xi ) ni

),

where ni the number of replicate observations and σ2 (xi ) the true variance at location xi . We estimate the true variance by the predictive mean of GS . Due to independence of the noise we can write the likelihood in matrix form for all observations 1 . . . N: p(¯t|y¯ ) = p(¯t|y) = N (¯t|y, RP −1 ), N where R = diag (σ2 (xi ))N i =1 and P = diag (ni )i =1 .

Our zero mean GP prior is: p(y) = N (y|0, K ). A. Boukouvalas, D. Cornford

MUCM

6/23

Heteroscedastic Model: Joint Process Derivation Continued

The marginal observation density can then be calculated: Z Z p(¯t) = p(¯t|y)p(y)dy = N (¯t|y, RP −1 )N (y|0, K )dy

= N (¯t|0, Cµ = K + RP −1 ).

Condition on the known sites to obtain predictive distribution: p(¯t∗ |¯t)

= N (K (x∗ , x )T (K (x , x ) + R (x )P (x )−1 )−1 ¯t ,

K (x∗ , x∗ ) + R (x∗ )P (x∗ )−1

+ K (x∗ , x )T (K (x , x ) + R (x )P (x )−1 )−1 K (x∗ , x )).

A. Boukouvalas, D. Cornford

MUCM

7/23

Heteroscedastic Simulated Example ‘Yuhba’ function 2

y = 2(e−30(x −0.25) + sin(πx 2 )) − 2 + exp(sin(2πx ))N (0, 1), where N (0, 1) is the standard normal distribution.

A. Boukouvalas, D. Cornford

MUCM

8/23

Yuhba function: Total number of evaluations fixed

(b) Root Mean Squared Error

(c) Mahalanobis

Comparison of emulator fit where the total number of simulator evaluations is fixed at different levels. Notation: 30T3 = 30 design points each with 3 replicates. Results shown for a total of 90, 300, 400, 600 and 1600 total number of simulator evaluations.

A. Boukouvalas, D. Cornford

MUCM

9/23

CSL Simulator Overview Rabies disease propagation simulator with two vector species: raccoon dogs and foxes. Two types of output: time series and summary statistics for each run. Stochastic simulator. Output is stochastic but not normally distributed. 14 inputs, 1 output (Time To Disease Extinction)

A. Boukouvalas, D. Cornford

MUCM

10/23

Performance

(a) Mahalanobis

(b) Elapsed Time

(c) RMSE

(d) MSE Variance

A. Boukouvalas, D. Cornford

MUCM

11/23

Comparing sparse approximation methods to replicate design

Mahalanobis Error of Projected process ‘Kersting’ (4000 design points using 1000 support points) vs replicated design (1000 design points × 4 replicates).

A. Boukouvalas, D. Cornford

MUCM

12/23

Screening

Interpretation of Gaussian Process All input factors have been sphered so length scales can be used for importance ranking. With mean function length scales apply to residual process only.

Interpreting the variance emulator (GS ) by looking at the regression coefficients (Coeff) and correlation length scales (Scale). FACTOR R AC D ENSITY R AC D EATH R AC B IRTH

A. Boukouvalas, D. Cornford

C OEFF 0.1608 0.0633 0.0200

FACTOR R AC R ABID F OX I NF F OX R ABID

S CALE 1.4281 1.4594 1.5047

MUCM

13/23

Fisher Information Matrix

The FIM is p × p symmetric matrix where p the number of unknown parameters:

F=−

Z

[

∂2 [ln(f (X; θ)]f (X; θ)dX ∂θi θj

Given X distributed as N (µ(θ), Σ(θ), the i , j element of the FIM is: Fij =

A. Boukouvalas, D. Cornford

∂µT −1 ∂µ 1 ∂Σ −1 ∂Σ Σ + tr (Σ−1 Σ ) ∂θi ∂θj 2 ∂θi ∂θj

MUCM

(2)

14/23

Derivatives of the joint mean function The derivatives of the joint mean function with respect to the covariance parameters of GM and GS are:

∂µGM ∗ ∂θµ ∂µGM ∗ ∂θΣ

T ∂Kµ∗ T −1 ∂Cµ −1 ¯ Cµ−1 ¯t − Kµ∗ C t, Cµ ∂θµ ∂θµ µ T −1 ∂R = −Kµ∗ Cµ P −1 Cµ−1 ¯t, ∂θΣ

=

(3) (4)

The R matrix is a diagonal with elements Rii = exp(r (xi )) and hence ∂r (x )

∂Rii = exp(r (xi )) ∂θΣi . r (xi ) is the most likely the derivative is ∂θ Σ prediction of the variance from GS at point xi . Hence the derivative is

∂r (xi ) ∂KΣ∗ T −1 2 −1 ∂CΣ −1 2 T = CΣ λ − KΣ∗ CΣ C λ . ∂θΣ ∂θΣ ∂θΣ Σ

A. Boukouvalas, D. Cornford

MUCM

(5)

15/23

Derivatives of the joint variance

The derivatives of the joint variance function with respect to the covariance parameters of GM and GS are:

∂ΣGM ∗ ∂θµ ∂ΣGM ∗ ∂θΣ

= =

∂Kµ∗∗ T −1 ∂Kµ −1 − Ξ − ΞT + Kµ∗ C Kµ∗ , Cµ ∂θµ ∂θµ µ ∂R∗ −1 T −1 ∂R P∗ + Kµ∗ Cµ P −1 Cµ−1 Kµ∗ , ∂θΣ ∂θΣ

(6) (7)

∂K T

where Ξ = ∂θµ∗ Cµ−1 Kµ∗ . µ

A. Boukouvalas, D. Cornford

MUCM

16/23

Biased Estimation Empirical Parameter Covariance We compute the empirical covariance of squared exponential kernel using 2000 samples of a Heteroscedastic GP.

A. Boukouvalas, D. Cornford

MUCM

17/23

Monotonicity

A. Boukouvalas, D. Cornford

MUCM

18/23

Submodularity

Definition A function F is submodular iff for A ⊂ B and ε \ B: F (A ∪ ε) − F (A) ≥ F (B ∪ ε) − F (B ) Nemhauser et al, 1978 Theorem If F is a monotone submodular function over a finite ground set with 1 k F (0) = 0 then the greedy optimization algorithm is within (1 − ( k − ) ) k constant factor of the optimal strategy for k design points.

A. Boukouvalas, D. Cornford

MUCM

19/23

Is the FIM a submodular function?

X-axis: Design Size, Y Axis: number of violations over 100 realizations.

A. Boukouvalas, D. Cornford

MUCM

20/23

Optimization space using Exhaustive Search

Using FIM pick 6 points from 30 point candidate set (1,623,160). A. Boukouvalas, D. Cornford

MUCM

21/23

Design Criterion Experiment: Designs Used Designs considered

A. Boukouvalas, D. Cornford

MUCM

22/23

Design Criterion Test Empirical Parameter Variance vs FIM

A. Boukouvalas, D. Cornford

MUCM

23/23

Summary Heteroscedastic Framework Approach improves upon existing methods both in terms of accuracy and computational efficiency, in terms of inference and prediction time. In combination with a discrepancy model and real-world observations, this method could facilitate the efficient statistical calibration of stochastic simulators. Open Questions Is the FIM a good design criterion for correlated non-linear models? In Zhu and Stein monotonicity of Fisher Information matrix to empirical parameter variance is shown but this is empirical evidence only. When doing maximum likelihood for the GP parameters, are they identifiable and consistently estimable? In some cases (see Stehlnik) Matern parameters are not. Under what conditions are they consistently estimable for the squared exponential? A. Boukouvalas, D. Cornford

MUCM

24/23

References

K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. ”Most likely heteroscedastic gaussian process regression”. In Proc. 24th International Conf. on Machine Learning, 2007. Paul W. Goldberg and Christopher K. I. Williams and Christopher M. Bishop. ”Regression with Input-dependent Noise: A Gaussian Process Treatment”. Advances in Neural Information Processing Systems. The MIT Press, 1998. Werner Muller and Milan Stehlik. ”Issues in the Optimal Design of Computer Experiments.”. IFAS Research Paper Series, July 2007. Werner Muller and Milan Stehlik. ”Issues in the Optimal Design of Computer Experiments.”. IFAS Research Paper Series, July 2007. Zhengyuan Zhu and Michael L. Stein. ”Spatial sampling design for parameter estimation of the covariance function”. Journal of Statistical Planning and Inference 2005

A. Boukouvalas, D. Cornford

MUCM

25/23