Sequential Data Assimilation with Sigma-point

Sequential Data Assimilation with Sigma-point Kalman Filter on Low-dimensional Manifold June 23, 2007

DRAFT - DO NOT CIRCULATE Zhengdong Lu, Todd K. Leen and Rudolph van der Merwe Department of Computer Science & Electrical Engineering, OGI School of Science & Engineering, Oregon Health & Science University, Portland, OR 97006, USA Sergey Frolov and Antonio M. Baptista Department of Environmental & Biomolecular Systems, OGI School of Science & Engineering, Oregon Health & Science University, Portland, OR 97006, USA

Abstract In order to address the highly nonlinear dynamics in estuary flow, we propose a novel data assimilation system based on components designed to accurately capture nonlinear dynamics. The core of the system is a sigma-point Kalman filter coupled to a fast, neural network emulator of the flow dynamics. In order to be computationally feasible, the entire system operates on a low-dimensional subspace obtained by principal component projection. Our probabilistic latent state space analysis properly accounts for errors introduced by the dimensionality reduction and by the emulator of the flow dynamics. We introduce the use of cross-validation to set parameters in the data assimilation, and report on its efficacy. Experiments on a benchmark estuary problem show that our data assimilation method can significantly reduce prediction errors.

1 Introduction The strongly non-linear dynamics encountered in estuarine flow presents a significant challenge to data assimilation systems. The nonlinear flow dynamics, together with the desire to build portable data assimilation technology that can be applied to different problems without substantial model-building overhead, led us to construct the novel data assimilation system presented here.

1

We present an efficient data assimilation system with a rigorous probabilistic foundation that deals naturally and accurately with nonlinear flow. Our research is part of the CORIE Baptista (2006) environmental observation and forecasting system (EOFS) for the Columbia River estuary and near ocean. The CORIE system is summarized in Figure 1. CORIE integrates a real-time observation network, data management system, with the advanced numerical models ELCIRC (Zhang and Baptista, 2005) and SELFE (Zhang et al., 2004). Through this integration, we seek to characterize and predict complex circulation and mixing processes in a system encompassing the lower river, the estuary and the near-ocean. Data Products

External Forcings

Sensor Data

• Short waves

• Remote sensing (satellite) data

• Ocean tides & circulation

• In-situ data from CORIE sensor platforms

• River discharges • Atmospheric forcings (wind, pressure, heat exchange, etc.)

Numerical Codes Bathymetry

(ELCIRC, SELFE)

• Daily forecasts • Historic hindcasts • Bio-constituents, etc.

Data Assimilation Probabilistic Sensor Fusion

Figure 1: CORIE is a pilot environmental observation and forecasting system (EOFS) for Columbia River. It integrates a real-time network, a data management system, advanced numerical models such as ELCIRC or SELFE, and an advanced data assimilation framework. Although variational assimilation is an icon of formal rigor with widespread successful application (Bennett, 1998), two features of estuary dynamics suggest a different approach. First, the strong non-linearity in estuarine dynamics requires an approach that naturally incorporates nonlinear dynamics. Secondly, development of an adjoint numerical code for CORIE represents a significant development overhead that is not portable to unrelated assimilation problems. In contrast, the algorithm we propose in this paper is based on an easily-trained neural network dynamics emulator, and (nonlinear) Kalman filtering algorithms, and thus can be customized to other data assimilation tasks with comparatively little effort. Another difficulty posed by the CORIE system is the intimidating dimensionality of the state space. For the whole CORIE domain, the 3D grid used by ELCIRC includes over 106 vertices. The variables of interest include the elevation, salinity, temperature and velocities on each vertex in the grid, leaving on order 107 dynamical degrees of freedom. The huge state dimension makes direct use of the Kalman filter intractable, since the associated covariance matrix is 107 × 107 . To make the computation tractable, we need to significantly reduce the model size. In our approach, we use principal components analysis (PCA) (Jolliffe, 1986; Sirovich, 1987; Berkooz et al., 1993) to project the high-dimensional state space to a lowdimensional subspace of dimension ∼ 20 − 60. 2

To deal with the strong non-linearity of the dynamics, we use the Sigma-point Kalman filter (SPKF) (van der Merwe and Wan, 2003, 2004), a deterministic sampling approached that has demonstrated success in highly nonlinear control and navigation problems. The Sigma-point filter filter requires that we propagate an ensemble of 2N + 1 states, where N is the dimension of the state vector that the Kalman filter operates on. This state vector is a time-delay embedding of the PCA subspace projection. For the problem considered here, N is several hundred. Efficient propagation of such large state ensembles demands an ultra-fast evaluation of the flow. Towards this end, we developed a neural network emulator, which we also refer to as a surrogate, of the dynamics of the system. For the CORIE domain, the emulator evaluates the forward dynamics over 1000 times faster than the ELCIRC circulation code(van der Merwe et al., 2007) The combination of dimension reduction, neural network surrogate and Sigma-point Kalman filter (SPKF) is well suited to very large, non-linear models like CORIE, and is unique in data assimilation practice. It is straightforward and comparatively quick to develop data assimilation for new systems using our framework. One need only fit the dimensionality reduction, train the neural network emulator van der Merwe et al. (2007) from circulation simulations of the system, and develop the noise models for the Kalman filter. In our experience, the development for new problems takes roughly one to three months. Previous work has been done on overcoming the burden of a high dimesional state space in data assimilation. Much of it is concentrated on reducing the covariance matrix, realized either by an explicit low-rank approximation of the error covariance matrix (Pham et al., 1998; Hoteit et al., 2002, 2001; Verlaan and Heemink, 2001), or by representing error statistics with a very small ensemble of states (which results in a low-rank covariance) (Evensen, 2003, 2002; Heemink et al., 2001). All these models use the full state space and typically require propagating multiple states through the dynamics during each Kalman filter iteration. This makes these techniques prohibitively expensive for problems of the scale of the CORIE domain. The idea of directly working on a reduced model has been pursued by several authors (Cane et al., 1996; Hoteit and Pham, 2003) prior to our work, using the full model to propagate the average forecast, and a reduced model to propagate the error covariance. (Hoteit et al. (2002) use a linear autoregressive (AR) model to propagate the error covariance while Cane et al. (1996) use a matrix to project linear dynamics to a lower-dimensional subspace.) Neither of these techniques is adequate to capture the dynamics of a large, highly-nonlinear estuary model. Instead, we train a neural net to mimic the non-linear dynamics in the dimension-reduced system. Although neural nets have long been used as fast surrogates for complex physical systems (Principe et al., 1992; Grzeszczuk et al., 1998; Bishop et al., 1995; Krasnopolsky and Schiller, 2003; Krasnopolsky and Chevallier, 2007), they have not been integrated into data assimilation systems. In previous data assimilation using the Kalman filter, non-linear dynamics is usually handled either by local linearization as in the extended Kalman filter (EKF) (Chui and Chen, 1999; Hoteit and Pham, 2003), or by random sampling as in the Ensemble Kalman filter (EnKF) (Evensen, 2003, 2002). It is known (van der Merwe, 2004; van der Merwe and Wan, 2003, 2004) that SPKF can more accurately propagate the statistics of the distribution through the nonlinear dynamics than the EKF, yet it has the same computational complexity. In contrast to the EnKF, the SPKF employees a principled deterministic sampling strategy designed to accurately capture the evolution of the first two moments of the probability distribution. The sampling design enables the SPKF 3

to achieve good performance with a relatively small number of ensemble states. Although the SPKF is widely known for its success in control and navigation, this is its first application to data assimilation. Another contribution of this paper is to give a rigorous probabilistic framework for data assimilation in a low-dimensional subspace. This probabilistic interpretation allows us to properly analyze the different sources of error rigorously. For example, we can effectively estimate the additional observation error induced by the dimensionality reduction, which is consistent with the one given by Cane et al. (1996). The paper is organized as follows. In section 2, we introduce the probabilistic foundation for model reduction. In section 3, we give the framework of data assimilation with the Sigma-point Kalman filter. In section 4, we shall our approach to an estuary benchmark problem. Section 5 summarizes the work and points out directions for future research.

2 Model Dimensionality Reduction In this section, we discuss the model dimensionality reduction for the dynamic system used in data assimilation. In section 2.1 and section 2.2, we introduce the dimension reduction technique based on principal component analysis (PCA) and its probabilistic interpretation. In section 2.3, we analyze the dynamics of the reduced model. In section 2.4, we give the observation form for the reduced model.

2.1 Principal Component Analysis Principal component analysis (PCA) is a classical dimensionality reduction technique (Jolliffe, . 1986; Sirovich, 1987; Berkooz et al., 1993). Let X = Rm denote the state space of the full numerical model, e.g., for CORIE m ≈ 107 . The basic idea is to take a set of snapshots X = {x1 , x2 , ..., xN }1 , xi ∈ X , randomly sampled from the trajectory of the numerical model, and find the first d principal components of X. We are assuming that the full set of dynamical degrees of freedom in the original numerical simulation has significant redundancy. PCA identifies linear redundancy (correlations) and allows us form a significantly smaller representation of the state. For the experiments reported here, retain enough principal components to capture over 98% of the total variance in the full numerical simulation. For simplicity, we assume the origin has been translated so that samples in X have zero 1 Pd mean: x ¯ = N i=1 xi = 0. Let Σ be the covariance matrix estimated from the sample X. The d covariance matrix eigenvectors corresponding to the largest eigenvalues are {φ1 , φ2 , ...φd }2 . (The vectors are labeled in order of decreasing eigenvalues.) Let XS denote the d-dimensional PCA subspace spanned by the d vectors: XS = span{φ1 , φ2 , · · · , φd }. 1

In this paper, a number used as superscript is a index of elements in finite set; a numbe used as subscript is a index of time step, as will be defined in this section. 2 In practice, we implement PCA using singular value decomposition (SVD) of the matrix X of state snapshots. This yields the same set of eigenvectors but is numerically better-conditioned than direct diagonalization of the covariance matrix.

4

The dimensionality reduction is achieved by the mapping Π : Rm → Rd   1 T ¯) (φ ) (x − x  2 T ¯ ) . (φ ) (x − x  Πx =   ..   . d T (φ ) (x − x ¯)

(1)

where Π = [φ1 φ2 · · · φd ]T is a d × m matrix.

2.2 Latent Space Formula The probabilistic interpretation of PCA proposed by Tipping and Bishop (1997) provides a generative model for the data, and a rigorous interpretation of the subspace XS and the principal component projection. We start with a d-dimensional latent space S (with zero-mean distribution) which is mapped to XS through the linear transform: xs = W s + µ,

(2)

where W is a m × d matrix (of rank d) and µ ∈ Rm is the mean of the distribution in XS . The actual observations in X are obtained by adding zero mean, spherical Gaussian noise ǫ ∼ N (0, σ 2 Im×m ) to X s x = W s + µ + ǫ, s ∈ S . (3) We can represent xs using the coordinates of XS , implemented using the mapping Π defined from the PCA: . xπ = Πxs = ΠW s + Πµ. (4) Note in equation (2) and (4), xs (∈ Rm ) and xπ (∈ Rd ) are two different representations of the same point. The d-dimensional vector xπ is re-embedded into the fullf m-dimensional space by the transpose of Π ΠT xπ ∈ Rm . The projection operator PΠ = ΠT Π maps any vector x ∈ X orthogonally to XS . In the remainder of the paper, the d-dimensional vector xπ will be refereed to as subspace state, as opposed to full-space states, which we will use to denote the m-dimensional states in X , such as x or xs . Figure 2 shows the relation between x, xs and Πx in a three dimensional illustration. We note that the addition of noise ǫ in equation (3), moves xs both parallel to XS and orthogonal to it. The orthogonal component accounts for the reconstruction error in approximating the state x by its projection PΠ x on XS . The parallel component generates a difference between xπ and Πx (assuming x and xπ are associated with the same s in equation (3) and (4)). Indeed, from equations equation (3), (2), (3), and (4), Πx − xπ = Π(x − xs ) = Πǫ .

(5)

In practice, the state xπ is unknown, all that is available is Πx which we use as an approximation. The quality of this approximation can measured by its expected squared error Exπ ,x (||Πx − xπ ||2 ) = Eǫ (||Πǫ||2 ) = σ 2 d.

(6)

In section 2.4 and 3.1, we will see that this approximation introduces some extra error. However, as will be shown in section 2.4, this error is negligible due to the extremely small value of σ 2 . 5

Figure 2: Geometric illustration of variables: x, xs , xπ and Πxπ . (a) Latent space S and a latent variable s. (b) Full-state space X and images of s after mappings in equation (3) and (2). (c) PCA subspace XS . Note the variables in (c) are subspace states.

2.3 Dynamics in Subspace Our central assumption is that the dynamics can be captured in a much lower-dimensional space than required by the full numerical model. A priori we do not have sufficient information to suggest an appropriate dimension. In practice, the dimensionality of the PCA subspace is chosen somewhat arbitrarily to mediate the tradeoff between dimensionality and the amount of total variance retained in the representation. The dynamics may require a larger embedding space, and we accommodate this by constructing a time-lag embedding of the PCA subspace. This sequence of time-lag copies of the subspace is the domain in which our neural network surrogate and Kalman filter operate. . We consider the evolution of system state at discrete times xk = x(t0 + kτ ), where t0 is the origin of the time axis and τ is the time interval used in data assimilation. (Note τ can be different from the “internal” time interval used by the full numeric code.) In the remainder of the paper, we will use subscripts to index the time step. As noted above we construct a seqence of time-lagged subspace states for the dynamics, so that the state sk is determined by the state sk−1 and the driving force uk−1 , and their historical values up to T time steps in the past sk = fL (sk−1 , · · · , sk−T x , uk−1 , · · · , uk−T x ),

(7)

where sk and uk are the latent space state and driving force at discrete time index k, and fL is a non-linear function modeling the dynamics. The isomorphism between S and XS is given in equation (4), which can be inverted (it is easy to prove that ΠW is a non-singular d × d matrix ): s = (ΠW )−1 (xπ − Πµ).

(8)

Using equation (7) and (8), we get the the corresponding dynamics in XS as follows: xπk = ΠW fL ((ΠW )−1 (xπk−1 − Πµ), · · · , (ΠW )−1 (xπk−T x − Πµ), uk−1 , · · · , uk−T x ) + Πµ, (9) 6

which can be rewritten in the following concise form: xπk = f (xπk−1 , · · · , xπk−T x , uk1 , · · · , uk−T x ).

(10)

Equation (10) gives the dynamics in the subspace XS . As mentioned earlier, we will use a neural network surrogate to mimic the dynamic system in equation (10), which will be discussed in section 3.1.

Figure 3: The dynamics in latent space S is isomorphically mapped to XS .

2.4 Observation Noise for the Reduced Model 2.4.1

Observation Form

Our observation is usually from the sensors on stations or from cruise data, both of which can be formulated as: yk = Hxk + wkm ,

(11)

here H is a linear operator and wkm is the observation noise at time step k. Equation (11) is not instantly applicable in Kalman filtering since xk ∈ X is the full-space state while the dynamics expressed in equation (10) is in the subspace coordinates. We need to find the observation operator that relates subspace states to observations. We first notice that xk = xsk + ǫk =

ΠT xπk

(12)

+ ǫk .

(13)

Combining equation (11) and equation (13), it is not hard to see yk = H(ΠT xπk + ǫk ) + wkm = =

HΠT xπk HΠT xπk

+ (Hǫk + +

wko ,

wkm )

(14) (15) (16)

where wko = Hǫk + wkm . Therefore for the dynamics in XS expressed in equation (10), the observational operator is HΠT , and the observation noise wko is the sum of the measurement noise wkm and Hǫk . We assume wkm and ǫk are both stationary white Gaussian noise independent of each other. We further assume the variance of wm is known to be R. 7

2.4.2

Analysis and Estimation of Observation Noise

The observation noise wko can be further decomposed into three parts: wko = Hǫk + wkm

(17)

= H(xk − PΠ xk ) + H(PΠ xk −

xsk )

+

wkm

(18)

where ǫk = xk − xsk (see Figure 2) and PΠ ≡ ΠT Π. The first term H(xk − PΠ xk ) comes from the reconstruction error xk − PΠ xk , which is the part of noise ǫk orthogonal to XS ; The second term comes from the part of noise ǫk that is within XS . These two terms are independent of each other. Therefore the covariance of wo consists of the following three parts: cov(wko ) = cov(H(xk − PΠ xk )) + cov(H(PΠ xk − xsk )) + cov(wkm ).

(19)

The first term of equation (19) is the covariance of reconstruction error, which can be estimated from the sample used to generate the PCA X = {x1 , x2 , · · · , xN }: N

1 X (PΠ xi − xi )(PΠ xi − xi )T H T ; cov(H(xk − PΠ xk )) ≈ H N

(20)

i=1

The second term of equation (19) is cov(H(PΠ xk −

xsk ))

Id×d 0 T T = σ HU U H , 0 0 2

(21)

where U is a m × m orthogonal matrix with each column a eigenvector of Σ. (See Appendix for the derivation.) The maximum likelihood estimation of σ 2 from X is given by (Tipping and Bishop, 1997): N

2 σ ˜M L =

X 1 (PΠ xi − xi )T (PΠ xi − xi ). N (m − d)

(22)

i=1

PN

Note that N1 i=1 (PΠ xi − xi )T (PΠ xi − xi ) is the estimated variance lost in dimension reduction. For our problem the lost variance is usually less than 2% of the total variance. Considering 2 ˜M m is usually greater than 105 , we have σ L less than 0.0001% of the total variance. In practice, we thus neglect the second term in equation (19), and thus estimate the covariance of the observation noise wko as: N 1 X (PΠ xi − xi )(PΠ xi − xi )T H T + R. N

(23)

wko ≈ H(xk − PΠ xk ) + wkm .

(24)

cov(wko ) ≈ H

i=1

Correspondingly The square norm of the vector (xk −PΠ xk ) is the reconstruction error for xk . It follows Equation (24) that when the reconstruction error is high there is higher contribution to the observation noise. 8

3 Kalman Filtering for Data Assimilation Based on the probabilistic dimension reduction model we proposed in section 2, we obtain a reduced dynamic model with manageable size and proper observation noise form. In this section we will build the framework of sequential data assimilation with Kalman filter. First, in section 3.1, we discuss using neural networks as ultra-fast surrogates for dynamics in the timelag embedding ofXS . Then in section 3.2, we combine the context from all previous sections and give the complete Kalman filter equations. In section 3.3, we briefly introduced Sigma-point Kalman filter technique that will be used in our data assimilation system. Finally in section 3.4, we discuss how to obtain data-assimilated state estimation from the Kalman filtering result.

3.1 Neural Network Surrogate As mentioned in the introduction section, our Sigma-point Kalman filter requires evaluating 2N +1 ensemble states at each time step, with N the dimension of the Kalman filter state vector. For our systems, N is the time-lag embedding of the dimension-reduced state, and is usually several hundred to over one thousand. It is prohibitively expensive to evaluate the full numerical model over this ensemble at each step of the data assimilation. Instead, we developed neural network surrogate for the dynamics of the reduced system, more specifically, as an approximator of the non-linear function f in equation (10). The trained neural network surrogate works on order of 1000 times faster than the numerical model (ELCIRC) (van der Merwe et al., 2007). Our neural network surrogates are non-linear feed-forward multi-layer perceptrons (Bishop, 1995). Such networks are very well-equipped for modeling non-linear relations among highdimensional inputs and outputs where large datasets are available for model fitting (or training). Where there is significant non-linearity, their performance exceeds that of traditional linear models such as the ARMA. In fact, for the prediction problem at hand, we found that linear predictors, fit using standard robust least-squares regression techniques were inherently unstable with poles lying outside the unit circle (van der Merwe et al., 2007). It is possible that those unstable components are sufficiently suppressed by the observation in the data assimilation process. If that is the case, as will be shown in our experiments in section 4, linear models could still be candidate surrogates and perform reasonably well; otherwise the instability will actually render the linear models useless (van der Merwe et al., 2007). The neural network is a standard multi-layer perceptron (MLP) with a single hidden layer with hyperbolic tangent activation functions and a linear output layer. This standard structure is often used for general non-linear regression problems. The size of the network input and output layers are dictated by the dimension and embedding length of the subspace state variables and forcings. The size of the hidden layer can be set in order to control total model complexity (number of free parameters). Typically this ‘hyperparameters’ can be set using some form of cross-validation. We chose not to constrain the size of the hidden layer severely, but rather control model complexity with weight-decay regularization (Bishop, 1995). The training of the neural network is based on the full-space states {xk } sampled from the trajectory of numerical model. Since we do not know the corresponding true subspace states {xπk }, we use the dimension-reduced states {Πxk } instead. As mentioned in section 2.2 (see Figure 2) Πxk

9

deviates from xπk by Πǫk . The expectation length of this derivation can be measured as follows Exk ,xπk (||Πxk − xπk ||2 ) = Eǫk (||Πǫk ||2 ) = σ 2 d Since σ 2 is extremely small, as established in section 2.3, this deviation can be safely neglected. The neural network prediction is formulated as: xπk+1 = fN N (xπk , · · · , xπk−T x +1 , uk , · · · , uk−T x +1 ) + wkx ,

(25)

where fN N is the neural network predictor and wkx is the prediction error (process noise) at the kth step. Again we assume the process noise is white and wkx ∼ N (0, Q). The covariance matrix Q can be estimated from the residual of surrogate predictions: Nt X ˆ= 1 Q (ˆ xπk − Πxk )(ˆ xπk − Πxk )T . Nt

(26)

k=1

where Nt is the number of time steps used for covariance matrix estimation, and x ˆπk is the π surrogate prediction of xk : x ˆπk = fN N (Πxk−1 , · · · , Πxk−T x , uk−1 , · · · , uk−T x )

(27)

The prediction error may originate from different sources. First, there is the inherent error of neural network in function approximation (Bishop, 1995). Second, our latent space assumption or time-lag embedding may be deficient: information important for state prediction may be lost. Finally, even if our assumption of the low-dimensional subspace is correct, the eigenvectors defining the true subspace will be somewhat different from the one we estimate from the finite set of trajectory snapshots.

3.2 Kalman Filtering Equations 3.2.1

Errors in the Forcings

In modeling the flow dynamics, in addition to the process noise wx in equation (25), we need to consider another important source of uncertainty: the error in measuring or modeling the driving force u. In this paper, we model this inaccuracy with a perturbed driving force u ˜k , formulated as follows: . u ˜k = uk (vk1 , vk2 , · · · , vkq ),

(28)

where {vk1 , vk2 , · · · , vkq } are q different sources of noise at time k. Note equation (28) is a fairly general form and it subsumes the noise in our benchmark problem (introduced in section 4.1) as a particular case. We generally assume each v i is colored noise modeled as: i i i i vki = gi (vk−1 , vk−2 , · · · , vk−T i ) + wk , i = 1, · · · , q

(29)

where wki is the driving white noise for the ith color noise at time step k. Without loss of generality, we assume T i ≥ T, i = 1, 2, · · · , q. 10

With the uncertainty in driving force incorporated, the prediction and observation equations are x xπk = fN N (xπk−1 , · · · , xπk−T x , u˜k−1 , · · · , u ˜k−T x ) + wk−1

yk =

Hπ T xπk

+

wko .

(30) (31)

The Kalman filter for data assimilation will be based on these two equations. To deal with the colored noise formulated in equation (29) and the time-delayed states in equation (30), we employ the Kalman filter formulation proposed by Gibson et al. (1991). We take an extended state vector: q q 1 2 2 T xk = [(xπk )T , · · · , (xπk−T x +1 )T , vk1 , · · · , vk−T 1 +1 , vk , · · · , vk−T 2 +1 , · · · , vk , · · · , vk−T q +1 ] . (32) x The extended state vector consists of two parts: the current and past (up to T time step earlier) subspace state xπ ; the current and past (up to T i time step earlier) noise from source i for i = 1, · · · , q. The length of the extended vector would be dE = T x d + T 1 + · · · + T q . Using the extended vector allows us to write a dynamic state-space model of the following form: p xk = f (xk−1 ) + wk−1

yk = Hxk +

wko .

(33) (34)

The extended process noise vector wkp consists of process noise for subspace state prediction and driving white noise for colored noise modeling: wkp = [(wkx )T , 0, · · · , 0, wk1 , 0, · · ·, 0, · · · , wkq , 0, · · ·, 0]T ,

(35)

and the extended observation noise vector wko is wko = [(wko )T , 0, · · · , 0]T .

(36)

Note both extended process noise wkp and extended observation noise wko are white Gaussian. Equation (33) is the dynamic equation for Kalman filter, which expands into:      x  xπk−1 xπk wk−1    .  . ..    ..   ..         xπ x  xπ x   .  fu ()  k−T +1   k−T   0   v1  1 · · · 0  1    0 0 ··· 0       vk−1  w1  k k−1         . . .. .. .. ..    .. . . . ..   ...   ..  . . · · · . .      .    v1    1    0 0 ··· 0  (37)  k−T 1+1  =  0 · · · 1   vk−T 1  +  0  ,     0 · · · 0    . . 0 A · · · 0 1      ..   ..  ..    ..   .  .. .. ..   ..    . · · · ...   q  . . . .     w   q q  vk    k−1  0 ··· 0 0 0 · · · Aq     vk−1   ..  ..    ..   .     .  . q q 0 vk−T vk−T q +1 q 11

where q q 1 1 fu (xπk−1 , · · · , xπk−T x , vk−1 , · · · , vk−T 1 , · · · vk−1 , · · · , vk−T q ) =

fN N (xπk−1 , · · · , xπk−T x , u ˜k−1 , · · · , u ˜k−T x ), and each matrix Ai is the model of color noise v i   gi () 1 · · · 0 0   Ai =  . . . . ... 0  ..  0

···

i = 1, 2, · · · , q.

(38)

(39)

1 0

The equation (34) is the observation equation, which expands into   xπk   ..   .   π x   k−T x +1   v1      k wo   . ..   k   0   1 yk = Hπ T 0 · · · 0  v . . 1  k−T +1  +     ..  .   .. 0        vkq    ..     . q vk−T q +1

(40)

3.3 Implementing the Sigma-point Kalman Filter The data assimilation task is to find the minimum mean-square error (MMSE) estimation of state xk given the noisy observations {yk , yk−1 , · · · }, the dynamic model f (·) and observation operator H. The Kalman equations provide this estimate ˆ k|k = x ˆ k|k−1 + Kk (yk − y ˆ k|k−1 ) x Pxk|k

=

Pxk|k−1

−

Kk Pyk|k−1 KTk

(41) (42)

ˆ k|k−1 is the optimal prediction of the state at time k conditioned on all of the observed where x ˆ k|k−1 is the optimal prediction of the obserinformation up to and including time k − 1, and y x ˆ k|k−1 and Pyk|k−1 is the covariance of vation at time k. The term Pk|k−1 is the covariance of x ˜ k = yk − y ˆ k|k−1 , termed the innovation. The optimal terms in this recursion are given by y ˆ k|k−1 = E[f (ˆ x xk−1|k−1 ) + wpk−1 ] ˆ k|k−1 = y Kk =

(43)

E[Hˆ xk|k−1 + wko ] y −1 Pxy k|k−1 (Pk|k−1 )

(44) (45)

ˆ k|k−1 )(yk − y ˆ k|k−1 )T ]E[(yk − y ˆ k|k−1 )(yk − y ˆ k|k−1 )T ]−1 (46) = E[(xk − x 12

ˆ k|k−1 corresponds to the expectation of a non-linear function of where the optimal prediction x p ˆ k−1|k−1 and wk−1 the random variables x (see equations (43)). A similar interpretation holds ˆ k|k−1 in equation (44). The optimal gain term Kk for the optimal prediction of the observation y is expressed as a function of posterior covariance matrices in equation (45) and (46). Note these terms require taking expectations of a non-linear function of the prior state estimate random variables. This recursion provides the optimal minimum mean-square error (MMSE) linear estimator of xk assuming all the relevant random variables in the system can be efficiently and consistently modeled by maintaining their first and second order moments, i.e., they can be accurately modeled by a Gaussian random variables (GRVs). We do not assume linearity of the system model f (·). The Sigma-point Kalman filter (SPKF) (van der Merwe and Wan, 2003, 2004) addresses the estimation of the expectation in equation (43)-(46). In SPKF, the state distribution is approximated with a Gaussian distribution, and is represented using a minimal set of carefully-chosen weighted sample points, called Sigma-points. The algorithm requires 2dE + 1 such points, where dE is the dimension of xk as defined in Section 3.2. The Sigma-points completely capture the true mean and covariance of the Gaussian distribution, and when propagated through the non-linear dynamics, captures the posterior mean and covariance accurately to third order. The extended Kalman filter (EKF), in contrast, only achieves first-order accuracy. Also, the computational complexity of the SPKF is the same order as that of the EKF. The program coding in this paper makes use of R. van der Merwe’s ReBEL toolbox. which is available at http://choosh.csee.ogi.edu/rebel/. The implementation of the SPKF for our system is summarized as follows. Consider propagating the random variable xk−1 through the nonlinear function f (·). According to our previous estimation step, variable xk−1 has mean x ˆk−1|k−1 and covariance Pxk−1|k−1 . To calculate i the statistics of xk , we form a set of 2dE + 1 sigma-points {Xk−1 : i = 0, ..., 2dE } where E i Xk−1 ∈ Rd . The sigma-points are calculated by 0 Xk−1 = x ˆk−1|k−1 i Xk−1 i Xk−1

(47)

q = x ˆk−1|k−1 + λ( Pxk−1|k−1 )i q = x ˆk−1|k−1 − λ( Pxk−1|k−1 )i

i = 1, ..., dE

(48)

i = dE + 1, ..., 2dE

(49)

where λ is a scalar scaling factor that determines the spread of the sigma-points around x ˆk−1|k−1 q x and ( Pk−1|k−1 )i indicates the ith column of the matrix square-root of the covariance matrix

Pxk−1|k−1 3 . Once the sigma-points are calculated from the prior statistics, they are propagated through the non-linear function, i i Xk|k−1 = f (Xk−1 )

i = 0, ..., 2dE .

(50)

The mean xk (before observation) is approximated using a weighted sample mean E

x ˆk|k−1 ≈

2d X

i wim Xk|k−1 .

i=0

3

For information on determining appropriate values of λ, see (van der Merwe and Wan, 2004, 2003).

13

(51)

and the covariance of xk (before observation) is approximated as E

Pxk|k−1

≈

E

2d X 2d X

c i i wij Xk|k−1 (Xk|k−1 )T + Ppk−1

(52)

i=0 j=0

p where Ppk−1 is the known covariance matrix of the additive process noise wk−1 , and the com c efficients wi and wij are non-negative scalar weights. The sigma points for the observation are i i Yk|k−1 = HXk|k−1 i = 0, ..., 2dE , (53)

ˆ k|k−1 , Pyk|k−1 , and Pxy where H is the observation operator (as in equation (35) ). The term y k|k−1 (as appear in equation (42)-(47)) can be approximated as: E

ˆ k|k−1 ≈ y

2d X

i wim Yk|k−1

(54)

i=0 E

Pyk|k−1

≈

E

2d X 2d X

j c i wij (Yk|k−1 )(Yk|k−1 )T + Pok

(55)

j c i wij Xk|k−1 (Yk|k−1 )T ,

(56)

i=0 j=0 E

Pxy k|k−1

≈

E

2d X 2d X i=0 j=0

where Pok is the known covariance matrix of the additive observation noise wko . Van der Merwe and Wan (van der Merwe and Wan, 2003, 2004) discuss non-additive observation noise. The c and the scaling factor λ are determined by the specific SPKF adopted values for the weights wij (van der Merwe, 2004). In this paper, we use the square root central-difference filter.

3.4 Estimation of Full State We can obtain two estimations of xπk from KF analysis states: the “0-lag” estimation ˆ k|k x ˆπk|k = Id×d , 0d×(dE −d) x where dE is the dimension of extended state, and the “full-lag” estimation ˆ k+T x −1|k+T x −1 . x ˆπk|k+T x−1 = 0d×(T x −1)d Id×d 0d×(T 1 +···+T q ) x

(57)

(58)

>From equation (57) and (58), we see that the 0-lag estimation x ˆπk|k has observations up to time step k incorporated, while full-lag estimation x ˆπk|k+Tx −1 is the optimal estimation after future observations {yk+Tx −1 , yk+Tx −2 , · · · , yk+1 } are available. Hence the full-lag estimation x ˆπk|k+Tx−1 is the result for fixed-lag Kalman smoothing with lag equal to Tx − 1. To evaluate our data assimilation, we calculate the difference between the 0-lag full state estimation x ˆk|k and true full-space state xk . The residual can be decomposed into two orthogonal parts: xk − x ˆk|k = (xk − π T πxk ) ⊕ (π T πxk − π T x ˆπk|k ). (59) 14

The first part (xk − π T πxk ) is orthogonal to XS and therefore independent of the Kalman filtering result. The second part (π T πxk −π T x ˆπk|k ) is the difference between estimated subspace π state x ˆk|k and the projection of xk on XS . Accordingly, the square error of xk and x ˆk|k , termed DA error, can be viewed as the sum of the two terms: ||xk − x ˆk|k ||2 = ||xk − π T πxk ||2 + ||πxk − x ˆπk|k ||2 .

(60)

The first term on the left side of equation (60) is reconstruction error and the second term will be refereed to as subspace DA error. Since the reconstruction error does not depend on Kalman filtering, it provides a natural lower bound of DA error.

4 Experiment on Benchmark Problem 4.1 Estuary Benchmark Description In this section, we will apply the data assimilation approach described in section 3 to a simulated estuary benchmark. Figure 4 shows the physical layout of the estuary benchmark simulation. The domain is 8km long with a 8km wide ocean mouth and 2km wide river inlet. The depth varies from 10m at the ocean side to 5m at the river inlet. The ELCIRC code uses a 3D grid with 84214 vertices. The variables considered in this model include the elevation, salinity and velocity on each vertex, so the total number of degrees of freedom is 174979.

Figure 4: Simple tidally-influenced estuary benchmark simulation with constant river flux feeding system. The upper panel is a bird-eye view of the estuary with the gray scale denoting the depth. The lower panel is a snapshot sectional plane of the estuary, with color denoting the salinity. In the simplified model, the tidal forcing is put uniformly across the boundary on the ocean

15

side. It has a simple periodic form: ut (t0 + kτ ′ ) = A cos(

2π (t0 + kτ ′ ) + φ), T

(61)

where t0 is the origin of time axis and τ ′ = 90 seconds is the time interval used in ELCRIC. The river flux is a constant: ur = B. To simulate uncertainty in the tidal amplitude A and phase φ we perturb them by independent colored noise: ˜ 0 + kτ ′ ) = A + v A (t0 + kτ ′ ) A(t ˜ 0 + kτ ′ ) = φ + v φ (t0 + kτ ′ ). φ(t

(62) (63)

The colored noise v A and v φ is generated using the following moving average (MA) model: v A (t0 + kτ ′ ) =

nA X

A ′ αA i w (t0 + (k − i)τ )

(64)

v φ (t0 + kτ ′ ) =

nφ X

αφi wφ (t0 + (k − i)τ ′ ),

(65)

i=0

i=0

where wA and wφ are independent white Gaussian noise. The perturbed form of tidal forcing is ˜ 0 + kτ ′ )). ˜ 0 + kτ ′ ) cos( 2π (t0 + kτ ′ ) + φ(t u ˜t (t0 + kτ ′ ) = A(t T

(66)

Figure 5 gives an instance of the perturbed amplitude and phase. Since the river flux is held constant through the simulation, the tidal forcing is the sole source of perturbation. In data assimilation, we consider the discretization in time with a interval τ = 4τ ′ = 1 hour. The notation rule is same as described in section 2.3, for example u ˜tk = u ˜t (t0 + kτ ). It is desirable to have samples of system trajectory under a broad variety of driving forces, based on which we can have a better coverage of both the manifold and dynamics. Intuitively this will directly benefit the model reduction and the training of neural network surrogates. To increase the variety of driving forces, we can either collect system trajectory of longer duration, or create virtual example of system trajectory with artificially designed forcings. In the benchmark problem, we implement this idea by having multiple runs of the system with different perturbation on tidal forcing. We have one ELCIRC run with noise-free tidal forcing (Run 0) and three runs with different instantiation of colored noise on both amplitude and phase (Run 1, Run 2, and Run 3), as formulated through equation (62)-(66), for two weeks4 . The noise is added starting from the first day of second week (day 8), so the first week simulation of all four runs are the same. We use Run 0 as the reference run, with forcing in equation (61), as our best guess of the forcing prior to any observation on state vector. We created simulations 4 Actually, we have Run 0 and Run 3 for three weeks. The data from the third week do not enter the training phase, but the data assimilation is from day 8 to day 20

16

amplitude (meters)

6.1 6 5.9 8

perturbed Amp. original Amp. 10

12

14

16

18

20

14 time (day)

16

18

20

phase (degree)

20 10 0 −10 −20 −30 8

perturbed phase original phase 10

12

Figure 5: An example of amplitude and phase perturbed with colored noise with different forcing perturbations (Run 1 and Run 2) and these, in combination with Run 0 used to define the PCA to train the neural network. Run 3 will be considered as the unknown ground truth with noisy observations available. The goal of data assimilation is thus to recover the states in Run 3 by incorporating the observations to the dynamics we learned from Run 0, Run 1 and Run 2. Observations used in data assimilation are given by three stations located as shown in Figure 6. Each station provides the measurement of elevation at its horizontal location, and the salinity and velocity of each vertex along the vertical. The number of valid observations is not a constant since the water level may drop below an observing node. The average number of valid observations over the whole simulation process is 50. The PCA is computed on data collected from Run 0,1 and 2. We retain the first 20 principal components, which accounts for over 98% of the total variance. We train a linear (forced autoregressive) model (LM) surrogate with 12-hours history (both for state xπ and force u). For the non-linear surrogate (MLP), we use 24-hours history.

4.2 Kalman Filter Implementation In data assimilation, we estimate not only the state xπ , but also the noise added on the amplitude and phase: v A and v φ . For the evolution of the subspace state, we have x xπk = fN N (xπk−1 , · · · , xπk−T x , u ˜tk−1 , · · · , u ˜tk−T x ) + wk−1 .

17

(67)

Figure 6: Data assimilation stations and validation stations. The white crosses are the location of data assimilation stations; the white circles are the location of validation stations (see Section 4.3). The generation of colored noise on the amplitude and phase is approximated with an autoregression (AR) model5 , as a special case of equation (29). The noise model is A

vkA

=

T X

A A aA i vk−i + wk−1

(68)

φ φ aφi vk−i + wk−1 .

(69)

i=1 φ

vkφ

=

T X i=0

φ The AR coefficients aA i and ai are estimated from the colored noise samples from Run 1 and 2. As shown in Section 3.1, we can write the dynamic state-space model in the following concise form: p xk = fk (xk−1 ) + wk−1

yk = Hxk +

wko .

(70) (71)

with extended state vector φ φ A xk = [(xπk )T , · · · , (xπk−T x +1 )T , vkA , · · · , vk−T ]T , A +1 , vk , · · · , v k−T φ +1

and extended process noise vector wkp wkp = [(wkx )T , 0, · · · , 0, wkA , 0 · · · , 0, wkφ , 0, · · · , 0]T . For the linear surrogate, the dimension of Kalman filter state x is 264, so the SPKF needs 2N + 1 = 529 sigma-points; for the non-linear case, due to longer history considered in the neural network predictor, we have a bigger state vector (528 dimensional) and thus more sigma-points (1057). (The number of time lags used in the surrogates was optimized for prediction accuracy.) 5

Note the noise is generated with a MA model and with a different time interval.

18

We can get an alternative to the previously described colored noise model by assuming the noise on both the amplitude and the phase are white, which leads to aA i = 0, i = 1, 2, ..., nA aφi = 0, i = 1, 2, ..., nφ in equation (68) and (69). We compare the colored and white-noise assumptions in our data assimilation results.

4.3 Tuning The Model Noise Covariance with Validation Stations The covariance of the process noise wx is estimated with equation (26). In reality, this estimation is often not the most appropriate description of the process noise. So instead of using the ˆ we use αQ ˆ as the covariance matrix and try to tune the scaling factor α by cross original Q, validation. For our benchmark problem, we add two observation stations as in Figure 6 to use for cross-validation. Data from these stations is not used directly in the data assimilation, but rather compared with the data assimilation field estimate; the scaling factor α is then adjusted to minimize the difference. Using H V to denote the observation operator associated with those validation stations, the validation error at time k is defined as ||ykV − H V ΠT x ˆπk|k ||2 where ykV are the observations from validation stations at time k. The value of α can then be tuned to achieve a smaller average validation error. While cross-validation is a valuable technique, it does not identify the value of α that leads to the lowest DA error over the entire domain. Hence, in discussing our results we include the DA error with α determined from the validation stations, and determined by the DA error over the entire domain.

4.4 Data Assimilation Experiments As discussed in section 3.1, we can choose either a linear or non-linear (neural network) surrogate in data assimilation. The linear predictor has the marginal advantage of faster training and slightly faster propagation, but can lead to an unstable system model. However, they usually suffer from the instability as a dynamic operator. The trained linear models often have multiple poles outside the unit circle. The corresponding unstable components, if not sufficiently suppressed by the observation during analysis step, can lead to quickly diverging state estimation. In our experiments on the CORIE domain, we found that linear surrogates were unstable enough to render them useless for data assimilation (van der Merwe et al., 2007). In this benchmark problem, although the trained linear model is also inherently unstable, it does not lead to diverging state estimation and performs reasonably well. We therefore suspect that the observation has successfully suppressed the unstable components in the data assimilation process. Another choice we need to make is the model of the noise added to A and φ. We may consider the correlation of noise in time domain and model it as colored noise. When the temporal correlation is not well justified or hard to estimate, we can assume it to be white 19

noise. In Section 3.2, we provide the Kalman filter equations under the two different noise assumptions. The colored noise model is closer to the physical reality we simulated in the benchmark problem and, if modeled properly, should yield better data assimilation result. Next we will compare four data assimilation settings with different types of surrogate and noise model. As suggested in section 3.4, the data assimilation performance at time k can be evaluated by the DA error, which is defined as the square error between the true state xk and the data-assimilated estimation x ˆk|k = ΠT x ˆπk|k . As shown in equation (59) and (60), the DA error is the sum of two parts: the reconstruction error ||xk − ΠT Πxk ||2 and the subspace DA error ||Πxk − x ˆπk|k ||2 . Since all the data assimilation models in this paper share the same PCA subspace XS , they yield the same reconstruction error. Therefore it is enough to compare their subspace DA error. We consider the subspace DA error averaged over day 8 to day 20 (the period over which the noise is applied). k1 X 1 ||Πxk − x ˆπk|k ||2 k1 − k0 + 1 k=k0

where k0 and k1 are respectively the time data assimilation starts and ends. As suggested in Section 3.3, for each data assimilation setting, we estimate two scaling factors: one based on the DA error over the entire domain, and one based on cross-validation. Table 1 summaries the performance of the different data assimilation settings. The nonlinear surrogate gives substantially smaller subspace DA error than the linear surrogate for both noise models. Moreover, the best scaling factor α according to the validation stations (= 1) and the true optimal (minimizing error over the entire domain) α (= 0.3) suggest less radical adjustment of the original process noise covariance. The comparison of noise model settings is more complicated. For the linear surrogate, the colored noise model is apparently better than the white noise model, while for non-linear models this superiority is marginal. We speculate that when using the non-linear surrogate, the Kalman filter relies less on the estimation of the colored noise than does the linear surrogate. To get a better understanding of this, we examine the noise estimation with linear and non-linear surrogate. Figure 7 shows the estimated noise 6 on both amplitude and phase with two different type of surrogates with optimal α. Apparently, the estimation given by non-linear surrogate is significantly worse than that given by linear surrogate.

4.5 Data Assimilation Results In this section, we give a detailed analysis of the data assimilation result given by a non-linear surrogate and colored noise model. First of all, we need to show this data assimilation is actually better than our best guess without observation, the reference run (Run 0). To show this, we measure the square error between the reference run state and the true state at each time step k: ||xk − x ˜k ||2 , called reference error, where x and x ˜ stand respectively for true state (from Run 3) and reference state (from Run 0). The reference error thus provides a baseline for the comparison to data assimilation results. 6 Like the subspace state xπ , we can get estimation of noise with different fixed-lag smoothing. Here we show the estimation with 12-hour lag.

20

method LM (white) LM (colored) MLP (white) MLP (colored)

α (x-validation) ≈0 8 × 10−3 1 1

mse (x-validation) 637.97 (0.0066) 335.35 (0.0035) 256.39 (0.0027) 256.29 (0.0027)

α (optimal) ≈0 5 × 10−4 0.3 0.3

ms (optimal) 637.97 (0.0066) 281.85 (0.0029) 229.86 (0.0024) 229.39 (0.0024)

amplitude noise (meters)

Table 1: Comparison of different settings. The column labeled “α (x-validation)” is the scaling factor α minimizing the cross-validation error; the column “mse (x-validation)” is the corresponding subspace DA error. The column “α (optimal)” is the scaling factor α that minimizes the DA error over the entire domain; the column “mse (optimal)” is the corresponding subspace DA error. The numbers in parentheses are the normalized subspace error, i.e., the subspace DA error divided by state variance . 0.15 True LM MLP

0.1 0.05 0 −0.05 −0.1 8

10

12

14

16

18

20

16

18

20

phase noise (degree)

20 10 0 −10

True LM MLP

−20 −30 8

10

12

14 time (day)

Figure 7: Colored noise estimation with linear and non-linear surrogate For evaluating data assimilation performance at time k, we considered the DA error ||xk − x ˆk|k ||2 , the reconstruction error ||xk − π T πxk ||2 and the subspace DA error ||πxk − x ˆπk|k ||2 . Figure 8 shows the time series of the reference error, the DA error, the subspace DA error and the reconstruction error. It is clear from Figure 8 that data assimilation achieves a significantly reduced error comparing to the reference run. Indeed, the average reference error over the two weeks run is 1665.2 while the DA error is 465.06. Another important observation is that reconstruction error is a substantial part of the DA error. Actually the average reconstruction error is 243.6, over half of the average DA error. Also, we notice that there is an apparent correlation between reconstruction error and DA error in subspace. This is understandable,

21

error (svd units)

true−ref. true−anal.(full) 100

50

0 8

error (svd units)

40

10

12

14

16

18

20

16

18

20

true−anal.(sub) recon.

30 20 10 8

10

12

14 Time (day)

Figure 8: Time series of data assimilation result. We show the square root of errors. The upper panel shows the the reference error (doted line) and DA error (solid line); The lower panel shows the the subspace DA error (doted line) and the reconstruction error (solid line). since as established in section 2.4, higher reconstruction error at time k usually means bigger observation noise wko , and thus poorer recovery of the state xk . Recalling each full-space state consists of measurements on elevation, salinity and velocity, we find it meaningful to discuss the data assimilation result on the three variables separately. Take salinity for example, we can simply compare the entries corresponding to salinity in xk and those entries in x ˆk|k . The square difference will be called DA error on salinity. In the same way, we can also calculate the reference error for the three variables. We find that data assimilation achieves a significant reduction of error on elevation and velocity, while salinity measurement does not benefit much from the data assimilation. In fact, the averaged DA error on salinity (156.11) is about the same with averaged reference error (158.96). The reason of this inutility may partially lie in the fact we lost too much information on salinity in the dimension reduction. In fact around 1.7% of variance of salinity is lost in dimension reduction, while for elevation and velocity, the figure is 0.1% and 0.2% respectively. This high reconstruction error on salinity has two consequences. First, we may not be able to model the dynamic of salinity accurately in subspace, comparing to elevation and velocity. Second, the observation may not provide enough information for salinity due to the high observation noise on salinity variables. For comparison to SPKF, we also tried ensemble Kalman filter (EnKF) (Evensen, 2003), which is widely used in data assimilation for oceanography, and particle filter (Arulampalam et al., 2002). Unlike previous work (Lisaeter et al., 2003; Evensen, 2002), in which EnKF is used in full space, we apply EnKF in the subspace due to the extremely high dimension of the 22

error (svd units)

errors in elevation 30

true−ref. true−anal.

20 10 0 8

10

12

14

16

18

20

16

18

20

16

18

20

error (svd units)

errors in salinity 30 20


10 0 8

10

12

14

error (svd units)

errors in velocity 100


50 0 8

10

12

14 Time (day)

Figure 9: Time series of DA error for different variables. We show the square root of errors. On elevation, the average reference error is 91.14 and the averaged DA error is 6.03; On salinity, the averaged reference error is 108.01 and the averaged DA error is 110.56; On velocity, the averaged reference error is 1446.1 and the averaged DA error is 348.47; system state. We set the number of ensemble states to be the same with the number of states in SPKF, so the two types of Kalman filter yield roughly the same computation burden. The data assimilation result given by EnKF is usually slightly worse than that given by SPKF. It is reasonable to assume this difference is due to SPKF’s better sampling strategy. Particle filters do not work well for this problem. The reason, as we speculate, is the high dimensionality of Kalman filter state (264 for the linear surrogate and 528 for the non-linear surrogate). Since each particle follows a very high dimensional Gaussian distribution, the mass of posterior probability can easily concentrate on one or two particles, which is what we observed in practice. This situation can be ameliorated by using a better sampling strategy, which is worth further exploration.

23

5 Conclusion and Future Research We built and tested a framework for sequential data assimilation for nonlinear oceanographic system based on Kalman filter. Our approach starts with a greatly reduced model obtained via principal component analysis. The probabilistic latent state space interpretation of PCA leads us to a modified observation noise. We use a fast neural network surrogate trained to mimic the reduced model to propagate an ensemble of states for the Kalman filter. We employ the Sigma-point Kalman filter, a state-of-the-art technique, to handle the non-linearity in system dynamics. The experiments on a benchmark estuary problem show that our data assimilation method can significantly reduce the error of model prediction by incorporating observation from sparsely located stations. We have also applied our technique to a realistic simulation of the Columbia River estuary (Frolov et al., 2007) and plume (Frolov et al., 2006). These are strongly nonlinear systems with a very large (107 ) state space and very many forcing degrees of freedom (106 − 107 ). Our results in the Columbia River estuary show that our data assimilation is able to reduce errors in simulated tides, salinity, and temperature. In the Columbia River plume, our data assimilation is able to correct both the size and the orientation of the plume relative to the baseline forecast. Several important components of the current method can be further improved. First, in the model reduction phase we may want to use non-linear dimension reduction approaches to achieve a smaller reconstruction error with a fixed subspace dimension. The available technique includes the alignment of local PCA models (Teh and Roweis, 2003) and regularized principal manifold (Smola et al., 2001), both of them can provide a coordination on the low dimensional manifold. Second, there is no effort to preserve the dynamics of the original system. Along the same lines, one may be able to do better than PCA or its nonlinear extensions by designing a dimensionality reduction technque specifically aimed at reducing the surrogate prediction error, rather than maximizing the variance captured. Second, there has been no effort to insure that the dimensionality reduction, the neural network surrogate, or the Kalman filter satisfy basic conservation laws. Incorporating mass conservation in the surrogates could be achieved through regularization. However, we see evidence for local salinity non-conservation in the Columbia River plume induced by the PCA projection. The Kalman filter disrupts salinity conservation in the CORIE estuary (Frolov et al., 2007). Insuring mass conservation in the Kalman filter and in the dimensionality reduction are challenges for future development.

Acknowledgements This work was funded by NSF grant OCI-0121475. We thank Joseph Zhang for helpful discussions. Some of the conceptual framework for our approach originated in SF’s PhD thesis proposal; ZL and RvdM developed the theoretical details and the implementation of the subspace SPKF, and developed the linear and nonlinear surrogates.

24

References Arulampalam, S., Maskell, S., Gordon, N., and Clapp, T. (2002). Tutorial on particle filters for on-line nonlinear/non-Gaussian Bayesian tracking. IEEE Transaction on Signal Processing, 50(2). Baptista, A. (2006). The first decade of a coastal margin collaborative observatory. In Oceans 2006, Boston. MTS / IEEE. Bennett, A. (1998). Inverse Modeling of the Ocean and Atmosphere. Cambridge University Press. Berkooz, G., Holmes, P., and Lumley, J. (1993). The Proper Orthogonal Decomposition in the Analysis of Turbulent Flows. Annual Review of Fluid Mechanics, pages 539–575. Bishop, C. (1995). Neural networks for pattern recognition. Oxford University. Bishop, C., Haynes, P., Smith, M., Todd, T., and Trotman, D. (1995). Real-time control of tokamak plasma using neural network. Neural Computation, 7(1):206–217. Cane, M., Kaplan, A., Miller, R., Tang, B., Hackert, E., and Busalacchi, A. (1996). Mapping tropical Pacific sea level: Data assimilation via a reduced state space Kalman filter. Journal of Geophyisical Research, 101(C10):22,599–22,617. Chui, C. and Chen, G. (1999). Kalman Filtering for Real Time Application. Springer-Verlag. Evensen, G. (2002). Sequential Data Assimilation for Nonlinear Dynamics: The Ensemble Kalman Filter. Springer-Verlag Berlin Heidelberg. Evensen, G. (2003). The Ensemble Kalman Filter: Theoretical Formulation and Practical Implementation. Ocean Dynamics, 53:343–367. Frolov, S., Baptista, A., Leen, T., Lu, Z., and van der Merwe, R. (2006). Assimilating in-situ measurements into a reduced-dimensionality model of an estuary-plume system. In EOS Transactions of the AGU, 87 (52), Fall Meeting Supplement, pages Abstract A31A–0846, San Francisco, CA. Frolov, S., Baptista, A., Lu, Z., van der Merwe, R., and Leen, T. (2007). Fast data assimilation using a nonlinear kalman filter and a model surrogate: an application to the columbia river estuary. Ocean Modeling. In review. Gibson, J., Koo, B., and Gray, S. (1991). Filtering of Colored Noise for Speech Enhancment and Coding. IEEE Transactions on Signal Processing, 39(8). Grzeszczuk, R., Terzopoulos, D., and Hinton, G. (1998). Fast neural network emulation of dynamical systems for computer animation. Adavances in neural information processing system, 11:882–888.

25

Heemink, A., Verlaan, M., and Segers, A. (2001). Variance Reduced Ensemble Kalman Filter. Monthly Weather Review, 129:1718–1728. Hoteit, I. and Pham, D. (2003). Evolution of Reduced State Space and Data Assimilation Schemes Based on the Kalman filter. Journal of Meteorological Society of Japan, 81:21–39. Hoteit, I., Pham, D., and Blum, J. (2001). A semi-evolutive partially local filter for data assimilation. Marine Pollution Bulletin, 43:164–174. Hoteit, I., Pham, D., and Blum, J. (2002). A simplified reduced order Kalman filtering and application to altimetric data assimilation. Journal of Marine Systems, 36:101–127. Jolliffe, I. (1986). Principal Component Analysis. Springer-Verlag. Krasnopolsky, V. M. and Chevallier, F. (2007). Neural network emulations for complex multidimensional geophysical mappings: Applications of neural network techniques to atmospheric and oceanic satellite retrievals and numerical modeling. Reviews of Geophysics. In Press. Krasnopolsky, V. M. and Schiller, H. (2003). Some neural network applications in environmental sciences. Part I: forward and inverse problems in geophysical remote measurements. Neural Networks, 16:321–334. Lisaeter, K., Rosanova, J., and Evensen, G. (2003). Assimilation of ice concentration in a coupled ice-ocean model, using the Ensemble Kalman Filter. Ocean Dynamics, 53:368–388. Pham, D., Verron, J., and Roubaud, M. (1998). A singular evolutive extended Kalman filter for data assimilation in oceanography. Journal of Marine Systems, 16:323–340. Principe, C., Rathie, A., and Kuo, J. (1992). Prediction of chatoic time series with neural networks and the issue of dynamic programming. International Journal of Bifurcation and Chaos, 2(4):989–996. Sirovich, L. (1987). Turbulence and the dynamics of coherenct sturctures. Part 1: coherent Sturctures. Quarterly of applied mathematics, 45(3):561–571. Smola, A., Mika, S., Schö"lkopf, B., and Williamson, R. (2001). Regularized Principal Manifold. Journal of Machine Learning Research, 1:179–209. Teh, Y. and Roweis, S. (2003). Automatic alignment of local representations. In Advances in Neural Information Processing System, volume 15. Tipping, M. and Bishop, C. (1997). Probabilistic principal component analysis. Technical Report NCRG/97/010, Neural Computing Research Group, Aston University. van der Merwe, R. (2004). Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models. PhD thesis, OGI School of Science & Engineering, OHSU. van der Merwe, R., Leen, T. K., Lu, Z., Frolov, S., and Baptista, A. M. (2007). Fast neural network surrogates for very high dimensional physics-based models in computational oceanography. Neural Networks. In press. 26

van der Merwe, R. and Wan, E. (2003). Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models. In Proceedings of the Workshop on Advances in Machine Learning. van der Merwe, R. and Wan, E. (2004). Sigma-Point Kalman Filters for Integrated Navigation. In Proceedings of the 60th Annual Meeting of The Institute of Navigation. Verlaan, M. and Heemink, A. (2001). Tidal flow forecasting using reduced rank square root filters. Stochastic hydrology and Hydraulics, 11:346–368. Zhang, Y. and Baptista, A. (2005). A semi-implicit finite element ocean circulation model. Part I: Formulations and benchmarks. International Journal for Numerical Methods in Fluids. Zhang, Y., Baptista, A., and Myers, E. (2004). A cross-scale model for 3D baroclinic circulation in estuary-plume-shelf systems: I formulations and skill asscessment. Continental Shelf Research, 24:2187–2214.

Appendix: Estimation of cov(Hǫ) Here we give a detailed derivation of equations (18)-(22) for estimating the covariance of the observation noise. From equation (3) cov(ǫ) = Ex,xs (x − xs )(x − xs )T . Using projection operator PΠ , we can rewrite Ex,xs (x − xs )(x − xs )T Ex,xs (x − xs )(x − xs )T

= Ex,xs ((xs − PΠ x) + (PΠ x − x))((xs − PΠ x) + (PΠ x − x))T = Ex,xS (xs − PΠ x)(xs − PΠ x)T + Ex,xs (xs − PΠ x)(PΠ x − x)T +Ex,xs (PΠ x − x)(xs − PΠ x)T + Ex,xs (PΠ x − x)(PΠ x − x)T

Proof for Ex,xs (PΠ x − x)(xs − PΠ x)T = 0 Easy to see, Ex,xs (PΠ x − x)(xs − PΠ x)T

= Exs Ex|xs (PΠ x − x)(xs − PΠ x)T Z = Exs P (x|xs )(PΠ x − x)(xs − PΠ x)T dx X

Since xs − PΠ x ∈ X s and PΠ x − x ⊥ X s , there is a orthogonal matrix U (U U T = I), such that,     z1 0  ..   ..  .  .         0  zd  s   ; x − PΠ x = U   ; and PΠ x − x = U   0 zd+1   ..   ..  .  .  zm 0 27

Let z = [z1 , z2 , ..., zm ]T = U T (xs − x)   T z1 0  ..   ..   .  .     Z Z zd   0  s s T     U T det(U )dz P (x|x )(PΠ x − x)(x − PΠ x) dx = P (z)U     0 z d+1 X Rm     ..   ..   .  .  0



0  ..  .  Z  0 = U P (z)  zd+1 z1 Rm   ..  . zm z1

···

0 .. .

··· ··· .. .

0 zd+1 zd .. .

···

zm zd

z1 zd+1 · · · .. .. . . zd zd+1 · · · 0 ··· .. . 0

···

zm



z1 zm ..  .   zd zm   dzU T 0   ..  .  0

where P (x|xs ) = P (z) ∝ exp(

2 z12 + z22 + · · · + zm ). 2σ 2

Easy to prove Z

Rm

P (z)zi zj dz = 0 if i 6= j.

It is then obvious that Z

P (x|xs )(PΠ x − x)(xs − PΠ x)T dx = 0

X

So we have Ex,xs (xs − x)(xs − x)T = Ex,xs (xs − PΠ x)(xs − PΠ x)T + Ex,xs (PΠ x − x)(PΠ x − x)T First term

Ex,xs (xs − PΠ x)(xs − PΠ x)T

Ex,xs (xs − PΠ x)(xs − PΠ x)T

= Exs Ex|xs (xs − PΠ x)(xs − PΠ x)T Z = Exs P (x|xs )(xs − PΠ x)(xs − PΠ x)T dx X

28

Using the same orthogonal transformation introduced previously   z1  ..  .   zd  s  x − PΠ x = U  0    ..  . 0

we have    T z1 z1  ..   ..   .  .     Z Z zd  zd  s s s T T   P (x|x )(x − PΠ x)(x − PΠ x) dx = P (z)U   0   0  U dz X X     ..   ..   .  .  0



z1 z1 · · ·  .. ..  . .  Z zd z1 · · · = U P (z)   0 ··· Rm   ..  . 0

···

z1 zd 0 · · · .. .. . . zd zd 0 · · · 0 0 ··· .. .. . . . . . 0 0 ···



0 ..  .   0  dzU T 0  ..  .  0

0  2 σ ···  .. . . . .   0 ··· =U  0 ···   .. . 0

···

0 .. .

0 ··· .. .

σ2 0 · · · 0 0 ··· .. .. . . . . . 0 0 ···

 0 ..  .   0  UT 0  ..  .  0

We then have σ2 · · ·  .. . . . .   0 · · · Ex,xs (xs −PΠ x)(xs −PΠ x)T = Exs U   0 ···   .. . 

0

Second term

···

0 .. .

0 ··· .. .

σ2 0 · · · 0 0 ··· .. .. . . . . . 0 0 ···

  2 0 σ  .. ..  . .     0  UT = U  0  0 0    .. ..  . . 0 0

Ex,xs (PΠ x − x)(PΠ x − x)T

It is obvious that Ex,xs (PΠ x − x)(PΠ x − x)T

= Ex (PΠ x − x)(PΠ x − x)T ≈

N 1 X (PΠ xi − xi )(PΠ xi − xi )T N i=1

29

··· .. . ··· ··· ···

0 .. .

0 ··· .. .

σ2 0 · · · 0 0 ··· .. .. . . . . . 0 0 ···

 0 ..  .   0  UT 0  ..  .  0

We then have cov(ǫ) ≈ U

2 N 1 X σ Id×d 0 T U + (PΠ xi − xi )(PΠ xi − xi )T 0 0 N

(72)

i=1

>From equation (13) and (14), the covariance of observation noise is cov(wto )

2 N 1 X σ Id×d 0 T T U H +H ≈ R + HU (PΠ xi − xi )(PΠ xi − xi )T H T 0 0 N

(73)

i=1

P i i i i T T where H N1 N i=1 (PΠ x − x )(PΠ x − x ) H = is very easy to calculate.

Estimating σ

1 N

PN

i i=1 (HPΠ x

− Hxi )(HPΠ xi − Hxi )T

We know

Ex,xs (PΠ x − x)T (PΠ x − x) = Exs Ex|xs (PΠ x − x)T (PΠ x − x) Z = Exs P (x|xs )(x − PΠ x)T (x − PΠ x)dx X

Using the same orthogonal transformation introduced previously   0  ..   .     0    PΠ x − x = U   z d+1    ..   .  zm

we have T   0 0  ..   ..   .   .      Z Z    0  0 s T T    P (x|x )(PΠ x − x) (PΠ x − x)dx = P (z)  zd+1  U U zd+1  dz X X      ..   ..   .   .  

zm

Z

=

X

zm

2 2 2 P (z)(zd+1 + zd+2 + · · · + zm )dz

= (m − d)σ 2 We then have Ex,xs (PΠ x − x)T (PΠ x − x) = Exs

Z

P (x|xs )(x − PΠ x)T (x − PΠ x)dx

X

= Exs (m − d)σ 2 = (m − d)σ 2 30

On the other hand, Ex,xs (PΠ x − x)T (PΠ x − x) ≈

N 1 X (PΠ xi − xi )T (PΠ xi − xi ) N i=1

We then get the estimation of σ N

X 1 σ ˜ = (P xi − xi )T (PΠ xi − xi ). N (m − d) 2

i=1

In our case, m − d is a huge number (≈ 2 × 105 ). As the result, N

σ ˜2 =

X 1 (PΠ xi − xi )T (PΠ xi − xi ) ≈ 0. N (m − d) i=1

In practice, we use the approximation: cov(wto ) ≈ R + +H

N 1 X (PΠ xi − xi )(PΠ xi − xi )T H T N i=1

31

(74)