Pred icti on and Esti mati on, Part II

1 downloads 26 Views 4MB Size Report
mati ngth e state. ▫. The state is estimated from the updated distribution. ❑ ...... nun ifor m dist ribu tion. ▫. Pred icte d st ate dist ribu tion for the n ext ti me i nsta.
11-755 Machine Learning for Signal Processing

Prediction and Estimation, Part II

Recap: An automotive example



Determine automatically, by only listening to a running automobile, if it is: 

Class 18. 1 Nov 2012

  



11-755/18797

1 Nov 2012

1

Idling; or Travelling at constant velocity; or Accelerating; or Decelerating

Assume (for illustration) that we only record energy level (SPL) in the sound 

0.33

The Model!

The SPL is measured once per second

1 Nov 2012

11-755/18797

0.5

Accelerating state

0.33

0.25 0.33 0.33 0.25

P(ST | x0:T-1) = SST-1 P(ST-1 | x0:T-1) P(ST|ST-1)

0.33

Cruising state

0.5

0.5

I

A

I

0

C

1/3 1/3

0 1/3 1/3

PREDICT D



0 1/3 1/3



0.25 0.25 0.25 0.25

P(ST | x0:T) = C. P(ST | x0:T-1) P(xT|ST)

UPDATE

At T=0 the predicted state distribution is the initial state probability At each time T, the current estimate of the distribution over states considers all observations x0 ... xT 



3

11-755/18797

Update the distribution of the state at T after observing xT

Predict the distribution of the state at T

65

0

A C D

The state-space model Assuming all transitions from a state are equally probable

Ot-2

2

Overall procedure

P(x|accel)

T=T+1

70 P(x|idle)

0.5 Idling state 45

0.25 0.25

0.33 Decelerating state 60 



1 Nov 2012

Prediction with HMMs

Ot-3



A natural outcome of the Markov nature of the model

The prediction+update is identical to the forward computation for HMMs to within a normalizing constant

1 Nov 2012

4

11-755/18797

Prediction with HMMs

Ot-1

Ot

Ot-3

At t, you have some beliefs for the states

 

1 Nov 2012

5

11-755/18797

Ot-2

Ot-1

Ot

At t, you have some beliefs for the states You predict the beliefs for the state at t+1

1 Nov 2012

6

11-755/18797

Prediction with HMMs

Estimating the state T=T+1 P(ST | x0:T-1) = SST-1 P(ST-1 | x0:T-1) P(ST|ST-1)

Ot-3

  

Ot-2

Ot-1

Ot

Ot+1

At t, you have some beliefs for the states You predict the beliefs for the state at t+1 And update these after observing



Ot+1

1 Nov 2012

7

11-755/18797

P(ST | x0:T) = C. P(ST | x0:T-1) P(xT|ST)

Estimate(ST)

Estimate(ST) = argmax STP(ST | x0:T)

Update the distribution of the state at T after observing xT

Predict the distribution of the state at T

The state is estimated from the updated distribution 

The updated distribution is propagated into time, not the state

1 Nov 2012

8

11-755/18797

Continuous state system

Predicting the next observation T=T+1 P(ST | x0:T-1) = SST-1 P(ST-1 | x0:T-1) P(ST|ST-1)

st  f (st 1 ,  t )

P(ST | x0:T) = C. P(ST | x0:T-1) P(xT|ST)

ot  g (st ,  t )

Update the distribution of the state at T after observing

Predict the distribution of the state at T

xT



Predict P(xT|x0:T-1) 

Predict xT



P(xT|x0:T-1) = SST P(xT|ST) P(ST|x0:T-1)

2

1

t

t 1

The state is the position of navlab or the star

Sensor readings (for navlab) or recorded image (for the telescope)

ot  Bt st   t

P( ) 

P( ) 

1 (2 ) d |  |

1 (2 ) d |  |





exp  0.5  m  1   m 



T



exp  0.5  m  1   m  T

A linear state dynamics equation



 P(s

The observations are dependent on the state and are the only way of knowing about the state 

9

st  At st 1   t

s  f (s ,  ) t 1

t

ot  g (st ,  t ) 

P( st | O 0:t -1 ) 

The state is a continuous valued parameter that is not directly seen 

The probability distribution for the observations at the next time is a mixture: 



The actual observation can be predicted from P(x |x0:T-1) 11-755/18797

1 Nov 2012T

Special case: Linear Gaussian model

Discrete vs. Continuous State Systems

3

0

Prediction at time t:

P( st | O0:t -1 )   P(st 1 | O0:t -1 ) P(st | st 1 ) st 1

Probability of state driving term  is Gaussian Sometimes viewed as a driving term m and additive zero­ mean noise



| O 0:t -1 ) P( st | st 1 )dst 1





Update after Ot:

P(st | O0:t )  CP(st | O0:t -1 ) P(Ot | st ) P(st | O0:t )  CP(st | O0:t -1 ) P(Ot | st )

A linear observation equation



Probability of observation noise  is Gaussian



At, Bt and Gaussian parameters assumed known



May vary with time



1 Nov 2012

12

11-755/18797

The Linear Gaussian model (KF) P (s)  Gaussian(s; s , R) 0

P(st | st 1 )  Gaussian(st ; m  At st 1 ,  ) P(ot | st )  Gaussian(ot ; Bt st ,  )

P(st | o0:t 1 )  Gaussians; st , Rt 

t

0:t

T t

t

t

T t





1

t

t

s  As t

t t 1



t

The Kalman filter 

Prediction

ot  Bt st   t

t

t 1





Update

t



1

Rˆt  I  Kt Bt Rt 13

11-755/18797

st  At st 1   t

1 Nov 2012

14

11-755/18797

Problems

P(Ot|st) 

Transition prob.

ot  g (st ,  t )

State output prob

C



K t  Rt BtT Bt Rt BtT  

sˆt  st  Kt ot  Bt st 

sˆt  st  K t o  Bt st  Rˆt  I  K t Bt Rt

Iterative prediction and update

a priori

 P(s)

P(s0)

 P(s0)

st  At 1sˆt  m

Rt    At 1Rˆt 1 AtT1

st  m  At sˆt 1

R    A Rˆ AT 

t



P(s | o )  Gaussian s; sˆ , Rˆ t



K  R B B R B  t



1 Nov 2012

Linear Gaussian Model

st  f (st 1 ,  t )

ot  Bt st   t P(st|st-1) 

P(s) 

O0)

P(s0|

P(O0|

s0)

f() and/or g() may not be nice linear functions 



P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

Conventional Kalman update rules are no longer valid



P(s1| O0:1)  C P(s1| O0) P(O1| s0) 

P( s2 | O 0:1 ) 

 P(s

1



| O 0:1 ) P( s2 | s1 )ds1



 and/or  may not be Gaussian 

P(s2| O0:2)  C P(s2| O0:1) P(O2| s2)

Gaussian based update rules no longer valid

All distributions remain Gaussian 1 Nov 2012

t

st  f (st 1 ,  t )





Gaussian based update rules no longer valid



16

11-755/18797

The problem with non-linear functions

Problems s  f (s ,  ) t 1

t

ot  g (st ,  t )

ot  g (st ,  t )

f() and/or g() may not be nice linear functions



Conventional Kalman update rules are no longer valid



17

11-755/18797



P( st | o 0:t -1 ) 

 P(s

t 1

| o 0:t -1 ) P( st | st 1 )dst 1



P(st | o0:t )  CP(st | o0:t -1 ) P(ot | st )

Estimation requires knowledge of P(o|s)  

 and/or  may not be Gaussian



Difficult to estimate for nonlinear g() Even if it can be estimated, may not be tractable with update loop

Estimation also requires knowledge of P(st|st-1)  

1 Nov 2012

Difficult for nonlinear f() May not be amenable to closed form integration

1 Nov 2012

18

11-755/18797

The problem with nonlinearity ot  g (st ,  t ) 

Example: a simple nonlinearity o    log(1  exp( s))

The PDF may not have a closed form P(ot | st ) 



 :g ( st , ) ot

| J g ( st , ) (ot ) |



P ( ) 

| J g ( st , ) (ot ) |



ot (1) ot (1) ...  (1)  (n)    ot (n) ot (n)   (1)  (n)

P(o|s) = ?  

0 ; s

Assume  is Gaussian P() = Gaussian(; m, )

Even if a closed form exists initially, it will typically become intractable very quickly

1 Nov 2012

19

11-755/18797

Example: a simple nonlinearity

1 Nov 2012

20

11-755/18797

Example: At T=0.

o    log(1  exp( s)) 

o    log(1  exp( s))

P(o|s) = ?

0 ; s

 ; s=0



P( )  Gaussian( ; m ,  ) 

P(o | s)  Gaussian(o; m  log(1  exp( s)),  )

Assume initial probability P(s) is Gaussian P(s0 )  P0 (s)  Gaussian(s; s , R) Update

P(s0 | o0 )  CP(o0 | s0 ) P(s0 )

P(s | o )  CGaussian(o; m  log(1  exp( s )),  )Gaussian(s ; s , R) 0

1 Nov 2012

21

11-755/18797

UPDATE: At T=0.



0

1 Nov 2012

0



0

22

11-755/18797

Prediction for T = 1

m  0; s  0   1; R  1

o    log(1  exp( s))

st  st 1   

P( )  Gaussian ( ;0,  )

Trivial, linear state transition equation P(st | st 1 )  Gaussian(st ; st 1 ,  ) 

 ; s=0



P(s0 | o0 )  CGaussian(o; m  log(1  exp( s0 )),  )Gaussian(s0 ; s , R)   0.5( m  log(1  exp( s0 ))  o)T 1 ( m  log(1  exp( s0 ))  o)   P( s | o )  C exp    0.5( s0  s )T R 1 ( s0  s )   0



Prediction P( s1 | o 0 )   P(s0 | o 0 ) P( s1 | s0 )ds0 





   0.5( m  log( 1  exp( s0 ))  o)T 1 ( m  log( 1  exp( s0 ))  o)  T 1  P( s1 | o 0 )   C exp   exp s1  s0   s1  s0  ds0  0.5( s0  s )T R 1 ( s0  s )   

0

= Not Gaussian

1 Nov 2012

23

11-755/18797



= intractable

1 Nov 2012

24

11-755/18797

Update at T=1 and later 

Update at T=1 P(st | o0:t )  CP(st | o0:t -1 ) P(ot | st ) 



The State prediction Equation st  f (st 1 ,  t ) 

Intractable

Prediction for T=2

 



P( st | o 0:t -1 ) 

 P(s

t 1

| o 0:t -1 ) P( st | st 1 )dst 1





Similar problems arise for the state prediction  equation P(st|st-1) may not have a closed form Even if it does, it may become intractable within  the prediction and update equations 

Intractable

1 Nov 2012

11-755/18797

25

Simplifying the problem: Linearize

Particularly the prediction equation, which includes an  integration operation

1 Nov 2012

11-755/18797

26

Simplifying the problem: Linearize

o    log(1  exp( s))

o    log(1  exp( s))

s



s

The tangent at any point is a good local approximation if the function is sufficiently smooth

1 Nov 2012



27

11-755/18797

Simplifying the problem: Linearize

The tangent at any point is a good local approximation if the function is sufficiently smooth

1 Nov 2012

28

11-755/18797

Simplifying the problem: Linearize

o    log(1  exp( s))

s



s

The tangent at any point is a good local approximation if the function is sufficiently smooth

1 Nov 2012



29

11-755/18797

The tangent at any point is a good local approximation if the function is sufficiently smooth

1 Nov 2012

30

11-755/18797

Linearizing the observation function

Most probability is in the low-error region

P(s)  Gaussian(s , R) o    g (s) 

o    g (s )  J g (s )(s  s )

Simple first-order Taylor series expansion 

J() is the Jacobian matrix 



P(s)  Gaussian(s , R)

Simply a determinant for scalar state

Expansion around a priori (or predicted) mean of the state

s



P(s) is small approximation error is large 

1 Nov 2012

11-755/18797

31

Most of the probability mass of s is in low-error regions

1 Nov 2012

11-755/18797

32

UPDATE.

Linearizing the observation function

o    g (s )  J g (s )(s  s )

P(s)  Gaussian(s , R) o    g (s) 

P(o | s)  Gaussian(o; g (s )  J g (s )(s  s ),  )

o    g (s )  J g (s )(s  s )

P(s)  Gaussian(s; s , R) P(s | o)  CP(o | s) P(s)

Observation PDF is Gaussian

P(s | o)  CGaussian(o; g (s )  J g (s )(s  s ),  )Gaussian(s; s , R)

P( )  Gaussian( ;0,  )









Gaussian!! 

33

11-755/18797



P(s | o)  Gaussian s; s  RJ g (s )T ( J g (s ) RJ g (s )T   ) 1 o  g (s ), I  RJ g (s )T ( J g (s ) RJ g (s )T   ) 1 J g (s ) R

P(o | s)  Gaussian(o; g (s )  J (s )(s  s ),  ) g

1 Nov 2012

Prediction? st  f (st 1 )  

Note: This is actually only an approximation

1 Nov 2012

34

11-755/18797

Prediction P( )  Gaussian ( ;0,  )

st  f (st 1 )  

Solution: Linearize



Again, direct use of f() can be disastrous



st    f (sˆt 1 )  J f (sˆt 1 )(st 1  sˆt 1 )

P(st 1 | o0:t 1 )  Gaussian(st 1; sˆt 1 , Rˆt 1 )

P( )  Gaussian( ;0,  )

 The state transition probability is now: P(st | st 1 )  Gaussian(st ; f (sˆt 1 )  J f (sˆt 1 )(st 1  sˆt 1 ),  )

P(st 1 | o0:t 1 )  Gaussian(st 1; sˆt 1 , Rˆt 1 )

st  f (st 1 )   

st    f (sˆt 1 )  J f (sˆt 1 )(st 1  sˆt 1 )



The predicted state probability is:

Linearize around the mean of the updated distribution of s at t-1 

1 Nov 2012

Which should be Gaussian



P( st | o 0:t -1 ) 

 P(s

t 1

| o 0:t -1 ) P( st | st 1 )dst 1



35

11-755/18797

1 Nov 2012

36

11-755/18797

Prediction P(st 1 | o0:t 1 )  Gaussian(st 1; sˆt 1 , Rˆt 1 )

P(s | s )  Gaussian(s ; f (sˆt 1 )  J f (sˆt 1 )(st 1  sˆt 1 ),  ) t

t 1

t



P( st | o 0:t -1 ) 

 P(s

t 1

| o 0:t -1 ) P( st | st 1 )dst 1

The linearized prediction/update ot  g (st )   st  f (st 1 )   

 

P( st | o 0:t -1 )   Gaussian( st 1; sˆt 1 , Rˆt 1 )Gaussian( st ; f ( sˆt 1 )  J f ( sˆt 1 )(st 1  sˆt 1 ),  )dst 1 





The predicted state probability is:



P(st | o0:t -1 )  Gaussian st ; fˆ (st 1 ), J f (sˆt 1 ) Rˆt 1 J f (sˆt 1 )T   



Given: two non­linear functions for state update and observation generation Note: the equations are deterministic non­linear  functions of the state variable  

Gaussian!! This is actually only 11-755/18797 an approximation

1 Nov2012

37

They are linear functions of the noise! Non­linear functions of stochastic noise are slightly more  complicated to handle

1 Nov 2012

Linearized Prediction and Update 



Prediction for time t

P(st | o0:t -1 )  Gaussianst ; st , Rt 



38

At  J f ( sˆt 1 ) Bt  J g ( st )

Rt  At Rˆt 1 At   T

st  f (sˆt 1 )

Update at time t

Update at time t







P(st | o0:t )  Gaussian st ; sˆt , Rˆt



T T sˆ  st  Rt Bt ( Bt Rt Bt   ) 1 ot  g ( st )  T T Rˆt  I  Rt Bt ( Bt Rt Bt   ) 1 Bt Rt t



39

11-755/18797

Prediction



11-755/18797

Linearized Prediction and Update

Prediction for time t

P(st | o0:t -1 )  Gaussianst ; st , Rt 

Rt  J f (sˆt 1 ) Rˆt 1 J f (sˆt 1 )T  

st  f (sˆt 1 ) 



P(st | o0:t )  Gaussian st ; sˆt , Rˆt

sˆt  st  Rt J g ( st )T ( J g ( st ) Rt J g ( st )T   ) 1 ot  g ( st )  Rˆt  I  Rt J g ( st )T ( J g ( st ) Rt J g ( st )T   ) 1 J g ( st ) Rt 1 Nov 2012





1 Nov 2012

The Extended Kalman filter

Update



40

11-755/18797

The Kalman filter At  J f ( sˆt 1 )

st  f (sˆt 1 )



Prediction

st  At sˆt 1  m

Bt  J g ( st )

Rt    At Rˆt 1 AtT

Rt    At Rˆt 1 AtT 



K t  Rt BtT Bt Rt BtT  



Update



1

K t  Rt BtT Bt Rt BtT  



Rˆt  I  Kt Bt Rt

Rˆt  I  Kt Bt Rt

sˆt  st  Kt ot  Bt st 

sˆt  st  Kt ot  g (st )

1 Nov 2012

41

11-755/18797

1 Nov 2012

1

42

11-755/18797

EKFs 

EKFs have limitations

EKFs are probably the most commonly used  algorithm for tracking and prediction  



Most systems are non­linear Specifically, the relationship between state and  observation is usually nonlinear The approach can be extended to include non­linear  functions of noise as well

The term “Kalman filter” often simply refers to an extended Kalman filter in most contexts.



43



Unstable

The estimate is often biased 



11-755/18797

If the non-linearity changes too quickly with s, the linear approximation is invalid 



But..



1 Nov 2012

A different problem: Non-Gaussian PDFs ot  g (st )   

The true function lies entirely on one side of the approximation

Various extensions have been proposed:  

Invariant extended Kalman filters (IEKF) Unscented Kalman filters (UKF)

1 Nov 2012

11-755/18797

44

Linear Gaussian Model

st  f (st 1 )   P(st|st-1) 

P(s) 

We have assumed so far that:   





a priori

P(Ot|st) 

Transition prob.

P0(s) is Gaussian or can be approximated as Gaussian P() is Gaussian P() is Gaussian

State output prob P(s0)  P(s) P(s0| O0)  C P(s0) P(O0| s0) 

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

This has a happy consequence: All distributions remain Gaussian



P(s1| O0:1)  C P(s1| O0) P(O1| s0) 

P( s2 | O 0:1 ) 

 P(s

1

| O 0:1 ) P( s2 | s1 )ds1



But when any of these are not Gaussian, the results are not so happy

P(s2| O0:2)  C P(s2| O0:1) P(O2| s2) All distributions remain Gaussian

1 Nov 2012

11-755/18797

45

A different problem: Non-Gaussian PDFs ot  g (st )   

A simple case

st  f (st 1 )  

1

o  Bs   t

P( )   wi Gaussian( ; mi , i )

t

i 0

We have assumed so far that:   

P0(s) is Gaussian or can be approximated as Gaussian P() is Gaussian P() is Gaussian

P() is a mixture of only two Gaussians



o is a linear function of s



Non-linear functions would be linearized anyway







This has a happy consequence: All distributions remain Gaussian

P(o|s) is also a Gaussian mixture!



1

P(ot | st )  P(  ot  Bst )   wi Gaussian(o; mi  Bst , i ) i 0

But when any of these are not Gaussian, the results are not so happy

1 Nov 2012

11-755/18797

47

P( )

1 Nov 2012

P(ot | st )

11-755/18797

48

When distributions are not Gaussian P(s) =

a priori

P(st|st-1) =

P(Ot|st) =

Transition prob.

When distributions are not Gaussian P(s) =

a priori

State output prob

P(st|st-1) =

P(Ot|st) =

Transition prob.

P(s0)  P(s)

State output prob P(s0)  P(s) P(s0| O0)  C P(s0) P(O0| s0)

When distributions are not Gaussian P(st|st-1) =

P(s) =

a priori

P(Ot|st) =

Transition prob.

When distributions are not Gaussian P(st|st-1) =

P(s) =

a priori

State output prob

P(Ot|st) =

Transition prob.

P(s0| O0)  C P(s0) P(O0| s0)

P(s0| O0)  C P(s0) P(O0| s0)

P(s0)  P(s)

P(s0)  P(s) 

P(st|st-1)

=

State output prob

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0





When distributions are not Gaussian P(s) =

a priori

P(Ot|st)

Transition prob.

=



P(s1| O0:1)  C P(s1| O0) P(O1| s0)

When distributions are not Gaussian P(s) =

P(st|st-1)

a priori

State output prob

=

P(Ot|st)

Transition prob.

= State output prob

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

P(s0| O0)  C P(s0) P(O0| s0)

P(s0| O0)  C P(s0) P(O0| s0)

P(s0)  P(s)

P(s0)  P(s) 







P(s1| O0:1)  C P(s1| O0) P(O1| s0)

P(s1| O0:1)  C P(s1| O0) P(O1| s0)



P( s | O 2

0:1

)   P(s

1



P( s2 | O 0:1 ) 

| O ) P( s | s )ds 0:1

2

1

1



 P(s

1

| O 0:1 ) P( s2 | s1 )ds1



P(s2| O0:2)  C P(s2| O0:1) P(O2| s2)

When P(Ot|st) has more than one Gaussian, after only a few time steps…

When distributions are not Gaussian P( s | O )  t

0:t

Related Topic: How to sample from a Distribution?  

“Sampling from a Distribution P(x; G) with parameters G” Generate random numbers such that  



The distribution of a large number of generated numbers is P(x; G) The parameters of the distribution are G

Many algorithms to generate RVs from a variety of distributions   

Generation from a uniform distribution is well studied Uniform RVs used to sample from multinomial distributions Other distributions: Most commonly, transform a uniform RV to the desired distribution

We have too many Gaussians for comfort.. 1 Nov 2012

Sampling from a multinomial

11-755/18797

56

Sampling a multinomial 1.0

Given a multinomial over N symbols, with probability of ith symbol = P(i)



Randomly generate symbols from this distribution



Can be done by sampling from a uniform distribution



P(1)



11-755/18797

P(2)

1.0

57

P(1)+P(2)+P(3)

P(3)

P(2)

P(3)

P(N)

Segment a range (0,1) according to the probabilities P(i) 



The P(i) terms will sum to 1.0

Randomly generate a number from a uniform distribution  

 1 Nov 2012

Matlab: “rand”. Generates a number between 0 and 1 with uniform probability

If the number falls in the ith segment, select the ith symbol

1 Nov 2012

58

11-755/18797

Related Topic: Sampling from a Gaussian

Sampling a multinomial P(1)+P(2)



Many algorithms 

P(1)

P(N)



Segment a range (0,1) according to the probabilities P(i)







The P(i) terms will sum to 1.0

Simplest: add many samples from a uniform RV The sum of 12 uniform RVs (uniform in (0,1)) is approximately Gaussian with mean 6 and variance 1 For scalar Gaussian, mean m, std dev s: 12

x   ri  6

Randomly generate a number from a uniform distribution



 

i 1

Matlab: “rand”. Generates a number between 0 and 1 with uniform probability



Matlab : x = m + randn* s 

If the number falls in the ith segment, select the ith symbol



1 Nov 2012

59

11-755/18797

“randn” draws from a Gaussian of mean=0, variance=1

1 Nov 2012

60

11-755/18797

Related Topic: Sampling from a Gaussian 

Multivariate (d-dimensional) Gaussian with mean m and covariance  

Compute eigen value matrix L and eigenvector matrix E for  

 

Sampling from a Gaussian Mixture

 w Gaussian( X ; m ,  ) i

61

i

Select a Gaussian by sampling the multinomial distribution of weights:

j ~ multinomial(w1, w2, …)

x d]

Multiply X by the square root of L and E, add m

11-755/18797



Sample from the selected Gaussian Gaussian( X ; m j ,  j )

1 Nov 2012

P(Ot|st) =

Transition prob.

i

i



 = E L ET

Generate d 0-mean unit-variance numbers x1..xd Arrange them in a vector: .. T X= [x1



Y = m + E sqrt(L) X 1 Nov 2012

62

11-755/18797

The problem of the exploding distribution

Returning to our problem:



P(st|st-1) =

P(s) =

a priori

State output prob



P(s0)  P(s)

The complexity of the distribution increases  exponentially with time This is a consequence of having a continuous state space 

Only Gaussian PDFs propagate without increase of complexity

P(s0| O0)  C P(s0) P(O0| s0) 

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0



Discrete-state systems do not have this problem



P(s1| O0:1)  C P(s1| O0) P(O1| s0)





P( s2 | O 0:1 ) 

 P(s

1



| O 0:1 ) P( s2 | s1 )ds1

The number of states in an HMM stays fixed However, discrete state spaces are too coarse



P(s2| O0:2)  C P(s2| O0:1) P(O2| s2)



Solution: Combine the two concepts 

When P(Ot|st) has more than one Gaussian, after only a few time steps…

Discretize the state space dynamically

1 Nov 2012

Discrete approximation to a distribution







A PDF can be approximated as a uniform probability distribution over randomly drawn samples 

Since each sample represents approximately the same probability mass (1/M if there are M samples)

P( x)  65

11-755/18797

64

11-755/18797

Discrete approximation: Random sampling

A large­enough collection of randomly­drawn samples from a distribution will approximately quantize the space  of the random variable into equi­probable regions We have more random samples from high­probability regions and fewer samples from low­probability reigons

1 Nov 2012

1 M

1 Nov 2012

M 1

 ( x  x ) i

i 0

66

11-755/18797

Note: Properties of a discrete distribution 1 P( x)  M 

M 1

 ( x  x ) i 0

i

M 1

P( x) P( y | x)   P( y | x ) ( x  x ) i 0

i

i

Discretizing the state space 

The product of a discrete distribution with another distribution is simply a weighted discrete probability M 1

P( x)   wi ( x  xi ) i 0





i

i 0

At each time, discretize the predicted state space P( st | o0:t )  

M 1

 P( x) P( y | x)dx   w P( y | x )



i



1 M

M 1

  (s i 0

t

 si )

si are randomly drawn samples from P(st|o0:t)

Propagate the discretized distribution

The integral of the product is a mixture distribution

1 Nov 2012

67

11-755/18797

Particle Filtering P(st|st-1) =

P(s) =

a priori

1 Nov 2012

68

11-755/18797

Particle Filtering P(Ot|st) =

Transition prob.

P(st|st-1) =

P(s) =

a priori

State output prob P(s0)

predict

P(Ot|st) =

Transition prob.

 P(s)

State output prob P(s0)  P(s)

sample

Assuming that we only generate FOUR samples from the predicted distributions

Assuming that we only generate FOUR samples from the predicted distributions

Particle Filtering P(s) =

P(st|st-1)

a priori

=

Particle Filtering P(Ot|st)

Transition prob.

=

P(s) =

P(st|st-1)

a priori

State output prob

=

P(Ot|st)

Transition prob.

= State output prob

P(s0| O0)  C P(s0) P(O0| s0)

P(s0| O0)  C P(s0) P(O0| s0)

P(s0)  P(s)

P(s0)  P(s)



update

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0 

predict

Assuming that we only generate FOUR samples from the predicted distributions

Assuming that we only generate FOUR samples from the predicted distributions

Particle Filtering P(s) =

a priori

P(st|st-1) =

P(Ot|st) =

Transition prob.

Particle Filtering P(s) =

a priori

State output prob

P(st|st-1) =

P(Ot|st) =

Transition prob.

State output prob

P(s0| O0)  C P(s0) P(O0| s0)

P(s0| O0)  C P(s0) P(O0| s0)

P(s0)  P(s)

P(s0)  P(s) 

P( s

1

| O )   P(s 0

0



P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

| O ) P( s | s )ds 0

1

0

0



sample



P(s1| O0:1)  C P(s1| O0) P(O1| s0) update

Assuming that we only generate FOUR samples from the predicted distributions

Assuming that we only generate FOUR samples from the predicted distributions

Particle Filtering P(st|st-1) =

P(s) =

a priori

Particle Filtering P(Ot|st) =

Transition prob.

P(st|st-1) =

P(s) =

a priori

State output prob

P(Ot|st) =

Transition prob.

State output prob

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0

P(s0| O0)  C P(s0) P(O0| s0)

P(s0| O0)  C P(s0) P(O0| s0)

P(s0)  P(s)

P(s0)  P(s) 







P(s1| O0:1)  C P(s1| O0) P(O1| s0)

P(s1| O0:1)  C P(s1| O0) P(O1| s0)



P( s2 | O 0:1 ) 

 P(s

1



P( s2 | O 0:1 ) 

| O 0:1 ) P( s2 | s1 )ds1 sample



predict

Assuming that we only generate FOUR samples from the predicted distributions



P(st|st-1)

a priori

=

 P(s

1

| O 0:1 ) P( s2 | s1 )ds1



Assuming that we only generate FOUR samples from the predicted distributions

Particle Filtering

Particle Filtering

Discretize state space at the prediction step 

P(s) =

=

P(Ot|st)

Transition prob.

By sampling the continuous predicted distribution 



State output prob P(s0)  P(s)



If appropriately sampled, all generated samples may be considered to be equally probable

Sampling results in a discrete uniform distribution

Update step updates the distribution of the quantized state space 

P(s0| O0)  C P(s0) P(O0| s0)

Results in a discrete non­uniform distribution



P( s1 | O 0 )   P(s0 | O 0 ) P( s1 | s0 )ds0





P(s1| O0:1)  C P(s1| O0) P(O1| s0)

Predicted state distribution for the next time instant will again be  continuous 

Must be discretized again by sampling



 P(s

P( s2 | O 0:1 ) 

1

| O 0:1 ) P( s2 | s1 )ds1



P(s2| O0:2)  C P(s2| O0:1) P(O2| s2)



update

Assuming that we only generate FOUR samples from the predicted distributions

At any step, the current state distribution will not have more  components than the number of samples generated at the previous  sampling step 

The complexity of distributions remains constant

1 Nov 2012

78

11-755/18797

Particle Filtering P(s) =

a priori

P(st|st-1) =

P(Ot|st) =

Transition prob.

State output prob

Particle Filtering P ( )

P ( )

st  f (st 1 )  

ot  g (st )   Prediction at time t: 

P( st | O 0:t -1 ) 

sample

 P(s

t 1

| O 0:t -1 ) P( st | st 1 )dst 1



predict



Update at time t:

At t = 0, sample the initial state distribution P( s0 | o1 )  P( s0 ) 

P(st | O0:t )  CP(st | O0:t -1 ) P(Ot | st ) update

Number of mixture components in predicted distribution governed by number of samples in discrete distribution By deriving a small (100-1000) number of samples at each time instant, all distributions are kept manageable



1 M

M 1

  (s i 0

0

 si0 ) where si0  P0 ( s)

Update the state distribution with the observation

P( s | o t

0:t

M 1

)  C  P (o  g ( s )) ( s i 0

t

t i

t

1 Nov 2012

Particle Filtering

s ) t i

C

1 M 1

 P (o  g (s )) i 0

11-755/18797

t

t i

80

Particle Filtering P ( )

P ( )

st  f (st 1 )  

ot  g (st )  

st  f (st 1 )   P ( )

ot  g (st )   P ( )

Predict the state distribution at t



M 1



P( st | o0:t 1 )  C  P (ot 1  g ( sit 1 )) P ( st  f ( sit 1 ))

Predict the state distribution at the next time

i 0

Sample the predicted state distribution at t



M 1

P( st | o0:t 1 )  C  P (ot 1  g ( sit 1 )) P ( st  f ( sit 1 ))

P( st | o0:t 1 ) 

i 0



Sample the predicted state distribution P( st | o0:t 1 ) 

1 M

M 1

 (s i 0

t

1 Nov 2012

1 M

M 1

 (s

t

i 0

 sit ) where sit  P( st | o0:t 1 )

Update the state distribution at t



 sit ) where sit  P( st | o0:t 1 )

M 1

P( st | o0:t )  C  P (ot  g ( sit )) ( st  sit ) 81

11-755/18797

Estimating a state 

i 0

1 Nov 2012

C

11-755/18797

1 M 1

 P (o  g (s )) i 0

t i

t

82

Simulations with a Linear Model

The algorithm gives us a discrete updated distribution over states:

ot  st  xt

st  st 1   t

M 1

P( st | o0:t )  C  P (ot  g ( sit )) ( st  sit )





i 0



t

has a Gaussian distribution with 0 mean, known variance

xt has a mixture Gaussian distribution with known parameters Simulation:



The actual state can be estimated as the mean of this distribution



Generate state sequence



M 1

sˆt  C  sit P (ot  g ( sit ))

ot from st and xt

Generate observation sequence



Alternately, it can be the most likely sample

st from model

Generate sequence of xt from model with one xt term for every st term



i 0



Attempt to estimate



st from ot

sˆt  s tj : j  arg max i P (ot  g (sit )) 1 Nov 2012

83

11-755/18797

Simulation: Synthesizing data Generate state sequence according to: t is Gaussian with mean 0 and variance 10

st  st 1   t

Simulation: Synthesizing data st  st 1   t Generate state sequence according to: t is Gaussian with mean 0 and variance 10 Generate observation sequence from state sequence according to: ot  st  xt xt is mixture Gaussian with parameters: Means = [-4, 0, 4, 8, 12, 16, 18, 20] Variances = [10, 10, 10, 10, 10, 10, 10, 10] Mixture weights = [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]

1 Nov 2012

85

11-755/18797

Simulation: Synthesizing data

1 Nov 2012

11-755/18797

86

SIMULATION: TIME = 1 predict

PREDICTED STATE DISTRIBUTION AT TIME = 1

Combined figure for more compact representation

1 Nov 2012

87

11-755/18797

SIMULATION: TIME = 1

1 Nov 2012

11-755/18797

sample

89

11-755/18797

88

SIMULATION: TIME = 1 sample

predict

SAMPLED VERSION OF PREDICTED STATE DISTRIBUTION AT TIME = 1

1 Nov 2012

SAMPLED VERSION OF PREDICTED STATE DISTRIBUTION AT TIME = 1

1 Nov 2012

11-755/18797

90

SIMULATION: TIME = 1 sample

update

SIMULATION: TIME = 1 update

update, t