CS340 Machine learning Bayesian statistics 2

CS340 Machine learning Bayesian statistics 2

1

Binomial distribution (count data) • X ~ Binom(θ, N), X ∈ {0,1,…,N}

N P (X = x|θ, N ) = θ x (1 − θ)N −x x theta=0.500

theta=0.250

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 1 2 3 4 5 6 7 8 9 10

0

theta=0.750

theta=0.900

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 910

0

0 1 2 3 4 5 6 7 8 910

2

Bernoulli distribution (binary data) • Binomial distribution when N=1 is called the Bernoulli distribution. • We write X ~ Ber(θ), X ∈ {0,1}

p(X) = θ X (1 − θ)1−X • So p(X=1) = θ, p(X=0) = 1-θ

3

Bernoulli likelihood function • The likelihood is

L(θ) = =

p(D|θ) =

N

p(xn |θ)

n=1

θ I(xn =1) (1 − θ)I(xn =0)

n

= =

θ θ

n

N1

I(xn =1)

(1 − θ)

(1 − θ)

N0

n

I(xn =0)

We say that N0 and N1 are sufficient statistics of D for θ This is the same as the Binomial likelihood function, up to constant factors. 4

Summary of beta-Bernoulli model • • • •

1 θ α1 −1 (1 − θ)α0 −1 B(α1 , α0 ) p(D|θ) = θ N1 (1 − θ)N0

Prior Likelihood Posterior p(θ|D) = Beta(θ|α1 + N1 , α0 + N0 ) α1 + N1 p(X = 1|D) = Posterior predictive α +α +N p(θ) = Beta(θ|α1 , α0 ) =

1

0

5

Marginal likelihood • When performing Bayesian model selection and empirical Bayes estimation, we will need p(D) =

p(D|θ)p(θ)dθ

• This is given by a ratio of the posterior and prior normalizing constants p(θ|D) = = = p(D)

=

p(θ)p(D|θ) p(D) N 1 1 α1 −1 α0 −1 N0 1 θ (1 − θ) (1 − θ) θ p(D) B(α1 , α0 ) α′1 −1

α′0 −1

(1 − θ) B(α1′ , α0′ ) B(α′1 , α0′ ) B(α1 , α0 ) θ

α1′ = α1 + N1 , α0′ = α0 + N0

6

Summary of beta-Bernoulli model • • • • •

1 θ α1 −1 (1 − θ)α0 −1 B(α1 , α0 ) p(D|θ) = θ N1 (1 − θ)N0

Prior Likelihood Posterior p(θ|D) = Beta(θ|α1 + N1 , α0 + N0 ) α1 + N1 p(X = 1|D) = Posterior predictive α1 + α0 + N Marginal likelihood p(θ) = Beta(θ|α1 , α0 ) =

Γ(α1 + N1 )Γ(α0 + N0 ) Γ(α1 + α0 ) B(α1 + N1 , α0 + N0 ) = p(D) = B(α1 , α0 ) Γ(α1 + N1 + α0 + N0 ) Γ(α1 )Γ(α0 )

7

From coins to dice • Let (X1,…Xd) | N ~ Multinomial(θ, N)

P (x1 , . . . , xd |θ, N ) = Xi’s no longer conditionally independent since ∑i xi = N

=

We also require ∑i θi = 1.

=

N x1 . . . xd

d

xi θi

i=1 d

N! xi θi x1 !x2 ! . . . xd ! i=1

θxi i ( xi )! x ! i i i

Xi ∈ {0,…,N} = number of times face i occurs 8

Multinomial(θ, 1) • Let (X1,…Xd) ~ Multinomial(θ, 1)

P (x1 , . . . , xd |θ) =

d

θixi

i=1

• Since ∑i Xi = 1, only one “bit” can be on, eg (0,1,0) means face 2 occurred. • Let X ∈ {1,…,d} represent the event that occurred.

P (X = k|θ)

=

d

I(X=k) θi

= θk

i=1

9

Likelihood function for the multinomial • Let D = (X1, …, XN), where Xi ∈ {1,…,K}

P (D|θ) ∝

N K

n=1 k=1

xnk θk

N xnk = θk n = θk k k

k

• The Ni are the sufficient statistics

10

Dirichlet distribution E[xk] = αk/(∑j αj)

• Generalization of Beta to K dimensions K

1 αK −1 α1 −1 α0 −1 p(x|α) = D(x|α) = ·x · x2 · · · xK I( xk − 1) Z(α) 1 k=1

• Normalization constant Z(α) =

(20,20,20)

···

1 −1 xα 1

K −1 · · · xα dx1 K

(2,2,2)

· · · dxK =

K

j=1 Γ(αj ) K Γ( j=1 αj )

(20,2,2)

11

Summary of Dirichlet-multinomial model • Xn ∼ Mult(θ,1), p(Xn=k) = θk

K 1 αk −1 p(θ) = Dir(θ|α , . . . , α ) = θ 1 K k Prior Z(α , . . . , α ) 1 K K k=1 Nk Likelihood p(D|θ) = θk

• • k=1 • Posterior p(θ|D) = Dir(θ|α1 + N1 , . . . , αK + NK ) • Posterior predictive p(X = k|D) = αk + Nk k αk + Nk • Marginal likelihood ′

′

′

Γ( k αk ) Γ(Nk + αk ) Z(N + α ) p(D) = = Z( α) Γ(N + k αk ) Γ(αk ) k

12

Normal-Normal model • Consider estimating the mean of a Gaussian whose variance is known. • The natural conjugate prior is Gaussian. • So the posterior is also Gaussian (“Gaussian times Gaussian gives Gaussian”). • The algebra is rather messy (see handout), so we will just state and interpret the results.

13

Normal-normal model • Likelihood p(D|µ) =

N

N (xn |µ, σ 2 )

n=1

∝

N σ2 2 exp − 2 (x − µ) ∝ N (x|µ, ) 2σ N

• Natural conjugate prior

1 p(µ) ∝ exp − 2 (µ − µ0 )2 2σ0

∝ N (µ|µ0 , σ02 )

• Posterior

p(µ|D) = µN

=

2 σN

=

2 N (µ|µN , σN ) 2 2 σ N σ0 µ0 Nx 2 µ0 + x = σN + 2 N σ02 + σ 2 N σ02 + σ 2 σ02 σ σ 2 σ02 1 = N N σ02 + σ 2 + σ12 σ2 0

14

Posterior mean • Consider N=1. µ1

= = =

σ2 σ02 µ0 + 2 x 2 2 2 σ + σ0 σ + σ0 σ02 µ0 + (x − µ0 ) 2 σ + σ02 σ2 x − (x − µ0 ) 2 σ + σ02

Convex comb of prior and MLE Prior plus data correction term

Data shrunk towards prior

15

Posterior precision • Precision = 1/variance, λ = 1/σ2. • Precisions add, means are averaged. p(µ|D, λ) = λN

=

µN

=

N (µ|µN , λ−1 N ) λ0 + N λ xN λ + µ0 λ0 = wx + (1 − w)µ0 λN

Nλ w= λN

16

Sequential updating • Suppose true mean=0.8, true variance=0.1. • p(µ|D) rapidly approaches a delta function centered on the true mean.

Bishop 2.12

17

Posterior predictive distribution • The predictive variance is the observation noise σ2 plus the uncertainty about µ, σ2N p(x|D) =

p(x|µ)p(µ|D)dµ

=

=

2 N (x|µN , σN + σ2)

2 N (x|µ, σ 2 )N (µ|µN , σN )dµ

• Or, future X = prior mean µ + noise ε X µ

= ∼

ǫ ∼

µ+ǫ N (µn , σn2 ) N (0, σ 2 )

E[X] = E[µ] + E[ǫ] = µn + 0 Var [X]

= Var [µ] + Var [ǫ] = σn2 + σ 2

18

Summary of Normal-Normal model • Prior • Likelihood • Posterior

p(µ) = N (µ|µ0 , (λ0 )−1 ) p(D|µ) =

N

N (xn |µ, λ−1 )

n=1

p(µ|D) = λN

=

µN

=

N (µ|µN , (λN )−1 ) λ0 + N λ xN λ + µ0 λ0 λN

• Posterior predictive p(x|D) =

2 N (x|µN , σN + σ2)

• Marginal likelihood Too messy to print here (see handout) 19