Bayesian statistics 2 ... When performing Bayesian model selection and .... of
Beta to K dimensions. • Normalization constant p(x|α) = D(x|α) = 1. Z(α). · x α1−1.
1.
CS340 Machine learning Bayesian statistics 2
1
Binomial distribution (count data) • X ~ Binom(θ, N), X ∈ {0,1,…,N}
N P (X = x|θ, N ) = θ x (1 − θ)N −x x theta=0.500
theta=0.250
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 1 2 3 4 5 6 7 8 9 10
0
theta=0.750
theta=0.900
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 910
0
0 1 2 3 4 5 6 7 8 910
2
Bernoulli distribution (binary data) • Binomial distribution when N=1 is called the Bernoulli distribution. • We write X ~ Ber(θ), X ∈ {0,1}
p(X) = θ X (1 − θ)1−X • So p(X=1) = θ, p(X=0) = 1-θ
3
Bernoulli likelihood function • The likelihood is
L(θ) = =
p(D|θ) =
N
p(xn |θ)
n=1
θ I(xn =1) (1 − θ)I(xn =0)
n
= =
θ θ
n
N1
I(xn =1)
(1 − θ)
(1 − θ)
N0
n
I(xn =0)
We say that N0 and N1 are sufficient statistics of D for θ This is the same as the Binomial likelihood function, up to constant factors. 4
Summary of beta-Bernoulli model • • • •
1 θ α1 −1 (1 − θ)α0 −1 B(α1 , α0 ) p(D|θ) = θ N1 (1 − θ)N0
Prior Likelihood Posterior p(θ|D) = Beta(θ|α1 + N1 , α0 + N0 ) α1 + N1 p(X = 1|D) = Posterior predictive α +α +N p(θ) = Beta(θ|α1 , α0 ) =
1
0
5
Marginal likelihood • When performing Bayesian model selection and empirical Bayes estimation, we will need p(D) =
p(D|θ)p(θ)dθ
• This is given by a ratio of the posterior and prior normalizing constants p(θ|D) = = = p(D)
=
p(θ)p(D|θ) p(D) N 1 1 α1 −1 α0 −1 N0 1 θ (1 − θ) (1 − θ) θ p(D) B(α1 , α0 ) α′1 −1
α′0 −1
(1 − θ) B(α1′ , α0′ ) B(α′1 , α0′ ) B(α1 , α0 ) θ
α1′ = α1 + N1 , α0′ = α0 + N0
6
Summary of beta-Bernoulli model • • • • •
1 θ α1 −1 (1 − θ)α0 −1 B(α1 , α0 ) p(D|θ) = θ N1 (1 − θ)N0
Prior Likelihood Posterior p(θ|D) = Beta(θ|α1 + N1 , α0 + N0 ) α1 + N1 p(X = 1|D) = Posterior predictive α1 + α0 + N Marginal likelihood p(θ) = Beta(θ|α1 , α0 ) =
Γ(α1 + N1 )Γ(α0 + N0 ) Γ(α1 + α0 ) B(α1 + N1 , α0 + N0 ) = p(D) = B(α1 , α0 ) Γ(α1 + N1 + α0 + N0 ) Γ(α1 )Γ(α0 )
7
From coins to dice • Let (X1,…Xd) | N ~ Multinomial(θ, N)
P (x1 , . . . , xd |θ, N ) = Xi’s no longer conditionally independent since ∑i xi = N
=
We also require ∑i θi = 1.
=
N x1 . . . xd
d
xi θi
i=1 d
N! xi θi x1 !x2 ! . . . xd ! i=1
θxi i ( xi )! x ! i i i
Xi ∈ {0,…,N} = number of times face i occurs 8
Multinomial(θ, 1) • Let (X1,…Xd) ~ Multinomial(θ, 1)
P (x1 , . . . , xd |θ) =
d
θixi
i=1
• Since ∑i Xi = 1, only one “bit” can be on, eg (0,1,0) means face 2 occurred. • Let X ∈ {1,…,d} represent the event that occurred.
P (X = k|θ)
=
d
I(X=k) θi
= θk
i=1
9
Likelihood function for the multinomial • Let D = (X1, …, XN), where Xi ∈ {1,…,K}
P (D|θ) ∝
N K
n=1 k=1
xnk θk
N xnk = θk n = θk k k
k
• The Ni are the sufficient statistics
10
Dirichlet distribution E[xk] = αk/(∑j αj)
• Generalization of Beta to K dimensions K
1 αK −1 α1 −1 α0 −1 p(x|α) = D(x|α) = ·x · x2 · · · xK I( xk − 1) Z(α) 1 k=1
• Normalization constant Z(α) =
(20,20,20)
···
1 −1 xα 1
K −1 · · · xα dx1 K
(2,2,2)
· · · dxK =
K
j=1 Γ(αj ) K Γ( j=1 αj )
(20,2,2)
11
Summary of Dirichlet-multinomial model • Xn ∼ Mult(θ,1), p(Xn=k) = θk
K 1 αk −1 p(θ) = Dir(θ|α , . . . , α ) = θ 1 K k Prior Z(α , . . . , α ) 1 K K k=1 Nk Likelihood p(D|θ) = θk
• • k=1 • Posterior p(θ|D) = Dir(θ|α1 + N1 , . . . , αK + NK ) • Posterior predictive p(X = k|D) = αk + Nk k αk + Nk • Marginal likelihood ′
′
′
Γ( k αk ) Γ(Nk + αk ) Z(N + α ) p(D) = = Z( α) Γ(N + k αk ) Γ(αk ) k
12
Normal-Normal model • Consider estimating the mean of a Gaussian whose variance is known. • The natural conjugate prior is Gaussian. • So the posterior is also Gaussian (“Gaussian times Gaussian gives Gaussian”). • The algebra is rather messy (see handout), so we will just state and interpret the results.
13
Normal-normal model • Likelihood p(D|µ) =
N
N (xn |µ, σ 2 )
n=1
∝
N σ2 2 exp − 2 (x − µ) ∝ N (x|µ, ) 2σ N
• Natural conjugate prior
1 p(µ) ∝ exp − 2 (µ − µ0 )2 2σ0
∝ N (µ|µ0 , σ02 )
• Posterior
p(µ|D) = µN
=
2 σN
=
2 N (µ|µN , σN ) 2 2 σ N σ0 µ0 Nx 2 µ0 + x = σN + 2 N σ02 + σ 2 N σ02 + σ 2 σ02 σ σ 2 σ02 1 = N N σ02 + σ 2 + σ12 σ2 0
14
Posterior mean • Consider N=1. µ1
= = =
σ2 σ02 µ0 + 2 x 2 2 2 σ + σ0 σ + σ0 σ02 µ0 + (x − µ0 ) 2 σ + σ02 σ2 x − (x − µ0 ) 2 σ + σ02
Convex comb of prior and MLE Prior plus data correction term
Data shrunk towards prior
15
Posterior precision • Precision = 1/variance, λ = 1/σ2. • Precisions add, means are averaged. p(µ|D, λ) = λN
=
µN
=
N (µ|µN , λ−1 N ) λ0 + N λ xN λ + µ0 λ0 = wx + (1 − w)µ0 λN
Nλ w= λN
16
Sequential updating • Suppose true mean=0.8, true variance=0.1. • p(µ|D) rapidly approaches a delta function centered on the true mean.
Bishop 2.12
17
Posterior predictive distribution • The predictive variance is the observation noise σ2 plus the uncertainty about µ, σ2N p(x|D) =
p(x|µ)p(µ|D)dµ
=
=
2 N (x|µN , σN + σ2)
2 N (x|µ, σ 2 )N (µ|µN , σN )dµ
• Or, future X = prior mean µ + noise ε X µ
= ∼
ǫ ∼
µ+ǫ N (µn , σn2 ) N (0, σ 2 )
E[X] = E[µ] + E[ǫ] = µn + 0 Var [X]
= Var [µ] + Var [ǫ] = σn2 + σ 2
18
Summary of Normal-Normal model • Prior • Likelihood • Posterior
p(µ) = N (µ|µ0 , (λ0 )−1 ) p(D|µ) =
N
N (xn |µ, λ−1 )
n=1
p(µ|D) = λN
=
µN
=
N (µ|µN , (λN )−1 ) λ0 + N λ xN λ + µ0 λ0 λN
• Posterior predictive p(x|D) =
2 N (x|µN , σN + σ2)
• Marginal likelihood Too messy to print here (see handout) 19