outline lecture 06 recap. Bayesians vs. Frequentists warmup computing with probabilities pathologies fun with Gaussians summary ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 06 recap Bayesians vs. Frequentists warmup computing with probabilities pathologies fun with Gaussians summary
recap
probability ⇔ degree of belief in the truth of a proposition X transitivity 0 6 p X 6 1 where
0 = p false and 1 = p true
closure / sum rule p X + p ¬X = 1 conditional probability / product rule p X, Y = p X Y p Y
recap
Bayes’ theorem p(X | Y) =
p(Y | X) p(X) p(Y)
p(X | Y) ≡ posterior probability
p(Y | X) ≡ likelihood
p(X) ≡ prior probability
Thomas Bayes (∗1701, †1761)
recap
Bayes’ theorem p(X | Y) =
p(Y | X) p(X) p(Y)
p(X | Y) ≡ posterior probability
p(Y | X) ≡ likelihood
p(X) ≡ prior probability
mnemonic X ≡ hypothesis Y ≡ data posterior ∝ likelihood × prior
Thomas Bayes (∗1701, †1761)
recap
marginalization Z p X, Y dY = p X Z p X Y dX = 1 Z p X Y, Z dZ = p X Y
recap
if X and Y are independent, then p X, Y = p X p Y because p X, Y = p X Y p Y and p X Y =p X
expectation
p(x)
recap
Z E f (x) = f (x) p(x) dx
x
p(x)
E[x]
x
p(x)
E[x]
x
E[x] p(x)
special case Z E x = x p(x) dx ≡ µ
E[x]
x
recap
variance 2 var f (x) = E f (x) − E f (x) h i h i = E f 2 (x) − E2 f (x)
special case h 2 2 i var x = E x − E x =E x−µ ≡ σ2
recap
covariance cov[x, y] = Exy x−E x y−E y = Exy xy −E x E y
covariance matrix h i T T y−E y = Exy xyT −E x E y cov[x, y] = Exy x−E x
important special case cov[x, x] = E xxT − µµT
where
µ=E x
Bayesians vs. Frequentists
two views on probability
Bayesian view degree of belief or plausibility
Frequentist view relative frequencies or percentages of counts e.g., coin tosses H, T, T, T, H, T, H, H, H, H ⇔ p H =
6 10
problems
Bayesian view where does the prior come from?
Frequentist view what is the probability of a yet unobserved event?
computing with probabilities
Bernoulli distribution
probability distribution of a binary r.v. X ∈ 0, 1 where p X =1 =1−p X =0 =q
Jakob Bernoulli (∗1654, †1705)
Bernoulli distribution
probability distribution of a binary r.v. X ∈ 0, 1 where p X =1 =1−p X =0 =q we write fBer (x | q) = qx (1 − q)1−x because fBer (1 | q) = q1 (1 − q)0 = q fBer (0 | q) = q0 (1 − q)1 = 1 − q
Jakob Bernoulli (∗1654, †1705)
expectation of a Bernoulli r.v.
assume X ∼ fBer (x | q)
expectation of a Bernoulli r.v.
assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) x=0
expectation of a Bernoulli r.v.
assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) = 0 · (1 − q) + 1 · q x=0
expectation of a Bernoulli r.v.
assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) = 0 · (1 − q) + 1 · q = q x=0
variance of a Bernoulli r.v.
assume X ∼ fBer (x | q) we have h 2 i var x = E x − E[x] 1 X = (x − q)2 · p(x) x=0
= (0 − q)2 · (1 − q) + (1 − q)2 · q = q2 · (1 − q) + (1 − q)2 · q = q · (1 − q) q + 1 − q = q · (1 − q)
binomial distribution
probability of observing k occurrences of x = 1 in n Bernoulli trials n k fBin (k | n, q) = q (1 − q)n−k k where n n! = k!(n − k)! k
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q)
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1 =
n X k=1
k
n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1 =
n X k=1
= nq
k
n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!
n X k=1
(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)!
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1 =
n X k=1
= nq
k
n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!
n X k=1
= nq
n−1 X m=0
(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)! (n − 1)! qm (1 − q)n−1−m m!(n − 1 − m)!
expectation of a binomial r.v.
assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1 =
n X k=1
= nq
k
n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!
n X k=1
= nq
n−1 X m=0
= nq
(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)! (n − 1)! qm (1 − q)n−1−m m!(n − 1 − m)!
variance of a binomial r.v.
assume X ∼ fBin (k | n, q) we have h 2 i var k = E k − E[k] n X 2 n = (k − n q) qk (1 − q)n−k k k=0
.. . = n q (1 − q)
Gaussian / normal distribution
often applies to physical quantities (as opposed to man-made ones) N x | µ, σ2 =
√ 1 2 π σ2
1
e− 2σ2 (x−µ)
2
Carl Friedrich Gauß (∗1777, †1855)
standard normal distribution
for µ = 0 and σ = 1, we have N(x) =
1 2 √1 e− 2 x 2π
−1
1
question where does the 2 π come from ? what does 2 π have to do with probabilities ?
question where does the 2 π come from ? what does 2 π have to do with probabilities ?
answer let’s see . . .
note
many functions have anti-derivatives that cannot be expressed in terms of elementary functions common examples include Z
2
e−x dx
Z
sin x dx x
Z
1 dx ln x
Z xx dx
note
many functions have anti-derivatives that cannot be expressed in terms of elementary functions common examples include Z
2
e−x dx
Z
sin x dx x
Z
1 dx ln x
Z xx dx
nevertheless, their definite integrals can be computed
Theorem
1
√ 2 π σ2
∞ Z
− 1 2 (z−µ)2
e −∞
2σ
dz = 1
Proof. to simplify notation, we first substitute x = z − µ and a = and then consider ∞ Z
2
e−a x dx
I= −∞
1 2 σ2
Proof. to simplify notation, we first substitute x = z − µ and a = and then consider ∞ Z
1 2 σ2
2
e−a x dx
I= −∞
then ∞ Z 2
I =
−a x2
e −∞
∞ Z
dx ·
e −∞
−a y2
∞ Z
∞ Z
e−a (x
dy = −∞ −∞
2 +y2 )
dx dy
Proof (cont.) next, we change variables from Cartesian coordinates (x, y) to polar coordinates (r, ϕ) x = r sin ϕ y = r cos ϕ so that x2 + y2 = r2
Proof (cont.) recall that dx dy = det J dr dϕ dϕ
dx
dy
dr
Proof (cont.) for the Jacobian, we have " # ∂x ∂x cos ϕ −r sin ϕ ∂r ∂ϕ ∂(x, y) = J= = ∂y ∂y ∂(r, ϕ) sin ϕ r cos ϕ ∂r
∂ϕ
so that det J = r cos2 ϕ + r sin2 ϕ = r and dx dy = det J dr dϕ = r dr dϕ
Proof (cont.) accordingly, we have ∞ Z 2Zπ
2
r e−a r dϕ dr
2
I = 0 0
Proof (cont.) accordingly, we have ∞ Z 2Zπ 2
I =
−a r2
re 0 0
∞ Z
2
r e−a r dr
dϕ dr = 2 π 0
Proof (cont.) accordingly, we have ∞ Z 2Zπ 2
I =
−a r2
re
∞ Z
0
0 0
next, we substitute u = r2
⇔
du = 2r dr
2
r e−a r dr
dϕ dr = 2 π
Proof (cont.) then 1 I = 2π 2
∞ Z
e−a u du
2
0
Proof (cont.) then 1 I = 2π 2
∞ Z
2
e 0
−a u
1 du = π − e−a u a
∞ 0
Proof (cont.) then 1 I2 = 2 π 2
∞ Z
0
1 e−a u du = π − e−a u a
∞ 0
! 1 = π 0− − a
Proof (cont.) then 1 I2 = 2 π 2
∞ Z
0
1 e−a u du = π − e−a u a
∞ 0
! 1 π = π 0− − = a a
Proof (cont.) then 1 I2 = 2 π 2
∞ Z
1 e−a u du = π − e−a u a
0
so that r I=
π √ = 2 π σ2 a
∞ 0
! 1 π = π 0− − = a a
Proof (cont.) then 1 I2 = 2 π 2
∞ Z
1 e−a u du = π − e−a u a
0
∞ 0
! 1 π = π 0− − = a a
so that r I=
π √ = 2 π σ2 a
⇔
1 √ ·I =1 2 π σ2
expectation of a normally distributed r.v.
assume X ∼ N x | µ, σ2 ≡ f (x)
expectation of a normally distributed r.v.
assume X ∼ N x | µ, σ2 ≡ f (x) we have ∞ Z
∞ Z
x N x | µ, σ
E x =
2
−∞
dx =
x f (x) dx −∞
expectation of a normally distributed r.v.
assume X ∼ N x | µ, σ2 ≡ f (x) we have ∞ Z
∞ Z
x N x | µ, σ
E x =
2
dx =
−∞
1 =√ 2 π σ2
x f (x) dx −∞
∞ Z
1
2
x e− 2σ2 (x−µ) dx −∞
expectation of a normally distributed r.v.
assume X ∼ N x | µ, σ2 ≡ f (x) we have ∞ Z
∞ Z
x N x | µ, σ
E x =
2
dx =
−∞
1 =√ 2 π σ2
x f (x) dx −∞
∞ Z
1
2
x e− 2σ2 (x−µ) dx −∞
we observe 1 1 2 2 d x−µ γ e− 2σ2 (x−µ) = − 2 γ e− 2σ2 (x−µ) dx σ
expectation of a normally distributed r.v.
in other words d µ x f (x) = 2 f (x) − 2 f (x) dx σ σ
expectation of a normally distributed r.v.
in other words d µ x f (x) = 2 f (x) − 2 f (x) dx σ σ therefore x f (x) = µ f (x) − σ2
d f (x) dx
expectation of a normally distributed r.v.
in other words ∞ Z
E x =
∞ Z
x f (x) dx = µ −∞
−∞
∞ Z 2
f (x) dx − σ
−∞
d f (x) dx dx
expectation of a normally distributed r.v.
in other words ∞ Z
E x =
∞ Z
x f (x) dx = µ −∞
∞ Z 2
f (x) dx − σ
−∞
−∞
h i∞ = µ − σ2 f (x)
−∞
d f (x) dx dx
expectation of a normally distributed r.v.
in other words ∞ Z
E x =
∞ Z
x f (x) dx = µ −∞
∞ Z 2
f (x) dx − σ
−∞
−∞
h i∞ = µ − σ2 f (x)
−∞
=µ
d f (x) dx dx
variance of a normally distributed r.v.
assume X ∼ N x | µ, σ2
we have h 2 i var x = E x − µ 1
=√ 2 π σ2 .. . = σ2
∞ Z
1
2
(x − µ)2 e− 2σ2 (x−µ) dx −∞
p(x)
p(x)
mean and variance of the normal distribution
µ−σ
µ
µ+σ
x
µ − 3σ
µ
µ + 3σ
x
µ+σ Z
f (x) dx ≈ 0.683 µ−σ
⇔
µ+3σ Z
f (x) dx ≈ 0.997 µ−3σ
the farther away from the mean we look, the less likely we will find something
final observations
the Bernoulli distribution is a specific binomial distribution fBer (x | q) = fBin (x | n = 1, q)
if n, nq and n(1 − q) are all large, the binomial distribution is well approximated by a normal distribution fBin (x | n, q) ≈ N x | nq, nq(1 − q)
pathologies
Pareto distribution
a power law distribution important in sociology / economics, ∀ x > x0 f (x | x0 , α) =
α x0α xα+1
Vilfredo F. D. Pareto (∗1848, †1932)
expectation / variance of a Pareto r.v.
assume X ∼ f (x | 1, 2)
1
expectation / variance of a Pareto r.v.
assume X ∼ f (x | 1, 2) we have ∞ Z
E x =2
x dx = 2 x3
1 1
expectation / variance of a Pareto r.v.
assume X ∼ f (x | 1, 2) we have ∞ Z
E x =2 1 ∞ Z
var x = 2
x dx = 2 x3
1
=2
1
(x − 2)2 dx x3
4x − 2 + log(x) x2
∞ 1
expectation / variance of a Pareto r.v.
assume X ∼ f (x | 1, 2) we have ∞ Z
E x =2 1 ∞ Z
var x = 2
x dx = 2 x3
1
=2
1
(x − 2)2 dx x3
4x − 2 + log(x) x2
⇒ var x does not exist
∞ 1
Cauchy / Lorentz distribution
a rather pathological distribution f (x | x0 , γ) =
1 h πγ 1 +
x−x0 2 γ
i
Augustin L. Cauchy (∗1789, †1857)
standard Cauchy distribution
distribution of the ratio of two standard normal variables for x0 = 0, γ = 1, we have f (x) =
1 π 1 + x2 −1
1
standard Cauchy distribution
distribution of the ratio of two standard normal variables for x0 = 0, γ = 1, we have f (x) =
1 π 1 + x2 −1
observe ∞ Z
−∞
∞ 1 dx = arctan(x) =π 1 + x2 −∞
1
expectation of a Cauchy r.v.
assume X ∼ f (x)
−1
1
expectation of a Cauchy r.v.
assume X ∼ f (x) we have ∞ Z
E x =
x dx π 1 + x2
−∞
−1
1
expectation of a Cauchy r.v.
assume X ∼ f (x) we have ∞ Z
E x =
x dx π 1 + x2
−∞
we observe Z
−1
1
−1
1
x 1 dx = log x2 + 1 2 2 π 1+x
expectation of a Cauchy r.v.
assume X ∼ f (x) we have ∞ Z
E x =
x dx π 1 + x2
−∞
we observe Z
−1
1
−1
1
x 1 dx = log x2 + 1 2 2 π 1+x
⇒ E x does not exist
fun with Gaussians
recall
uni-variate Gaussian distribution for x, µ ∈ R and 0 < σ2 ∈ R N(x | µ, σ2 ) =
√ 1 2 π σ2
1
e− 2σ2 (x−µ)
2
multi-variate Gaussian distribution for x, µ ∈ Rm and non-singular Σ ∈ Rm×m N(x | µ, Σ) = √
1 (2 π)m det Σ
1
e− 2 (x−µ)
T
Σ−1 (x−µ)
examples
uni-variate N(x), x ∈ R
bi-variate N(x), x ∈ R2
affine transformations of multivariate Gaussian r.v.s
suppose x ∼ N µ, Σ
and let z = Ax + b then z ∼ N Aµ + b, AΣAT
first important corollary
if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any subset of the xi has a marginal distribution that is also multivariate Gaussian
first important corollary
if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any subset of the xi has a marginal distribution that is also multivariate Gaussian, because subsets may be selected using projection matrices such as 1 0 0 0 ... 0 A = 0 0 0 0 . . . 0 0 0 1 0 ... 0
second important corollary
if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any projection of x into a one dimensional subspace produces a uni-variate Gaussian
second important corollary
if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any projection of x into a one dimensional subspace produces a uni-variate Gaussian, since such projections can be achieved using A = w1 , w2 , w3 , . . . , wm = wT so that z = wT x ∼ N wT µ, wT Σw
convolution of two uni-variate Gaussians
Na x |
µa , σ2a
∗ Nb x |
Zx µb , σ2b
= 0
Na (x − t) Nb (t) dt
convolution of two uni-variate Gaussians
Na x |
µa , σ2a
∗ Nb x |
Zx µb , σ2b
=
Na (x − t) Nb (t) dt
0
(x − µ)2 =√ exp − 2 σ2 2 π σ2 1
convolution of two uni-variate Gaussians
Na x |
µa , σ2a
∗ Nb x |
Zx µb , σ2b
=
Na (x − t) Nb (t) dt
0
(x − µ)2 =√ exp − 2 σ2 2 π σ2 1
where µ = µa + µ b σ2 = σ2a + σ2b
convolution of two multi-variate Gaussians
1 1 exp − (x − µ)T Σ−1 (x − µ) Na x | µa , Σa ∗ Nb x | µb , Σb = p 2 (2 π)m det Σ where µ = µa + µb Σ = Σa + Σb
uni-variate Gaussian product theorem
Na x | µa , σ2a · Nb x | µb , σ2b =
(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b
uni-variate Gaussian product theorem
(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b γ (x − m)2 =√ exp − 2 s2 2 π s2
Na x | µa , σ2a · Nb x | µb , σ2b =
uni-variate Gaussian product theorem
(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b γ (x − m)2 =√ exp − 2 s2 2 π s2
Na x | µa , σ2a · Nb x | µb , σ2b =
m=
µa σ2b + µb σ2a σ2a + σ2b
s2 =
σ2a σ2b σ2a + σ2b
(µa −µb )2
exp − 2 σ2a +σ2b γ= q 2 π σ2a + σ2b
multi-variate Gaussian product theorem
Na x | µa , Σa · Nb x | µb , Σb √ det S det M 1 · γ · exp − (x − m)T S−1 (x − m) = √ 2 det Σa det Σb
multi-variate Gaussian product theorem
Na x | µa , Σa · Nb x | µb , Σb √ det S det M 1 · γ · exp − (x − m)T S−1 (x − m) = √ 2 det Σa det Σb
−1 −1 S = Σ−1 + Σ a b
M = Σa + Σb
−1 m = S Σ−1 µ + Σ µ a b a b exp − 21 (µa − µb )T M−1 (µa − µb ) p γ= (2 π)m det M
summary
we now know about
several common probability distributions how to compute their means and variances the fact that means and variances may be undefined several convenient properties of the Gaussian distribution