Pattern Recognition

Pattern Recognition Prof. Christian Bauckhage

outline lecture 06 recap Bayesians vs. Frequentists warmup computing with probabilities pathologies fun with Gaussians summary

recap

probability ⇔ degree of belief in the truth of a proposition X transitivity 0 6 p X 6 1 where

0 = p false and 1 = p true

closure / sum rule p X + p ¬X = 1 conditional probability / product rule p X, Y = p X Y p Y

recap

Bayes’ theorem p(X | Y) =

p(Y | X) p(X) p(Y)

p(X | Y) ≡ posterior probability

p(Y | X) ≡ likelihood

p(X) ≡ prior probability

Thomas Bayes (∗1701, †1761)

recap

Bayes’ theorem p(X | Y) =

p(Y | X) p(X) p(Y)

p(X | Y) ≡ posterior probability

p(Y | X) ≡ likelihood

p(X) ≡ prior probability

mnemonic X ≡ hypothesis Y ≡ data posterior ∝ likelihood × prior

Thomas Bayes (∗1701, †1761)

recap

marginalization Z p X, Y dY = p X Z p X Y dX = 1 Z p X Y, Z dZ = p X Y

recap

if X and Y are independent, then p X, Y = p X p Y because p X, Y = p X Y p Y and p X Y =p X

expectation

p(x)

recap

Z E f (x) = f (x) p(x) dx

x

p(x)

E[x]

x

p(x)

E[x]

x

E[x] p(x)

special case Z E x = x p(x) dx ≡ µ

E[x]

x

recap

variance 2 var f (x) = E f (x) − E f (x) h i h i = E f 2 (x) − E2 f (x)

special case h 2 2 i var x = E x − E x =E x−µ ≡ σ2

recap

covariance cov[x, y] = Exy x−E x y−E y = Exy xy −E x E y

covariance matrix h i T T y−E y = Exy xyT −E x E y cov[x, y] = Exy x−E x

important special case cov[x, x] = E xxT − µµT

where

µ=E x

Bayesians vs. Frequentists

two views on probability

Bayesian view degree of belief or plausibility

Frequentist view relative frequencies or percentages of counts e.g., coin tosses H, T, T, T, H, T, H, H, H, H ⇔ p H =

6 10

problems

Bayesian view where does the prior come from?

Frequentist view what is the probability of a yet unobserved event?

computing with probabilities

Bernoulli distribution

probability distribution of a binary r.v. X ∈ 0, 1 where p X =1 =1−p X =0 =q

Jakob Bernoulli (∗1654, †1705)

Bernoulli distribution

probability distribution of a binary r.v. X ∈ 0, 1 where p X =1 =1−p X =0 =q we write fBer (x | q) = qx (1 − q)1−x because fBer (1 | q) = q1 (1 − q)0 = q fBer (0 | q) = q0 (1 − q)1 = 1 − q

Jakob Bernoulli (∗1654, †1705)

expectation of a Bernoulli r.v.

assume X ∼ fBer (x | q)


assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) x=0


assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) = 0 · (1 − q) + 1 · q x=0


assume X ∼ fBer (x | q) we have 1 X E x = x · p(x) = 0 · (1 − q) + 1 · q = q x=0

variance of a Bernoulli r.v.

assume X ∼ fBer (x | q) we have h 2 i var x = E x − E[x] 1 X = (x − q)2 · p(x) x=0

= (0 − q)2 · (1 − q) + (1 − q)2 · q = q2 · (1 − q) + (1 − q)2 · q = q · (1 − q) q + 1 − q = q · (1 − q)

binomial distribution

probability of observing k occurrences of x = 1 in n Bernoulli trials n k fBin (k | n, q) = q (1 − q)n−k k where n n! = k!(n − k)! k

expectation of a binomial r.v.

assume X ∼ fBin (k | n, q)


assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0


assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1


assume X ∼ fBin (k | n, q) we have n n X X n k E k = k · p(k) = k q (1 − q)n−k k k=0 k=0 n X n k = k q (1 − q)n−k k k=1 =

n X k=1

k

n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!



n X k=1

= nq

k

n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!

n X k=1

(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)!



n X k=1

= nq

k

n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!

n X k=1

= nq

n−1 X m=0

(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)! (n − 1)! qm (1 − q)n−1−m m!(n − 1 − m)!



n X k=1

= nq

k

n(n − 1)! q qk−1 (1 − q)n−k k(k − 1)!(n − k)!

n X k=1

= nq

n−1 X m=0

= nq

(n − 1)! qk−1 (1 − q)n−k (k − 1)!(n − k)! (n − 1)! qm (1 − q)n−1−m m!(n − 1 − m)!

variance of a binomial r.v.

assume X ∼ fBin (k | n, q) we have h 2 i var k = E k − E[k] n X 2 n = (k − n q) qk (1 − q)n−k k k=0

.. . = n q (1 − q)

Gaussian / normal distribution

often applies to physical quantities (as opposed to man-made ones) N x | µ, σ2 =

√ 1 2 π σ2

1

e− 2σ2 (x−µ)

2

Carl Friedrich Gauß (∗1777, †1855)

standard normal distribution

for µ = 0 and σ = 1, we have N(x) =

1 2 √1 e− 2 x 2π

−1

1

question where does the 2 π come from ? what does 2 π have to do with probabilities ?

question where does the 2 π come from ? what does 2 π have to do with probabilities ?

answer let’s see . . .

note

many functions have anti-derivatives that cannot be expressed in terms of elementary functions common examples include Z

2

e−x dx

Z

sin x dx x

Z

1 dx ln x

Z xx dx

note

many functions have anti-derivatives that cannot be expressed in terms of elementary functions common examples include Z

2

e−x dx

Z

sin x dx x

Z

1 dx ln x

Z xx dx

nevertheless, their definite integrals can be computed

Theorem

1

√ 2 π σ2

∞ Z

− 1 2 (z−µ)2

e −∞

2σ

dz = 1

Proof. to simplify notation, we first substitute x = z − µ and a = and then consider ∞ Z

2

e−a x dx

I= −∞

1 2 σ2

Proof. to simplify notation, we first substitute x = z − µ and a = and then consider ∞ Z

1 2 σ2

2

e−a x dx

I= −∞

then ∞ Z 2

I =

−a x2

e −∞

∞ Z

dx ·

e −∞

−a y2

∞ Z

∞ Z

e−a (x

dy = −∞ −∞

2 +y2 )

dx dy

Proof (cont.) next, we change variables from Cartesian coordinates (x, y) to polar coordinates (r, ϕ) x = r sin ϕ y = r cos ϕ so that x2 + y2 = r2

Proof (cont.) recall that dx dy = det J dr dϕ dϕ

dx

dy

dr

Proof (cont.) for the Jacobian, we have   " # ∂x ∂x cos ϕ −r sin ϕ ∂r ∂ϕ ∂(x, y) = J= = ∂y ∂y ∂(r, ϕ) sin ϕ r cos ϕ ∂r

∂ϕ

so that det J = r cos2 ϕ + r sin2 ϕ = r and dx dy = det J dr dϕ = r dr dϕ

Proof (cont.) accordingly, we have ∞ Z 2Zπ

2

r e−a r dϕ dr

2

I = 0 0

Proof (cont.) accordingly, we have ∞ Z 2Zπ 2

I =

−a r2

re 0 0

∞ Z

2

r e−a r dr

dϕ dr = 2 π 0

Proof (cont.) accordingly, we have ∞ Z 2Zπ 2

I =

−a r2

re

∞ Z

0

0 0

next, we substitute u = r2

⇔

du = 2r dr

2

r e−a r dr

dϕ dr = 2 π

Proof (cont.) then 1 I = 2π 2

∞ Z

e−a u du

2

0

Proof (cont.) then 1 I = 2π 2

∞ Z

2

e 0

−a u

1 du = π − e−a u a

∞ 0

Proof (cont.) then 1 I2 = 2 π 2

∞ Z

0

1 e−a u du = π − e−a u a

∞ 0

! 1 = π 0− − a


∞ Z

0


∞ 0

! 1 π = π 0− − = a a


∞ Z


0

so that r I=

π √ = 2 π σ2 a

∞ 0

! 1 π = π 0− − = a a


∞ Z


0

∞ 0

! 1 π = π 0− − = a a

so that r I=

π √ = 2 π σ2 a

⇔

1 √ ·I =1 2 π σ2

expectation of a normally distributed r.v.

assume X ∼ N x | µ, σ2 ≡ f (x)


assume X ∼ N x | µ, σ2 ≡ f (x) we have ∞ Z

∞ Z

x N x | µ, σ

E x =

2

−∞

dx =

x f (x) dx −∞



∞ Z

x N x | µ, σ

E x =

2

dx =

−∞

1 =√ 2 π σ2

x f (x) dx −∞

∞ Z

1

2

x e− 2σ2 (x−µ) dx −∞



∞ Z

x N x | µ, σ

E x =

2

dx =

−∞

1 =√ 2 π σ2

x f (x) dx −∞

∞ Z

1

2

x e− 2σ2 (x−µ) dx −∞

we observe 1 1 2 2 d x−µ γ e− 2σ2 (x−µ) = − 2 γ e− 2σ2 (x−µ) dx σ


in other words d µ x f (x) = 2 f (x) − 2 f (x) dx σ σ


in other words d µ x f (x) = 2 f (x) − 2 f (x) dx σ σ therefore x f (x) = µ f (x) − σ2

d f (x) dx


in other words ∞ Z

E x =

∞ Z

x f (x) dx = µ −∞

−∞

∞ Z 2

f (x) dx − σ

−∞

d f (x) dx dx



E x =

∞ Z

x f (x) dx = µ −∞

∞ Z 2

f (x) dx − σ

−∞

−∞

h i∞ = µ − σ2 f (x)

−∞

d f (x) dx dx



E x =

∞ Z

x f (x) dx = µ −∞

∞ Z 2

f (x) dx − σ

−∞

−∞

h i∞ = µ − σ2 f (x)

−∞

=µ

d f (x) dx dx

variance of a normally distributed r.v.

assume X ∼ N x | µ, σ2

we have h 2 i var x = E x − µ 1

=√ 2 π σ2 .. . = σ2

∞ Z

1

2

(x − µ)2 e− 2σ2 (x−µ) dx −∞

p(x)

p(x)

mean and variance of the normal distribution

µ−σ

µ

µ+σ

x

µ − 3σ

µ

µ + 3σ

x

µ+σ Z

f (x) dx ≈ 0.683 µ−σ

⇔

µ+3σ Z

f (x) dx ≈ 0.997 µ−3σ

the farther away from the mean we look, the less likely we will find something

final observations

the Bernoulli distribution is a specific binomial distribution fBer (x | q) = fBin (x | n = 1, q)

if n, nq and n(1 − q) are all large, the binomial distribution is well approximated by a normal distribution fBin (x | n, q) ≈ N x | nq, nq(1 − q)

pathologies

Pareto distribution

a power law distribution important in sociology / economics, ∀ x > x0 f (x | x0 , α) =

α x0α xα+1

Vilfredo F. D. Pareto (∗1848, †1932)

expectation / variance of a Pareto r.v.

assume X ∼ f (x | 1, 2)

1


assume X ∼ f (x | 1, 2) we have ∞ Z

E x =2

x dx = 2 x3

1 1



E x =2 1 ∞ Z

var x = 2

x dx = 2 x3

1

=2

1

(x − 2)2 dx x3

4x − 2 + log(x) x2

∞ 1



E x =2 1 ∞ Z

var x = 2

x dx = 2 x3

1

=2

1

(x − 2)2 dx x3

4x − 2 + log(x) x2

⇒ var x does not exist

∞ 1

Cauchy / Lorentz distribution

a rather pathological distribution f (x | x0 , γ) =

1 h πγ 1 +

x−x0 2 γ

i

Augustin L. Cauchy (∗1789, †1857)

standard Cauchy distribution

distribution of the ratio of two standard normal variables for x0 = 0, γ = 1, we have f (x) =

1 π 1 + x2 −1

1

standard Cauchy distribution

distribution of the ratio of two standard normal variables for x0 = 0, γ = 1, we have f (x) =

1 π 1 + x2 −1

observe ∞ Z

−∞

∞ 1 dx = arctan(x) =π 1 + x2 −∞

1

expectation of a Cauchy r.v.

assume X ∼ f (x)

−1

1


assume X ∼ f (x) we have ∞ Z

E x =

x dx π 1 + x2

−∞

−1

1



E x =

x dx π 1 + x2

−∞

we observe Z

−1

1

−1

1

x 1 dx = log x2 + 1 2 2 π 1+x



E x =

x dx π 1 + x2

−∞

we observe Z

−1

1

−1

1

x 1 dx = log x2 + 1 2 2 π 1+x

⇒ E x does not exist

fun with Gaussians

recall

uni-variate Gaussian distribution for x, µ ∈ R and 0 < σ2 ∈ R N(x | µ, σ2 ) =

√ 1 2 π σ2

1

e− 2σ2 (x−µ)

2

multi-variate Gaussian distribution for x, µ ∈ Rm and non-singular Σ ∈ Rm×m N(x | µ, Σ) = √

1 (2 π)m det Σ

1

e− 2 (x−µ)

T

Σ−1 (x−µ)

examples

uni-variate N(x), x ∈ R

bi-variate N(x), x ∈ R2

affine transformations of multivariate Gaussian r.v.s

suppose x ∼ N µ, Σ

and let z = Ax + b then z ∼ N Aµ + b, AΣAT

first important corollary

if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any subset of the xi has a marginal distribution that is also multivariate Gaussian

first important corollary

if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any subset of the xi has a marginal distribution that is also multivariate Gaussian, because subsets may be selected using projection matrices such as   1 0 0 0 ... 0 A = 0 0 0 0 . . . 0 0 0 1 0 ... 0

second important corollary

if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any projection of x into a one dimensional subspace produces a uni-variate Gaussian

second important corollary

if T x = x1 , x2 , x3 . . . , xm ∼ N µ, Σ then any projection of x into a one dimensional subspace produces a uni-variate Gaussian, since such projections can be achieved using A = w1 , w2 , w3 , . . . , wm = wT so that z = wT x ∼ N wT µ, wT Σw

convolution of two uni-variate Gaussians

Na x |

µa , σ2a

∗ Nb x |

Zx µb , σ2b

= 0

Na (x − t) Nb (t) dt


Na x |

µa , σ2a

∗ Nb x |

Zx µb , σ2b

=


0

(x − µ)2 =√ exp − 2 σ2 2 π σ2 1


Na x |

µa , σ2a

∗ Nb x |

Zx µb , σ2b

=


0

(x − µ)2 =√ exp − 2 σ2 2 π σ2 1

where µ = µa + µ b σ2 = σ2a + σ2b

convolution of two multi-variate Gaussians

1 1 exp − (x − µ)T Σ−1 (x − µ) Na x | µa , Σa ∗ Nb x | µb , Σb = p 2 (2 π)m det Σ where µ = µa + µb Σ = Σa + Σb

uni-variate Gaussian product theorem

Na x | µa , σ2a · Nb x | µb , σ2b =

(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b


(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b γ (x − m)2 =√ exp − 2 s2 2 π s2



(x − µa )2 (x − µb )2 1 exp − − 2 π σa σb 2 σ2a 2 σ2b γ (x − m)2 =√ exp − 2 s2 2 π s2


m=

µa σ2b + µb σ2a σ2a + σ2b

s2 =

σ2a σ2b σ2a + σ2b

(µa −µb )2

exp − 2 σ2a +σ2b γ= q 2 π σ2a + σ2b

multi-variate Gaussian product theorem

Na x | µa , Σa · Nb x | µb , Σb √ det S det M 1 · γ · exp − (x − m)T S−1 (x − m) = √ 2 det Σa det Σb

multi-variate Gaussian product theorem

Na x | µa , Σa · Nb x | µb , Σb √ det S det M 1 · γ · exp − (x − m)T S−1 (x − m) = √ 2 det Σa det Σb

−1 −1 S = Σ−1 + Σ a b

M = Σa + Σb

−1 m = S Σ−1 µ + Σ µ a b a b exp − 21 (µa − µb )T M−1 (µa − µb ) p γ= (2 π)m det M

summary

we now know about

several common probability distributions how to compute their means and variances the fact that means and variances may be undefined several convenient properties of the Gaussian distribution