Generalized Linear Models

Generalized Linear Models Poisson Regression

Ulrich Halekoh Biometry Research Unit Danish Institute of Agricultural Sciences

March 20, 2003 Poisson Regression

1

Poisson distributed data The Poisson distribution is used to describe COUNTS

in contrast to binomial data, • there is (in principle) no limit on the number of events and the maximal number of possible events is not known in advance. • each observational unit (person, animal, plant) may contribute several times to the counts


2

Poisson variable: examples • counts in time I accidents that happen at different times I registration of particles from radioactive decay I number of deaths from some illness in different years or different populations I mutations in a populations of animals. • counts in space I infectious organism spread on an agar-plate I blood-cells in a blood-sample I infected trees in a wood March 20, 2003 Poisson Regression

3

I raisins in a dough


4

The Poisson distribution A Poisson distributed random variable N with parameter λ is defined for all non-negative integer numbers 0, 1, 2, . . . . N ∼ P oisson(λ) the expectation is E(N ) = λ and the variance V ar(N ) = λ For a Poisson-distributed variable are expectation and variance equal!


5

Poisson distribution λ = 10

0.08 0.00

0.00

0.02

0.05

0.04

0.06

probability

0.10

probability

0.15

0.10

0.20

0.12

Poisson distribution λ = 3

0

5

10 count


15

20

0

5

10 count

15

20

6

Example: Counting of organisms Example: (Blæsild, Granfeld (1995)) Density of nematodes In samples of 40µl or 20µl, the number of nematodes is counted. There are 15 samples for each volume. The counts of the nematodes in the samples volume 40µl 20µl

21 9

27 12

28 13


28 13

28 14

number of nematodes 31 32 33 37 38 14 14 15 16 18

39 20

39 21

39 21

41 24

45 24

7

The question of the experiment is: • How many nematodes are there per volume-unit [i.e. 1 ml]? Relating counts to some entity within which the count are observed is called a RATE (we will use the abbreviation ρ). A rate is counts per • volume, i.e. ml • time, i.e. second • area, i.e. cm2 March 20, 2003 Poisson Regression

8

In the experiment it is obvious, that • different: the mean number of counts of nematodes for the 40µl solution and the 20µl solution • equal: the expected number of nematodes per unit-volume is the same, because the same source solution was used


9

The number of observed bacteria can be described by a Poisson distribution, with different means for the two solutions random part specification: NS,i ∼ P oisson(λS,i),

S = 20µl, 40µl, i = 1, . . . , 15

systematic part specification: a seperate mean for each volume-group λS,i = λS


for all i

10

The rate ρ and the mean-number of counts in a volume v is related via average number of bacteria in a volume v ρ = v λS = vS we get, by taking logarithms both sides: log(λS ) = log(vs) + log(ρ) So after applying the log-function as a link-function for the expected number of counts λS , we have a linear model with March 20, 2003 Poisson Regression

11

• log(ρ) the intercept to estimate • log(vS ) an OFFSET, because it is known in advance and we need not to estimate a coefficient for that ’covariate’


12

Nematodes: SAS program proc genmod; model count= / dist=poisson link=log offset=log(v); lrci; \ = −0.18 The intercept is estimated as log(ρ) with 95% CI [-0.25,-0.11] Hence ρˆ = exp(−0.18) = 0.83 CI:[0.78,0.90] i.e.g there are on average 0.83 nematodes per 1µl.


13

Poisson regression Relating count-measurements to covariates is called P OISSON REGRESSION

random part of the model Nj ∼ P oisson(λ(xj))

systematic part of the model (possibly after applying the link-function h to λ(xj ) h(λ(xj )) = uj +

q X k=1


βk xk,j

14

One common choice for h is the logarithm. The uj term denotes here an OFFSET variable, which is included in the model, but for which no parameter estimations are done (NOTE: there is no β-parameter attached to uj ) The offset is especially useful for the log-link function and using as offset log(u) because the model may than be interpreted as modeling the log-rate of counts per unit entity of u. λ(xj ) log λ(xj ) − log uj = log = uj

q X

βj xk,j

k=1

exp βk is then the relative change in rate by increasing the March 20, 2003 Poisson Regression

15

covariate xk by 1.


16

Example: counts of mites For two varieties of apple trees the number of a mites on leaves were counted. 50 leaves of different trees were examined for each variety and one recorded • number of mites Nr,i, • variety Vr ,

r = A, B, i = 1, . . . , 50

r = A, B

• size of leaf S in cm2 The size S is considered an offset variable. March 20, 2003 Poisson Regression

17

A model for the counts would be random part Nr,i ∼ P oisson(λr,i) systematic part: log(λr,i) = log(Sr,i) + αr ,


r = A, B, i = 1, . . . , 15

18

SAS program for mites ods output estimates=est; proc genmod; class variety; model event = variety/ dist=poisson link=log offset=logsize; /* caclulating the rates*/ estimate ’log-rate variety A’ intercept 1 variety 1 0; estimate ’log-rate variety B’ intercept 1 variety 0 1; data est; set est; keep label rate low upp; rate=exp(estimate); upp=exp(uppercl); low=exp(lowercl); March 20, 2003 Poisson Regression

19

There is a significant effect of the variety: Analysis of deviance Source

Deviance

DF

Square

Pr > ChiSq

Intercept variety

467.7857 119.4826

1

348.30

0: scale parameter

0.5

a = 3,s = 0.3 a = 20,s = 0.3

0.0

dgamma(x, shape = 0.5, scale = 1)

a > 0: shape parameter

0 March 20, 2003 Poisson Regression

2

4

6 x

8

10

33

34

Example: Mean clotting time of blood Mean clotting time in second (T ) of blood for nine percent concentrations of normal plasma (C) and two levels of clotting agent. (Hurn et al., 1945) 1 2 3 4 5 6 7 8 9


C 5 10 15 20 30 40 60 80 100

T :lot1 118 58 42 35 27 25 21 19 18

T :lot2 69 35 26 21 18 16 13 12 12

Time~C

Time~1/C

80 100 60

clotting time

60

20

40

40

60

80

100

0.05

0.10

0.15

1/(concentration)

1/Time~C

1/Time~log(C)

0.20

0.06 0.04 0.02

0.02

0.04

0.06

1/(clotting time)

0.08

concentration

0.08

20

1//clotting time)

40

80 100

lot1 lot2

20

clotting time

35

20

40

60

March 20, 2003 Poisson Regression concentration

80

100

1.5

2.0

2.5

3.0

3.5

log(concentration)

4.0

4.5

36

Plot of the clotting time against the concentration reveals a HYPERBOLIC relationship. Either • T versus 1/C • 1/T versus log(C) gives a linear relation. We proceed with the latter.


37

random part of the model: Tlot,i ∼ Gamma(a, slot,i),

lot=lot1, lot2 , i = 1, . . . , 9

systematic part of the model: 1 = αlot + βlot log(Clot,i) E(Tlot,i) NOTE: The link function is the inverse function h(x) =

1 x

The link functions is applied to the EXPECTATION of the clotting time T NOT the time itself!!


38

The SAS program for the analysis is: proc genmod; class lot; model T= lot logC dist=gamma link=pow(-1) scale=P ;

lot*logC/

The inverse link-function is specified via link=pow(-1)


39

The analysis of deviance is LR Statistics For Type 1 Analysis Source Intercept lot logC logC*lot

Deviance

Num DF

Den DF

F Value

Pr > F

7.7087 6.6314 0.3004 0.0294

1 1 1

14 14 14

512.97 3014.59 129.05