Predictive Modeling Logistic Regression

20 downloads 173 Views 101KB Size Report
1. Predictive Modeling. Logistic Regression. Logistic Regression. • Consider the Multiple Linear Regression Model: i. 0. 1 i1. 2 i2 k ik i y = x x x β β β β ε. +. +. + +.
Predictive Modeling Logistic Regression

Logistic Regression • Consider the Multiple Linear Regression Model:

yi = β 0 + β1xi1 + β 2 xi2 +

+ βk xik + ε i

where the response variable y is dichotomous, taking on only one of two values: y =1 if a success =0 if a failure

1

Logistic Regression • The response variable, y, is really just a Bernoulli trial, with E(y) = π where π = probability of a success on any given trial • π can only take on values between 0 and 1

Logistic Regression • Thus, the Multiple Regression Model π = µ y|x = β 0 + β1xi1 + β 2 xi2 +

+ βk xik

is not appropriate for a dichotomous response variable, since this model assumes π can take on any value, but in fact it can only take on values between 0 and 1.

2

Logistic Regression • When the response variable is dichotomous, a more appropriate linear model is the Logistic Regression Model:

 π logit(π ) = log   1− π

  = α + β1xi1 + β 2 xi2 + 

+ βk xik

Logistic Regression • The ratio

π 1− π is called the odds, and is a function of the probability π

3

Logistic Regression • Consider the model with one predictor (k=1): Logit

Odds

 π log   1− π

π 1− π

  =α + βx  α +β x

=e

α

=e

(e )

β x

Logistic Regression Odds

π 1− π

α

=e

(e )

β x

• Every one unit increase in x increases the odds by a factor of eβ

4

Logistic Regression x=1

x=2

x=3

π 1− π

π 1− π

π 1− π

( )

= eα e β

( )

= eα e β

3

= eα e β

( )

= eα e β

( )( eβ )

2

= eα e β

( )( eβ )( eβ )

Logistic Regression • Thus, eβ is the odds ratio, comparing the odds at x+1 with the odds at x • An odds ratio equal to 1 (i.e., eβ = 1) occurs when β = 0, which describes the situation where the predictor x has no association with the response y.

5

Logistic Regression • As with regular linear regression, we obtain a sample of n observations, with each observation measured on all k predictor variables and on the response variable. • We use these sample data to fit our model and estimate the parameters

Logistic Regression • Using the sample data, we obtain the model:

 p  log  = a + bx   1-p  for a single predictor

6

Logistic Regression • The estimate of π is

p=

ea+bx 1+ea+bx

Example: Coronary Heart Disease Odds

Log(Odds)

Age

Age Group (x)

CHD Present

N

p

p/(1-p)

log(p/(1-p)=a+bx

Predicted π-hat

25

1

1

10

0.10

0.09834

-2.31929

0.08954

30

2

2

15

0.13

0.17184

-1.76117

0.14664

35

3

3

12

0.25

0.30028

-1.20305

0.23093

40

4

5

15

0.33

0.52470

-0.64493

0.34413

45

5

6

13

0.46

0.91685

-0.08681

0.47831

50

6

5

8

0.63

1.60209

0.47131

0.61569

55

7

13

17

0.76

2.79947

1.02943

0.73681

60

8

8

10

0.80

4.89175

1.58755

0.83027

Exp(b) = 1.74738

b = 0.55812

(Multiplicative)

(Additive)

100

7

Example: Coronary Heart Disease Probability of Coronary Heart Disease 1.00

0.80 log(p/(1-p)) = -2.8774 + 0.5581 * X

Proportion

0.60

0.40 predicted observed 0.20

0.00 1

2

3

4

5

6

7

8

Age Group

Parameter Estimation • A 100(1-α)% Confidence Interval for β is:

βˆ ± zα / 2 × a.s.e. ( βˆ ) where zα/2 is a critical value from the standard normal distribution, and a.s.e. stands for the asymptotic standard error

8

Parameter Estimation • A 100(1-α)% Confidence Interval for the odds ratio eβ is:

e

βˆ ± zα / 2 ×a.s.e.( βˆ )

Hypothesis Testing H0: β = 0 H1: β ≠ 0 The test statistics for testing this hypothesis are χ2 statistics. There are 3 that are commonly used:

9

Hypothesis Testing 2

  βˆ  ∼ χ12 =  a.s.e. βˆ   

Wald Test

QW

Likelihood Ratio Test

QLR = −2 Lα − Lα ,β ∼ χ12

Score Test

QS ∼ χ12

( )

(

)

Goodness of Fit • Let m = # of levels of x (m=8 for CHD example) • Let ni = number of observations in the ith level of x • Let k = number of parameters (k=2 for CHD example) 2 XPearson

2 pi − πˆi ) ( 2 = ∑ ni ∼ χ m-k m

i=1

πˆi

m p  2 2 = 2 ∑ ni pi log  i  ∼ χ m-k XDeviance i=1  πˆi 

10

Example: Coronary Heart Disease Odds

Log(Odds)

Age

Age Group (x)

CHD Present

N

p

p/(1-p)

log(p/(1-p)=a+bx

Predicted π-hat

25

1

1

10

0.10

0.09834

-2.31929

0.08954

30

2

2

15

0.13

0.17184

-1.76117

0.14664

35

3

3

12

0.25

0.30028

-1.20305

0.23093

40

4

5

15

0.33

0.52470

-0.64493

0.34413

45

5

6

13

0.46

0.91685

-0.08681

0.47831

50

6

5

8

0.63

1.60209

0.47131

0.61569

55

7

13

17

0.76

2.79947

1.02943

0.73681

60

8

8

10

0.80

4.89175

1.58755

0.83027

Exp(b) = 1.74738

b = 0.55812

(Multiplicative)

(Additive)

100

Logistic Regression Example Coronary Heart Disease vs. Age (first 25 observations, out of 100) Obs

Age

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

25 25 25 25 25 25 25 25 25 25 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30

Age Group

CHD

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

11

SAS Code: Proc Logistic

Proc Logistic Data=CHD100 Descending; Model CHD=AgeGroup; Run;

Proc Logistic Output: CHD Example The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test

Chi-Square

DF

Pr > ChiSq

28.4851 26.0782 21.4281

1 1 1