Bootstrap Methods and their Application

Bootstrap Methods and their Application Anthony Davison c

2006 A short course based on the book ‘Bootstrap Methods and their Application’, by A. C. Davison and D. V. Hinkley c

Cambridge University Press, 1997

Anthony Davison: Bootstrap Methods and their Application,

1

Summary ◮ ◮

Bootstrap: simulation methods for frequentist inference. Useful when • standard assumptions invalid (n small, data not normal, . . .); • standard problem has non-standard twist; • complex problem has no (reliable) theory; • or (almost) anywhere else.

◮

Aim to describe • • • •

basic ideas; confidence intervals; tests; some approaches for regression


2

Motivation

Motivation

Basic notions

Confidence intervals

Several samples

Variance estimation

Tests

Regression


3

Motivation

AIDS data ◮ ◮ ◮

◮

◮

UK AIDS diagnoses 1988–1992. Reporting delay up to several years! Problem: predict state of epidemic at end 1992, with realistic statement of uncertainty. Simple model: number of reports in row j and column k Poisson, mean µjk = exp(αj + βk ). Unreported diagnoses in period j Poisson, mean X X µjk = exp(αj ) exp(βk ). k unobs

◮

k unobs

Estimate total unreported diagnoses from period j by replacing αj and βk by MLEs. • How reliable are these predictions? • How sensible is the Poisson model?


4

Motivation

Diagnosis period Year 1988

1989

1990

1991

1992

Quarter 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Reporting-delay interval (quarters): 0† 31 26 31 36 32 15 34 38 31 32 49 44 41 56 53 63 71 95 76 67

1 80 99 95 77 92 92 104 101 124 132 107 153 137 124 175 135 161 178 181

2 16 27 35 20 32 14 29 34 47 36 51 41 29 39 35 24 48 39

3 9 9 13 26 10 27 31 18 24 10 17 16 33 14 17 23 25

4 3 8 18 11 12 22 18 9 11 9 15 11 7 12 13 12

5 2 11 4 3 19 21 8 15 15 7 8 6 11 7 11


6 8 3 6 8 12 12 6 6 8 6 9 5 6 10

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···

≥14 6 3 3 2 2 1

Total reports to end of 1992 174 211 224 205 224 219 253 233 281 245 260 285 271 263 306 258 310 318 273 133

5

Motivation

AIDS data ◮

◮

300

+ ++ + ++ + + + + + + + + + + + + ++ + ++ + +++

200 0

100

Diagnoses

400

500

◮

Data (+), fits of simple model (solid), complex model (dots) Variance formulae could be derived — painful! but useful? Effects of overdispersion, complex model, . . .?

+ ++ + ++ + ++++ 1984

1986

1988

1990

1992

Time


6

Motivation

Goal

Find reliable assessment of uncertainty when ◮

estimator complex

◮

data complex

◮

sample size small

◮

model non-standard


7

Basic notions

Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


8

Basic notions

Handedness data Table: Data from a study of handedness; hand is an integer measure of handedness, and dnan a genetic measure. Data due to Dr Gordon Claridge, University of Oxford.

1 2 3 4 5 6 7 8 9 10

dnan 13 18 20 21 21 24 24 27 28 28

hand 1 1 3 1 1 1 1 1 1 2

11 12 13 14 15 16 17 18 19 20

dnan 28 28 28 28 28 28 29 29 29 29

hand 1 2 1 4 1 1 1 1 1 2

21 22 23 24 25 26 27 28 29 30

dnan 29 29 29 30 30 30 30 31 31 31


hand 2 1 1 1 1 2 1 1 1 1

31 32 33 34 35 36 37

dnan 31 31 33 33 34 41 44

hand 1 2 6 1 1 4 8

9

Basic notions

Handedness data Figure: Scatter plot of handedness data. The numbers show the multiplicities of the observations.

7

8

1

hand

5

6

1

4

1

3

1

2211

2 1

1

1

1 15

2 20

15534

2 25

11

30 dnan


35

40

45

10

Basic notions

Handedness data

◮

◮ ◮

Is there dependence between dnan and hand for these n = 37 individuals? Sample product-moment correlation coefficient is θb = 0.509. Standard confidence interval (based on bivariate normal population) gives 95% CI (0.221, 0.715).

◮

Data not bivariate normal!

◮

What is the status of the interval? Can we do better?


11

Basic notions

Frequentist inference ◮ ◮ ◮

Estimator θb for unknown parameter θ. iid

Statistical model: data y1 , . . . , yn ∼ F , unknown Handedness data • • •

◮

◮

y = (dnan, hand) F puts probability mass on subset of R2 θb is correlation coefficient

Key issue: what is variability of θb when samples are repeatedly taken from F ? Imagine F known — could answer question by • analytical (mathematical) calculation • simulation


12

Basic notions

Simulation with F known ◮

For r = 1, . . . , R: iid

◮

• generate random sample y1∗ , . . . , yn∗ ∼ F ; • compute θbr using y1∗ , . . . , yn∗ ;

Output after R iterations:

∗ θb1∗ , θb2∗ , . . . , θbR

◮

∗ to estimate sampling distribution of θ b Use θb1∗ , θb2∗ , . . . , θbR (histogram, density estimate, . . .)

◮

In practice R is finite, so some error remains

◮

If R → ∞, then get perfect match to theoretical calculation (if available): Monte Carlo error disappears completely


13

Basic notions

Handedness data: Fitted bivariate normal model Figure: Contours of bivariate normal distribution fitted to handedness data; parameter estimates are µ b1 = 28.5, µ b2 = 1.7, σ b1 = 5.4, σ b2 = 1.5, ρb = 0.509. The data are also shown. 10 0.020 1

8

0.015 1

hand

6

1

4

0.010

1

1 0.005

2211

2 1

1

2

15534 11

2

0

0.000 10

15

20

25

30

35

40

45

dnan


14

Basic notions

Handedness data: Parametric bootstrap samples

10 8

10 8

8

10

Figure: Left: original data, with jittered vertical values. Centre and right: two samples generated from the fitted bivariate normal distribution.

0

10

20

30 dnan

40

50

hand 0

2

4

hand 4 2 0

0

2

4

hand

6

Correlation 0.533

6

Correlation 0.753

6

Correlation 0.509

0

10

20

30 dnan


40

50

0

10

20

30

40

50

dnan

15

Basic notions

Handedness data: Correlation coefficient Figure: Bootstrap distributions with R = 10000. Left: simulation from fitted bivariate normal distribution. Right: simulation from the data by bootstrap resampling. The lines show the theoretical probability density function of the correlation coefficient under sampling from a fitted bivariate

3.5 3.0 Probability density 1.0 1.5 2.0 2.5 0.5 0.0

0.0

0.5

Probability density 1.0 1.5 2.0 2.5

3.0

3.5

normal distribution.

−0.5

0.0 0.5 Correlation coefficient

1.0

−0.5


0.0 0.5 Correlation coefficient

1.0

16

Basic notions

F unknown ◮

Replace unknown F by estimate Fb obtained

• parametrically — e.g. maximum likelihood or robust fit of distribution F (y) = F (y; ψ) (normal, exponential, bivariate normal, . . .) • nonparametrically — using empirical distribution function (EDF) of original data y1 , . . . , yn , which puts mass 1/n on each of the yj

◮

◮

Algorithm: For r = 1, . . . , R: iid • generate random sample y1∗ , . . . , yn∗ ∼ Fb ; ∗ ∗ • compute θbr using y1 , . . . , yn ;

Output after R iterations:

∗ θb1∗ , θb2∗ , . . . , θbR Anthony Davison: Bootstrap Methods and their Application,

17

Basic notions

Nonparametric bootstrap ◮

iid Bootstrap (re)sample y1∗ , . . . , yn∗ ∼ Fb, where Fb is EDF of y1 , . . . , yn

• Repetitions will occur!

◮ ◮

◮ ◮ ◮

Compute bootstrap θb∗ using y1∗ , . . . , yn∗ .

For handedness data take n = 37 pairs y ∗ = (dnan, hand)∗ with equal probabilities 1/37 and replacement from original pairs (dnan, hand) Repeat this R times, to get θb∗ , . . . , θb∗ 1

R

See picture

Results quite different from parametric simulation — why?


18

Basic notions

Handedness data: Bootstrap samples

10

15

20

25 30 dnan

35

40

45

6 1

2

3

hand 4 5

6 hand 4 5 3 2 1

1

2

3

hand 4 5

6

7

Correlation 0.491

7

Correlation 0.733

7

Correlation 0.509

8

8

8

Figure: Left: original data, with jittered vertical values. Centre and right: two bootstrap samples, with jittered vertical values.

10

15

20

25 30 dnan

35


40

45

10

15

20

25 30 dnan

35

40

45

19

Basic notions

Using the θb∗ ◮

◮ ◮

b Bootstrap replicates θbr∗ used to estimate properties of θ. Write θ = θ(F ) to emphasize dependence on F Bias of θb as estimator of θ is iid β(F ) = E(θb | y1 , . . . , yn ∼ F ) − θ(F )

estimated by replacing unknown F by known estimate Fb: ◮

iid β(Fb) = E(θb | y1 , . . . , yn ∼ Fb ) − θ(Fb )

Replace theoretical expectation E() by empirical average: β(Fb) ≈ b = θb∗ − θb = R−1


R X r=1

θbr∗ − θb 20

Basic notions

◮

Estimate variance ν(F ) = var(θb | F ) by

1 X b∗ b∗ 2 v= θ −θ R − 1 r=1 r R

◮

◮

Estimate quantiles of θb by taking empirical quantiles of ∗ θb1∗ , . . . , θbR

For handedness data, 10,000 replicates shown earlier give b = −0.046,

v = 0.043 = 0.2052


21

Basic notions

Handedness data b Figure: Summaries of the θb∗ . Left: histogram, with vertical line showing θ. Right: normal Q–Q plot of θb∗ .

0.0

−0.5

0.5

t* 0.0

Density 1.0 1.5

0.5

2.0

2.5

Histogram of t

−0.5

0.0 t*

0.5

1.0

−4 −2 0 2 4 Quantiles of Standard Normal


22

Basic notions

How many bootstraps? ◮

◮ ◮

Must estimate moments and quantiles of θb and derived quantities. Nowadays often feasible to take R ≥ 5000 Need R ≥ 100 to estimate bias, variance, etc. Need R ≫ 100, prefer R ≥ 1000 to estimate quantiles needed for 95% confidence intervals

2 Z* 0 −2 −4

−4

−2

Z* 0

2

4

R=999

4

R=199

−4

−2 0 2 Theoretical Quantiles

4


−4

−2 0 2 Theoretical Quantiles

4

23

Basic notions

Key points ◮

Estimator is algorithm • • • •

◮

◮

applied to original data y1 , . . . , yn gives original θb applied to simulated data y1∗ , . . . , yn∗ gives θb∗ θb can be of (almost) any complexity for more sophisticated ideas (later) to work, θb must often be smooth function of data

Sample is used to estimate F

• Fb ≈ F — heroic assumption

Simulation replaces theoretical calculation • removes need for mathematical skill • does not remove need for thought • check code very carefully — garbage in, garbage out!

◮

Two sources of error • statistical (Fb 6= F ) — reduce by thought • simulation (R 6= ∞) — reduce by taking R large (enough)


24


Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


25


Normal confidence intervals ◮

◮

◮

. If θb approximately normal, then θb ∼ N (θ + β, ν), where θb has bias β = β(F ) and variance ν = ν(F ) If β, ν known, (1 − 2α) confidence interval for θ would be (D1) θb − β ± zα ν 1/2 ,

where Φ(zα ) = α. Replace β, ν by estimates: . . β(F ) = β(Fb) = b = θb∗ − θb X . . (θbr∗ − θb∗ )2 , ν(F ) = ν(Fb) = v = (R − 1)−1 r

◮

giving (1 − 2α) interval θb − b ± zα v 1/2 . Handedness data: R = 10, 000, b = −0.046, v = 0.2052 , α = 0.025, zα = −1.96, so 95% CI is (0.147, 0.963)


26


Normal confidence intervals Normal approximation reliable? Transformation needed? b b − θ)}: Here are plots for ψb = 12 log{(1 + θ)/(1

0.0

0.5

Density 1.0

1.5

◮

Transformed correlation coefficient −0.5 0.0 0.5 1.0 1.5

◮

−0.5 0.0 0.5 1.0 1.5 Transformed correlation coefficient

−4


−2 0 2 Quantiles of Standard Normal

4

27


Normal confidence intervals ◮

Correlation coefficient: try Fisher’s z transformation: b = ψb = ψ(θ)

for which compute bψ = R−1

R X r=1

◮

◮ ◮

1 2

b b log{(1 + θ)/(1 − θ)} 1 X b∗ b∗ 2 , ψ −ψ R − 1 r=1 r R

b ψbr∗ − ψ,

vψ =

(1 − 2α) confidence interval for θ is o n o n 1/2 1/2 , ψ −1 ψb − bψ − zα vψ ψ −1 ψb − bψ − z1−α vψ For handedness data, get (0.074, 0.804) But how do we choose a transformation in general?


28


Pivots ◮

◮

◮

◮

∗ mimic effect of sampling from Hope properties of θb1∗ , . . . , θbR original model. Amounts to faith in ‘substitution principle’: may replace unknown F with known Fb — false in general, but often more nearly true for pivots. Pivot is combination of data and parameter whose distribution is independent of underlying model. iid

Canonical example: Y1 , . . . , Yn ∼ N (µ, σ 2 ). Then Z=

◮

Y −µ (S 2 /n)1/2

∼

tn−1 ,

for all µ, σ 2 — so independent of the underlying distribution, provided this is normal Exact pivot generally unavailable in nonparametric case.


29


Studentized statistic ◮ ◮ ◮

Idea: generalize Student t statistic to bootstrap setting Requires variance V for θb computed from y1 , . . . , yn Analogue of Student t statistic: θb − θ V 1/2 If the quantiles zα of Z known, then Z=

◮

θb − θ Pr (zα ≤ Z ≤ z1−α ) = Pr zα ≤ 1/2 ≤ z1−α V

!

= 1 − 2α

(zα no longer denotes a normal quantile!) implies that Pr θb − V 1/2 z1−α ≤ θ ≤ θb − V 1/2 zα = 1 − 2α

so (1 − 2α) confidence interval is (θb − V 1/2 z1−α , θb − V 1/2 zα )


30

Confidence intervals ◮

Bootstrap sample gives (θb∗ , V ∗ ) and hence Z∗ =

◮

b V ): R bootstrap copies of (θ, (θb1∗ , V1∗ ),

and corresponding z1∗ =

◮

◮

θb1∗ − θb

, ∗1/2

V1 ∗ ∗ z1 , . . . , zR

θb∗ − θb V ∗1/2

(θb2∗ , V2∗ ),

z2∗ =

θb2∗ − θb

, ∗1/2

V2

...,

...,

∗ (θbR , VR∗ ) ∗ zR =

∗ −θ b θbR . ∗1/2 VR

Use to estimate distribution of Z — for example, ∗ < · · · < z∗ order statistics z(1) (R) used to estimate quantiles Get (1 − 2α) confidence interval ∗ θb − V 1/2 z((1−α)(R+1)) ,


∗ θb − V 1/2 z(α(R+1))

31


Why Studentize? ◮

D

Studentize, so Z −→ N (0, 1) as n → ∞. Edgeworth series: Pr(Z ≤ z | F ) = Φ(z) + n−1/2 a(z)φ(z) + O(n−1 ); a(·) even quadratic polynomial.

◮

Corresponding expansion for Z ∗ is Pr(Z ∗ ≤ z | Fb) = Φ(z) + n−1/2 b a(z)φ(z) + Op (n−1 ).

Typically b a(z) = a(z) + Op (n−1/2 ), so

Pr(Z ∗ ≤ z | Fb) − Pr(Z ≤ z | F ) = Op (n−1 ).


32


◮

D

If don’t studentize, Z = (θb − θ) −→ N (0, ν). Then z z z Pr(Z ≤ z | F ) = Φ 1/2 +n−1/2 a′ 1/2 φ 1/2 +O(n−1 ) ν ν ν and

Pr(Z ∗ ≤ z | Fb) = Φ

z z z −1/2 ′ +n b a φ 1/2 +Op (n−1 ). νb1/2 νb1/2 νb

Typically νb = ν + Op (n−1/2 ), giving

◮

Pr(Z ∗ ≤ z | Fb ) − Pr(Z ≤ z | F ) = Op (n−1/2 ).

Thus use of Studentized Z reduces error from Op (n−1/2 ) to Op (n−1 ) — better than using large-sample asymptotics, for which error is usually Op (n−1/2 ).


33


Other confidence intervals ◮

◮

Problem for studentized intervals: must obtain V , intervals not scale-invariant Simpler approaches: • Basic bootstrap interval: treat θb − θ as pivot, get ∗ b θb − (θb((R+1)(1−α)) − θ),

∗ b θb − (θb((R+1)α) − θ).

∗ • Percentile interval: use empirical quantiles of θb1∗ , . . . , θbR :

◮

∗ , θb((R+1)α)

∗ . θb((R+1)(1−α))

Improved percentile intervals (BCa , ABC, . . .) • Replace percentile interval with ∗ θb((R+1)α ′) ,

∗ θb((R+1)(1−α ′′ )) ,

where α′ , α′′ chosen to improve properties. • Scale-invariant. Anthony Davison: Bootstrap Methods and their Application,

34


Handedness data ◮

95% confidence intervals for correlation coefficient θ, R = 10, 000: Normal 0.147 0.963 Percentile −0.047 0.758 Basic 0.262 1.043 ′ ′′ BCa (α = 0.0485, α = 0.0085) 0.053 0.792 Student 0.030 1.206 Basic (transformed) Student (transformed)

◮

0.131 0.824 −0.277 0.868

Transformation is essential here!


35


General comparison ◮

Bootstrap confidence intervals usually too short — leads to under-coverage

◮

Normal and basic intervals depend on scale.

◮

Percentile interval often too short but is scale-invariant. Studentized intervals give best coverage overall, but

◮

• depend on scale, can be sensitive to V • length can be very variable • best on transformed scale, where V is approximately constant ◮

Improved percentile intervals have same error in principle as studentized intervals, but often shorter — so lower coverage


36


Caution ◮

◮

.. .

10

◮

Edgeworth theory OK for smooth statistics — beware rough statistics: must check output. Bootstrap of median theoretically OK, but very sensitive to sample values in practice. Role for smoothing?

0 -10

-5

T*-t for medians

5

. .. ... . ........ ......... ............. . ............................. .......... ................ ............ .............. ........ . ......... . ...... . . -2

0

2

Quantiles of Standard Normal


37


Key points

◮

Numerous procedures available for ‘automatic’ construction of confidence intervals

◮

Computer does the work

◮

Need R ≥ 1000 in most cases

◮

Generally such intervals are a bit too short

◮

Must examine output to check if assumptions (e.g. smoothness of statistic) satisfied

◮

May need variance estimate V — see later


38

Several samples

Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


39

Several samples

Gravity data Table: Measurements of the acceleration due to gravity, g, given as deviations from 980,000 ×10−3 cms−2 , in units of cms−2 × 10−3 .

1 76 82 83 54 35 46 87 68

2 87 95 98 100 109 109 100 81 75 68 67

3 105 83 76 75 51 76 93 75 62

Series 4 5 95 76 90 76 76 78 76 79 87 72 79 68 77 75 71 78


6 78 78 78 86 87 81 73 67 75 82 83

7 82 79 81 79 77 79 79 78 79 82 76 73 64

8 84 86 85 82 77 76 77 80 83 81 78 78 78

40

Several samples

Gravity data

40

60

g

80

100

Figure: Gravity series boxplots, showing a reduction in variance, a shift in location, and possible outliers.

1

2

3

4

5

6

7

8

series


41

Several samples

Gravity data ◮

Eight series of measurements of gravitational acceleration g made May 1934 – July 1935 in Washington DC

◮

Data are deviations from 9.8 m/s2 in units of 10−3 cm/s2

◮

Goal: Estimate g and provide confidence interval

◮

Weighted combination of series averages and its variance estimate !−1 P8 8 2 X y × n /s i i i=1 i , ni /s2i , V = θb = P 8 2 n /s i i=1 i i=1 giving

θb = 78.54,

V = 0.592

and 95% confidence interval of θb ± 1.96V 1/2 = (77.5, 79.8)


42

Several samples

Gravity data: Bootstrap ◮

◮

Apply stratified (re)sampling to series, taking each series as a separate stratum. Compute θb∗ , V ∗ for simulated data Confidence interval based on θb∗ − θb Z ∗ = ∗1/2 , V whose distribution is approximated by simulations z1∗ = giving

◮

∗ −θ b θbR θb1∗ − θb ∗ , . . . , z = , R 1/2 1/2 V1 VR

∗ ∗ (θb − V 1/2 z((R+1)(1−α)) , θb − V 1/2 z((R+1)α) )

For 95% limits set α = 0.025, so with R = 999 use ∗ , z∗ z(25) (975) , giving interval (77.1, 80.3).


43

Several samples

..

5 0 -5

z*

.

0

2

-2

0

2


... . .. .. .. ................... . . . .................................................................. . . .. .. . .... . .. .. ... ... .. . .. ....................................................................................... . . .............. ....... . . . ................................................................. ... . . . . . . . . . . . . . .. . ................................................................ .... . .. .. ...... ..... ..... ........... . . . ..... . .. ........................ . . . . . . ............................. ........ . . . ... . . ........... ..... .................. .. .. .. . . . . .. . . ..... ...... .... . . . . .. . . . . . .. . . . . . . .

.. . ... ..................... .... .. ................... ........................................ . .......................... .................... . ......................................................... .. . . . . . . . . .............................................. . . ...................................... . . . ................................. .... . . . . ... ... .. .................................... . .. . . . ...... ..... .... . . . . . ... ... ..... . . . . .

77

78

79

80

81

sqrt(v*)


0.1 0.2 0.3 0.4 0.5 0.6 0.7

sqrt(v*)

.

.. . .... ............ ........... ..................... ......... ... ...... ........ ............. .......... .... ....... .... ............. .... . . . . . . . . ... ........

. -2

0.1 0.2 0.3 0.4 0.5 0.6 0.7

.

-10

.. .. .... .... ..... ..... . . . . . .... ......... ............ ........ ........... ......... ...... .... ......... . . . . . . .... ..... .... ...

-15

79 77

78

t*

80

81

Figure: Summary plots for 999 nonparametric bootstrap simulations. Top: normal probability plots of t∗ and z ∗ = (t∗ − t)/v ∗1/2 . Line on the top left has intercept t and slope v 1/2 , line on the top right has intercept zero and unit slope. Bottom: the smallest t∗r also has the smallest v ∗ , leading to an outlying value of z ∗ .

. -15

-10

t*


-5

0

5

z*

44

Several samples

Key points

◮

For several independent samples, implement bootstrap by stratified sampling independently from each

◮

Same basic ideas apply for confidence intervals


45

Variance estimation

Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


46

Variance estimation

Variance estimation

◮

◮

Variance estimate V needed for certain types of confidence interval (esp. studentized bootstrap) Ways to compute this: • • • •

double bootstrap delta method nonparametric delta method jackknife


47

Variance estimation

Double bootstrap

◮ ◮

◮

◮ ◮

Bootstrap sample y1∗ , . . . , yn∗ and corresponding estimate θb∗ Take Q second-level bootstrap samples y1∗∗ , . . . , yn∗∗ from y1∗ , . . . , yn∗ , giving corresponding bootstrap estimates ∗∗ , θb1∗∗ , . . . , θbQ

Compute variance estimate V as sample variance of ∗∗ θb1∗∗ , . . . , θbQ

Requires total R(Q + 1) resamples, so could be expensive . Often reasonable to take Q = 50 for variance estimation, so need O(50 × 1000) resamples — nowadays not infeasible


48

Variance estimation

Delta method ◮

◮ ◮

◮ ◮ ◮

◮

Computation of variance formulae for functions of averages and other estimators b estimates ψ = g(θ), and θb ∼. N (θ, σ 2 /n) Suppose ψb = g(θ) Then provided g′ (θ) 6= 0, have (D2) b = g(θ) + O(n−1 ) E(ψ) b = σ 2 g′ (θ)2 /n + O(n−3/2 ) var(ψ)

. 2 ′ b2 b = Then var(ψ) σ b g (θ) /n = V Example (D3): θb = Y , ψb = log θb . b = Variance stabilisation (D4): if var(θ) S(θ)2 /n, find . b =constant transformation h such that var{h(θ)} Extends to multivariate estimators, and to ψb = g(θb1 , . . . , θbd )


49

Variance estimation


50

Variance estimation

Nonparametric delta method ◮ ◮

Write parameter θ = t(F ) as functional of distribution F General approximation: n 1 X . L(Yj ; F )2 . V = VL = 2 n j=1

◮

◮

L(y; F ) is influence function value for θ for observation at y when distribution is F : t {(1 − ε)F + εHy } − t(F ) , L(y; F ) = lim ε→0 ε where Hy puts unit mass at y. Close link to robustness. Empirical versions of L(y; F ) and VL are X lj2 , lj = L(yj ; Fb), vL = n−2 usually obtained by analytic/numerical differentation.


51

Variance estimation

Computation of lj ◮ ◮

Write θb in weighted form, differentiate with respect to ε Sample average: X 1X yj = wj yj θb = y = n wj ≡1/n Change weights:

wj 7→ ε + (1 − ε) n1 ,

wi 7→ (1 − ε) n1 ,

i 6= j

so (D5)

◮

y 7→ y ε = εyj + (1 − ε)y = ε(yj − y) + y, P −1 2 giving lj = yj − y and vL = n12 (yj − y)2 = n−1 n n s Interpretation: lj is standardized change in y when increase mass on yj


52

Variance estimation

Nonparametric delta method: Ratio ◮

Population F (u, x) with y = (u, x) and Z Z θ = t(F ) = x dF (u, x)/ u dF (u, x), sample version is

◮

θb = t(Fb) =

Z

x dFb(u, x)/

Z

u dFb(u, x) = x/u

Then using chain rule of differentiation (D6),

giving

b j )/u, lj = (xj − θu

1 X vL = 2 n

b j xj − θu u


!2 53

Variance estimation

Handedness data: Correlation coefficient ◮

Correlation coefficient P may be written as a function of −1 averages xu = n xj uj etc.: θb = n

xu − x u

(x2

− x2 )(u2

−

o1/2 ,

u2 )

from which empirical influence values lj can be derived ◮

In this example (and for others involving only averages), nonparametric delta method is equivalent to delta method

◮

Get vL = 0.029

◮

for comparison with v = 0.043 obtained by bootstrapping. b — as here! vL typically underestimates var(θ)


54

Variance estimation

Delta methods: Comments ◮

Can be applied to many complex statistics

◮

Delta method variances often underestimate true variances:

◮

b vL < var(θ)

Can be applied automatically (numerical differentation) if algorithm for θb written in weighted form, e.g. X xw = wj xj , wj ≡ 1/n for x

and vary weights successively for j = 1, . . . , n, setting X wj = wi + ε, i 6= j, wi = 1

for ε = 1/(100n) and using the definition as derivative Anthony Davison: Bootstrap Methods and their Application,

55

Variance estimation

Jackknife ◮

Approximation to empirical influence values given by lj ≈ ljack,j = (n − 1)(θb − θb−j ),

where θb−j is value of θb computed from sample y1 , . . . , yj−1 ,

◮

◮ ◮

yj+1 , . . . , yn

Jackknife bias and variance estimates are X 1 1X 2 − nb2jack ljack,j , vjack = ljack,j bjack = − n n(n − 1)

Requires n + 1 calculations of θb b with Corresponds to numerical differentiation of θ, ε = −1/(n − 1)


56

Variance estimation

Key points

◮

Several methods available for estimation of variances

◮

Needed for some types of confidence interval

◮

Most general method is double bootstrap: can be expensive

◮

Delta methods rely on linear expansion, can be applied numerically or analytically

◮

Jackknife gives approximation to delta method, can fail for rough statistics


57

Tests

Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


58

Tests

Ingredients

◮

Ingredients for testing problems: • data y1 , . . . , yn ; • model M0 to be tested; • test statistic t = t(y1 , . . . , yn ), with large values giving evidence against M0 , and observed value tobs

◮

◮

P-value, pobs = Pr(T ≥ tobs | M0 ) measures evidence against M0 — small pobs indicates evidence against M0 . Difficulties: • pobs may depend upon ‘nuisance’ parameters, those of M0 ; • pobs often hard to calculate.


59

Tests

Examples ◮

Balsam-fir seedlings in 5 × 5 quadrats — Poisson sample? 0 0 1 4 3

◮

1 2 1 1 1

2 0 1 2 4

3 2 1 5 3

4 4 4 2 1

3 2 1 0 0

4 3 5 3 0

2 3 2 2 2

2 4 2 1 7

1 2 3 1 0

Two-way layout: row-column independence? 1 2 0 1 0

2 0 1 1 1

2 0 1 2 1

1 2 1 0 1

1 3 2 0 1


0 0 7 0 0

1 0 3 1 0 60

Tests

Estimation of pobs ◮

◮

◮

Estimate pobs by simulation from fitted null hypothesis c0 . model M Algorithm: for r = 1, . . . , R: c0 ; • simulate data set y1∗ , . . . , yn∗ from M • calculate test statistic t∗r from y1∗ , . . . , yn∗ .

Calculate simulation estimate

of ◮

pb =

1 + #{t∗r ≥ tobs } 1+R

c0 ). pbobs = Pr(T ≥ tobs | M

Simulation and statstical errors:

pb ≈ pbobs ≈ pobs


61

Tests

Handedness data: Test of independence ◮ ◮

◮ ◮

◮

◮

Are dnan and hand positively associated? Take T = θb (correlation coefficient), with tobs = 0.509; this is large in case of positive association (one-sided test) Null hypothesis of independence: F (u, x) = F1 (u)F2 (x) Take bootstrap samples independently from Fb1 ≡ (dnan1 , . . . , dnann ) and from Fb2 ≡ (hand1 , . . . , handn ), then put them together to get bootstrap data (dnan∗1 , hand∗1 ), . . . , (dnan∗n , hand∗n ). b so With R = 9, 999 get 18 values of θb∗ ≥ θ, 1 + 18 pb = = 0.0019 : 1 + 9999 hand and dnan seem to be positively associated To test positive or negative association (two-sided test), b gives pb = 0.004. take T = |θ|:


62

Tests

c0 Handedness data: Bootstrap from M

1

2

2

1

2

15

20

0.5

3

1 2

1

1102 3 4 3

25

30 dnan

1

35

40

0.0

1

hand 4 5

6

2

Probability density 1.0 1.5 2.0

7

2.5

8

1

45


−0.6

−0.2 0.2 Test statistic

0.6

63

Tests

Choice of R ◮

◮

◮

Take R big enough to get small standard error for pb, typically ≥ 100, using binomial calculation: 1 + #{t∗r ≥ tobs } var(b p) = var 1+R pobs (1 − pobs ) 1 . Rpobs (1 − pobs ) = = 2 R R . so if pobs = 0.05 need R ≥ 1900 for 10% relative error (D7) . Can choose R sequentially: e.g. if pb = 0.06 and R = 99, can augment R enough to diminish standard error. Taking R too small lowers power of test.


64

Tests

Duality with confidence interval ◮

Often unclear how to impose null hypothesis on sampling scheme

◮

General approach based on duality between confidence interval I1−α = (θα , ∞) and test of null hypothesis θ = θ0

◮

Reject null hypothesis at level α in favour of alternative θ > θ0 , if θ0 < θα

◮

Handedness data: θ0 = 0 6∈ I0.95 , but θ0 = 0 ∈ I0.99 , so estimated significance level 0.01 < pb < 0.05: weaker evidence than before Extends to tests of θ = θ0 against other alternatives:

◮

• if θ0 ∈ 6 I 1−α = (−∞, θα ), have evidence that θ < θ0 • if θ0 ∈ 6 I1−2α = (θα , θα ), have evidence that θ 6= θ0


65

Tests

Pivot tests ◮ ◮

◮ ◮

◮

◮

Equivalent to use of confidence intervals Idea: use (approximate) pivot such as Z = (θb − θ)/V 1/2 as statistic to test θ = θ0 Observed value of pivot is zobs = (θb − θ0 )/V 1/2 Significance level is ! θb − θ = Pr(Z ≥ zobs | M0 ) ≥ zobs | M0 Pr V 1/2 = Pr(Z ≥ zobs | F ) . = Pr(Z ≥ zobs | Fb)

Compare observed zobs with simulated distribution of ∗1/2 , without needing to construct null b Z ∗ = (θb∗ − θ)/V c0 hypothesis model M Use of (approximate) pivot is essential for success


66

Tests

Example: Handedness data ◮

◮

Test zero correlation (θ0 = 0), not independence; θb = 0.509, V = 0.1702 : θb − θ0 0.509 − 0 zobs = 1/2 = = 2.99 0.170 V Observed significance level is 1 + #{zr∗ ≥ zobs } 1 + 215 = = 0.0216 1+R 1 + 9999

0.0

Probability density 0.1 0.2 0.3

pb =

−6

−4

−2 0 2 Test statistic

4


6

67

Tests

Exact tests ◮

Problem: bootstrap estimate is c0 ) 6= Pr(T ≥ t | M0 ) = pobs , pbobs = Pr(T ≥ tobs | M

so estimate the wrong thing ◮

In some cases can eliminate parameters from null hypothesis distribution by conditioning on sufficient statistic

◮

Then simulate from conditional distribution

◮

More generally, can use Metropolis–Hastings algorithm to simulate from conditional distribution (below)


68

Tests

Example: Fir data ◮ ◮

◮

◮

◮

iid

Data Y1 , . . . , Yn ∼ Pois(λ), with λ unknown Poisson model has E(Y ) = var(Y ) = λ: base test of overdispersion on X . T = (Yj − Y )2 /Y ∼ χ2n−1 ;

observed value is tobs = 55.15 Unconditional significance level:

c0 , λ) Pr(T ≥ tobs | M

Condition on value w of sufficient statistic W = c0 , W = w), pobs = Pr(T ≥ tobs | M

P

Yj :

independent of λ, owing to sufficiency of W Exact test: simulate from P multinomial distribution of Y1 , . . . , Yn given W = Yj = 107.


69

Tests

Example: Fir data

0.04

Figure: Simulation results for dispersion test. Left panel: R = 999 values of the dispersion statistic t∗ obtained under multinomial sampling: the data value is tobs = 55.15 and pb = 0.25. Right panel: chi-squared plot of ordered values of t∗ , dotted line shows χ249 approximation to null conditional distribution.

80 60 40 20

0.0

0.01

0.02

0.03

dispersion statistic t*

.

20

40

60

.

... . .... ...... ........ . . . ..... ...... ...... ..... ....... . . . . . ...... ...... ..... ...... ...... . . . . .. ..... ... . ..

80

t*


30

40

50

60

70

80

chi-squared quantiles

70

Tests

Handedness data: Permutation test ◮ ◮ ◮

◮

Are dnan and hand related? Take T = θb (correlation coefficient) again Impose null hypothesis of independence: F (u, x) = F1 (u)F2 (x), but condition so that marginal distributions Fb1 and Fb2 are held fixed under resampling plan — permutation test Take resamples of form (dnan1 , hand1∗ ), . . . , (dnann , handn∗ )

◮

◮

where (1∗ , . . . , n∗ ) is random permutation of (1, . . . , n) Doing this with R = 9, 999 gives one- and two-sided significance probabilities of 0.002, 0.003 Typically values of pb very similar to those for corresponding bootstrap test


71

Tests

1

0.0

2

0.5

3

hand 4 5

6

7

Probability density 1.0 1.5 2.0 2.5

8

3.0

Handedness data: Permutation resample

15

20

25

30 dnan

35

40

45


−0.6

−0.2 0.2 Test statistic

0.6

72

Tests

Contingency table 1 2 0 1 0 ◮

2 0 1 1 1

2 0 1 2 1

1 2 1 0 1

1 3 2 0 1

0 0 7 0 0

1 0 3 1 0

Are row and column classifications independent: Pr(row i, column j) = Pr(row i) × Pr(column j)?

◮

Standard test statistic for independence is T =

X (yij − ybij )2 , ybij i,j

◮

ybij =

yi· y·j y··

.

Get Pr(χ224 ≥ 38.53) = 0.048, but is T ∼ χ224 ?


73

Tests

Exact tests: Contingency table ◮

For exact test, need to simulate distribution of T conditional on sufficient statistics — row and column totals

◮

Algorithm (D8) for conditional simulation: 1. choose two rows j1 < j2 and two columns k1 < k2 at random 2. generate new values from hypergeometric distribution of yj1 k1 conditional on margins of 2 × 2 table y j 1 k1 y j 1 k2 y j 2 k1 y j 2 k2 3. compute test statistic T ∗ every I = 100 iterations, say

◮

Compare observed value tobs = 38.53 with simulated T ∗ — . get pb = 0.08


74

Tests

Key points

◮ ◮

Tests can be performed using resampling/simulation Must take account of null hypothesis, by • modifying sampling scheme to satisfy null hypothesis • inverting confidence interval (pivot test)

◮

Can use Monte Carlo simulation to get approximations to exact tests — simulate from null distribution of data, conditional on observed value of sufficient statistic

◮

Sometimes obtain permutation tests — very similar to bootstrap tests


75

Regression

Motivation

Basic notions


Several samples

Variance estimation

Tests

Regression


76

Regression

Linear regression ◮

Independent data (x1 , y1 ), . . ., (xn , yn ) with yj = xTj β + εj ,

◮

b leverages hj , residuals Least squares estimates β, ej =

◮

εj ∼ (0, σ 2 )

yj − xTj βb

(1 − hj

)1/2

.

∼

(0, σ 2 )

Design matrix X is experimental ancillary — should be held fixed if possible, as b = σ 2 (X T X)−1 var(β)

if model y = Xβ + ε correct


77

Regression

Linear regression: Resampling schemes ◮

Two main resampling schemes

◮

Model-based resampling: yj∗ = xTj βb + ε∗j ,

ε∗j ∼ EDF(e1 − e, . . . , en − e)

• Fixes design but not robust to model failure • Assumes εj sampled from population ◮

Case resampling: (xj , yj )∗ ∼ EDF{(x1 , y1 ), . . . , (xn , yn )} • Varying design X but robust • Assumes (xj , yj ) sampled from population • Usually design variation no problem; can prove awkward in designed experiments and when design sensitive.


78

Regression

Cement data Table: Cement data: y is the heat (calories per gram of cement) evolved while samples of cement set. The covariates are percentages by weight of four constituents, tricalciumaluminate x1 , tricalcium silicate x2 , tetracalcium alumino ferrite x3 and dicalcium silicate x4 .

1 2 3 4 5 6 7 8 9 10 11 12 13

x1 7 1 11 11 7 11 3 1 2 21 1 11 10

x2 26 29 56 31 52 55 71 31 54 47 40 66 68

x3 6 15 8 8 6 9 17 22 18 4 23 9 8

x4 60 52 20 47 33 22 6 44 22 26 34 12 12


y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4

79

Regression

Cement data ◮

Fit linear model y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + ε and apply case resampling

◮

◮

. Covariates compositional: x1 + · · · + x4 = 100% so X almost collinear — smallest eigenvalue of X T X is l5 = 0.0012 Plot of βb1∗ against smallest eigenvalue of X ∗T X ∗ reveals that var∗ (βb∗ ) strongly variable 1

◮

Relevant subset for case resampling — post-stratification of output based on l5∗ ?


80

Regression

10

10

Cement data

. .

5 beta2hat*

-5

.

.

5 10

-10

-10

.

. .

1

. . . . .. ........ .. . .............................. . . .. .................................. ........ ................ .. . .. ... ... ................................................................................ .. . . ... ..................... . ... . . .......... ............... .. .. .. ... .. . .

. . . 0

5 0 -5

beta1hat*

. .. . . . ..... ........... . . . . .. ..... ......................................................... .... . . ...... . .. .. .............................................................. . . .. ... .. ................................................................. . . .. . .... ................ . . ... ..... . . . .. . . .

50

500

1

smallest eigenvalue


. 5 10

50

500

smallest eigenvalue

81

Regression

Cement data

Table: Standard errors of linear regression coefficients for cement data. Theoretical and error resampling assume homoscedasticity. Resampling results use R = 999 samples, but last two rows are based only on those samples with the middle 500 and the largest 800 values of ℓ∗1 .

Normal theory Model-based resampling, R = 999 Case resampling, all R = 999 Case resampling, largest 800


βb0 70.1 66.3 108.5 67.3

βb1 0.74 0.70 1.13 0.77

βb2 0.72 0.69 1.12 0.69

82

Regression

Survival data dose x survival % y

◮ ◮

117.5 44.000 55.000

235.0 16.000 13.000

470.0 4.000 1.960 6.120

705.0 0.500 0.320

940.0 0.110 0.015 0.019

1410 0.700 0.006

Data on survival % for rats at different doses Linear model: log(survival) = β0 + β1 dose •• ••

• • • 200

• 600

• 1000

0

•

• • -2

• •

• • •

•

-4

20

30

log survival %

2

•

0

10

survival %

40

50

4

•

••

• 1400

• 200

dose


600

1000

1400

dose

83

Regression

Survival data ◮ ◮

-0.004 -0.008 -0.012

estimated slope

0.0

◮

Case resampling Replication of outlier: none (0), once (1), two or more (•). Model-based sampling including residual would lead to change in intercept but not slope.

•

•

•

• •• • • •

• 1 • • •• • •• 1 ••• • 1 •••••• ••• ••• 111 1 11 1 ••• • •••••••• ••••• 11 1111 1 • 1 1 1 1 1 1 1 1 1 11 11111111 1111111 1 11 1 1111 11111 1 1 1 1 1 1 1 1 1 1 0 1 1 11 1 1 0 0 0 01000 00 0 0 0 0 0 0 00000 0 0 0 0 000 0 0 0 00 0 000 0 0 0000 00 0 00 000 0000 00 0 0

0.5

1.0

1.5

2.0

2.5

3.0

sum of squares Anthony Davison: Bootstrap Methods and their Application,

84

Regression

Generalized linear model ◮

Response may be binomial, Poisson, gamma, normal, . . . yj

◮

mean µj , variance φV (µj ),

where g(µj ) = xTj β is linear predictor; g(·) is link function. b fitted values µ MLE β, bj , Pearson residuals rP j =

◮

∼

yj − µ bj

{V (b µj )(1 − hj )}

.

1/2

∼

(0, φ).

Bootstrapped responses

yj∗ = µ bj + V (b µj )1/2 ε∗j

where ε∗j ∼ EDF(rP 1 − rP , . . . , rP n − rP ). However • possible that yj∗ 6∈ {0, 1, 2, . . . , } • rP j not exchangeable, so may need stratified resampling Anthony Davison: Bootstrap Methods and their Application,

85

Regression

AIDS data ◮

Log-linear model: number of reports in row j and column k follows Poisson distribution with mean µjk = exp(αj + βk )

◮

Log link function g(µjk ) = log µjk = αj + βk and variance function var(Yjk ) = φ × V (µjk ) = 1 × µjk

◮

Pearson residuals: rjk =

◮

Yjk − µ bjk {b µjk (1 − hjk )}1/2

Model-based simulation:

1/2

∗ =µ bjk + µ bjk ε∗jk Yjk


86

Regression

Diagnosis period Year 1988

1989

1990

1991

1992

Quarter 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Reporting-delay interval (quarters): 0† 31 26 31 36 32 15 34 38 31 32 49 44 41 56 53 63 71 95 76 67

1 80 99 95 77 92 92 104 101 124 132 107 153 137 124 175 135 161 178 181

2 16 27 35 20 32 14 29 34 47 36 51 41 29 39 35 24 48 39

3 9 9 13 26 10 27 31 18 24 10 17 16 33 14 17 23 25

4 3 8 18 11 12 22 18 9 11 9 15 11 7 12 13 12

5 2 11 4 3 19 21 8 15 15 7 8 6 11 7 11


6 8 3 6 8 12 12 6 6 8 6 9 5 6 10

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···

≥14 6 3 3 2 2 1

Total reports to end of 1992 174 211 224 205 224 219 253 233 281 245 260 285 271 263 306 258 310 318 273 133

87

Regression

AIDS data

0

rP

2

4

1984 1986 1988 1990 1992

-2

+ ++ + + + + ++ ++ + + + + +++ ++ ++ + ++++ + + + +++ +++++

-4

300 200 0

100

Diagnoses

400

6

500

◮

Poisson two-way model deviance 716.5 on 413 df — indicates strong overdispersion: φ > 1, so Poisson model implausible Residuals highly inhomogeneous — exchangeability doubtful

-6

◮

0

1

2

3

4

5

skewness Anthony Davison: Bootstrap Methods and their Application,

88

Regression

AIDS data: Prediction intervals ◮

To estimate prediction error: ∗ • simulate complete table yjk ; ∗ • estimate parameters from incomplete yjk • get estimated row totals and ‘truth’ X b∗ X ∗ ∗ ∗ µ b∗+,j = eαbj = eβk , y+,j yjk . k unobs

• Prediction error

k unobs

∗ y+,j −µ b∗+,j ∗1/2

µ b+,j

studentized so more nearly pivotal. ◮

Form prediction intervals from R replicates.


89

Regression

AIDS data: Resampling plans ◮

Resampling schemes: • • • •

parametric simulation, fitted Poisson model parametric simulation, fitted negative binomial model nonparametric resampling of rP stratified nonparametric resampling of rP

◮

Stratification based on skewness of residuals, equivalent to stratifying original data by values of fitted means

◮

Take strata for which µ bjk < 1,

1≤µ bjk < 2,


µ bjk ≥ 2 90

Regression

AIDS data: Results ◮

Deviance/df ratios for the sampling schemes, R = 999.

◮

Poisson variation inadequate.

◮

95% prediction limits.

600

5 4 3

1.0

1.5

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

deviance/df

nonparametric

stratified nonparametric

300

5 4

200

3

+ + ++ + +

1

2

+

+ ++ + + + ++ + +

1989 1990 1991 1992

0

0

1

2

3

4

5

6

deviance/df

Diagnoses

0.5

6

0.0

400

500

2 1 0

0

1

2

3

4

5

6

negative binomial

6

poisson

0.0

0.5

1.0

1.5

deviance/df

2.0

2.5

0.0

0.5

1.0

1.5

2.0

2.5

deviance/df


91

Regression

AIDS data: Semiparametric model ◮

More realistic: generalized additive model µjk = exp {α(j) + βk } ,

600 500 Diagnoses

400 300

+ ++ + + + ++ ++ + + +++++ + +++ + +++++ + + + +++ + ++++

+ +

200

0

Diagnoses

◮

where α(j) is locally-fitted smooth. Same resampling plans as before 95% intervals now generally narrower and shifted upwards 100 200 300 400 500 600

◮

+ + ++ + +

1984 1986 1988 1990 1992


+ ++ + + + ++ +

1989 1990 1991 1992

92

Regression

Key points

◮ ◮

Key assumption: independence of cases Two main resampling schemes for regression settings: • Model-based • Case resampling

◮

Intermediate schemes possible

◮

Can help to reduce dependence on assumptions needed for regression model

◮

These two basic approaches also used for more complex settings (time series, . . .), where data are dependent


93

Regression

Summary ◮ ◮

Bootstrap: simulation methods for frequentist inference. Useful when • standard assumptions invalid (n small, data not normal, . . .); • standard problem has non-standard twist; • complex problem has no (reliable) theory; • or (almost) anywhere else.

◮

Have described • • • •

basic ideas confidence intervals tests some approaches for regression


94

Regression

Books ◮

Chernick (1999) Bootstrap Methods: A Practicioner’s Guide. Wiley

◮

Davison and Hinkley (1997) Bootstrap Methods and their Application. Cambridge University Press

◮

Efron and Tibshirani (1993) An Introduction to the Bootstrap. Chapman & Hall

◮

Hall (1992) The Bootstrap and Edgeworth Expansion. Springer

◮

Lunneborg (2000) Data Analysis by Resampling: Concepts and Applications. Duxbury Press

◮

Manly (1997) Randomisation, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall

◮

Shao and Tu (1995) The Jackknife and Bootstrap. Springer


95