Contingency Tables

10 downloads 382 Views 557KB Size Report
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. ... 2386. 689. 13369. 3982. d.f. = 4 – 2 – 2 + 1 wow, can't get less significant than this!
CS395T Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2009 The University of Texas at Austin Unit 11: Contingency Tables

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

1

Contigency Tables, a.k.a. Cross-Tabulation

Is alcohol implicated in malformations? This kind of data is often used to set public policy, so it is important that we be able to assess its statistical significance.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

2

Contingency Tables (a.k.a. cross-tabulation) Ask: Is a gene is more likely to be single-exon if it is AT-rich? rowcon = [(g.ne == 1) (g.ne > 1)]; colcon = [(g.piso < 0.4) (g.piso > 0.6)]; crosstab(rowcon * (1:2)', colcon * (1:2)')

ans =

annoying counts of “other”

48 220

2386 13369

1

>.6

ts n u co

689 3982

table = contingencytable(rowcon,colcon)

my improved function (below)

table = 2386 13369

689 3982

(fewer genes AT rich than CG rich)

sum(table, 1)

ans = 15755

4671

column marginals

ptable = table ./ repmat(sum(table,1),[2 1])

ptable = 0.1514 0.8486

0.1475 0.8525

So can we claim that these are statistically identical? Or is the effect here also “significant but small”?

my contingency table function: function table = contingencytable(rowcons, colcons) nrow = size(rowcons,2); ncol = size(colcons,2); table = squeeze(sum( repmat(rowcons,[1 1 ncol]) .* ... permute(repmat(colcons,[1 1 nrow]),[1 3 2]),1 )); The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

3

Chi-square (or Pearson) statistic for contingency tables notation:

null hypothesis:

expected value of Nij

Æ •Are the conditions for valid chi-square distribution satisfied? Yes, because number of counts in all bins is large.

the statistic is:

table = 2386 13369

689 3982

•If they were small, we couldn’t use fix-themoments trick, because small number of bins (no CLT). This occurs often in biomedical data. •So what then?

nhtable = sum(table,2)*sum(table,1)/sum(sum(table))

nhtable = 1.0e+004 * 0.2372 1.3383

0.0703 0.3968

chis = sum(sum((table-nhtable).^2./nhtable))

chis = 0.4369 p = chi2cdf(chis,1)

d.f. = 4 – 2 – 2 + 1

p = 0.4914

wow, can’t get less significant than this! No evidence of an association between single-exon and AT- vs. CG-rich. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

4

When counts are small, some subtle issues show up. Let’s look closely. “conditions”, e.g. healthy vs. sick

The setup is:

counts

“factors”, e.g. vaccinated vs. unvaccinated

marginals (totals, dot means summed over)

The null hypothesis is: “Conditions and factors are unrelated.” To do a p-value test we must: 1. Invent a statistic that measures deviation from the null hypothesis. 2. Compute that statistic for our data. 3. Find the distribution of that statistic over the (unseen) population. That’s the hard part! What is the “population” of contingency tables? We’ll now see that it depends (albeit slightly) on the experimental protocol, not just on the counts! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

5

Protocol 1: Retrospective analysis or “case/control study” C1 already has the disease. We retrospectively look at their factors. In the null hypothesis, both columns share row probabilities q and (1-q). But we don’t know q. It’s a “nuisance parameter”.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

6

Digression on the hypergeometric distribution The Texas Legislature has m Republicans and n Democrats. A committee of size r is chosen randomly [not realistic!] What is the probability distribution of i, the number of Democrats on the committee?

p(i) = hyper(i; m + n, n, r) number of ways to choose i Democrats

¡n¢¡ m ¢

r−i ¢ = ¡i m+n

number of ways to choose r-i Republicans

r

i

m

total number ways to choose the committee

r

n

m+n

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

7

Protocol 2: Prospective experiment or “longitudinal study” Identify samples with the factors, then watch to see who gets the disease In the null hypothesis, both rows share row probabilities p and (1-p). But we don’t know p. It’s now the nuisance parameter.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

8

Protocol 3: Cross-sectional or snapshot study (no fixed marginals) E.g., test all Austin residents for both disease and factors.

multinomial distribution

Asymptotic methods (e.g. chi-square) are often equivalent to estimating p,q and thus the nuisance factors, from the data itself.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

9

Digression on the multinomial distribution On each i.i.d. try, exactly one of K outcomes occurs, with probabilities

p1 , p2 , . . . , pK

K X

pi = 1

i=1

For N tries, the probability of seeing exactly the outcome

n1 , n2 , . . . , nK

K X

ni = N

i=1

is

P (n1 , . . . , nK |N, p1 , . . . , pK ) =

probability of one specific outcome

N! pn1 1 pn2 2 · · · pnKK n1 ! · · · nK !

number of equivalent arrangements N=26:

abcde fgh ijklmnop q rs tuvwxyz (12345)(123)(12345678)(1)(12)(1234567) n1 = 5

n2 = 3

N! arrangements

n6 = 7

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

partition into the observed ni’s 10

So, in all three cases we got a product of “nuisance” probabilities (depending on unknown p or q or both) and a “sufficient statistic” conditioned on all the marginals. Fisher’s Exact Test just throws away the nuisance factors and uses the sufficient statistic:

This can also be seen to be the (purely combinatorial) probability of the table with all marginals fixed: Numerator: number of partitions with n00=k Denominator: sum numerator over k With all marginals fixed, n00 determines the whole table.

table is fully determined by k alone

Vandermonde’s identity: Proof: How many ways can you choose a subcommittee of size r from a committee with n Democrats and m Republicans?

How many Democrats on the subcommittee?

A statistic is sufficient "when no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter". The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

11

That was all about the distribution of tables in the null hypothesis. Now we complete the rest of the tail test paradigm: The most popular choice for a statistic for 2x2 tables is the “Wald statistic”:

(sorry for the slight change in notation!)

This is constructed so that it will asymptotically became a true t-value.

Notice that this is monotonic with m when all marginals are fixed. You could instead use the Pearson (chi-square) statistic, but not the assumption that it is chi-square distributed.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

12

So, the Fisher Exact Test looks like this: • •

Is this table a significant result?

Compute the statistic for the data Loop over all possible contingency tables with the same marginals – for 2x2 there is just one free parameter

• •



Compute the statistic for each table in the loop Accumulate weight (by hypergeometric probability) of statistic the data statistic Output the p-value (or, because of discreteness effects, the range) Actually, here in the 2x2 case, all statistics monotonic in m are equivalent (except for some two-tail issues)! So the test statistic only matters in the case of larger tables, when there is more than one degree of freedom (with fixed marginals). The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

13

Compute Fisher Exact Test for our table myprob = @(m) nchoosek(24,m)*nchoosek(29,11-m)/nchoosek(53,11); ms = 0:11 ps = arrayfun(myprob,ms)

ms = 0 ps = 0.0005 0.0070

1

2

0.0063 0.0007

3

4

0.0363 0.0000

5

0.1140

6

7

0.2176

8 0.2649

9

10

0.2097

11 0.1078

0.0353

[sum(ps(1:8)) ps(9) sum(ps(10:12))]

ans = 0.9570 sum(ps(9:12))

ans = 0.0430 bar(ms,ps)

add for 2-tail

0.0353

0.0077

Editorial: We will next learn an efficient way to compute the Fisher Exact test. But despite the words “Fisher” (true) and “exact” (questionable) in its name, this test isn’t conceptually well grounded, since virtually never are all marginals held fixed (none of Protocols 1,2,3 above)! At best it is an approximation that ignores the nuisance parameters (p and/or q). I don’t understand why Fisher Exact is so widely used. I think it is historical accident, due 1-tail p-value to outdated frequentist worship of sufficient statistics!

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

14

An computational alternative to the Fisher Exact Test is the Permuation Test. The idea is to break any association between the row and column variables by shuffling. This is allowed under the null hypothesis of no association! A

B

f

1

3

g

2

1

f f f f g g g

expand the table back to data

construct the new table

A B B B A A B

randomly permute the second column

A

B

f

2

2

g

1

2

f f f f g g g

B A A B B A B

note, will always have just two columns for any size contingency table

compute and save statistic, then

(notice that all marginals are preserved – will come back to this point) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

15

Aha! The permutation preserves all marginals. In fact, it is a Monte Carlo calculation of the Fisher Exact Test. And it is easy to compute for any size table! Code up the Wald statistic. function t = wald(tab) m = tab(1,1); n = tab(1,2); mm = m + tab(2,1); nn = n + tab(2,2); p1 = m/mm; p2 = n/nn; p = (m+n)/(mm+nn); t = (p1-p2)/sqrt(p*(1-p)*(1/mm+1/nn)); table = [8 3; 16 26;]

table = 8 16

3 26

tdata = wald(table)

tdata = 2.0542

The data show about a 2 standard deviation effect, except that they’re not really standard deviations because of the small counts!

In scientific papers, people can equally well say, “Fisher Exact test” or “Permutation test”. You might think that the former sounds more learned, but to me it sounds like they don’t know exactly what their test actually did! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

16

Expand the table and generate permutations: (Darn it, I couldn’t think of a way to do this in Matlab without an explicit loop, thus spoiling my no-loop record)* [row col] = ndgrid(1:2,1:2); d = []; for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end; size(d)

ans = 53

2

accumarray(d,1,[2,2])

ans = 8 16

Check that we recover the original table.

3 26

gen = @(x) wald(accumarray( [d(randperm(size(d,1)),1) d(:,2)] ,1,[2,2])); gen(1)

ans =

Try one permutation just to see it work.

-0.6676 perms = arrayfun(gen,1:100000);

It’s fast, so can easily do lots of permuations.

hist(perms,(-4:.1:5)) cdfplot(perms)

*Peter Perkins (MathWorks) suggests the wonderfully obscure d = [rldecode(table,row) rldecode(table,col)]; where rldecode is one of Peter Acklam’s Matlab Tips and Tricks. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

17

We get discrete values because only a few discrete tables are possible.

data value is here (ties)

neg of data value is here

pvalgt = numel(perms(perms>wald(table)))/numel(perms) pvalge = numel(perms(perms>=wald(table)))/numel(perms) pvaltt = (numel(perms(perms > wald(table)))+numel(perms(perms < -wald(table))))/numel(perms) pvaltte = (numel(perms(perms >= wald(table)))+numel(perms(perms -wald(table)))/numel(perms)

pvalgt = 0.00812 pvalge = 0.04382 pvaltt = 0.01498 pvaltte = 0.05068 pvalwrongtail = 0.99314

We reproduce values from Fisher Exact test done combinatorially

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

18

To learn more, let’s play with a different data set:

Different “standard methods” applied to this data get p-values ranging from 0.005 to 0.190. (Agresti 1992) Fisher Exact Test done combinatorially is not a viable option (both because of computational workload and because we only derived the 2x2 case!) So we’ll try the (equivalent) permutation test.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

19

Expand the table and generate 1000 permutations (Now takes ~1 min. Go figure out how to do the permutation test without expanding all the data!)

table = [17066 14464 788 126 37; 48 38 5 1 1]

table = 17066 48

14464 38

788 5

126 1

37 1

pearson(table)

ans = 12.0821 [row col] = ndgrid(1:size(table,1),1:size(table,2)); d = []; for k=1:numel(table); d = cat(1,d,repmat([row(k),col(k)],table(k),1)); end; size(d)

ans =

Yes, has the dimensions we expect.

32574 2

tablecheck = accumarray(d,1,size(table))

tablecheck = 17066 48

14464 38

788 5

And we can reconstruct the original table.

126 1

37 1

gen = @(x) pearson( accumarray([ d(randperm(size(d,1)),1) d(:,2)],1,size(table))); gen(1)

ans = 1.3378 perms = arrayfun(gen, 1:1000);

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

20

hist(perms,(0:.5:40))

pearson(table)

cdfplot(perms)

ans = 12.0821

pvalgt = numel(perms(perms>pearson(table)))/numel(perms) pvalge = numel(perms(perms>=pearson(table)))/numel(perms)

pvalgt = 0.0380 pvalge = 0.0380

our answer: the p-value

Two questions remain: 1. How good or bad an approximation was it to hold all marginals fixed? 2. Is there a more powerful statistical test for this data? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

21

The more powerful statistical approach to the maternal drinking contingency table is to recognize that the table is ordinal, not just nominal •

Choose a test statistic that actually reflects your hypothesis! – the columns are ordered by an increasing independent variable – “more drinks lead to more abnormalities” – the obvious statistic is “difference of mean number of drinks between the two rows” – if a threshold effect is plausible, might also try “difference of mean of square” • we will discuss multiple hypothesis correction!



With this different statistic, we do a permutation test as before

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

22

Input the table and display the means and their differences: table = [17066 14464 788 126 37; 48 38 5 1 1] sum(table(:))

table = 17066 48 ans = 32574

14464 38

788 5

drinks = [0 0.5 1.5 4. 6.]; drinksq = drinks.^2; norm = sum(table,2); mudrinks = (table * drinks')./norm mudrinksq = (table * drinksq')./norm

mudrinks = 0.2814 0.39247 mudrinksq = 0.26899 0.78226 diff = [-1 1] * mudrinks diffsq = [-1 1] * mudrinksq

126 1

37 1

reasonable quantification of the ordinal categories: exactness isn’t important, since we get to define the statistic

These are our chosen “statistics”. The question is: Are either of them statistically significant? We’ll use the permutation test to find out.

diff = 0.11108 diffsq = 0.51327

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

23

Expand table back to dataset of length 32574: [row col] = ndgrid(1:2,1:5)

This tells each cell its row and column number

row = 1 2

1 2

1 2

1 2

1 2

1 1

2 2

3 3

4 4

5 5

col =

d = []; for k=1:numel(table); d = [d; repmat([row(k),col(k)],table(k),1)]; end; size(d)

ans = 32574

2

Yes, has the dimensions we expect.

accumarray(d,1,[2,5])

ans = 17066 48

14464 38

788 5

126 1

mean(drinks(d(d(:,1)==2,2)))

37 1 And we can reconstruct the original table.

ans = 0.39247

And we get the right mean, so it looks like we are good to go…

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

24

Compute the statistic for the data and for 1000 permuations: As before, the idea is to sample from the null hypothesis (no association) while keeping the distributions of each single variable unchanged. Do this by permuting a label that is irrelevant in the null hypothesis. diffmean = @(d) mean(drinks(d(d(:,1)==2,2))) - mean(drinks(d(d(:,1)==1,2))); diffmean(d)

ans = 0.11108 diffmean([d(randperm(size(d,1)),1) d(:,2)])

Try one permutation just to see it work.

ans = 0.014027 perms = arrayfun(@(x) diffmean([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]); pval = numel(perms(perms>diffmean(d)))/numel(perms)

pval = 0.015 hist(perms,(-.15:.01:.3))

So, as a p-value, the association is now more than twice as significant as when we ignored the column ordering. We were throwing away useful information! Reminder: p-value is a false positive rate. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

25

Same analysis for the squared-drinks statistic: diffmeansq = @(d) mean(drinksq(d(d(:,1)==2,2))) - mean(drinksq(d(d(:,1)==1,2))); diffmeansq(d)

ans = 0.51327 permsq = arrayfun(@(x) diffmeansq([d(randperm(size(d,1)),1) d(:,2)]), [1:1000]); pval = numel(permsq(permsq>diffmeansq(d)))/numel(permsq)

pval = 0.011 hist(permsq,(-.3:.05:1))



Should we apply a multiple hypothesis correction to both pval’s (mult x 2) ? Probably not. – mean and mean-of-squares highly correlated, and – the previous result was significant – we’re not just shopping uniform p-values

• •

But, if your data can stand it, Bonferroni is the gold standard Can you think of a principled way to do multiple hypothesis correction with highly correlated tests?

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

26

This was not bootstrap resampling! The latter tells us how much variation in the signal one might see in repeated identical experiments. Permutation test breaks the causal connection. Bootstrap doesn’t. Bootstrap might be useful in understanding why another experiment didn’t see the effect (false negative). diffmean(d(randsample(size(d,1),size(d,1),true),:))

ans = Try one resample just to see it work.

0.20703

resamp = arrayfun(@(x) diffmean(d(randsample(size(d,1),size(d,1),true),:)), [1:1000]); bias = mean(resamp) - diffmean(d) If this is large, we should worry. resamp = resamp - bias; pval = numel(resamp(resamp> x2) different

In fact, the whole construct is fragile to irrelevant “number theoretical coincidences” about the values of the marginals –



some people split the difference

real protocols don’t fix both sets of marginals Fisher’s elegant elimination of the nuisance parameters p and/or q is a trap

We actually need to estimate a distribution for the nuisance parameters (p’s and/or q’s) and marginalizing over them – –

this makes us Bayesians in a non-Bayesian (p-value) world but we’ve already seen that that’s OK! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

29

Remember “conjugate distributions”? Once again, we want to estimate parameters from observed counts. Beta is conjugate to Binomial: µ ¶ N n P (n|N, q) = q (1 − q)N −n n P (q|N, n) ∝ q n (1 − q)N−n P (q)

Bayes, with prior.

A “conjugate prior” is one that preserves the functional form of the distribution. (α = β = 0 is a perfectly good choice: flat prior on q)

P (q) ∝ qα (1 − q)β

So the conjugate distribution is

P (q|N, n) = R 1 0

=

q n+α (1 − q)N −n+β

q n+α (1 − q)N −n+β dq

Γ(N + α + β + 2) q n+α (1 − q)N −n+β Γ(n + α + 1)Γ(N − n + β + 1)

∼ Beta(n + α + 1, N − n + β + 1) This Beta distribution has

n+α+1 mean = N +α+β+1

var =

(n + α + 1)(N − n + β + 1) (N + α + β + 2)2 (N + α + β + 3)

Matlab, Mathematica, and NR3 all have methods for generating random Beta deviates The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

30

If we generalize to contingency tables other than 2x2, Dirichlet is the relevant conjugate distribution to Multinomial Multinomial distribution (you can derive it by “repeated binomial” or combinatorics as we did earlier):

N! q1n1 q2n2 · · · , P (n1 , n2 , . . . |N, q1 , q2 , . . .) = n1 !n2 ! · · · Conjugate distribution, using conjugate priors:

P (q1 , q2 , . . . |N, n1 , n2 , . . .) ∝ q1n1 +α1 q2n2 +α2 · · ·

(

X

ni = N,

X

qi = 1)

the Dirichlet distribution

Normalization turns out to be:

P (q1 , q2 , . . . |N, n1 , n2 , . . .) =

Γ(N + α1 + 1 + α2 + 1 + · · · ) n1 +α1 n2 +α2 ··· q2 q Γ(n1 + α1 + 1)Γ(n2 + α2 + 1) · · · 1

Rather amazingly, there is a simple way to generate a (non-independent) set of q deviates from an independent set of Gamma deviates: yi ∼ Gamma(ni + αi + 1),

y ni +αi e−y p(y) = Γ(ni + αi + 1)

qi = yi

.X

yi

i

(In fact, in the case with I=2, this is how Beta deviates [previous slide] are usually generated.)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

31

So let’s reanalyze assuming that the condition (column) marginals were fixed by the protocol, and we Bayessample the row probabilities: (You’ll never believe my encapsulated function unless I go through an example!) >> table = [8 3; 16 26] table = 8 3 16 26 >> marfix = sum(table,1)' column marginals (transposed) marfix = 24 29 >> marvar = sum(table,2)' row marginals (transposed) marvar = 11 42 >> gammas = gamrnd(marvar+1,1) ~Gamma(ni+1) gammas = 12.1000 44.5735 >> q = gammas ./ sum(gammas) generated (random) row probabilities qi q = 0.2135 0.7865 >> qmat = repmat(q,[size(table,2),1]) finally, we generate multinomial deviates for each qmat = column, using the generated row probabilities 0.2135 0.7865 0.2135 0.7865 The reason everything is done in the transpose is >> tabout = mnrnd(marfix,qmat)' because of the way that Matlab’s mnrnd function tabout = expects its arguments to be shaped. Sorry about 4 8 that! 20 21 The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

32

In case it’s not obvious, sampling over q is the same as marginalizing over q: We generate a deviate pair (f,q) by choosing a q, and then, independently, an f given q, so

p(f, q) = p(q) p(f |q) We then ignore q, so our f is drawn from the distribution

p(f ) =

R

p(f, q) dq =

R

p(q) p(f |q) dq

which is the same as the desired marginalization

p(f ) =

R

p(f |q) p(q) dq

(This is so close to self-evident, that I’m not sure that the proof adds anything!)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

33

Encapsulate the sampling process into a function. Then, generate a bunch of samples and look at their Wald statistics. function tabout = tabnullsamp(tabin) marfix = sum(tabin,1)'; marvar = sum(tabin,2)'; q = gamrnd(marvar+1,1); qmat = repmat(q./sum(q),[size(tabin,2),1]); tabout = mnrnd(marfix,qmat)'; wald(table)

ans = 2.0542 tabnullsamp(table), tabnullsamp(table)

ans = 6 18 ans = 7 17

7 22 9 20

wald(tabnullsamp(table))

ans = 1.0392 samps = arrayfun(@(x) wald(tabnullsamp(table)), 1:30000); hist(samps,-4:.05:4) cdfplot(samps)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

34

There are still discreteness effects (after all, these are integer tables), but they are less troubling:

pval = numel(samps(samps>=wald(table)))/numel(samps) pvaltt = (numel(samps(samps>=wald(table)))+numel(samps(sampspearson(table)))/numel(samps)

pval = 0.0445

different from previous: protocol matters!

This would probably the “honest” answer if the table were nominal, not ordinal. But, as said, since it is ordinal, the previous analysis using difference of means is more powerful.

You now know all you need to know about contingency tables – and much more than almost everyone who uses them! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

38