Phylogenetic Dependency Networks

Phylogenetic dependency networks: Inferring patterns of adaptation in HIV

Jonathan M. Carlson

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

University of Washington

2009

Program Authorized to Offer Degree: Computer Science and Engineering

University of Washington Graduate School

This is to certify that I have examined this copy of a doctoral dissertation by Jonathan M. Carlson and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made.

Co-Chairs of the Supervisory Committee:

David Heckerman Walter L. Ruzzo

Reading Committee:

David Heckerman James Mullins Walter L. Ruzzo

Date:

In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.”

Signature

Date

University of Washington Abstract

Phylogenetic dependency networks: Inferring patterns of adaptation in HIV Jonathan M. Carlson Co-Chairs of the Supervisory Committee: Affiliate Professor David Heckerman Medical Education and Biomedical Informatics, and Microbiology Professor Walter L. Ruzzo Computer Science and Engineering

Populations adapt to their environment through a process of natural selection. By studying this process, one can gain insight into the specific functions of adaptive traits that provide an advantage in certain environments. HIV has proven to be remarkably adept at adaptation. So much so that the virus quickly adapts to each individual who is infected, effectively nullifying the immune response of most patients. By identifying the specific adaptations HIV employs against the immune system, it may be possible to identify vaccine targets that reduce HIV’s capacity to successfully adapt. This dissertation introduces the Phylogenetic Dependency Network (PDN) for the identification of adaptive traits and the environments in which they arise. The PDN is a directed graphical model in which nodes represent measurable traits of the population and the environment and arcs represent probabilistic dependencies among traits. The probability component of the PDN consists of a model of adaptive evolution in which each population trait adapts to a set of predictors, traits to which it is connected in the PDN. The structure of the PDN is identified through a model selection approach and can be interpreted as an estimate of which traits directly interact. We

introduce a class of probabilistic adaptive evolution models called conditional adaptation models. These models assume that each trait has evolved independent of all other traits in the PDN until it reached the current environment, at which point the predictors act to influence adaptation of the trait. One of the key benefits of this approach over traditional methods is the ability to simultaneously model multiple interactions. Existing approaches are typically constrained to consider the evolutionary interaction of two traits at a time. In complex environments in which each trait interacts with many other traits, this constrained view of adaptation blurs the distinction of which traits are truly interacting and which are only indirectly correlated. By modeling these interactions using conditional adaptation models, we are able to accurately capture dense networks of interactions. We apply our PDN approach to study adaptation of HIV to the human cellular immune response, identifying a large set of HIV adaptations that consistently arise in patients with similar immune genetics. These adaptations often take the form of multiple mutations spanning large regions of HIV proteins and indicate the presence of preferred patterns of adaptation. Although these adaptation networks are quite complex, the presence of these preferred adaptation patterns suggest weak points in viral adaptation that may be exploited by future vaccines.

TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

Chapter 1:

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2: Detecting Adaptation: Introduction and Review 2.1 Selection and adaptation . . . . . . . . . . . . . . . . 2.2 Phylogeny confounds the comparative method . . . . 2.3 Related work on the comparative method . . . . . . . 2.4 Limitations of existing methods . . . . . . . . . . . .

. . . . .

5 5 7 9 14

Chapter 3: HIV Immune Escape: Introduction and Review . . . . . . . . . 3.1 The HLA-restricted CTL response is a major selective force driving HIV-1 evolution within an infected host . . . . . . . . . . . . . . . . . 3.2 Escape follows generally predictable patterns in response to specific immune pressures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Immune selection pressures drive HIV evolution at the population level: but to what extent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Assessing the extent of HLA-driven HIV-1 evolution at the population level: challenges and controversies . . . . . . . . . . . . . . . . . . . . 3.5 HLA-associated immune pressures influence population HIV diversity at up to 40% of positions in some proteins . . . . . . . . . . . . . . . 3.6 Clinical consequences of immune-mediated evolution . . . . . . . . . . 3.7 Strategies to cope with viral diversity in HIV-1 vaccine design . . . . 3.8 Remaining challenges . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Chapter 4:

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Phylogenetic Dependency Networks . . . . . . . . . . . . . . . . i

19 20 21 22 25 26 28 28 31

4.1 4.2 4.3 4.4

Phylogenetically corrected distributions for one predictor trait . . . . Phylogenetically corrected distributions for more than one predictor trait q-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 37 42 43

Chapter 5: Evaluation and Application of the Univariate Model . . . . . . 5.1 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . 5.3 Application 1: Effect of immune pressure on HIV evolution . . . . . . 5.4 Application 2: Pairwise correlations between amino acids in HIV . . . 5.5 Application 3: Genomic search for genotype-phenotype associations in Arabidopsis thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Studies using the univariate conditional evolution model . . . . . . . 5.7 Limitations of univariate conditional evolution model . . . . . . . . . 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 52 59 63 65

Chapter 6: Evaluation of Multivariate Models . . . . . . . . . . . . . . . . 6.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Model validation on synthetic data . . . . . . . . . . . . . . . . . . .

79 79 83

68 71 75 77

Chapter 7:

Using PDNs to Infer Patterns of Immune Escape and Covariation in HIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phylogenetic dependency network for Gag p17 and p24 . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98 98 99 112

Chapter 8: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Limitations and future directions . . . . . . . . . . . . . . . . . . . .

120 124

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

137

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

150

Appendix A: Next Generation Sequencing: Extending the model to single genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Likelihood calculation . . . . . . . . . . . . . . . . . . . . . . . . . .

187 187

7.1 7.2 7.3

ii

A.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . Appendix B: On computing FDR for Fisher’s exact test B.1 Examples of FET for sequence data . . . . . . . B.2 Background . . . . . . . . . . . . . . . . . . . . B.3 Computing pFDR for Fisher’s exact test . . . . B.4 Numerical results . . . . . . . . . . . . . . . . . B.5 Creating synthetic data sets . . . . . . . . . . . B.6 Proofs and Remarks . . . . . . . . . . . . . . . B.7 Discussion . . . . . . . . . . . . . . . . . . . . .

iii

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

191 196 197 199 204 215 217 219 226

LIST OF FIGURES Figure Number

Page

2.1

Phylogeny confounds the comparative method . . . . . . . . . . . . .

8

4.1

Phylogenetic dependency network . . . . . . . . . . . . . . . . . . . .

33

4.2

The univariate model . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.3

The multivariate model . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.4

Decision Tree leaf distribution . . . . . . . . . . . . . . . . . . . . . .

40

4.5

Noisy Add leaf distribution . . . . . . . . . . . . . . . . . . . . . . . .

48

5.1

PR and calibration curves on synthetic data. . . . . . . . . . . . . . .

61

5.2

PR and calibration curves over different trees . . . . . . . . . . . . .

63

5.3

PR curves for the real the full HLA-amino-acid data. . . . . . . . . .

65

5.4

Correlated amino-acid pairs in HIV-1 p6. . . . . . . . . . . . . . . . .

67

5.5

GWAS for Arabidopsis bacterial response . . . . . . . . . . . . . . . .

70

6.1

p-value calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

6.2

Noisy Add represents real data better than Decision Tree . . . . . . .

85

6.3

Performance on HOMER data . . . . . . . . . . . . . . . . . . . . . .

88

6.4

Tree built from the combined HOMER and Durban cohorts . . . . . .

91

6.5

Performance on synthetic mixed clade data . . . . . . . . . . . . . . .

92

6.6

Power to detect associations . . . . . . . . . . . . . . . . . . . . . . .

97

7.1

Gag PDN for combined HOMER and Contract cohorts . . . . . . . .

101

7.2

Number of optimal epitopes found vs. q-value rank . . . . . . . . . .

112

8.1

Univariate model with linked predictors . . . . . . . . . . . . . . . . .

134

8.2

Noisy Add model with linked and unlinked predictors . . . . . . . . .

136

B.1 P-value histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . .

201

B.2 Pooled vs. marginal p-values . . . . . . . . . . . . . . . . . . . . . . .

206

iv

B.3 The advantage of using π0 (α) computed using the filtering technique over not filtering. Because filtering only affects π ˆ0 , these gains result in proportionally reduced (yet conservative) pFDR estimates. . . . . . B.4 Estimated pFDR vs. true false discovery proportion . . . . . . . . . . B.5 Power gains for proposed pFDR method . . . . . . . . . . . . . . . .

v

215 216 217

LIST OF TABLES Table Number

Page

5.1

Predicted HLA-amino acid associations in Gag. . . . . . . . . . . . .

66

7.1

HLA-codon associations in which consensus is the predicted resistant form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

103

B.1 2 × 2 contingency table on binary variables X and Y . . . . . . . . . . B.2 Outcomes when testing m hypotheses. . . . . . . . . . . . . . . . . . B.3 Comparing π ˆ0 estimations for synthetic data sets derived from the Epitope data with different true π0 . Storey’s method was evaluated at λ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Comparing π ˆ0 estimations for the real data sets. . . . . . . . . . . . .

vi

200 202

210 211

ACKNOWLEDGMENTS I would like first to thank my supervisor, David Heckerman, who has been an invaluable mentor, manager, colleague and friend, as well as Larry Ruzzo, who has been incredibly supportive of me throughout my rather non-traditional graduate school experience. I would also like to thank Jim Mullins, for his collaboration, sharing his data and post docs, and serving on my reading committee, and Elizabeth Thompson, for serving as my GSR. This work has been been the result of numerous collaborations, for which I am extremely grateful. In particular, this dissertation would be entirely different without the collaboration of Zabrina Brumme, Chanson Brumme and Richard Harrigan, who introduced me to (and continue my education on) HIV, have provided invaluable data and ideas, and have critically read and contributed to just about all the work I have done related to HIV. Special acknowledgment also goes to Philip Goulder and Philippa Matthews, who provided invaluable data and ideas for the results in chapter 7 and first suggested we try the Decision Tree model. I have also had incredible support at MSR. In particular, Carl Kadie has been an incredible coding resource. Much of the implementation is due directly to him, and most everything else reflects his influence. In addition, the work in Appendix B was done in close collaboration with Guy Shani, who developed the efficient algorithms and performed the experiments. And of course Jennifer Listgarten, who generally makes MSR a more interesting and enjoyable place to work. Thanks to my parents for their unwavering support; to Richard Peterson, who introduced me to biology, to Tom Cormen, who introduced me to computer science, vii

and to Bob Gross, who brought it together for me in computational biology and gave me incredible research opportunities as an undergraduate; to Arijit Chakravarty, for mentoring me and shaping my approach to research; to Scott Saponas, Jon Froehlich, Seth Bridges and others, who made grad school fun and helped prove that CS grad students can hold their own in IM sports; and to Kate, for absolutely everything. This work was supported in part by funding from a Microsoft Research Graduate Fellowship.

viii

DEDICATION To my amazing wife Kate, who is everything to me.

ix

1

Chapter 1 INTRODUCTION Since its identification as the pathogenic cause of Acquired Immunodeficiency Syndrome (AIDS) in the early 1980s, Human Immunodeficiency Virus Type 1 (HIV-1) has emerged as a major global pandemic with an estimated 33 million infected individuals worldwide at the end of 2007 [221]. Specifically targeting the CD4+ subset of T-lymphocytes (the so-called “helper T cells”), HIV-1 causes a progressive deterioration of immune function, leaving the infected individual susceptible to a range of opportunistic infections that eventually lead to AIDS and death. Although improvements in antiretroviral therapy have dramatically reduced HIV-related morbidity and mortality among those with access to treatment [172], the search for an effective HIV-1 vaccine continues. One of the enduring challenges facing HIV vaccine design is the remarkable rate of viral mutation and adaptation that allows the virus to evade the adaptive immune response of the host. As the immune system learns to target the virus, novel viral mutations that allow the virus to escape the attack provide an advantage over virus particles (virions) lacking the mutations. These escape mutations thus come to dominate the viral population in the host via selection and the immune system is left learning to target what must appear to be a novel pathogen, causing the process to repeat. The result has been devastating to HIV vaccine design. The goal of a vaccine is to train the host immune system to recognize the virus before exposure, but when the virus is constantly changing, it is extremely difficult to predict what the attacking virus will look like, and thus, how to train the immune system. Even when monkeys are inoculated and subsequently challenged with the same virus (so called homologous

2

challenge), protection is often incomplete, in part because any virions that survive the initial attack quickly adapt. So then, is the search for a vaccine futile? Although the power of adaptation may seem insurmountable, it should not be surprising that there is a large body of evidence that the space of viable mutations is constrained. Indeed, through all the myriad mutations the result must still be an infectious virion. It would thus seem that identification and characterization of these constraints is both possible and necessary for the advancement of the field. That, in a nutshell, is the purpose of this dissertation: to develop, test, and apply a statistical model of evolution that can robustly identify patterns and constraints in viral evolution. The result is promising. Although the patterns are complex and will require significant follow up to tease apart, they are also dense, suggesting a promising consistency that may provide insight into weak points in viral adaptation and suggest new targets for vaccine design. It should be evident that adaption is not a process unique to HIV, and thus the models explored here may find many uses outside the realm of HIV. Nevertheless, the rapid rate of HIV adaptation provides a unique opportunity to capture adaptation in near real time and to analyze distinct populations that are isolated from each other (by virtue of infecting different hosts), yet have had a chance to adapt to their environment. Thus, the broad approach we take is to analyze a large cohort of individuals who have been infected for some time (to allow the adaptations to arise) and have not been exposed to antiretroviral therapy (which introduces a tremendous source of selection pressure that may obliterate the signal induced by the immune system). We do so by employing statistical models of evolution that attempt to identify common patterns of selection and adaptation. The statistical model we propose is the phylogenetic dependency network (PDN), so-called because it is an adaptation of the dependency network [94] to an evolutionary context. In brief, the PDN is a graphical model that relies heavily on local probability distributions that are conditioned on an underlying phylogeny. The graphical

3

structure, as well as the parameters of the probability distributions, are learned from the data, with the resulting structure indicating statistical dependencies among variables. Although these dependencies cannot be interpreted as causal, they provide a useful means to understanding potential interactions among the variables and suggest simple experiments that can confirm specific biological interactions. We describe the model in detail in chapter 4. There are several possible probability distributions that fit naturally within the PDN framework. Although we describe the details of the models in chapter 4, it is useful to explore how the distributions work in practice. chapter 5 is devoted to the simpler model, which we call the univariate conditional adaptation model, as it describes the adaptation of one trait in response to a single source of selection pressure. We use synthetic data sets and real world analyses to explore the properties of the model and compare it to previous approaches. Interestingly, this simple model has proven quite useful in practice. We review some recent studies that have utilized the univariate model in section 5.6. The more complete model we propose is the multivariate conditional adaptation model, of which there are two variations. These models incorporate multiple sources of selection pressure and are, in principle, better able to describe dense interaction networks, in which each trait is influenced by myriad sources of selection pressure. These models are explored in detail in chapter 6. In chapter 7, we apply the PDN to the study of HIV adaptation. In this work, we combine the two largest antiretroviral na¨ıve cohorts, comprising a diverse set of HIV sequences and host genetics, into a single analysis. Combined with the PDN, we have unprecedented statistical power and precision to explore the patterns of HIV adaptation in response to the immune system. To the biologist, this chapter may be read as the main result of the dissertation. For the interested reader, the appendices provide technical discussions of tangential issues that arose as part of the dissertation research. Appendix A provides a preliminary look at the technical details around extending the models discussed in

4

this dissertation to the case in which individual viral sequences are sampled from each individual. In Appendix B, we consider the statistical characteristics of the false discovery rate for the simplest probability distribution we encounter: the joint distribution of two independent, identically distributed binary variables evaluated using Fisher’s exact test (FET) [66].

5

Chapter 2 DETECTING ADAPTATION: INTRODUCTION AND REVIEW 2.1

Selection and adaptation

Let us begin with a brief, informal introduction to natural selection and adaptation, as the concepts and terminology will be useful in later discussions. The process of natural selection and adaptation can be summarized as follows. In any population of individuals, there is a natural variation among those individuals arising from genetic mutations, which arise randomly and then are passed on to offspring. Most mutations are deleterious, resulting in individuals who are less fit than the rest of the population, meaning that, on average, they tend to produce fewer offspring. Because individuals who have these deleterious mutations produce fewer offspring, over time, the frequency of these mutations will be quite small or may even be eliminated from the population. This process is referred to as negative or purifying selection. In contrast, some mutations actually increase the ability of the individual to procreate. Over time, individuals harboring these mutations come to dominate the population and the mutation may even reach fixation, meaning the majority of surviving individuals have the mutation. This process is referred to as positive selection, and the mutation (or set of mutations) that resulted in increased fitness is called an adaptation. Finally, many mutations make no discernable difference to the individual. If we were to follow the frequency of such mutations over a long period of time, we would see the frequency of the mutation follow a random walk pattern, referred to as neutral evolution. The process of selection and adaptation, is thus an interactive process between the population and the environment. Certain characteristics (traits)

6

of the environment favor some mutations over others. Such traits serve as a source of selection pressure, and the resulting adaptations are thus in response to that selection pressure. Thus, when the environment changes, the selection pressures may change, favoring different adaptations. Where the study of adaptation becomes particularly interesting is when comparable populations are compared in different environments. In this scenario, a mutation that is neutral in environment A maybe be beneficial in environment B. Thus, over time, this mutation will be far more prevalent in environment B than in environment A. Furthermore, the implication is that there are specific characteristics of environment B that interact (directly or indirectly) with the mutation. In the case of HIV, these interactions prove vital for the design of an effective vaccine (see chapter 3). To ground our example, suppose we are interested in trait Y , which can take on values in {0, 1}. In addition, suppose there exists an environmental trait X ∈ {0, 1}, which exerts selection pressure on Y , such that when X = 1, there is a selective advantage for Y = 1 over Y = 0. Then we have, Pr {Y = 1|X = 1} > Pr {Y = 0|X = 1} . If we sample enough individuals from the two different environments, we will be able to see this correlation in the form of statistical dependence. In this vein, the comparative method [92] seeks to identify correlated traits, with the assumption that correlation implies the interaction of selection pressure and adaptation, which may imply a specific function for the traits involved. In essence, one simply samples a large number of traits from individuals, as well as a number of environmental traits, then tests for correlations among all pairs of traits. Interactions that achieve some significance threshold are then considered candidates for experimental followup to determine the underlying process of selection and adaptation that leads to the apparent correlation. The challenge with the comparative method, as elegantly described by Felsenstein [61], lies in the confounding effect of the evolutionary history

7

of the traits, which tends to make traits look more correlated than they really are. 2.2

Phylogeny confounds the comparative method

The relevant question then is how should we test for significance? Suppose we are considering the two binary traits X and Y . To determine whether these two variables are associated, we could count the number of individuals with and without each trait and apply a simple statistical test for indepedence, such as Fisher’s exact test. This procedure, however, ignores the phylogenetic structure among the sequences [61]. Suppose these sequences have the phylogeny shown in Figure 2.1a. In essence, there are two clusters of individuals where individuals within a cluster are similar to each other but quite different from those in the other cluster. Now suppose we observe that traits X and Y are present in the two individuals on the top and absent in the two individuals on the bottom, as shown in Figure 2.1a. The observations of the amino acid are well explained by the phylogeny alone and should not be treated as independent observations. Consequently, the application of Fisher’s exact test or some other test that ignores the phylogenetic structure would overcount these observations when determining the correlation of X and Y . Such overcounting will lead to an overestimation of the statistical significance, leading to a surprising number of false positives. In contrast, suppose we make the observations shown in Figure 2.1b. Here, observing the presence and absence of the trait Y in the same branch of the phylogeny is quite surprising, until the observations of X are taken into account. In this case, the application of a simple test would undercount the observations when determining the correlation of X and Y , leading to an underestimation of statistical significance and potentially increasing the number of false negatives. Simple statistical methods such as Fisher’s exact test assume the data to be infinitely exchangeable or independent and identically distributed (IID). Although sequence data and other biological data are IID a priori, they are not IID once we learn their hierarchical structure. Furthermore, as we have just seen, this structure can

8

xy

xy

xy

xy

xy

xy

xy

xy

a

b

Figure 2.1: Examples illustrating the (a) overcounting and (b) undercounting of evidence for an association between X and Y .

easily confound the statistical search for associations within such data. An important point that has not been emphasized in previous work is that different applications may involve different evolutionary processes leading to different kinds of confounding and requiring different solutions. For example, in one process, X and Y coevolve according to the phylogenetic tree—any change in Y during evolution influences the evolution of X, and vice versa. Here, the phylogenetic tree serves as a confounder of X and Y in the traditional sense—the tree is a hidden common cause of both X and Y , leading to spurious correlations between X and Y when the tree is ignored. In another process, only Y evolves according to the tree and the influence of X on Y occurs only at the tips of the tree. X need not evolve according to the tree or follow the tree, but instead can have any distribution, including one in which the observations of X are IID. We refer to these two processes as coevolution and conditional adaptation, respectively.

9

2.3

Related work on the comparative method

In this section, we briefly review existing methods for the comparative method that account for phylogeny. The interested reader is referred to the textbooks of Harvey and Pagel [92] and Felsenstein [62], as well as the review by Martins [149], for a more detailed review of the field.

2.3.1

Correlated evolution of continuous variables

One of the clearest arguments for identifying correlated traits in the context of phylogenies come from Felsenstein [61], who laid out a similar argument to that given in section 2.2 proposed the method of independent contrasts as a remedy. In this method, the (continuous) traits are assumed to be derived from a Brownian motion (or other Gaussian) model of evolution and the n samples are converted to n − 1 differences between adjacent nodes in the tree, with the variance of the differences computed according to the branch lengths of the tree. The resulting differences are independent, allowing for regression tests between two traits. Of course, other models of neutral evolution may be more appropriate, as the random walk of Brownian motion is rather implausible for most systems in which at least some purifying selection provides a continuous draw toward some steady state. A variation of Felsenstein’s method using the Ornstein-Uhlenbeck (OU) process, which can be thought of as a random walk tethered by an elastic band to some center, provides a more realistic framework [89]. Butler and King [29], building on Hansen’s model [89], provide a natural framework in which different selective regimes can be tested. In essence, a specific source of selection pressure, hypothesized to act on a specific set of branches, is allowed to change the center to which the OU process gravitates. The univariate conditional model (section 4.1 and chapter 5) can be thought of as a discrete-variable version of this OU approach. These methods are effective and well used, but are appropriate only for continuous data. Felsenstein

10

provided a method for modeling a discrete variable using an underlying continuous variable and a thresholding procedure, but likelihood estimation requires Markov chain Monte Carlo and is quite slow in practice [64]. 2.3.2

Correlated evolution of discrete variables

For discrete data, early work focused on reconstructing ancestral states using parsimony and looking for correlations between transitions using standard methods such as the χ2 test of independence [141, 190]. The problem with these approaches is that they ignore the uncertainty in the assignment of ancestral states. In some cases, choosing an ancestral state when there is roughly equal evidence between two states can alter the conclusions. To account for this uncertainty, Pagel [171] presented a maximum likelihood approach that averages over all possible configurations of the internal nodes. In effect, Pagel performs a likelihood ratio test where the null model is two binary traits evolving independently using classical maximum likelihood over phylogenies [60], and the alternative model has 2 × 2 = 4 states that define correlated evolution. At about the same time, Muse independently proposed the four-state version of Pagel’s method for the purpose of detecting compensatory mutations in RNA secondary structure [158]. 2.3.3

Correlated evolution of amino acids

Coevolution of discrete traits has been propelled largely by the amino acid coevolution community. Here the goal is to identify coevolving pairs of codons in the same or interacting proteins (see [39] for a recent review). Early methods ignored the problem of phylogeny and used simple methods of correlation such as mutual information [11, 122] or the correlation coefficient [80, 173, 215], but the phylogeny was soon shown to play a confounding role [168, 177, 228]. A number of methods emerged that attempted to calibrate p-values based on the phylogeny. The idea behind these approaches is that the primary concern of phylogeny-based confounding is that the

11

resulting statistics do not follow the expected distribution, making any p-values that are computed on the assumption of a specific distribution invalid. To recalibrate, the null distribution is estimated from the data using some method that accounts for the phylogeny. Briefly, methods have been proposed based on global or local measures of average similarity [26, 127, 138, 159, 213] or flat clusters based on early bifurcations of the phylogenetic tree [11, 27]. One approach that explicitly incorporates the phylogeny is the parametric bootstrap, in which standard IID models are recalibrated by generating independently evolved amino acids according to an independence model of evolution [27, 228] (in chapter 5, we consider this method in more detail). One broad criticism of these approaches is that they leave the underlying ordering of the statistics unchanged. That is, it is implicitly assumed that the phylogeny does not affect the strength of the association, only the interpretation of the relative strength. As we have argued in section 2.2, however, the phylogeny may profoundly alter the relative strength of the statistic, even changing the ordering of associations. Thus, directly incorporating the phylogeny into the derivation of the statistic may dramatically increase power.

A more direct way to incorporate the phylogeny is to adapt one of the discrete methods from the evolutionary biology community. Several methods have derived from Ridley’s original proposal of reconstructing ancestral states [17, 74, 102, 182]. Similarly, Poon et al. [181] and Pollock et al. [178], which we study in some detail in chapter 5, develop special cases of Pagel’s method [171] that force the model to be reversible, and Yeang and Haussler [233] develop a version of Pagel’s method that does not require the collapse of amino acids into binary space, but rather achieves the necessary state-space reduction by setting the instantaneous rate of simultaneous mutation in both both codons at a nonzero constant that is the same for all amino acids.

12

2.3.4

Identifying adaptation from DNA sequence alone

An alternative approach to discrete data is to simply identify codons that are under diversifying (positive) or purifying (negative) selection pressure, an approach that is particularly suited in the absence of testable hypotheses regarding the source of selection pressure. Several methods have been developed that compare the rate of synonymous mutations ds to that of nonsynonymous mutations dn to identify genes, regions, or even individual codons that experience more amino acid substitutions than expected based on the underlying neutral substitution rate [160]. One of the more common methods is PAML, which uses a codon substitution model over a phylogeny to compute dn /ds [165, 231]. Stewart et al. [207] extended this concept in the program QUASI to identify specific residues that are positively selected, although this model assumes a star phylogeny in that only deviations from consensus are considered. Peters et al. [174] identified HLA-associated polymorphisms by identifying positively selected residues in HIV using QUASI, then correlating those residues to HLA alleles using standard tests that assume independence (see chapter 3 for an introduction to the importance of HLA-derived selection pressure on HIV). Delport et al. [48] developed a codon model that specifically identifies amino acids that toggle back an forth, on the assumption that these represent specific adaptations to certain environments (assumed to be HLA in their case study) that revert in the absence of the environment. Chen and Lee [35] developed a more direct approach to identifying sources of selection pressure by defining a dn /ds ratio that was conditional on specific environmental variables (in their case, HIV antiretroviral drugs), but also did so using a na¨ıve definition of dn /ds based on deviation from consensus. Such definitions implicitly assume a star phylogeny and ignore branch lengths (i.e., they assume the traits are IID), and are thus likely to lead to statistical bias in cases where there is some structure in the tree, as in HIV [17, 31]. Even when the phylogeny and branch lengths are considered, as in PAML, dn /ds ratios can be viewed as a means to calibrating statistics. That is,

13

the synonymous substitution rate serves to normalize observations regarding nonsynonymous substitutions. Thus, models that explicitly model evolutionary interactions between two variables (e.g. [31, 102, 141, 158, 171, 178, 181, 190]) are expected to have greater statistical power than dn /ds ratios, due to the fact that they can upweight surprising deviations between evolutionarily similar species whereas dn and ds represent summary statistics across all species and cannot leverage such information. 2.3.5

Similarities to population genetics

Population structure in biological data has also been addressed in the area of genome wide association studies (GWAS). Although standard population genetic models assume populations mate randomly, violations of this assumption result in latent population structure that inflates false positive rates, an effect that will only increase as study sizes increase [144, 145]. There are two rather different approaches in this community that have been used to compensate for population structure. The more commonly used approach attempts to recalibrate standard statistics by normalizing results according to the distribution of the statistic across the entire genome [50, 51]. As we shall see, solutions to calibration are insufficient, as population structure also affects discriminatory power. The other approach assumes population structure is flat and can be captured by a small number of (perhaps overlapping) clusters or continuous hidden variables [185, 186, 198, 204, 217]. Although these methods increase discriminatory power relative to simple IID models, there is mounting evidence that populations are hierarchically structured. For example, in addition to high level geographical/social constraints that impose population structure [195], structure exists within a number of subpopulations that have been studied [13, 30, 96, 223]. If hierarchical models describe the data better than flat cluster models, then it stands to reason that such models will have higher discriminatory power. Thus, several authors have suggested that a more accurate approach would be to model population structure hierarchically [8, 114, 240]. Aranzana et al. [8] described one such model in

14

their Arabidopsis study, which we re-examine in section 5.5. A more general approach that can capture pedigree structure is the linear mixed-effects model [97, 114], most recently extended by Yu et al. [241]. This linear model includes a pairwise correlation term that models genetic relatedness, a white noise term, and an environmental impact term that is used to identify sources of selection pressure. This approach can be thought of as unifying population genetics and phylogenies, in that Brownian motion and OU processes over phylogenies can be captured in the genetic relatedness term [101, 140]. Unfortunately, modeling discrete data with these approaches is computationally slow and difficult to optimize, at it requires variational approximations (H. Kang and D. Heckerman, unpublished data). 2.4

Limitations of existing methods

Although there is a long and rich history of the development of methods for the comparative method, we note two major gaps in the literature, each of which will be addressed by this dissertation.

2.4.1

Chains of interaction

First, we note that the traditional application of the comparative method is to look for correlations among pairs of variables. The problem with this approach can be seen by a simple example. We have been considering the case where trait X exerts selection pressure on trait Y , which can be graphically depicted as X → Y. Suppose, however, that Y in turn exerts selection pressure on another trait Z. A common example is if Y and Z are both codons in the same protein. In this scenario, a mutation in Y may serve as an adaptation in response to X, but the change in Y destabilizes the protein unless a compensatory change in Z occurs. This causal model

15

can be graphically depicted as X → Y → Z. The problem is, when we apply the comparative method to all pairs of variables, we will find that all three pairs (X, Y ), (Y, Z), and (X, Z) are correlated, even though Z is conditionally independent of X. We refer to such causal models as chains of interactions. If chains are common, then indirect associations ((X, Z) in the present example) will be common, leading to a large number of false positive results. To our knowledge, only Poon et al. [182] have addressed the problem of chains of interactions when considering discrete traits in the context of phylogeny. (Several authors have employed Bayesian networks, which solve this problem, but have done so assuming the traits are IID [45, 181].) Poon’s approach can be described in three steps: (1) first, a phylogeny is inferred from all traits (they focus on amino acid covariation); (2) next, they infer the most likely state of each trait for each hidden node in the phylogeny in a method similar to [190]; (3) finally, they feed both the observed and inferred traits into a standard directed acyclic graphical (DAG) model inference algorithm [93], treating the inferred states as observed data. This paper represents a major advance in the field in that it was the first to both identify and address the problem of chains of interactions. We must, however, note several weakness to this approach. First, one must be cautious whenever hidden nodes are treated as observed data. In cases where there is strong evidence that the hidden node takes on a specific value, this approximation may yield good results in practice. For large phylogenies, however, it is often the case that the evidence for one state is only slightly greater than the evidence for another state, especially for nodes near the root (and thus farther removed from the observed data). Although Poon et al. showed that their method was reasonably robust to different instantiations of the hidden nodes on the data sets they considered, in general, power is typically gained when all the hidden nodes are integrated out. Second, the DAG model has some inherent constraints. For example, the acyclicity

16

constraint, which is required for computational tractability, may not be a reasonable assumption. Certainly in the case of amino acid covariation one can expect cycles to exist in the true causal model. The result can be difficulty in interpreting the independencies implied by the resulting structure [94]. Additionally, the goal of the DAG model is to maximize the joint likelihood over all the attributes. This requires an exponential number of parameters relative to the number of traits considered, leading to a substantial computational burden and a requirement for a massive amount of data to infer all but the simplest interaction networks. 2.4.2

Coevolution versus conditional adaptation

As discussed in section 2.2, there are at least two evolutionary processes that can lead to phylogenetic confounding: coevolution and conditional adaptation. To our knowledge, all of the existing approaches assume a coevolutionary process, by explicitly incorporating the assumption in a generative model (e.g., [158, 171, 178]), or by mapping both traits to the same tree (e.g., [17, 74, 102, 141, 182, 190]). In principle, most of the p-value calibration techniques could be adapted to estimate null data by randomizing only one variable with respect to the tree, though we are not aware of any explicit discussion in the literature to this effect. The problem with this assumption is clear. When one of the traits (typically the environmental trait) does not map well to the phylogeny of the other trait, forcing it to do so will hurt the modeling procedure. In practice, there are two solutions to this problem. (1) Many approaches can model an IID variable on a tree as a boundary condition of the model parameters. For example, the generative model of Pollock et al. [178] can, in principle, handle either or both traits being IID by setting the corresponding mutation rate parameter to infinity. In practice, however, we find that this model does not perform well on conditional adaptation data (see chapter 5). The difference between conditional adaptation and coevolution is, however, deeper than the fact that conditional adaptation is more likely to accommodate the two traits

17

following two different distributions. The specific interactions assumed by the causal models are different. In the coevolution case, the assumption is typically that the two traits are either positively or negative correlated in the traditional sense. By this we mean that, if the traits are positively correlated, the mutation of either trait to 1 will apply pressure for the other trait to mutate to 1. Conversely, the mutation of either trait to 0 will apply pressure for the other trait to mutate to 0. (The opposite is of course true for negative correlation.) The conditional adaptation assumption can posit two different definitions. (1) It can be a directed association. Specifically, changing trait X (the environmental variable in our example) may influence trait Y , but changing trait Y will have no influence on X. Second, the interaction may be only partially positive (or negative). For example, it maybe that X = 1 induces positive selection pressure for Y to transition to 1, but when X = 0, Y is under neutral selection. The point is not that the coevolution model is fundamentally unsound. Indeed, there are many examples where this model is likely closer to the true causal model than is conditional adaptation, with a prime example being amino acid covariation. Rather, each model appears to better describe different evolutionary processes. In this dissertation, we propose several specific conditional adaptation models, which are inspired by the process of adaptation described in section 2.1. In chapter 5, we explicitly compare one of these models to the coevolution model of Pollock et al [178]. As we shall see, both the coevolution and conditional adaptation models are distinct, though we also find that our conditional adaptation model is able to approximate the coevolution model reasonably well.

18

Chapter 3 HIV IMMUNE ESCAPE: INTRODUCTION AND REVIEW The extraordinary mutational capacity of HIV-1 represents a major challenge to vaccine development [76, 143]. On average, the error-prone HIV-1 Reverse Transcriptase introduces one mutational “error” per replication cycle, while template-switching and recombination represent additional mechanisms for generating alternative viral species [143]. Within an infected individual, a progressive expansion of viral diversity occurs over the disease course [205], with multiple variants co-existing as a heterogeneous swarm or quasispecies that is unique to each patient. On a global scale, HIV-1 has undergone dramatic diversification since its introduction into humans less than 100 years ago [121, 229]: nucleotide sequences from the multiple viral subtypes and circulating recombinant forms comprising HIV-1 Group “M” strains (which account for the majority of infections worldwide) may differ by up to 35–40% [76]. Achieving a broader understanding of the factors driving viral evolution on both an individual and a global level is thus of paramount importance to vaccine design. Over the natural course of infection, the host immune response acts as a major selective force driving HIV-1 evolution in a continuous dynamic process known as immune escape [85]. Directed against three-dimensional epitopes on the virion surface, the role of antibodies is to neutralize free-floating virus or to tag them for destruction by effector cells or complement. Escape from the HIV-1-specific antibody response thus takes the form of amino acid substitutions within the viral Envelope protein and represents a main driver of both intra-individual and global HIV Envelope diversity [226]. In contrast, the role of cytotoxic T-lymphocytes (CTL) is to eliminate

19

virus-infected cells through recognition of short, linear peptides processed intracellularly and presented on the cell surface by Human Leukocyte Antigen (HLA) class I molecules. Since peptides from all viral proteins have the capacity to bind and be presented by class I molecules, HLA-restricted CTL select for escape mutations on a proteome-wide basis. This fact, combined with the observation that CTL likely contribute more to immune control of HIV-1 infection than antibodies [133], highlights CTL as a potentially major in vivo selective force driving genome-wide viral evolution. It is generally agreed that a successful HIV-1 vaccine will require stimulation of an effective CTL-based immune response in addition to an antibody response [133]. The recent suspension of a major CTL-based HIV-1 vaccine trial [161] underscores the need to improve our understanding of host antiviral immunity, including the impact of immune-driven viral adaptation on HIV-1 sequence diversity and its potential consequences for future vaccine strategies. In what follows, we summarize recent advances in our knowledge of CTL-driven HIV evolution at a population level and discuss the implications for vaccine design. 3.1

The HLA-restricted CTL response is a major selective force driving HIV-1 evolution within an infected host

HLA-restricted CTL are major mediators of host antiviral control during HIV-1 infection [85, 86]. A temporal correlation exists between the appearance of HIV-1-specific CTL in vivo and the decline of acute-phase viremia [123] (antibodies appear only later [85, 133]), and experimental depletion of CD8+ cells in rhesus macaques prior to Simian Immunodeficiency Virus (SIV) infection results in inability to control virus levels [199]. In addition, a strong epidemiological link exists between specific HLA class I alleles and differential rates of HIV-1 disease progression [14, 34, 44], suggesting that the quality of the CTL response and/or the characteristics of targeted epitopes strongly influences the effectiveness of antiviral control [85]. However, the strongest

20

evidence supporting CTL as a major determinant of HIV-1 control may be mutational immune escape. First described in 1997, selection of viral escape mutations within key CTL epitopes during primary [16, 19] and chronic [84, 85] HIV-1 infection identified immune-driven evolution as a continuous process occurring throughout the disease course. Although escape mutations are often selected within CTL epitopes (thus disrupting peptide-HLA binding or recognition of the peptide/HLA complex by the T-cell receptor), escape is by no means confined to epitope boundaries. Mutations in epitope flanking regions, which impair intracellular peptide processing and presentation, have also been described [3, 54, 239], as have secondary or compensatory sequence changes that can stabilize escape mutations selected elsewhere [41, 85, 201]. 3.2

Escape follows generally predictable patterns in response to specific immune pressures

In the past decade, observational studies have identified a large number of CTL escape mutations that are reproducibly selected in the context of specific HLA restrictions [19, 84, 85, 113, 132, 201]. This has led to a monumentally important observation: HIV-1 evolution follows generally predictable patterns when specific immune pressures are applied. This phenomenon was most strikingly demonstrated in a unique casereport of monozygotic twins infected with the same virus on the same day through needle sharing: over a three-year follow-up period, the kinetics and patterns of CTL and antibody escape mutations were nearly identical in both twins [53]. Even among unrelated individuals, kinetics and patterns of HIV-1 evolution are broadly predictable based on HLA restriction. The majority of persons expressing the “protective” HLAB*57 allele [34], for example, will select for a T to N mutation at position three of the TW10 epitope in the p24 Gag protein within the first weeks after infection [85, 132]. In B*27-expressing individuals, the first mutation that arises in the immunodominant Gag epitope KK10 is an L to M change at position six, followed years later by an R to K change at position two [85, 113]. The fact that sites and pathways of escape are

21

broadly predictable indicates that despite the extensive worldwide sequence diversity of HIV-1, substantial constraints govern the evolution of this virus [2, 53]. This raises the possibility that immunogens incorporating knowledge of common escape pathways may be designed. Until relatively recently, however, identification of escape mutations have generally been limited to smaller observational studies, and largely biased towards “protective” HLA alleles associated with long term viremic control [19, 84, 85, 113, 132, 201]. 3.3

Immune selection pressures drive HIV evolution at the population level: but to what extent?

If within-host CTL escape patterns are broadly predictable based on HLA profile, the frequency and distribution of HLA alleles in humans likely shape viral evolution at the population level in a similarly predictable manner. Indeed, the HLA footprinting hypothesis states that the circulating HIV-1 consensus sequence reflects viral adaptation to the most commonly-expressed HLA alleles in a population [131, 157]. One potential mechanism underlying this hypothetical footprinting effect is the repeated selection of fitness-neutral escape mutations in the context of frequently-observed HLA alleles, eventually leading to fixation of “inactive” forms of CTL epitopes in the circulating pool of viral strains and potentially rendering HIV less immunogenic as the epidemic progresses [131, 157]. Alternatively, a bottleneck effect may have occurred early in the course of the pandemic. Under this hypothesis, escape mutations arising in the earliest patients persist in the circulating strain. In either scenario, persisting escape mutations must have minimal fitness cost in the absence of CTL to prevent reversion back to the susceptible form following transmission to patients who lack the restricting HLA allele. The role of CTL in shaping population HIV-1 sequence diversity is influenced by a complex interaction among multiple conflicting selective forces, one example being the delicate balance between the benefits of escape versus the associated costs to

22

viral fitness [137]. While escape mutations, by definition, confer a selective advantage under active CTL pressure, these mutants may not represent the most efficiently replicating species in the absence of immune pressure. For example, the dominant T242N escape mutation at position three of the B*57-restricted TW10 epitope in p24 Gag abrogates B*57-epitope binding [132], but also confers a substantial replicative cost in the absence of CTL pressure as demonstrated by rapid reversion to wild type following transmission to a B*57-negative individual [132] and in vitro assays measuring viral replicative capacity [21, 147]. Reversion of escape mutations following transmission to an individual lacking the HLA allele, however, does not necessarily occur in all cases: the fitness cost of the substitution, the presence of compensatory mutations, and a complex array of other host and viral selective forces influence which immune-selected mutations may reach appreciable levels in the population [132]. Estimating the extent of immune imprinting on HIV-1 is additionally challenging due to the lack of a comprehensive map of HLA-associated escape sites across the viral genome. Until recently, studies of CTL escape have focused on select alleles and/or HLA-restricted epitopes in small observational studies; however, recent advances in DNA sequencing technologies have facilitated the collection of HLA and HIV-1 sequence data in large cohorts of HIV-infected individuals, thus allowing the first population-based assessments of HLA-driven imprinting on the viral genome. 3.4

Assessing the extent of HLA-driven HIV-1 evolution at the population level: challenges and controversies

The first study investigating HLA-mediated imprinting on HIV-1 at the population level was published by Moore et al. in 2002 [157]. Specific HLA alleles associated with the presence or absence of the consensus amino acid over codons 20-227 of the Reverse Transcriptase protein were identified in a cross-sectional analysis of over 400 clinically-derived HIV-1 sequences from Western Australia, using Fisher’s exact test and logistic regression. A total of 64 “positive” and 25 “negative” correlations (where

23

the HLA was respectively associated with divergence from, or preservation of, the consensus residue) were identified. Positive correlations confirmed the predictable selection of HLA-restricted escape mutations and highlighted the substantial portion of viral codons whose evolution is driven by escape. Negative associations, on the other hand, were interpreted as evidence to support the HLA footprinting effect; in other words, confirmation that the predominant circulating HIV-1 sequence had arisen in response to selection pressures imposed by the most commonly-expressed HLA alleles in the population [157]. There is one major concern, however, with this type of statistical association study. Evolutionary biologists have pointed out that standard tests of association that assume independence (such as Fisher’s exact test or logistic regression) are inappropriate for the analysis of inter-species (or in this case, inter-strain) data, because sequences with a shared phylogenetic history do not represent statistically independent observations [61] (see chapter 2 for review). This problem is most apparent in the analysis of heterogeneous cohorts, where standard tests will identify correlations between subtype-specific viral polymorphisms and HLA alleles prevalent among individuals infected with those subtypes. In this case, results do not necessarily indicate that these polymorphisms are selected under contemporary immune pressures; rather, they more likely reflect a correlation between HLA alleles enriched in particular human populations and a subtype (or lineage)-specific viral polymorphism shared by all sequences in this branch of the tree (a so-called founder effect) [17]. Even in relatively homogeneous cohorts, failure to account for the evolutionary relatedness among HIV-1 sequences leads to increased variance in traditional statistical tests and thus uncertainty in interpreting the results (see ref. [31] and chapter 2). To address this issue, Bhattacharya et al. [17] proposed two methods to account for the underlying phylogenetic structure of HIV-1 sequences when identifying sites of CTL-mediated selection (the second method is the univariate model described and explored in Chapters 4 and 5). These approaches attempted to identify associations

24

that were unusual in the context of a shared lineage while statistically downplaying associations that could have been explained by neutral evolution. Their methods were used to identify HLA-associated polymorphisms in the Gag and Protease proteins in a mixed-clade data set of 96 sequences from the same cohort investigated by Moore et al. [17]. By comparing associations identified using standard versus lineage-corrected analyses, Bhattacharya et al. [17] demonstrated that a number of associations identified by the original logistic regression method actually represented spurious associations that were better explained by founder effects. The authors noted however that phylogenetic correction achieves more than simply the elimination of spurious associations due to lineage effects. Consideration of tree structure also identified novel associations that were overlooked by the uncorrected analysis, thus allowing the potential for increased power compared to standard methods (though a formal power analysis was not undertaken) [17]. Overall, however, in the 96 sequences analyzed by Bhattacharya et al., only 14 strong HLA allele/HIV-1 sequence associations were identified [17]. The relative paucity of associations identified by Bhattacharya et al. [17] compared to Moore et al. [157] led some to question the contribution of HLA-mediated selection pressures to viral evolution [118]. What followed was a scientific debate regarding the extent of HLA-mediated viral imprinting at the population level. Was the data we previously accepted to be strong evidence of immune-driven viral evolution simply explained by founder effects? Not necessarily. Not only did the Moore [157] and Bhattacharya [17] analyses feature different HIV-1 gene regions sequenced from different patient groups, but the latter was based on a data set less than one-quarter the size of the former, and thus had substantially reduced power to detect associations. Indeed, mathematical modeling experiments suggest that the false-negative rate (i.e. the % of true associations missed) may be over 80% in a data set of N=100 (chapter 6), and Bhattacharya et al. themselves emphasized that their results should not be interpreted as evidence that immune pressure is a weak force in HIV-1 evolution [17]. Rather, the results un-

25

derscore the importance of disentangling the effects of immune selection from founder effects and emphasize the need for larger data sets to determine the extent to which immune imprinting shapes HIV-1 diversity at the population level. 3.5

HLA-associated immune pressures influence population HIV diversity at up to 40% of positions in some proteins

Soon after the Bhattacharya et al. publication [17], we applied the phylogeneticallycorrected methods to the first large-scale, population-based analysis of HLA-mediated imprinting on multiple HIV-1 genes [24]. Nearly 500 HLA-associated polymorphisms in the HIV-1 Protease, Reverse Transcriptase, Vpr and Nef proteins were identified in a cohort of ≈ 700 chronically-infected, antiretroviral na¨ıve individuals [24]. Polymorphisms were dichotomized based on the direction of selection pressure: escape and reversion associations represented amino acids enriched in the presence or absence of a specific HLA allele, respectively. As expected, escape and reversion associations generally represented non-consensus and consensus residues, respectively, although some exceptions were noted. In addition, HIV-1 codons under diametrically opposed HLA selection pressures (where the escaped amino for one allele represented the reverted (immunologically susceptible) form for another, and vice versa) were identified, highlighting a dynamic tug-of-war of immune selective pressures influencing HIV-1 diversity at the population level [24]. Of note, the fact that the majority of identified associations fell outside the boundaries of known epitopes lead us to conlcude that escape is not confined to single point mutations selected within regions directly targeted by CTL; rather, immune pressures select for a broad range of polymorphisms on a protein- (and genome-) wide level. (As will be discussed in section 3.8 and chapter 6, however, at least some of these association may be due to other sources of confounding.) In addition, substantial differences in the number of immune selection events were observed among HIV proteins [24]. Nef exhibited the greatest evidence for immune

26

adaptation, with ≈ 40% of its codons harboring at least one HLA association. In contrast, 10–15% of codons in Protease, Reverse Transcriptase, and Vpr exhibited evidence for HLA-mediated selection. More recent work suggests that the portion of Gag codons exhibiting HLA-associated substitutions is slightly higher than Pol [25]. Taken together, data indicate that amino acid variation at a minimum of 10–40% of HIV-1 subtype B codons is driven by HLA class I-associated immune pressures, even after correction for lineage effects [24]. In addition, we later applied the same methods to the entire clade C proteome on a cohort of 261 South Africans and found evidence for 310 distinct HLA-mediated escape events spanning the entire proteome [196], thus reconfirming that CTL immune selection substantially shapes HIV-1 diversity at the population level. 3.6

Clinical consequences of immune-mediated evolution

Immune escape is believed to be a major factor limiting the immune system’s ability to control HIV-1 in the long term [85]. However, with the exception of documented loss of viremia control following escape within the B*27-restricted KK10 epitope in Gag [84], a clear relationship between CTL escape and HIV-1 disease progression has been difficult to demonstrate. A multitude of factors influence the frequency and kinetics of viral escape, making the clinical consequences of immune-driven HIV-1 evolution challenging to quantify. Even among individuals expressing the same HLA allele, substantial differences in the magnitude and frequency of epitope targeting are often observed, rendering it problematic to study the clinical consequences of escape on a population basis. A second complication is the issue of immunodominance, an incompletely understood phenomenon describing the fact that, despite the expression of up to six HLA class I alleles (and the potential to target multiple epitopes per allele), an individual’s CTL response is often initially directed against a single or few epitopes with a single HLA restriction [134, 236, 237]. Moreover, the breadth and specificity of targeted epitopes changes throughout the disease course. While the acute-phase

27

CTL response is generally narrowly directed, a progressive broadening of the epitopic repertoire occurs over time, likely as a consequence of viral escape within the early immunodominant epitopes [83]. In addition, the balance between increased fitness due to immune escape versus decreased fitness from the resulting substitution [137], the fact that escape is not always absolute (such that, in many cases the selected variant retains partial HLA binding and/or CTL recognition capacity [219]), and the ability of the immune response to adapt to a continuously evolving target (as demonstrated by the emergence of de novo CTL responses against newly-selected escape variants [4]), all contribute to the complexity of this issue. Finally, the fact that escape does not occur in isolation must also be considered, because mutations are likely to be selected in the context of a variety of compensatory changes and at the same time that reversion of transmitted mutations is occurring. Taken together, it seems unlikely to expect that the selection of any single mutation would be accompanied by a clinically detectable impact on HIV-1 disease progression. Nevertheless, the proteome-wide association study of Rousseau et al. [196] identified seven polymorphisms that each significantly predicted viral load, highlighting the role these mutations play in viremic control. In addition, the identification of HLA-associated polymorphisms in large clinicallyderived data sets allows, for the first time, the characterization of the relationship between escape and markers of HIV-1 disease progression on a broader level. In their 2002 study, Moore et al. reported that the presence of HLA-associated polymorphisms in RT predicted pre-treatment plasma virus load (pVL) on a population basis [157]; however, this observation was not confirmed in the larger Brumme et al. [24] study that included Protease and Reverse Transcriptase. More recent work has, however, begun to identify classes of associations that contribute strongly to viremic control, with the importance of B alleles, the Gag protein and reversion associations particularly highlighted (see section 5.6 for details). Further research is needed to assess whether there may be gene-specific differences in the contribution of CTL escape mu-

28

tations to markers of HIV disease, the answer to which may greatly inform vaccine design. 3.7

Strategies to cope with viral diversity in HIV-1 vaccine design

Numerous strategies have been proposed to address the challenge of global sequence diversity in HIV-1 vaccine design. Immunogens based on consensus or phylogenetic reconstructions of ancestral and/or “center-of-tree” sequences attempt to minimize genetic distance between the vaccine and circulating HIV-1 strains [76, 163]. Polyvalent vaccine immunogens featuring maximal coverage of viral diversity and potential epitopes in compact sequence space have also been proposed [65, 107, 164]. To address the substantial mutational capacity of HIV-1, one proposed approach is to limit vaccine design to immunogenic yet highly conserved regions, such that escape could only occur at a substantial fitness cost [7]. A complimentary strategy would be to design immunogens that incorporate both “wild-type” and “escaped” variants (as long as the variant retains its ability to bind HLA), thereby blocking preferred routes of escape in infected individuals [20]. The development of phylogenetically-informed methods to accurately identify viral sites under active immune selection pressure using large clinically-derived HIV-1 sequence data sets represents a major advancement to the study of how viral genomes are shaped by human immunogenetic selection pressures. Achieving a complete picture of the sites, pathways and kinetics of immune escape will not only help us gain an understanding of the extent to which host immunity shapes HIV-1 evolution, but will also inform the rational design of future vaccine immunogens. 3.8

Remaining challenges

Despite recent advances in the field, a genome-wide map of HLA-associated polymorphisms in HIV-1 remains far from complete. Although the importance of addressing the confounding effects of HIV-1 population structure are now recognized [17, 24],

29

three important issues need to be addressed: statistical power, linkage disequilibrium (LD) among HLA alleles and coevolution of amino acids in viral sequences. As we have seen, the size of the study can dramatically effect the conclusions that are drawn. To date, the largest study analyzes a fairly homogenous cohort of approximately 700 individuals, which resulted in dramatically different conclusions than studies on smaller cohorts. We will discuss this issue in the context of our proposed model in subsection 6.2.5, but the results discussed thus far underscore the importance of exploring these types of cross-sectional studies to even larger cohorts, as well as to assess other HIV-1 subtypes, where levels and patterns of HLA-mediated imprinting may differ. Indeed, one advantage of lineage-corrected methods may be the ability to combine heterogeneous cohorts to increase power, though the extent of shared epitopes among clades remains incompletely characterized. Of major concern is the potential confounding effects of linkage disequilibrium (LD) among HLA alleles. The concern over LD arises due to the fact that HLA class I alleles are situated in close proximity on the human genome and are not inherited independently [28]; therefore, an escape mutation driven by HLA-B*57, for example, may also be detected as being associated with the tightly linked Cw*06 allele. Whereas Moore et al. [157] addressed the issue of LD using logistic regression and a backwards elimination technique to identify the allele(s) that best explained the escape polymorphism [157], current lineage-corrected methods [17, 24] do not directly address this issue. Similarly, none of the current approaches account for the effects of amino acid coevolution, such as when an amino acid substitution at one site preferentially occurs in the context of a secondary (or compensatory) mutation at another site [21, 201]. Failure to account for this issue may result in both the primary and compensatory mutations being identified as correlated with the restricting HLA allele, when in fact only the former is directly selected. Although Moore et al. attempted to include co-varying amino acids as regressors [157], the confounding effects of phylogeny are

30

even greater when two amino acids share the same evolutionary history; indeed, a substantial body of literature exists to address the problem of covarying amino acids [39], though these methods have not previously been extended to incorporate HLA allele information or even correlations among multiple codons (see Chapters 2 and 5). Similarly, although Bhattacharya et al. [17] found evidence of compensatory mutations using a modified version of the evolutionary history reconstruction method proposed by Ridley in 1983 [190], the results are difficult to interpret due to the confounding of long and short range effects that occur when only pairs of variables are considered. Clearly more comprehensive approaches are needed to disentangle the complex interactions that govern HIV-1 escape pathways.

31

Chapter 4 PHYLOGENETIC DEPENDENCY NETWORKS We have argued in Chapters 2 and 3 that the study of HIV adaptation (and, by analogy, any study of adaptation) requires a framework that can model the evolutionary history of an amino acid in conjunction with selection pressure from one or more factors (HLA alleles, other amino acids, etc.). In this chapter, we describe the phylogenetic dependency network (PDN), an extension of the dependency network [94] that is well-suited for modeling adaptation. We will describe the PDN generally, using the area of HLA-mediated HIV adaptation to illustrate the concepts. A dependency network represents the probabilistic dependencies among a set of predictor and target attributes [94]. In our domain, target traits, denoted Y, correspond to the presence or absence of amino acids at all codons in an HIV protein. For a given Y in Y, the predictor traits, denoted X, correspond to the presence or absence of amino acids at all codons other than that for Y and the presence or absence of all HLA alleles. We constrain all traits to be binary, though generalizations are possible. We have found that this choice yields more statistical power in practice. A dependency network (phylogenetically corrected or otherwise) has two components. The first component, sometimes referred to as the structure of a dependency network, is a directed graph linking nodes, where each node corresponds to one of the traits in the domain. (We use the same name—e.g., Y —for the trait and its corresponding node in the graph.) An arc from X to Y in the graph is a statement that the probability distribution for Y depends on X. Thus, in our domain, a dependency network graphically depicts which HLA and codon traits predict each codon. The second component is a collection of conditional or local probability distributions,

32

one for every target trait of interest. The local probability distribution for target ˆ where X ˆ ⊆ X are the parents of Y in the graph. Therefore, in trait Y is P (Y |X), our domain, a dependency network contains a probability distribution for each codon trait conditioned on various HLA and codon traits. When constructing a dependency network, each local probability distribution is learned independently. This approach is computationally efficient, although it can lead to a decrease in statistical efficiency (see Discussion). A phylogenetic dependency network (PDN) for our HIV application is a dependency network in which each local probability distribution is corrected for the phylogenetic structure of the HIV sequences. That is, the probability that a codon in an ˆ but also on where individual is a given amino acid depends on not only the traits X, that individual’s HIV sequence sits in the phylogeny (Figure 4.1). Specifically, a PDN ˆ one for each Y in Y, where PΨ refers to is a collection of the distributions PΨ (Y |X), a distribution corrected for phylogeny. ˆ the set of parents for Y . SpecifWe use a model-selection approach to identify X, ically, we use significance tests—False Discovery Rate (FDR) thresholds based on ˆ To avoid the inappropriate use of an likelihood-ratio tests (LRTs)—to determine X. LRT, we exclude traits as possible predictors when the corresponding predictor-target pair has a 2 × 2 contingency table that includes at least one bin where both the observed and expected value is at most three. This parameter was chosen based on performance with independent data (not shown). 4.1

Phylogenetically corrected distributions for one predictor trait

A simple approach for identifying a set of traits that predict a given codon (i.e., for identifying the parents of a target trait in a PDN) is to test for pairwise correlations between a target codon and each predictor trait. The details of a statistical model that follows this approach, hereafter referred to as the univariate model, are described in section 4.4 and evaluated in chapter 5. The model was first presented in [17], with

33

Figure 4.1: Phylogenetic dependency network (PDN). A PDN is a graphical model consisting of target traits whose outcome is a probabilistic function of predictor traits. Each of these probabilistic functions takes the phylogeny of the sequences into account. Here, the target traits (green nodes) are binary and represent the presence or absence of amino acids at codons. These target traits may have dependencies on other codons (codon covariation) and/or on HLA alleles (HLA-mediated escape), which are denoted by blue nodes. Arcs represent the learned dependencies between target and predictor traits. All target traits are assumed to be influenced by the phylogeny (red arcs). The probability components of a PDN are the local conditional probabilities, each of which relates a single target trait to the phylogeny and a subset of the predictor traits. These local conditional probabilities are learned independently for each target trait. In the hypothetical example depicted here, B*57 and B*58 predict M1 and A*02 predicts A5. A5 predicts A3, and there is a cyclical dependency among M1, G2, A3 and R4, in which most of the arcs are bidirectional.

34

the details described in [31]. We will provide a high level overview of the model here. To determine whether there is a significant pairwise correlation between predictor trait X and target amino acid Y , we compare the likelihood of a null model (sometimes referred to as the single variable model) that reflects the assertion that Y is under no selection pressure to an alternative model that reflects the assertion that Y is under selection pressure induced by a single predictor trait X. The null model assumes the target codon Y can be described completely by a model of independent evolution along a phylogenetic tree (Figure 4.2A). The leaves of the tree correspond to individuals in the study and are typically observed. The interior nodes of the tree correspond to unseen individuals infected by an HIV sequence that is a point of divergence. These nodes are hidden—that is, never observed. We use Yi to denote trait Y for the ith individual in the study (i = 1, . . . , N ). (Note that Yi is a variable in the ordinary statistical sense.) Because each target trait is binary, a natural null model is the twostate version of the continuous time Markov process, commonly used in phylogenetics [60]. This model assumes evolution is independent between different branches of the phylogeny and that the only informative predictor of a node in the evolutionary tree is its parent node in the tree. The alternative model adds a component of selection pressure derived from the predictor trait X (Figure 4.2B). We use variable Xi to denote the trait X for the ith individual in the study (i = 1, . . . , N ), and do not explicitly name X for the unseen individuals represented in the interior of the phylogenetic tree. Because X may not share the same evolutionary history as Y , we assume X influences Y only at the leaves of the tree. In particular, we assume that, among the variables corresponding to trait X, only Xi influences Yi for each i. This assumption was evaluated more fully by Carlson et al. [31] and found to be a reasonable approximation, even when X and Y share the same evolutionary history. To model selection pressure at the leaves, we extend the null model by adding a hidden trait H (with corresponding variables Hi , i = 1, . . . , H) that represents what Y would have been had there been

35

Figure 4.2: The univariate model. (A) The null model, in which an amino acid evolves independently down the tree until it reaches a leaf. (B) The alternate model, in which an amino acid evolves independently down the tree until is reaches an individual, where it is influenced by selection pressure from the predictor. The variable Hi for the ith individual represents the variable Yi had there been no influence from Xi . Only the Yi and Xi are observed. Conditional probability distributions are not shown.

no selection pressure. The probability distribution for Yi then depends on Hi and Xi . When the values of Hi and Yi are different, we say that a transition conditioned on Xi has taken place. The precise rules governing the transitions conditioned on Xi are given by the univariate leaf distribution PΨ (Y |X) = P (Yi |Hi , Xi ). We assume that this leaf distribution is not a function of i—that is, this distribution is the same for each individual i = 1, . . . , N . Also note that the subscript Ψ is a reminder that Yi depends not only on Xi , but also on the phylogeny through variable Hi . In the univariate case, we define four possible leaf distributions. Escape means an individual may transition to Yi = 0 only when Xi = 1. Reversion means an individual

36

may transition to Yi = 1 only when Xi = 0. Attraction means an individual may transition to Yi = 1 only when Xi = 1. Repulsion means an individual may transition to Yi = 0 only when Xi = 0. Given a univariate leaf distribution, a single parameter s specifies the probability that the transition occurs given the appropriate state of Xi . Note that attraction/repulsion correspond to a positive correlation between Xi and Yi , whereas escape/reversion correspond to a negative correlation. The names of these leaf distributions correspond to various processes for selection pressure [31]. For example, the B*57-restricted CTL response selects for escape from the susceptible threonine at position 242 of the HIV Gag protein [132]. So, from the perspective of hidden and target traits that correspond to the presence and absence of threonine, the amino acid can transition from threonine to not threonine (H = 1, Y = 0) with a non-zero probability only when the individual has the B*57 allele (X = 1), which corresponds to the escape distribution just described. In addition, escape from threonine bears a fitness cost that leads to reversion in B*57-negative individuals [132]. Consequently, the amino acid can transition from not threonine to threonine (H = 0, Y = 1) with non-zero probability only when the individual lacks B*57 (X = 0), corresponding to the reversion distribution. The codon for threonine usually escapes to the resistant amino acid asparagine [132]. Continuing the example from the perspective of hidden and target traits that correspond to the presence and absence of asparagine, the amino acid can transition from not asparagine to asparagine (H = 0, Y = 1) only when the individual has the B*57 allele (X = 1), which corresponds to the attraction distribution. Finally, the amino acid can transition from asparagine to not asparagine (H = 1, Y = 0) only when the individual lacks the B*57 allele (X = 0), which corresponds to the repulsion distribution. Although there is a natural pairing between escape/reversion and attraction/repulsion, in that the former indicates a negative correlation and the latter a positive correlation, the processes are each distinct and may provide information as to the underlying mechanism (see the section on distinguishing leaf distributions in Results). Furthermore, whereas the vast

37

majority of clinically-derived HIV sequences have either threonine or asparagine at codon 242, most codons are more variable, with more than one amino acid susceptible to, or resistant from, CTL pressure mediated by the HLA allele. Consequently, escape/attraction and reversion/repulsion for alternate amino acids often provide additional information. Note that by restricting the univariate leaf distribution to one of these four forms, we have assumed that only one process (escape, reversion, attraction, or repulsion) is occurring for a given predictor-target pair. Although in reality both escape and reversion (or attraction and repulsion) may occur with the same HLA-epitope combination, relaxing our assumption leads to substantial loss of power. Thus, we apply each of the four leaf distributions to the predictor-target pair and include only the most significant correlation in the model. 4.2

Phylogenetically corrected distributions for more than one predictor trait

The univariate model works well when there are no correlations among predictor traits or among target traits [31]. As discussed, however, use of the model in the presence of linkage disequilibrium among HLA alleles and HIV codon covariation will likely lead to spurious associations. To avoid this problem, we use a multivariate model [33], in which more than one trait can be used to predict a particular target trait. In this model, for a given target trait Y , shown in Figure 4.3, the target trait is allowed to evolve independently down the tree until it reaches a leaf in the tree corresponding to an individual in the study. At this point, selection pressure within the individual ˆ which depends on is governed by a multivariate leaf distribution, denoted PΨ (Y |X), ˆ As in the univariate case, this leaf distribution is the multiple predictor traits X. same for each individual i = 1, . . . , N . The set of significant predictor traits can be identified by a number of methods including forward, backward, and forward/backward selection. In this work, we ˆ is iteratively augmented with the most concentrate on forward selection, wherein X

38

X

Hidden attribute H

Predictor attribute X

Observed target attribute Y

Figure 4.3: The multivariate model. Here, an amino acid evolves independently down the tree until is reaches an individual, where it is influenced by one or more predictor traits.

significantly associated trait at each iteration. For each added trait, we record only the most significant leaf distribution (escape, reversion, attraction, or repulsion). The significance of a predictor X with respect to target trait Y is computed using false discovery rates based on an LRT in which both the null and alternative models are conditioned on all significant predictors that were identified in previous iterations of forward selection. For practical purposes, we terminate forward selection when the most significant association has a p-value greater than or equal to some threshold to be described. There any many possibilities for the form of the multivariate leaf distribution ˆ PΨ (Y |X). In this paper, we consider two distributions: Decision Tree and Noisy Add.

39

Decision Tree ˆ is to A straightforward way to represent the multivariate leaf distribution PΨ (Y |X) list the probability distribution for Y given every possible instance of the traits H ˆ Unfortunately, the length of this list grows exponentially with the number and X. of predictor traits. An alternative is to use a Decision Tree, which is a compact representation of such a list. The use of the Decision Tree as a multivariate leaf distribution was recently employed by Matthews et al. [153] to account for HLA LD. Here, we describe the approach in some detail. A graphical depiction of the Decision Tree leaf distribution is shown in Figure 4.4. Note that this tree should not be confused with the phylogenetic tree. To help avoid this confusion, we use the term tip to refer to the bottom points on the Decision Tree. Each path in the tree from root to tip defines a particular instance of a subset of the ˆ which in turn defines a conditioning event for the distribution of the target traits X, ˆ = (B57, trait. For example, in Figure 4.4, we consider the set of predictor traits X C06, M28), with each branch labeled 0 or 1. The path that follows the value 0 for the trait B57, the value 0 for the trait C06, and the value 1 for the trait M28 corresponds to the instance (B57 = 0, C06 = 0, M28 = 1)—that is, the individual has M28 but not B57 or C06. At the tip of this path sits the corresponding conditional probability distribution PΨ (Yi |B57 = 0, C06 = 0, M28 = 1). In general, each tip k in the Decision ˆ = pathk ), where pathk Tree is associated with the conditional distribution PΨ (Yi |X is the conditioning event corresponding to the kth path. The collection of these conditional distributions over all tips constitutes the multivariate leaf distribution. A Decision Tree leaf distribution can be constructed in many ways. As mentioned, we use forward greedy search. First, we initialize the tree to a single root node, which is simply the univariate leaf distribution for the most significant trait. We then grow the tree iteratively. At each iteration, we consider extending (or splitting) a tip node k on some trait not already in the path to the tip. When splitting tip node k on

40

B57 1

0 C06 1

0 M28

1

0

Predictor variable X Local probability distribution

Figure 4.4: Decision Tree leaf distribution. Each path from root to leaf yields a distinct local probability distribution.

an trait X, the node is replaced with two branches and two corresponding tip nodes. The left and right branches correspond to adding X = 1 and X = 0, respectively, to the conditioning event associated with the original tip node. The split is made if the resulting local distribution is a significantly better estimate than that prior to the split, as measured by an LRT. The LRT is computed using the univariate model applied to those individuals whose trait values match those described by pathk . To make the process more efficient in our HIV application, we consider splitting the tip ˆ That is, we repeatedly apply the node only under the path X = 0 for all X in X. univariate model to all individuals for whom X = 0 for all the previously identified significant predictor traits. We iterate this process until no significant predictors are found, using a threshold of p < 0.05. Noisy Add One drawback of the Decision Tree approach is that, as the tree grows, the number of samples that we use to test for the next split decreases. Rather than consider smaller and smaller subsets of the data, the Noisy Add leaf distribution models selection pressure as an additive process among the predictor traits. That is, the Noisy Add leaf

41

distribution is based on the assumption that each predictor trait independently contributes a positive or negative selection pressure on the target trait. These pressures then sum to determine the value of the target trait. In the univariate case, each leaf distribution can be seen as representing three mutually exclusive and exhaustive events (for each individual): (1) the selection pressure is absent, either because the state of the predictor trait excludes selection pressure or, with probability 1 − s, no transition occurred despite the potential for selection pressure; (2) selection pressure leads to Yi = 1 (attraction or reversion); or (3) selection pressure leads to Yi = 0 (escape or repulsion). We can represent these three possible events by a hidden trait I that takes on the values 0, 1, and −1, respectively. Given a set of M predictor traits, we can associate a hidden variable Iij for the jth trait in the ith individual. Then, assuming that selection pressure across the predictor traits contributes independently and equally to the outcome of Yi , we can determine the outcome of Yi by summing the values of the Iij variables: Σi = Ii1 + · · · IiM . If Σi is 0, then it is as if no selection occurred. If Σi < 0, then negative selection (escape/repulsion) has occurred, and the target variable Yi will be zero. If Σi > 0, then positive selection (attraction/reversion) has occurred, and the target variable will be one. Of course, we don’t know the actual values of I j for each predictor variable, so we must sum over the possibilities, resulting in a probability distribution over Σi . The strength or frequency of selection pressure contributed by each predictor trait j is captured by the parameter sj . Like the corresponding parameter s in the univariate model, sj is the probability that the predictor trait exerts selection pressure (Iij 6= 0), given the appropriate state for the predictor trait. A more precise definition of Noisy Add, including the generalization from the univariate model, specifics of learning the parameters sj , and methods for reducing computation time can be found in the section on model details. ˆ as a predictor of target Y The contribution of a given predictor trait X j ∈ X ˆ − X j . The most is quantified using an LRT against the null model consisting of X

42

significant predictor trait is added to the Noisy Add model on each iteration, stopping when the most significant predictor fails to achieve p < 0.005. (We use a more aggressive threshold than that for Decision Tree because Noisy Add is more computationally intensive.) 4.3

q-values

We identify significance using q-values [211], which conservatively estimate the false discovery rate (FDR) [15] for each p-value. The FDR is defined to be the expected proportion of false positives among results called significant at a given threshold t. The q-value of t is the minimum FDR observed for all t0 ≥ t [211]. Following Storey and Tibshirani [211], we use the approximation F (t) E [F (t)] FDR(t) = E , ≈ S(t) E [S(t)]

(4.1)

where S(t) is the number of associations called significant at t and F (t) is the number of true nulls (false positives) at t. To estimate the numerator, we order the p-values of the association tests in increasing order p1 , . . . , pm and use the approximation E [S(pi )] ≈ S(pi ) = i. To compute E [F (t)], Storey and Tibshirani point out that uniformity of p-values allows the approximation ˆ 0 pi m E [F (pi )] ≈ π

(4.2)

where π ˆ0 is a (conservative) estimate of the proportion of all hypotheses that are truly null. In our case, we assume a priori that the vast majority of the many hypotheses tested will be null (i.e., most codons and HLA alleles have no direct effect on a given target trait), and so conservatively set π ˆ0 = 1. It should be pointed out that ˆ 0 pi m E [F (pi )] = π

(4.3)

is only guaranteed if the data are continuous and the p-values are uniformly distributed under the null hypothesis. The continuous assumption is certainly not met

43

for genetic data, which is discrete, and the p-values are quite often not uniformly distributed. Thus, E [F (p)] often has to be estimated directly from the data. We will discuss this in more detail in sections 5.1.2 and 6.1.1. 4.4

Model Details

In this section, we provide details regarding the univariate and Noisy Add models, in addition to a brief discussion on computational requirements for the models.

4.4.1

Details of univariate model

First, let us consider the null model. Consider target trait Y that denotes the presence (Y = 1) or absence (Y = 0) of a particular amino acid at a particular codon. We use variable Yi , i = 1, . . . , N to denote the trait Y for the ith individual in the study. (We use corresponding notation for predictor traits and variables.) It is quite common to assume that the variables Y1 , . . . , YN are independent and identically distributed (IID). In our application, however, the variables are related through a phylogenetic tree. We can model these relationships using a probabilistic phylogenetic model as shown in Figure 4.2A. Nodes at the leaves of the tree, labeled Y1 , . . . , YN correspond to the variables with the same name. (In general, we will use the same designation for both a variable and its node.) Unlabeled nodes in the interior of the tree correspond to events of divergence. We use Ψ to denote the structure (branchings and branch lengths) of the tree. Associated with each variable (or node) B in this phylogenetic tree is a conditional probability distribution P (B|A), where A is the parent node of B. As in the probabilistic model of Felsenstein [60] for a phylogenetic tree, we assume that the conditional probability table is described by a continuous time Markov process (CTMP) and parameterized by θ = (π, λ), where π is the stationary distribution of Y = 1 and λ is the rate of mutation. The conditional probability table of the CTMP from parent

44

node A to child node B along a branch of length d is given by   e−λd + π · (1 − e−λd ) if a = b b P (B = b|A = a, d) =  π · (1 − e−λd ) if a = 6 b. b

(4.4)

where πb = π when b = 1, and πb = 1 − π when b = 0. This evolution model is reversible, making the choice of root in the tree arbitrary [60]. Given a set of observations for (typically, all of) Y1 , . . . , YN , there are several criteria that can be used to identify good values for the parameters π and λ and the structure Ψ of this model (or, in the Bayesian case, a distribution over these quantities). For this and all models discussed in this paper, we choose parameters and structure using the maximum likelihood criterion, as is done in (e.g.) [60]. There are a number of methods for identifying the maximum-likelihood parameters, including gradient decent and the Expectation-Maximization (EM) algorithm. In this work, we use the EM algorithm [49] to learn θ. To learn the structure Ψ, we apply PhyML to the nucleotide sequences using the general time reversible GTR model with all other parameters estimated from the data [87]. We denote this null model PΨ (Y |θ), as it represents a phylogenetically corrected distribution for Y . Note that this model includes the situation where the observations of Y1 , . . . , YN are IID as a special case (i.e., the limit as λ tends to infinity.) Now let us consider the alternative model, which reflects the assumption that a codon is under selection pressure induced by a single predictor trait X. To construct this model, shown in Figure 4.2B, we begin with the null model and first change each Yi to Hi , which represents what Yi would have been had there been no influence from Xi . Then, we assume that, for each individual i, the probability distribution for Yi depends on Xi and Hi . Further, we assume that these conditional distributions P (Yi |Hi , Xi ) are the same for each individual i, and collectively denote them by Pψ (Y |X). In general, this univariate leaf distribution can have four parameters corresponding to the four states of the conditional variables Hi and Xi . In our experience, however, use of such a distribution leads to loss of power. Consequently, we consider four separate

45

distributions (as was previously defined [31]) and, for any given association, choose the one that best fits the data: Escape P (Yi = 0|Hi = 1, Xi = 1) = s > 0; P (Yi = 1|Hi = 0, Xi = 1) = 0; P (Yi = a|Hi = a, Xi = 0) = 1. That is, Hi and Yi can be in different states only when Hi = 1 and Xi = 1. Reversion P (Yi = 1|Hi = 0, Xi = 0) = s > 0; P (Yi = 0|Hi = 1, Xi = 0) = 0; P (Yi = a|Hi = a, Xi = 1) = 1. That is, Hi and Yi can be in different states only when Hi = 0 and Xi = 0. Attraction P (Yi = 1|Hi = 0, Xi = 1) = s > 0; P (Yi = 0|Hi = 1, Xi = 1) = 0; P (Yi = a|Hi = a, Xi = 0) = 1. That is, Hi and Yi can be in different states only when Hi = 0 and Xi = 1. Repulsion P (Yi = 0|Hi = 1, Xi = 0) = s > 0; P (Yi = 1|Hi = 0, Xi = 0) = 0; P (Yi = a|Hi = a, Xi = 1) = 1. That is, Hi and Yi can be in different states only when Hi = 1 and Xi = 0. This model is reversible in the sense that the choice of root node among non-leaf nodes does not affect the likelihood of the data. We also note that, in principle, all parameters θ=(π, λ, s) and the structure Ψ can be optimized simultaneously. In practice, however, we find that using the structure Ψ learned in the absence of information about X works well, and is computationally more efficient. In addition, it may seem counter-intuitive that the HLA alleles of the individuals corresponding to the interior nodes of the phylogeny are not being taken into account. A path from one node to the next in the phylogeny, however, presumably reflects a series of infections over many individuals, some who will have the allele and some who will not. Thus, there will be some net evolution, which we account for by optimizing the parameters π and λ for each codon individually. Finally, we note that this model can be thought

46

of as a (discrete) mixed-effects model, wherein the predictor variables Xi correspond to the fixed effects and the hidden variables Hi correspond to the random effects [40]. Rather than being related by (e.g.) a Gaussian covariance matrix, the random effects are related by a phylogenetic tree. Both the null and alternative models are instances of what is known as a generative or directed acyclic graphical (DAG) model. In general, a generative model consists of a structure, a directed acyclic graph, in which nodes correspond to variables and missing arcs specify conditional independencies among the variables, and a set of conditional probability distributions, one distribution for each node. The conditional probability distribution for a given node is the distribution of the node given its parents. The conditional independencies specified by the structure of the graph allow the joint distribution of the data to be written as the product over the nodes of their conditional distributions. The independencies represented by the model facilitate computationally efficient inference, parameter estimation, and structure learning [93]. Importantly, given a set of parameters learned from real data, synthetic data can be easily generated from the model. When constructing PDNs, we separately learn a DAG model to encode each local probability distribution. As mentioned in the Discussion, however, one can restrict the arcs in a PDN to be acyclic, thus resulting in a single (phylogenetic) DAG model for all the traits in the data set. In the following section, we consider the multiple-predictor case and again use graphical models to represent phylogenetically corrected distributions. As we shall see, the computational efficiencies afforded by graphical models will play an even more important role. 4.4.2

Details of Noisy Add model

To understand the Noisy Add leaf distribution, let us recast the univariate distribution as the generative process shown in Figure 4.5A. (Recall that this distribution is independent of i. In the text that follows, we describe this and the generalized process

47

for an arbitrary individual i. In the corresponding figures, we drop the subscript i to simplify the notation.) If Xi = 1 (for escape or attraction; Xi = 0 for reversion or repulsion), a coin weighted with probability s for heads is flipped. If the coin lands heads, then the intermediate variable Ii gets the value 1 (for attraction or repulsion; -1 for escape and reversion). Otherwise, Ii gets the value 0, corresponding to no selection pressure. The value of Ii is then copied to the value of another variable Σi . (The copy is not necessary here, but will help us generalize.) Finally, the target variable Yi is assigned a value based on the deterministic function shown in Figure 4.5B. With a little checking, it can be seen that this process produces precisely the univariate leaf distributions for escape, reversion, attraction, and repulsion. ˆ i = X 1, . . . , X M The generalization of this process to multiple predictor variables X i i is shown in Figure 4.5C. Here, there is an Xij and Iij node for each predictor variable Xij . The weight on the coin is possibly different for each predictor variable. We use sj to denote the weight for predictor variable Xij , and s to represent the collection of parameters (s1 , . . . , sM ). The variable Σi is now a sum of the intermediate variables Ii1 , . . . , IiM . Finally, as in the univariate case, Yi is a deterministic function of Σi and Hi as given in Figure 4.5B. Applying this generative process to individuals i = 1, . . . , N , we obtain the conditional distribution P (Y1 , . . . , YN , H1 , . . . , HN , I11 , . . . , INM |X11 , . . . , XNM , θ), where θ = (s, π, λ) are the parameters of the model. Maximum likelihood values for these parameters can be inferred efficiently. The summation Σi = Ii1 + . . . + IiM can be grouped as Σi = (((Ii1 + Ii2 ) + Ii3 ) + . . . + IiM ), yielding the graphical model shown in Figure 4.5D. This grouping makes it possible to compute the distribution for Yi for any instance ˆ i and Hi in time that is quadratic in M . Furthermore, given any of the variables X ˆ i , Hi , and Yi , the probability distributions for instance of the predictor variables X Ii1 , . . . , IiM can be computed in time that is quadratic in M . Consequently, we can use the EM algorithm to estimate the parameters s efficiently. To estimate the full set of Noisy Add parameters θ, we embed this estimation procedure within an outer

48

X {0,1} or I {0,-1} Σ H

X A

s

Σ

H

Y

0

0

0

0

1

1

>0

0

1

>0

1

1

0 associations, then the amino acid was generated according to the given multivariate model with the predictor parameters

82

s1 , . . . , sM , taken from the real data. When an observation was missing in the real data, the corresponding observation in the synthetic data was also made to be missing. We treated amino acid insertions/deletions and mixtures as missing data. Our goal was to generate data that is as realistic as possible, both in the values of the parameters used and the number of predictors deemed correlated with the target. Because our recall rate is less than 100% (see section on synthetic results), planting only those associations that are found in the real data would result in a smaller proportion of synthetic predictor-target pairs called significant than real predictor-target pairs called significant. We therefore planted two associations for every observed significant association in the real data and reduced the number of independently evolving codons accordingly. For the Noisy Add model, this procedure planted 72 HLA-codon and 612 codon-codon associations in the HOMER cohort and 114 HLA-codon and 952 codon-codon associations in the combined HOMER-Durban cohort. In hindsight, doubling the number of planted associations was an overcompensation, as experiments on this synthetic data yielded a 75% recall rate. Nonetheless, the doubling produced a reasonable result, as Noisy Add declared 0.56% of all synthetic predictor-target pairs significant at q ≤ 0.2 compared to 0.65% of all predictor-target pairs in the real data for the combined HOMER-Durban cohort. 6.1.3

Data analysis

As mentioned, we binarized all data. For example, if three amino acids were observed at a given sequence position, we created three binary attributes corresponding to the presence and absence of each amino acid. When reporting results, however, we assumed that the most relevant information was at the codon level. Thus, unless stated otherwise, HLA-codon associations refer to the most significant associations between an HLA allele and any observed amino acid at the codon under any of the four leaf distributions. Likewise, codon-codon associations refer to the most significant association between the codons over all the associations computed for the complete

83

repertoire of observed amino acids and possible leaf distributions at those codons. This approach was taken exclusively in the synthetic studies, though the results were similar when we looked at exact associations (at the level of observed residues and leaf distributions; data not shown). We report power results as Precision-Recall (PR) curves, where the x-axis is recall (T P/(F N + T P )) and the y-axis is precision (T P/(T P + F P )), where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. To construct PR curves, we computed precision and recall for every observed q-value for each method. We used as a gold standard the synthetic data as described in the previous section. Accuracy of q-values, called calibration, is plotted as (1 − Precision) versus q-value. A perfectly calibrated result is a line with slope one. To compare two PR curves, we computed p-values using the absolute value of the difference between the areas under the two curves as the statistic. The null distribution assumes the two curves will on average provide the same ranking over the predictor target pairs and is constructed using a permutation test in which two pseudo-curves are generated by randomly swapping the ranks between the two methods for each predictor-target pair. That is, if methods M1 and M2 provide ranks of r1 and r2 , respectively, for a predictor target pair P T , then with probability 0.5, M1 will be reassigned rank r2 and M2 will be reassigned rank r1 for P T . Resulting ties in ranks were broken at random. Ranks were used rather than q-values so that the scores of two uncalibrated methods could be compared directly. 10, 000 permutation tests were run to compute each p-value. 6.2

Model validation on synthetic data

In this section, we use synthetic data to demonstrate the power and calibration of the proposed models and to demonstrate that failure to account for the phylogenetic tree, linkage disequilibrium (LD) among HLA alleles, and covariation among the amino acids will lead to a significant drop in power and inflation in estimates of significance.

84

6.2.1

Noisy Add represents real data better than Decision Tree

We have described two models that can each simultaneously account for the shared evolutionary history among viral sequences, linkage disequilibrium among HLA alleles, and covariation among the HIV amino acids. Before proceeding, it is useful to determine which of the two models better represents the real data. To examine this issue, we generated synthetic data from the HOMER Gag data according to (1) the Decision Tree model fit to real data (D(DT ) ), and (2) the Noisy Add model fit to real data (D(N A) ). We then applied both models to both data sets. In general, the model that generated the data should be the optimal model for performing inference on that data. We indeed found this to be true in our experiments, but in addition, we found that the performance of the Noisy Add model was equivalent to that of the Decision Tree model on D(DT ) (there was no detectible difference between the PR curves; p=0.46), whereas the performance of the Noisy Add model on D(N A) was significantly better than that of the Decision Tree model (p < 0.0001) (Figure 6.2). Thus, the Noisy Add model appears to be better able to capture the relationships in the true data than the Decision Tree model. Consequently, in what follows, we concentrate exclusively on the Noisy Add model. We note, however, that the Decision Tree model is computationally more favorable and may be useful when resources are limited.

6.2.2

Covariation confounds simple tests

As we have discussed, there are at least three major sources of statistical confounding for HIV-HLA association tests: phylogeny (P ), linkage disequilibrium among HLA alleles (L), and covariation among HIV codons (C). Previous approaches to finding HLA-associated polymorphisms have accounted for LD but not phylogeny [157], accounted for phylogeny but not LD [17], or accounted for phylogeny and LD but not covariation[24, 25, 153, 196]. None of the previous approaches considered HIV

1

1

0.9

0.9

0.8

0.8

0.7

0.7 Precision

Precision

85

0.6 0.5

Noisy Add

0.4 0.3

0.6 0.5 0.4 0.3

Decision Tree

0.2

0.2

0.1

0.1

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Recall

Figure 6.2: Noisy Add represents real data better than Decision Tree. Synthetic data were generated according to the Decision Tree model fit to real data (A) and the Noisy Add model fit to real data (B). On both data sets, the Noisy Add model performs at least as well as the Decision Tree model. In contrast, the Decision Tree model does poorly when applied to data generated from the Noisy Add model.

codon covariation. To compare the relative contribution of each of these sources of confounding, we constructed five models that each account for a subset of the confounding sources as well as a baseline model that does not account for any source of confounding: 1. No correction for confounding (MFET ). We use Fisher’s exact test to compute exact p-values for associations between X and Y assuming X and Y are independent and identically distributed across individuals. 2. HLA LD only (ML ). We use the Noisy Add model where only HLA-allele attributes are predictors and no correction for phylogenetic structure is made (achieved by fixing λ to be infinity). This model is similar to the one used by Moore et al. [157], except that Moore et al. used logistic regression rather than

86

Noisy Add. 3. HLA LD and covariation only (MLC ). We use the Noisy Add model (where both HLA-allele attributes and attributes representing other codons are predictors) with no correction for phylogenetic structure (λ set to infinity). This model is similar to a second model in Moore et al., who suggested adding other codons as covariates to their logistic regression model [157]. Bhattacharya et al. later suggested that this approach could implicitly correct for some of the effects of the phylogeny [17]. As we shall see, it does when considering HLA-codon associations, but does not when considering codon-codon associations. 4. Phylogeny only (MP ). We use the univariate model where only HLA-allele attributes are predictors. This model is the second method described in [17] and fully evaluated in [31]. 5. Phylogeny and HLA LD (MPL ). We use the Noisy Add model where only HLAallele attributes are predictors. Matthews et al. [153] used this approach with the Decision Tree leaf distribution. Also, this model is similar to the approach described in [24, 196], wherein the univariate model in [17] is followed by an ad hoc post processing step that identifies HLAs in LD that are most likely to be responsible for immune pressure. 6. HLA LD, covariation, and phylogeny (MPLC ). We use the Noisy Add model. Ability to identify direct HLA-codon associations. Because the primary purpose of previous studies has been to find HLA-mediated adaptations in the HIV genome, we first looked at the ability of these models to recover HLA-codon associations, ignoring codon-codon associations. Figure 6.3A shows the precision-recall (PR) curves for the six methods when run on synthetic data from the

87

HOMER cohort. These curves indicate that all three sources of confounding play a significant role, and failure to account for any one of them leads to a dramatic drop in power. Although confounding due to both phylogeny and HLA linkage disequilibrium have been previously recognized [17, 24, 153, 157, 196], these curves demonstrate the significant confounding effect of codon covariation. As we have discussed, this observation can be explained by the failure of the univariate model to distinguish between direct and one-hop associations. Although both associations may be considered HLA associations, there are practical implications to distinguishing a direct association, which is likely to be the primary (e.g., most common, rapidly selected or necessary) escape mutation in an HLA-restricted epitope, and an indirect association, which may (e.g.) compensate for fitness costs introduced the primary escape mutation or provide further escape in the context of the primary escape mutation.

It is interesting to note that accounting for phylogeny and linkage disequilibrium (MPL ) does not appear to increase power over accounting for linkage disequilibrium alone (ML ) or even baseline (MFET ), and accounting for all three confounders (MPLC ) has only a modest (but significant, p = 0.009) increase in power over accounting only for linkage disequilibrium and codon covariation (MLC ). One reason may be the relative homogeneity of the HOMER cohort (97% clade B), which limits the amount of power that can be gleaned from the phylogeny. It is important to note, however, that any unaccounted for structure in the data will lead to an increased bias in the LRT and thus the q statistic [31]. This effect is seen here in the poor q-value calibration of the phylogeny-na¨ıve models shown in Figure 6.3B. Only the models that account for at least phylogeny and LD (MPLC and MPL ) have calibrated q-values. In contrast, the models that do not account for phylogeny or linkage disequilibrium grossly exaggerate significance.

88

A

1

0.8 1 − Precision

Precision

0.8 0.6 PLC LC PL L P FET

0.4 0.2

0.6 0.4 0.2

0

0 0

0.2

0.4 0.6 Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

q

C

1

D

1 0.8 1 − Precision

0.8 Precision

B

1

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

0.2

0.4 0.6 Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

q

Figure 6.3: Performance on data generated from the 97% clade B HOMER cohort. Precision-recall (A) and calibration curves (B) of the models with respect to HLAcodon associations; precision-recall (C) and calibration curves (D) of the models with respect to both HLA-codon and codon-codon associations. Better precisionrecall curves are ones that tend toward the upper right of the plot. Curves with perfect calibration follow the diagonal.

89

Ability to identify codon covariation.

The fact that codon covariation significantly confounds HLA-codon association statistics suggests that many of the codons are strongly influenced by polymorphisms at other positions. Indeed, prediction of covarying amino acids has a rich literature, with most methods unable to scale beyond pairs of covarying amino acids or to statistically account for the shared phylogeny [39, 182]. We therefore measured the ability of MFET , MP , MLC and MPLC to recover codon-codon associations in addition to HLA-codon associations.

The full Noisy Add model (MPLC ) achieves roughly the same power as it did for HLA-codon associations (≈ 70% recall at 20% q-value; Figure 6.3C). In contrast, failure to account for phylogenetic confounding (MLC ) significantly reduced power (p < 0.0001), despite the relative homogeneity of the data. Furthermore, accounting only for phylogeny (MP ), as many codon-covariation models have proposed [12, 39, 47, 92, 122, 138, 146, 158, 171, 178, 187, 190], performed even worse, reflecting a tendency to pick up indirect associations. At high precision (> 70%), accounting for only phylogeny improved performance relative to baseline, though at lower precision Fisher’s exact test outperformed the more error-prone LRT-based MP . In addition, the phylogeny-only (MP ) and the phylogeny-na¨ıve (MLC , and MFET ) models were extremely poorly calibrated (Figure 6.3D), indicating that q-values produced by these models are misleading. In the following section, where we consider multi-clade data, we shall see a more dramatic example of this failure. Thus, these experiments demonstrate the importance of accounting for both phylogeny and multivariate covariation when inferring correlated evolution among codons, even in relatively homogeneous cohorts.

90

6.2.3

Results on multi-clade synthetic data

Comparing recent large cohort studies [24, 25, 153, 196] to previous smaller studies [17, 31, 157] suggests that more associations can be detected by increasing sample size, a result directly confirmed by Rousseau et al. [196]. Although substantially increasing the size of existing cohorts may not be feasible, existing cohorts can potentially be merged together. One problem with this approach is that different cohorts typically consist of populations sampled from different geographical areas that differ substantially in HIV subtype distributions and ethnic composition (and thus HLA allele distribution), and the traditional approach of stratifying by clade and/or demographics defeats the purpose of increasing sample size. By correcting for the phylogenetic structure of the sequences, however, we can attempt to exploit the larger sample size of the combined data. At the same time, we can examine the similarities and differences among associations in different clades. To do so, we combined the HOMER and Durban cohorts, yielding a mixed-clade group of 1144 individuals, with a roughly equal mix of clades B and C (Figure 6.4). As in the previous experiments, we fit the full Noisy Add model to this combined data set and then generated synthetic data from the resulting model. We then attempted to learn back the associations using the full data set and then, for comparison, by stratifying the data and running the Noisy Add model separately for each clade. As indicated by the PR and calibration curves (Figure 6.5), the Noisy Add model successfully accounted for the heterogeneity in the data, as it remained calibrated and successfully recovered 80% of HLA-codon associations and 75% of all associations at 20% FDR. Importantly, the model demonstrated higher power on the combined data set than on the stratified data (p < 0.0005 for both HLA-codon associations only and all associations), indicating that there is shared information at both the HLA-codon and codon-codon levels and that power can be increased by merging data sets from disparate cohorts as long as all three sources of confounding are accounted for.

91

Figure 6.4: Tree built from the combined HOMER (red) and Durban (blue) cohorts [129]. In the text, “clade B” refers to the predominately red subtree and “clade C” refers to the predominantly blue subtree.

92

A

1

0.8 1 − Precision

Precision

0.8 0.6 0.4 0.2

0.6

PLC PLC Strat LC LC Strat PL L P FET

0.4 0.2

0

0 0

0.2

0.4 0.6 Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

q

C

1

D

1 0.8 1 − Precision

0.8 Precision

B

1

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

0.2

0.4 0.6 Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

q

Figure 6.5: Performance on data generated from the mixed-clade B/C data set. Precision-recall (A) and calibration curves (B) of the models with respect to HLAcodon associations; precision-recall (C) and calibration curves (D) of the models with respect to both HLA-codon and codon-codon associations. “PLC Strat” and “LC Strat” refer to running MPLC and MLC , respectively, on data stratified by clade. The curves reflect the combined results from the two strata.

93

We then applied the remaining five models to this mixed-clade data set, observing the founder effects demonstrated by Bhattacharya et al. [17]. In particular, using either Fisher’s exact test (MFET ) or accounting for LD alone (ML ), as proposed by Moore et al. [157], results in strikingly poor PR curves (Figure 6.5A) with calibration plots that indicate it is impossible to achieve greater than 10% precision (Figure 6.5B). These results are due to the founder effects demonstrated in [17] that arise from the fact that both HLA allele and HIV clade frequencies differ between human populations in different geographical areas. In contrast, using phylogeny alone (MP ) to account for founder effects, as proposed by Bhattacharya et al. [17], greatly increases power in the PR curve, though calibration is still poor. In this case, accounting for LD in addition to phylogeny (MPL ), as proposed by Brumme et al. [24], only moderately increases power, though it corrects the problem with calibration. Similar to the results for single clade data, accounting for LD and codon covariation (MLC ) yielded further improvements in both power and calibration, though we note the peculiar nature of the PR curve, which indicates that the most significant associations are spurious. This peculiarity is even more pronounced when looking at both HLA-codon and codoncodon associations (Figure 6.5C). Inspection of the strongest spurious associations indicates that they are founder effects that serve as clade markers—meaning the strongest associations simply identify a sequence as clade B or clade C. Once these markers are accounted for in the model, the performance of the model begins to improve (where the right-hand-side of the curve increases with recall). In contrast, failure to account for any confounding (MFET ) results in a strikingly poor PR curve. Given the prominent structure of the multiclade data, a natural solution is to stratify the data by clade, running the phylogeny-na¨ıve model (MLC ) separately on each clade. Although stratifying the data removes the strongest founder effects, the overall performance is not significantly different from MLC without stratification (p = 0.34 for HLA-codon associations and p = 0.09 for all associations). Nevertheless, it is interesting to note that, in the HLA-codon case, after the founder effects are

94

incorporated into the model, the non-stratified version of MLC appears to perform better than the stratified version, reinforcing the observation that there are common sites of escape in the two cohorts. Unfortunately, it is impossible to distinguish founder effects from true signal, limiting the practical value of this approach. In contrast, accounting for phylogeny with the full model (MPLC ) significantly outperformed both the stratified and non-stratified versions of the phylogeny-na¨ıve model (MLC ) on both types of associations (p < 0.0001 in both cases) and did not suffer from founder effects. Finally, it is striking that for codon coevolution, it is better to account only for protein-wide codon covariation than to use a sophisticated phylogenetic-correction algorithm that is limited to pairwise associations, especially if the data can be stratified by gross tree topology (in this case, HIV clades), although accounting for both phylogeny and codon covariation is clearly a more powerful approach. 6.2.4

Noisy Add can distinguish among specific leaf distributions

As discussed in Methods, the Noisy Add and univariate models incorporate a model of selection pressure for each predictor attribute that can take one of four forms: escape, reversion, attraction or repulsion. Furthermore, although escape and reversion (attraction and repulsion) are negative (positive) correlations, each process is distinct. So far, we have assumed that the primary purpose of these studies is to uncover associations at a codon level, and so ignored the specific leaf distribution learned. Nevertheless, the leaf distribution may be informative. To determine how well the model can recover the true leaf distribution, we compared the leaf distribution of the model instance used to generate the synthetic data with that learned from the synthetic data. On the three synthetic data sets we have discussed (synthetic data generated by Decision Tree and Noisy Add leaf distributions on the HOMER cohort, and synthetic data generated with Noisy Add leaf distributions on the combined HOMER and Durban cohorts), the Noisy Add model recovered the correct

95

leaf distribution in 85–90% of cases where it recovered the correct predictor-target pair. It should be noted, however, that the forward selection scheme used by Noisy Add makes it unlikely that both complementary processes (escape/reversion or attraction/repulsion) will be recovered, even if both processes are present. Thus, when we find (e.g.) an escape association, it does not preclude the presence of a reversion association. There is great utility in being able to distinguish between escape and reversion processes. Escape is an indication of CTL pressure, whereas reversion in the absence of the allele is as indication of replicative cost to the escape variant that leads to active reversion (as opposed to passive drift) in the absence of CTL-mediated selection pressure. Consistent with these interpretations, Matthews et al. [153] recently showed that reversion associations but not escape associations correlate with reduced plasma viral load in chronic infection. More generally, of course, the biological interpretation of these distribution forms will vary across domains of application. 6.2.5

Statistical power as a function of sample and effect sizes

Previous studies have differed widely in their results, in part because they employed different methods, and in part because they used different sample sizes (473 [157], 96 [17], ≈ 550 [24], 181 [31], 261–452 [196], and 262–666 [153]), which affects the power to detect associations. Not surprisingly, the larger studies found more associations, which suggests even larger studies may be beneficial. It is important to note, however, that the adverse effects of violating model assumptions increases with sample size, as assumption violations lead to deviation from (and statistical rejection of) the null distribution (see, for example, [144]). To measure the effects of sample size on the power of the six methods in consideration, we created additional synthetic data sets of size 143, 286, and 572 by randomly selecting 12.5%, 25% and 50% of the individuals from the mixed-clade synthetic data set. Figure 6.6 shows the dramatic increase in power that the full Noisy Add model

96

(MPLC ) experiences as a function of N . Here, power is defined to be the ability to detect associations at 80% precision regardless of the model’s reported q-value. In the range tested, the Noisy Add model has an approximately linear increase in the power to detect associations (HLA-codon and codon-codon associations combined) as the sample size increases. In contrast, increasing sample size for the other models has a limited effect. In particular, failure to account for codon covariation leads to a flat power curve for all models that do not account for codon covariation. The model that accounts for codon covariation and HLA allele LD but not phylogeny (MLC ) does experience a linear increase in power to detect HLA-codon associations (but not codon-codon associations), though the power is less than that of the phylogeneticallycorrected model (MPLC ) at all sample sizes. Only the full model (MPLC ) experiences any increase in power to detect codon-codon associations. Thus, simply increasing the cohort size will not lead to an increase in power if improper models are used. Rather, model calibration is likely to be negatively impacted as large numbers of spurious (yet non-null) associations are detected. Similar trends were seen on the HOMER (clade B only) cohort (data not shown). Statistical power is a function of sample size and effect size. In the results just described, the planted associations came from those detected at 20% q-value on real data at N = 1144. If we instead plant associations detected at 20% on real data at (e.g.) N = 572, those associations will presumably be stronger and hence the measured power should be greater. To demonstrate this supposition, we ran Noisy Add on a random subsample of the mixed clade cohort, and then generated new synthetic data sets based on these associations. The dashed blue lines (labeled “PLC Half”) in Figures 6.6A/B show the increased power to detect planted associations originally found at N = 572 compared to planted associations originally found at N = 1144. Thus, if associations of the strength detected by Brumme et al. [24] (N ≈ 550) are desired, then N = 1150 will provide sufficient power to recover 90% of the associations. If, however, more subtle effects are sought, then larger cohorts

97

A

0.8

0.8

0.6

0.6

0.4

0.2

B

1

Recall

Recall

1

PLC PLC Half LC PL L P FET

0.4

0.2

0

0 0

500

1000 N

0

500

1000 N

Figure 6.6: Power to detect both HLA-codon and codon-codon associations (A) or just HLA-codon associations (B) in the mixed-clade cohort at 80% precision. The “PLC Half” curve plots the power of MPLC on synthetic data generated using only associations that were identified from a cohort one half the size of the full cohort. The curves show how power is affected by the strengths of the planted associations.

are necessary. It is our opinion that only post hoc analyses of larger cohorts will determine the minimum relevant effect size. Furthermore, it should be noted that, whereas power can be increased by combining data from multiple cohorts, if only associations for a single cohort are of interest, then greater power will clearly be achieved from a single-clade cohort of size N than from a multi-clade cohort of size N . Nevertheless, the power increase from combining cohorts of different clades will prove useful in situations where single-clade cohorts cannot be expanded in practice.

98

Chapter 7 USING PDNS TO INFER PATTERNS OF IMMUNE ESCAPE AND COVARIATION IN HIV 7.1 7.1.1

Technical details Data

These methods were applied to population-based HIV sequences from chronically infected, antiretroviral na¨ıve and HLA-typed individuals from two cohorts: the HOMER cohort from British Columbia, Canada, consisting of 567 predominantly clade B gag sequences [25], and the Durban cohort, consisting of 522 predominantly clade C p17/p24 gag sequences from Durban, South Africa [116, 196]. Individuals in the HOMER and Durban cohorts were HLA-typed to two- and four-digit resolution, respectively. Here, we truncate the Durban data to two-digits for comparison with the HOMER cohort. Viral sequences were determined by nested reverse-transcriptase polymerase chain reaction (RT-PCR) amplification of extracted plasma HIV RNA followed by bulk sequencing, as previously described [24, 25, 196]. Phylogenies were constructed from these sequences using PhyML [87], run using the general time reversible model over the HIV sequences and estimating all parameters via maximum likelihood. 7.1.2

Data analysis

When we refer to associations involving codons, we will sometimes find the following notation useful. T242 will refer to an amino acid (in this case threonine) at a specific codon (242). If the association is escape or reversion, then T242 is the susceptible form. If the association is attraction or repulsion, then T242 is the resistant form. The PDN

99

will often find complementary associations. For example, T242 is the susceptible form with respect to B*57, and N242 is the resistant form. We will sometimes refer to such associations as T242N. For simplicity, we will usually report only the smaller q-value of the two associations. If only the susceptible association is significant (q ≤ 0.2), we will sometimes write T242X. Likewise, if only the resistant is significant, we will sometimes write X242N. Optimally defined epitopes [70] were taken from http://www.hiv.lanl.gov/ content/immunology/tables/optimal_ctl_summary.html, accessed on December 21, 2007. To allow the inclusion of processing mutations [54], we called an association a match to the optimal epitope if it was within three codons of the epitope boundary, as described elsewhere [24, 196]. When using the optimal epitopes as a bronze standard for comparing methods, we considered only the most significant HLA-codon association per HLA-epitope pair to prevent double counting that arises due to the extent of within-epitope covariation (see Results). Similarly, in cases where an HLAcodon pair was not within three codons of an optimal epitope, we computed the most likely predicted epitope using Epipred [95] so that neighboring associations in putative epitopes were not double counted. 7.2

Phylogenetic dependency network for Gag p17 and p24

Having established the Noisy Add model’s ability to reliably recover associations in mixed clade data sets, we now turn to an analysis of the actual associations that were recovered on the mixed clade B/C data. The Noisy Add model found 149 HLAcodon associations and 1386 codon-codon associations at q ≤ 0.2, representing 100 distinct HLA-codon pairs and 716 distinct codon-codon pairs. To explore these dense networks we developed a dependency-network viewer, PhyloDv, designed for intraprotein networks. PhyloDv draws the protein as a circle, with the N-terminus at the “3 o’clock” position and the protein extending counter-clockwise around the circle. Codon-codon associations are drawn as headless arcs (or edges) within the circle,

100

whereas HLA-codon associations are drawn as external edges. Edge color reflects the strength of the association. Figure 7.1 shows the full Gag dependency network at 20% q-value. The program, which includes interactive detailed views of each codon to explore the specific associations, is available upon request. The individual associations are available as Supplementary Table 1. The dense PDN reveals broad patterns of codon covariation and HLA-mediated substitutions. For example, pairs of codons are more likely to covary within a subprotein (N=528) than between subproteins (N=188; p < 10−31 ), and a disproportionate number of p17 codons (24%) are associated with HLA alleles than are p24 codons (13%; p = 0.009). Not surprisingly, a disproportionate number of covarying codons were within 10 positions of each other (162/716; p < 10−55 ). (We compute p-values by using Fisher’s exact test to estimate the significance of a contingency table that compares observed associations against null codon pairs, which we define to be the set of all codon pairs that were not called significant by the PDN but which did pass the minimum count pre-processing filter.) Interestingly, of the 62 codons that are associated with at least one HLA allele, 59 (95%) predict substitutions at other codons. Furthermore, on average, each HLAassociated codon predicts substitutions at 7.0 other codons on average (range 0–25). These two observations highlight the complexity of HLA-mediated escape. Also of note is that, of the 181 codons that covary with at least one other codon, 60 (33%) have an association with at least one HLA allele, suggesting that HLA-mediated selection pressure will confound codon coevolution when unaccounted for. Among the 68 HLA-codon associations with escape/reversion leaf distributions, 33 represent escape/reversion from a residue that is consensus in both clades B and C, 7 represent escape/reversion from clade B consensus only, and 11 represent escape from consensus C only, where we define clade B consensus based on the HOMER cohort and clade C consensus based on the Durban cohort. Interestingly, of the 11 clade C susceptible associations, 5 had predicted resistant forms matching clade B

101

Figure 7.1: Gag phylogenetic dependency network for combined HOMER and Contract cohorts. Gag p17 and p24 are drawn counterclockwise, with the N-terminus of p17 at the 3 o’clock position. Arcs indicate association between codons (inside the circle) or between HLA alleles and codons (outside the circle). Colors indicate q-values of the most significant association between the two attributes.

102

consensus (A*29 F79Y, A*68 F79Y, B*35 D260E, A*11 F301Y, B*44 D312E), and 2 of the 3 clade B susceptible associations had a predicted resistant form matching clade C consensus (A*01 Y79F, A*31 R91K). In all, there were 21 HLA-codon associations for which the predicted resistant form was clade B or C consensus (Table 7.1). These associations may represent instances in support of the “HLA footprinting” hypothesis, which states that the current circulating viral sequences are a reflection of escape from prominent HLA alleles in different human subpopulations [131, 157]. Indeed, 17 of these 21 associations involved common HLA alleles that are found in at least 10% of individuals in at least one of the two cohorts (Table 7.1). Four of these 21 associations lie in optimal epitopes, which is reasonable given that such responses are less likely to be identified using overlapping peptide scanning technologies that seek to maximize consensus sequence coverage. In three of those four optimal epitopes, the predicted susceptible form matches the optimal epitope. B*07-associated S357G is the one exception, where G is both clade B and C consensus and is also the amino acid in the optimal epitope sequence. This association may represent an instance where the so-called optimal epitope was actually a partially escaped form. It is interesting to note however that B*07 is a very common allele in both the HOMER and Durban cohorts and, in one recent study, all studied optimal B*07 epitopes in Gag, Pol and Nef were found to contain at least one association predicting that the optimal epitope actually contained an escape polymorphism [23]. As noted in the synthetic results, Noisy Add can distinguish with 85% accuracy the difference between reversion and escape leaf distributions (though it cannot discern whether both are present). On the current data set, 5 HLA-codon associations were identified as primarily reversion: B*14 K302, B*15 K26, B*57 T242, B*58 G55, and B*81 L184. These associations most likely have a corresponding escape association that we are not detecting, but these associations are nonetheless notable in that there is a strong statistical pull towards the “susceptible” form in the absence of the associated allele, which may suggest that fitness costs are associated with the resistant

103

Table 7.1: HLA-codon associations in which clade B and/or clade C is the predicted resistant form. Bold lines match optimal epitopes. X indicates no significant association. Consensus

HLA Freq (%)

HLA

Pos

Susc

Res

B

C

B

C

A*01

76

R

K

K

K

23.5

9.9

A*01

79

Y

F

Y

F

23.5

9.9

A*11

93

X

E

E

E

12.3

0.3

A*11

301

F

Y

Y

F

12.3

0.3

A*24

30

K

R

R

M

19.2

5.4

A*29

79

F

Y

Y

F

8.4

15.5

A*30

67

X

A

S

A

5.2

34.1

A*31

91

R

K

R

K

8.0

0.9

A*68

79

F

Y

Y

F

9.3

24.2

A*74

109

X

N

N

N

0.3

9.9

B*07

357

S

G

G

G

23.1

9.9

B*14

147

X

I

I

I

6.5

5.5

B*15

28

X

K

K

H

18.8

34.8

B*15

147

X

I

I

I

18.8

34.8

B*35

260

D

E

E

D

16.9

3.6

Optimal

KYKLKHIVW

GPGHKARVL

PPIPVGDIY/ NPVPVGNIY

B*42

30

X

R

R

M

0.4

22.4

B*44

312

D

E

E

D

19.5

14.9

B*57

54

A

S

S

S

6.4

9.5

B*81

163

X

A

A

A

0.1

11.8

C*06

146

P

A

A

A

13.7

28.3

C*06

242

N

T

T

T

13.7

28.3

AEQASQDVKNW

104

form [6, 132, 153]. Indeed, in the case of T242, the resistant form N242 is known to reduce in vitro fitness [21, 132]. 7.2.1

Known escape pathways are predicted by the PDN

In some cases, CTL escape requires a set of secondary substitutions that may stabilize protein structure, compensate for lost protein function, or facilitate further escape [21, 41, 72, 105, 113, 175, 176, 201, 234]. To date, however, the identification of such complex pathways has been largely limited to studies of single immunodominant epitopes restricted by HLA alleles that are known to be protective against infection. The PDN systematically predicts potential escape pathways across all epitopes and HLA restrictions. Here, we assess the quality of these predictions by checking to see whether well-studied escape pathways are found in our PDN. The first HIV escape pathway that was described in detail is escape from the B*27-restricted KK10 epitope in p24 Gag 263–272 [84, 113]. In the mid 1990s, it was demonstrated that the R264K/G mutations abrogated B*27 recognition of the KK10 epitope [84, 166]. Here, we find that B*27 is strongly correlated with escape from R264 (Noisy Add q = 0.01), with the result being evenly distributed between K (q = 0.08) and G (q = 0.11). Kelleher et al. later reported that the R264K but not R264G mutation was typically preceded by L268M. Accordingly, in the PDN, we find that L268M predicts R264K (q < 0.001) but not R264G, and that L268M is itself predicted by B*27 (q = 0.001). Kelleher et al. also reported that the R264G was associated with E260D, and Schneidewind et al. [200] confirmed that E260D compensates for R264G but not for R264K. We find the same association in the PDN. (Note that, although E is clade B consensus and D is clade C consensus, every individual in both cohorts with G264 has D260). Recently, Schneidewind et al. demonstrated that S173A compensates for the loss of replicative capacity incurred by R264K [201]; this R264K substitution (but not R264G) is strongly associated with S173A in our model (q < 0.001). In addition, we

105

note that R264K is strongly associated with substitution I267V (q < 0.001) within the KK10 epitope and with L215M (q = 0.01). Residue 264 is within 3 angstroms of both codons 215 and 173 in folded p24 [214], which may explain the compensatory relationship between codons 173 and 264 [201] and predicts a similar relationship between codons 264 and 215. Finally, although it is not known if there are any determinants that predict whether KK10 escape occurs via the R264K or R264G pathway [200], we find several associations that predict one pathway or the other. Most strikingly, of the 7 individuals with R264G, 4 have Q136R (q = 0.0001), a substitution which also strongly predicts the D260E substitution of the R264G pathway (q < 0.001). In addition, A146P is associated with maintaining wild type L268 (not the R264K pathway) whereas A163X predicts I267V (R264K pathway). Both A146P [54] and A163X [41] are B*57mediated escape substitutions (see below), though no individuals in the cohorts are both B*57 and B*27 positive, making interpretation difficult. The B*57 and B*5801 alleles have been strongly associated with effective HIV control [5, 110, 115, 155], an effect that may be due in part to successful targeting of Gag epitopes [116] and the high cost to viral load of CTL escape from some epitopes targeted by these alleles [21, 147]. Recently, the details of escape from the B*57restricted TW10 epitope in Gag codons 240–249 have been described [21, 85, 147]. TW10 escape begins with a T242N escape mutation, which partially abrogates B*57 binding, but also elicits a measurable fitness cost to the virus in part by disrupting cyclophilin A (cypA) interactions [147]. The fitness costs of this mutation may be partially restored by compensatory substitutions H219Q, I223V and M228I [21]. This escape pattern is captured by the dependency network, which finds a direct HLA-codon association between B*57 and 242 (q < 0.001). The T242N substitution predicts (q < 0.001) further escape at G248A (position 9 of the TW10 epitope) and a single compensatory mutation at codon E210D (q = 0.01) in the CypA binding loop, whereas the G248A substitution predicts compensatory substitutions V218A

106

(q = 0.02) and M228V (q = 0.08), and G248T predicts H219Q (q = 0.07) and M228I (q < 0.001), of the CypA binding loop. Although 228 is (in at least some structural models) in direct contact with 248 (3 angstroms) [214], the other associations are more distant (10-20 angstroms). Nevertheless, the CypA substitutions have been shown to compensate for the 242 and 248 mutations [21], underscoring the fact that compensatory mutations may be of a more functional nature and not strictly due to protein structural constraints. Previous studies have reported alternative escape pathways in the A*02-restricted SLYNTVATL epitope (Gag positions 77–85) at epitope positions 3, 6, and 8 (though the Y79F escape at epitope position 3 is clade C consensus) [105]. Although the model finds associations with three HLA alleles that restrict known epitopes that overlap this region (A*01, A*11, A*29), no correlations were observed with A*02. The lack of signal may be due to several factors that will each dilute the signal: multiple escape pathways that occur in different sites, dilution from the overlapping epitopes for which there is a stronger signal, and evidence that a lack of fitness cost will lead to low rates of reversion [105], which, coupled with the high rate of A*02 in the population (40.5% in the combined cohort), suggests that many non-A*02 individuals will have escape variants. 7.2.2

Codon covariation reflects three-dimensional protein structure

In the first study of its kind, Poon and Chao [179] reported that 70% of artificially induced, fitness-reducing mutations selected for partially restoring compensatory mutations in the DNA Bacteriophage φX174. No studies have systematically explored this phenomenon for immune escape in HIV Gag or other viruses, but the case studies of B*27 KK10 and B*57 TW10 make it clear that compensation happens in response to at least some CTL escape mutations. Poon and Chao further reported that compensatory mutations tended to cluster in linear and/or three-dimensional space, though many exceptions were noted. Indeed, the KK10 and TW10 case studies re-

107

veal two patterns of compensation: distal compensation in the case of TW10, where compensatory mutations are distal in three-dimensional space but alter functional dependencies, and proximal compensation in the case of KK10, where codon pairs in a compensatory pathway are in close proximity and are likely required to maintain structural fidelity. Although the PDN predicted both known pathways, only the latter form of compensation can be easily and independently verified computationally by computing distances between covarying codon pairs. To determine the proportion of covarying codon pairs that are in direct contact, we computed codon-codon distances against the p17 trimer crystal structure [98] and the p24 and p17p24 polyprotein NMR structures [214]. The distances were computed as the minimum distance between any reported atoms for each codon in a single PDB model, taken over all models and all three structures. The distances for the p17 trimer crystal structure tended to be farther than for the p24 and p17p24 NMR structures as hydrogen atoms, which tend to be on the periphery of amino acid molecules, are not mapped in crystal structures. To compute the significance of the results, we also computed three-dimensional distances among null codon pairs, which we defined to be all codon pairs for which no significant direct associations were found even though both codons exhibited enough variability to pass our minimum count filter. Among the 424 significant (q ≤ 0.2) linearly distal codon pairs (> 10 codons apart) that could be mapped to at least one structure, 37 (8.7%; p < 10−11 , Fisher’s exact test against null codon pairs) were within 5 ˚ A of each other and 113 (26.7%; p < 10−20 ) were within 10 ˚ A. Even among linearly proximal codon pairs (2–10 codons apart), covarying pairs were more likely (75/121; 62%) than non-covarying pairs (1075/2417; 44%) to be within 5 ˚ A in the three-dimensional structure (p = 0.0001). To further validate the ability of the model to distinguish direct associations within a chain of interactions, we computed pairwise distances among all linearly distal onehop associations, excluding instances where a direct association was also inferred. The median distance between direct association codon pairs (15.9 ˚ A) was significantly

108

smaller than the median distance between one-hop codon pairs (19.2 ˚ A, p < 0.0001). The direct codon-codon associations were also significantly closer than those of MLC (median 18.4 ˚ A, p = 0.003; 5.3% < 5 ˚ A, p = 0.04), which doesn’t account for phylogeny, and MP (median 21.8 ˚ A, p < 10−12 ; 3.2% < 5 ˚ A, p < 10−5 ), which computes pairwise, phylogenetically-corrected associations. Fisher’s exact test (median 22.1 ˚ A) was indistinguishable from the null codon pairs (median 22.9 ˚ A, p = 0.63). The complete set of distances is reported as Supplementary Table 2. It should be noted that long range distances do not preclude a compensatory relationship, as long range effects are common [179] and both p17 and p24 form complexes, suggesting that some structural compensations may exist at the interface between two instances of the same protein. Nevertheless, those codon pairs for which we observe both strong dependencies and colocalization in the three-dimensional structure are strong candidates for further study with regards to compensation. 7.2.3

Codon covariation reflects correlated epitope targeting

The epitopes targeted by CTL are not a simple function of the individual’s HLA repertoire. Rather, specific patterns of epitope targeting are often observed. For example, epitope targeting by CTL often follows patterns of immunodominance [237], wherein initially only one or a few epitopes (the dominant epitopes) are strongly targeted by the T-cell response. However, a shift in immunodominance patterns occurs over the course of infection, as the T-cell response broadens to target additional epitopes [83, 109]. Given that patterns of immunodominance appear to be largely consistent at the population level in at least some cases [7, 23, 75, 85], the sequential selection of escape mutations restricted by the same HLA allele that results from sequential targeting of HLA-restricted epitopes over the course of infection may also be reflected as patterns of codon covariation. In the case where escape is sequential, escape in subdominant epitopes may be better predicted by escape in dominant epitopes than by the presence of the restricting HLA allele. To use the immunodominant B*57 allele

109

as an example, the earliest and most frequently targeted B*57-restricted epitope is TW10 [5]. TW10, however, is not the only B*57-restricted Gag epitope. Other epitopes exist in codons 162–172 (KF11) [41], 147–155 (IW9) [54], and 308–316 (QW9) [70]. Recent results indicate that TW10 tends to escape most rapidly, followed by IW9 then KF11 [23] (QW9 was not studied). On the combined Durban-HOMER data set, the dependency network predicts direct HLA-codon associations between B*57 and codons in TW10, IW9, KF11 and positions 54–62, which we’ll refer to as putative pSG9 epitope, as well as striking codon covariation. For example, the antigen-processing escape A146P [54] (one codon upstream of the IW9 epitope) is predicted by both the presence of B*57 and the presence of the T242N TW10 escape mutation, suggesting that escape in IW9 often occurs in the context of escape in TW10 (but not always, as indicated by the direct B*57-146 association). Similarly, A163G KF11 escape is predicted by escape substitutions T242N (TW10) and I147M (IW9), and lack of escape at 310 (QW9), reflecting previous reports of the targeting order of Gag B*57-restricted epitopes [5, 23, 41], whereas pSG8 escape is correlated with escape at TW10, IW9 and QW9. It is important to note that the order in which direct escape associations arise cannot be inferred from the PDN. Rather the presence of arcs between epitopes suggests that targeting of epitopes restricted by the same allele is somehow correlated. Immunodominance is one biological mechanism that may induce such CTL-mediated codon covariation. Another may be the overall strength of the CTL response and/or the strength of the CTL response to epitopes targeted by a given allele. In the most extreme example, epitopes are either targeted or not depending on the strength of the immune system. Individuals who are targeting the epitopes will tend to select for escape mutations in all the epitopes, whereas individuals who have weakened immune systems may not target any of the epitopes (or will target with less strength). In this scenario, escape from epitope A implies that the immune system is active, thus increasing the likelihood of escape from epitope B, and vice versa. In a less extreme ex-

110

ample, suppose that some individuals with a given allele mount a response to epitopes restricted by that allele, whereas other individuals do not. This situation will lead to codon-codon dependencies among associations in epitopes restricted by the allele. In addition, several studies have noted the accumulation of multiple or alternative escape substitutions within the same epitope [24, 31, 41, 100, 105, 132, 196, 200, 201, 241]. We would therefore expect to see codon-codon dependencies within the same epitope as well. To determine how much of the observed codon covariation may be CTL-mediated, we looked at covariation in the PDN with regard to known, optimally defined epitopes. Among linearly proximal codon pairs, both co-evolving codons were within the same HLA-restricted optimal epitope 138 of 162 times (85.2%, compared to 72.0% of null codon pairs; p = 0.0002). 213 of 554 (38.4%; p = 0.003) linearly distal codons pairs occurred within different optimally-defined epitopes restricted by the same HLA allele (compared to 32.3% of null codon pairs). If we also include predicted epitopes, defined here as the region ±5 codons from a direct HLA-codon association, then 304 (54.8%; p < 10−17 ) linearly distal codon pairs are in known or predicted epitopes restricted by the same HLA allele (compared to 36.6% of null codon pairs). We thus conclude that a majority of codon covariation in Gag p17/p24 is attributable to CTL-mediated selection pressures, though the specific mechanism of CTL-mediated covariation cannot be identified from this study. 7.2.4

Direct HLA-codon associations map to known epitopes

The observation that a majority of codon-codon associations occur within or proximal to epitopes restricted by the same HLA allele suggests that CTL escape is driving much of the observed HIV codon variation. Indeed, Brumme et al. [23] recently showed that at least 36% of observed Gag substitutions in acutely infected individuals are due to HLA-associated polymorphisms (possibly including indirect associations), a proportion that may increase once the full PDN is considered. It may therefore

111

be surprising that there are only 100 direct HLA-codon associations. The synthetic studies showed that the Noisy Add model can successfully recover the primary escape mutations and is not prone to hallucinating indirect associations, indicating that we can assume the direct–indirect distinction with some confidence. Thus, there appears to be a dense network of correlated escape among epitopes, with a relatively sparse set of primary escape mutations that are most rapidly and/or most frequently selected for. Teasing apart the underlying causality and accuracy of this network requires a large number of longitudinal samples and laborious experimental data. Nevertheless, the accuracy of the PDN can be approximately tested by evaluating which associations lie in optimal epitopes. Specifically, if the PDN is accurate, then direct associations are more likely to lie in or near epitopes than are one-hop associations (H → B in the H → A → B chain, where H → B is not directly inferred by the algorithm). Thus, we categorized every direct and one-hop association based on whether or not it was within three codons of an optimally-defined epitope, using a strict matching criterion that required that an optimal epitope exactly matched the consensus sequence among either clade B or clade C HIV sequences that had the predicted susceptible amino acid and the codon in question. Figure 7.2 shows the number of in-epitope associations found as a function of the q-value rank of the association. To prevent double counting, only the most significant association per HLA-epitope pair was considered (see subsection 7.1.2). The plots suggest that direct associations may be more likely than one-hop associations to lie in epitopes, although the difference is not statistically significant. Given that most codon-codon covariations are between epitopes restricted by the same allele, it should not be surprising that many one-hop associations lie in epitopes. We thus additionally plotted only those one-hop associations that did not lie in an optimal epitope that was already predicted by a direct association (Figure 7.2, clean one-hop). Only four such associations were found with q ≤ 0.4 (p < 0.0001 compared to direct associations with a permutation test), indicating that most of the one-hop epitopes were epitopes that

112

20

# Known epitopes

15 Direct One Hop Clean One Hop LC PL L P

10

5

0

0

50

100

150

Rank

Figure 7.2: Number of associations in optimal epitopes as a function of q-value rank.

additionally had direct associations. It is therefore not surprising that MPL , which fails to account for codon-codon covariation, identified escape mutations within almost as many optimal epitopes as the full model MPLC (Figure 7.2). We further compared the HLA-codon associations of the other three models to the optimal epitopes. Only ML and MFET , which fail to account for both phylogeny and codon covariation and are thus quite prone to founder effects [17], performed significantly worse than the other models (p < 0.0001). The models that roughly account for clade differences, either through codon covariation (MLC ) or phylogeny (MP ), performed slightly worse than the full model, though these differences were not significant.

7.3

Discussion

We have presented the first approach to simultaneously account for viral phylogeny, codon covariation, and HLA linkage disequilibrium in population-based association studies. It is also the first large scale multiclade analysis of HLA-mediated escape

113

in HIV-1, as well as the first approach that simultaneously accounts for HLA linkage disequilibrium, HIV ancestral relationships, and codon covariation. The large number of direct HLA-codon associations confirms a substantial role of the HLA-restricted CTL response in driving HIV evolution, and supports the observation that patterns of HIV evolution are broadly predictable based on host immunogenetic profiles [2, 17, 24, 25, 157, 196]. Moreover, results demonstrate that escape and reversion mutations often arise in the context of a complex set of correlated substitutions that reflect both compensatory mutations and dependencies among epitopes. On the whole, the phylogenetic dependency network predicts that a major proportion of p17 (41%) and p24 (20%) codons are under selective pressure from at least one HLA allele, a result that confirms a dominant role of T-cell responses in driving viral evolution [2, 32, 85]. This study also represents a significant step forward by providing a statistical approach that can help differentiate direct (H → A) HLA escape polymorphisms from indirect or, more specifically, one-hop (H → B) escape polymorphisms in situations where the true interaction is the chain H → A → B. Although the direct–indirect distinction can arise under several mechanisms, the explicit statistical interpretation is as follows: a direct HLA-polymorphism H → A association means the HLA allele H is a strong predictor of the polymorphism A, whereas an indirect HLA-polymorphism association H → A → B means the polymorphism B is better predicted by the polymorphism A than by the HLA allele H. Although B is in a sense HLA-associated, the distinction of direct versus indirect associations may have important biological implications. For example, many of the indirect associations identified by the dependency network for the B*57-restricted TW10 and B*27-restricted KK10 epitopes are consistent with known compensatory mutations associated with escape in these epitopes [21, 201]. In addition to these described pathways, the dependency network reports a number of covarying amino acids. Many of these are in close physical contact, and thus likely candidates for compensatory pathways that can be tested via in vitro replication capacity assays, although distal covarying codons may also exhibit

114

compensatory relationships [38, 142, 179, 232, 242]. Understanding the specifics of compensatory-based covariation has important implications for T-cell-based vaccine design, as escapes that require multiple compensatory mutations may take longer to arise due to chance and the compensatory mutations may not completely recover lost fitness [21, 85, 200, 201]. Compensation is not the only potential causal mechanism of codon covariation. Other mechanisms include those associated with CTL-mediated covariation. Indeed, the PDN indicates that up to 50% of the observed codon-codon covariation occurs between epitopes restricted by the same HLA allele, suggesting much of the observed codon covariation in HIV is CTL-mediated. Two possible mechanisms of CTL-mediated covariation include inter-patient variability in the immune system’s ability to target epitopes and consistent patterns of epitope targeting due to immunodominance. Distinguishing between these two mechanisms may have direct relevance to vaccine design, but will require comparing the results of the PDN to clinical response data that can measure epitope targeting and longitudinal samples that can identify order of escape. Although it is well known that the order in which the epitopes of some HLA alleles are targeted is broadly consistent [7, 18, 23], identifying new patterns may yield new vaccine candidates. Specifically, it is possible that HLA alleles that are currently considered non-protective target ineffective dominant epitopes during acute infection. Redefining the immunodominance hierarchy via immunogen exposure may thus increase the effectiveness of these alleles upon subsequent HIV challenge [73]. A major challenge to vaccine design is global HIV diversity [76, 143]. Although there is accumulating evidence that suggests that patterns of escape appear to be broadly predictable [24, 25, 32, 157, 174, 196], these studies have been limited to relatively small sample sizes or cohorts consisting predominantly of a single clade. Although a comparison of the Durban and British Columbia results showed instances of both consistency and divergence of associated escape in the two clades [196], these

115

studies were run separately, did not account for codon covariation, and used different methods for determining associations. Thus, the extent to which escape pathways are shared across clades was largely unknown. Our results, which reflect data equally distributed between clade B and clade C sequences and are evaluated by taking HLA LD, viral lineage and codon covariation into account, confirm the existence of common escape pathways. This similarity suggests that a broadly reactive vaccine may be possible, though more work to further characterize inter-clade similarities and differences will be necessary. Despite the broad similarities seen between clades B and C, we noted several intriguing examples where the resistant form of an epitope matched the consensus sequence for one of the clades. Such examples support the HLA footprinting hypothesis [131, 157], which proposes that consensus sequences of circulating strains in a population are a result of consistent escape (and lack of reversion) from the most common HLA alleles in that population, an hypothesis that is especially well founded in cases where the consensus polymorphisms are different in different populations. For example, 53% of individuals in the British Columbia cohort [25] have A*01, whereas only 24% of the Durban cohort [116, 196] have A*01, and F79 (clade B consensus) is the resistant form of the association. Furthermore, alleles A*29 and A*68 have higher frequencies in the South African cohort, and Y79 (clade C consensus) is the resistant form of their associations. Thus, at codon 79, there appears to be broad selection pressure for evolutionary fixation of F in the South African cohort and fixation of Y in the British Columbian cohort. Our analyses identified a total of 21 codons (four with independent experimental support [70]) where the predicted escape matched clade B or C consensus, adding support to the hypothesis that CTL pressure serves a broad, population-level role in shaping HIV evolution, and may even serve a key role in clade differentiation [157]. We have focused this study on the highly immunogenic Gag p17 and p24 proteins, which are believed to serve a key role in effective control of HIV [56, 77, 116, 243].

116

Moving forward, it will be important to extend such studies to full length genomes, where patterns of covariation may reflect cites of protein-protein interaction [39, 88, 224] and may further reveal broad patterns of immunodominance. Furthermore, as the number and diversity of large cohorts of HIV-infected, HLA-typed, individuals continue to grow [25, 116, 157, 174, 196], it will be important to combine data sets in order to increase statistical power and further detail the similarities and differences among clades that may inform broad-coverage immunogen design. One limitation of our two-clade study is that, because the HLA data in the HOMER cohort had only 2-digit resolution, we truncated the HLA data in the Durban cohort to 2 digit types as well. Although closely related HLA alleles often target the same or similar epitopes [70], making 2-digit resolution an appropriate choice for some allele-epitope pairs, important differences do exist. An example is the distinction between the B*5801 allele that is associated with effective viral control, and B*5802 which is associated with poor viral control [115, 162]. In cases where the prevalence of four-digit resolution types differs substantially between cohorts (as is the case with B*58) and the four-digit types target different epitopes, truncation to two-digit types before combining cohorts will lead to confounding in which the two-digit types from one cohort will tend to lead to escape whereas the two-digit types from another cohort do not. In principle, a better approach would be to include both 2-digit and 4-digit HLA alleles (or any other grouping of alleles) as predictor attributes in our model. For example, if all B*57 alleles select for the same escape mutation, then B*57 would be chosen by the model as a stronger predictor than B*5701, whereas escape mutations selected only by B*5701 would lead to the 4-digit allele being chosen. Of course, these facts should encourage researchers to perform high resolution typing on individuals in their cohorts. In addition, Listgarten et al. [135] have developed a statistical approach for inferring high resolution HLA alleles from low resolution haplotypes. Although incorporating the uncertainly of those predictions into the PDN is beyond the scope of this paper, the ability to infer high resolution HLA data will

117

allow for more effective evaluation of large, multi-cohort studies. The comparative method has long been used to generate hypotheses regarding traits and the environment [92, 148, 149]. Because (quasi-) species share a common history, the inherent population structure (in this case, the phylogeny) must be accounted for [61], and numerous methods that do so have been proposed (e.g., [64, 92, 148] and references therein). Our study on HLA immune escape mutations suggests, however, that simply accounting for population structure is not enough, as HLA linkage disequilibrium (structure among environmental predictors) and codon covariation (structure among target traits) are at least as important as phylogeny in both increasing statistical power and avoiding false positives. This issue is relevant to applications beyond those studied here. Specifically, whenever chains of interactions are common, pairwise methods will tend to identify direct as well as indirect correlations. This effect was most dramatically seen in the synthetic codon-covariation tests, in which using a logistic regression-like approach (which accounts for chains) dramatically outperformed the phylogenetic pairwise approach. Although many phylogeny-aware comparative methods have been developed for codon-covariation [39], the problem of chains of interactions has only recently been addressed [45, 182]. The PDN provides an efficient framework in which chains of interactions can be identified in the context of both the phylogeny and confounding from external sources of selection pressure (here, HLA-mediated CTL response). The first approach to identifying chains of interactions in a phylogenetic context was recently provided by Poon et al. [182]. They employed a directed acyclic graphical (DAG) model rather than a dependency network. In a DAG model, arcs from predictor to target attributes form a directed acyclic graph and local distributions take the same form as in a PDN. (A DAG model is often referred to as a Bayesian network, although the latter name is misleading as non-Bayesian procedures can be used to construct DAG models.) When learning the distributions in the DAG model, Poon et al. took phylogeny into account, although in a way different from our approach.

118

In particular, when learning the distribution of an attribute given its parents in the DAG model, they imputed for each individual the value of the attribute corresponding to the ancestor of that individual in the phylogeny. These imputed values were then treated as observed data and fed to a standard DAG model structure learning algorithm. The PDN provides an alternative approach that leverages the strengths of dependency networks. The most apparent difference is that dependency networks allow cycles, resulting in a network that is easier for the non-expert to interpret than is the DAG model [94]. In addition, Poon et al. used unrestricted local distributions in contrast to our use of Noisy Add. The use of Noisy Add, where the number of parameters is linear in the number of parents rather than exponential, results in a substantial increase in power. Finally, because the PDN is concerned only with local probabilities, only the target variable is conditioned on the phylogeny, allowing the PDN to efficiently model associations with attributes, such as HLA alleles, that are not expected to follow the phylogeny, as well as attributes, such as other codons, that are expected to follow the same phylogeny [31]. The result is an efficient method that can simultaneously incorporate a diverse range of selection pressure attributes. One drawback of a dependency network relative to a DAG model is that the local distributions among the target attributes overlap and yet are learned independently. (For example, the local distribution for A given B and the local distribution for B given A are closely related, yet are learned independently.) This independent learning leads to a decrease in statistical efficiency. In practice, however, this decrease is typically minimal [94]. Another drawback of a dependency network is that the presence of cycles make inference of the joint distribution cumbersome, requiring an inefficient modified Gibbs sampling procedure to estimate the joint likelihood [94]. One possible solution is to modify the method for constructing a PDN to yield a DAG model. In particular, we can choose a random ordering for the attributes, and then build a PDN wherein the allowed predictors of a target attribute are only those

119

that precede the target attribute in the ordering. The resulting collection of local probability distributions defines a DAG model (where acyclicity is guaranteed by the ordering constraint). The resulting model can be improved substantially by applying the above procedure to a dozen or so random orderings, and then choosing the best model according to some criterion (e.g., a Bayesian criterion or cross validation) [104]. The resulting DAG is a generative model that can be used to perform inference on the joint distribution.

120

Chapter 8 SUMMARY This dissertation has introduced the Phylogenetic Dependency Network, a framework for the identification of adaptive traits and the specific sources of selection pressure that drive adaptation. The idea of the PDN is to fit each potentially adaptive target trait to a probabilistic model of evolution that conditions the target on both the phylogeny and other predictive traits that exert selection pressure on the target. The structure of the PDN then consists of arcs connecting targets to the sources of selection pressure that putatively drive the adaptation of each target. We have defined and explored three specific probabilistic models of conditional adaptation, which can be used as the probability components of the PDN. These models are similar in that they each make the simplifying assumption that each target trait has evolved independent of the predictor traits throughout its evolutionary history. It is only when the target trait reaches the environment in which we are able to observe it that we model the interaction between predictor and target. Although this assumption certainly does not describe the true biological process, we have provided empirical evidence that it is a useful and reasonable approximation. Specifically, we considered two extreme examples. In the first, the predictor trait(s) is IID and uniformly distributed. In this case, models based on this assumption are indistinguishable from models that integrate out the interaction of the predictor(s) in each hidden internal node, as this integration is indistinguishable from increasing the rate of evolution in the independence model. In the opposite extreme, the predictor and target are coevolving. In our simulation studies, we found that the conditional adaptation model is nearly as good as the generative coevolution model, suggesting

121

that conditional adaptation is a reasonable approximation even when the true causal model is coevolution. Of the three conditional adaptation models we proposed, we found that Noisy Add is the most expressive. In this model, multiple predictor traits are combined to drive the adaptation of the target. The parameters of the model specify whether each predictor exerts positive or negative selection pressure on the trait, as well as the probability that any non-zero selection pressure is exerted. Because there is only a single parameter for each predictor trait, the number of parameters grows linearly with the number of traits in the model. This represents a major advance over previous approaches, in all of which the number of parameters grows exponentially with the number of traits in the model. This property allows us to fit a large number of traits to the model, even with a modest amount of data, and to do so in a reasonable amount of computational time. Because the conditional adaptation assumption can reasonably model interactions of target traits with predictors that exhibit a wide range of distributions relative to the target trait’s, the Noisy Add model is able to simultaneously incorporate a diverse range of predictor traits into the model. Traditionally, approaches to the comparative method have assumed that the predictor traits follow a specific, prespecified distribution. Most often, the assumption is that the predictor has coevolved with the target along the same phylogeny. Although this model is reasonable for many domains, such as the study of codon covariation in proteins, we have argued that it is not appropriate for many applications in which the predictor trait is derived from the environment. Specifically, we have considered the case where we model the adaptation of HIV to the specific HLA alleles of the human host. In this case, we have shown that a coevolution assumption does not describe the data well and is significantly outperformed by the conditional adaptation model. Furthermore, because the conditional adaptation model can describe coevolution data reasonably well, the Noisy Add model is able to simultaneously incorporate a diverse range of predictors, including those that have

122

coevolved with the target traits and those whose distributions are quite different from that of the target. In addition, our forward selection approach to determining the predictors for a given target can effectively account for correlations among predictors, which we have shown is important for increasing accuracy. Importantly, the ability to simultaneously model the influence of a diverse range of predictor traits has enabled us to explore HIV adaptation from a rich new perspective. Focusing on HIV adaptation to the HLA-mediated adaptive immune response, we have argued that there are three major sources of statistical confounding: (1) population structure due to the HIV phylogeny, (2) linkage disequilibrium among HLA alleles, and (3) HIV codon covariation. Although these sources of confounding are often acknowledged, limitations of previous approaches have prevented an in depth study of HIV adaptation that can correctly account for these problems. In contrast, we have demonstrated that a PDN that uses the Noisy Add model is able to simultaneously account for all three sources of confounding. The result is that we are able to more accurately predict which specific HLA alleles are directly correlated with specific HIV codon substitutions. When we applied our model to a large and diverse cohort, we found a rich network of interactions. It was not uncommon for a single HIV target codon to be directly correlated with a handful of HLA alleles and a large number of other HIV codons. The density of the dependency network underscores the importance of using a model that can simultaneously account for multiple interactions. As we demonstrated with both synthetic data and arguments based on observations of the real data, failure to do so results in a large number of spurious associations, especially as the size of the data set (and the corresponding power to detect deviations from the null model) increases. Even when the goal of a study is simply to identify coevolving codons, the limitation of previous coevolution approaches to comparing pairs of traits leads to a large number of spurious associations that derive from indirect interactions. Our application of the PDN to the largest cohort of HLA-typed, chronically in-

123

fected and antiretroviral-na¨ıve patients studied to date reveals the complexity—yet promising consistency—of HIV adaptation to the cellular immune response. In addition to identifying several of the handful of well-studied escape pathways, we identified a plethora of putative escape adaptations, many of which lie within or adjacent to known epitopes, but many of which also lie in putative epitopes. Remarkably, many of these adaptations correlate strongly with coevolutionary adaptations elsewhere in the protein. Although we cannot infer a causal mechanism for the coevolutionary changes, the combination of recently characterized compensatory mutations coupled with the observation that coevolving pairs of codons in our data tend to be proximal in the folded protein suggests that many of the coevolving pairs may represent compensatory mutations. Moving forward, a major goal of T-cell vaccine design is to identify the specific epitopes that should be included in a vaccine. Given the patterns of escape and compensation that have been characterized in some epitopes that are targeted by protective alleles, the dense map of escape and coevolution that we have identified may provide key new insights into which regions of the virus are more vulnerable to an effective immune attack and should therefore be included in a vaccine. In conclusion, the main contributions of this dissertation can be succinctly summarized as follows

1. The introduction of the phylogenetic dependency framework, which uses the assumption of conditional adaptation to map sources of selection pressure to adaptive traits.

2. The definition and exploration of three probabilistic models of conditional adaptation. In particular, the Noisy Add distribution, which efficiently and parsimoniously models the effect of selection pressure derived from multiple predictor traits on a single target trait.

124

3. The application of Noisy Add and the PDN to the study of HLA-mediated adaptation in HIV. The source code and executables for the PDN, the distributions discussed, and the PhyloDv viewer that we use to display the results are freely available at http: //www.codeplex.com/MSCompBio. 8.1

Limitations and future directions

We now discuss some limitations of the PDN framework and the conditional adaptation models we have discussed, as well as some future directions that could address many of these problems.

8.1.1

Numerical and computational issues

Perhaps the most obvious limitation is the amount of computing power required to run these models. Indeed, even the largest data set considered in this dissertation can be analyzed using Fisher’s exact test with only an hour or two of CPU time, whereas running Noisy Add took the better part of year’s worth of computing time. Nevertheless, as we have shown, the vast increase in computing time brings with it a substantial increase in statistical power and confidence in the calibration of the resulting significance scores. Indeed, in our largest synthetic studies, the application of FET yielded all but unusable predictions. In addition, given the advent of cheap, high performance computing, we feel that the computational cost is a small price to pay for the increase in accuracy. This tradeoff is especially worth while if the algorithm can be run on a shared resource to defray the marginal cost. As such, we are developing a web-based, cluster-backed server as a service to the biological community so that computing costs are not an issue for researchers. On the technical side, there is always a concern that EM optimization will settle on a local optimum. In principle, with enough random restarts this problem can be

125

eliminated, though restarts are costly computationally and we have found that the fitness landscape tends to be such that, when a local optima is a problem, it takes a large number of restarts to find the global optimum. In practice, we have found that the best solution is to initialize all parameters with “reasonable” starting values, which can be estimated from the data by, for example, assuming the data are IID When local optima are a problem, there are three possibilities: (1) only the calculation of the alternative likelihood falls into a local optima, (2) only the calculation of the null likelihood falls into a local optima, or (3) both calculations fall into a local optima. Typically the most damaging scenario occurs in case (2), as the result can be a wildly inflated significance measure. Fortunately, the fact that the null model is nested in the alternative model implies that if case (2) occurs when there is no true correlation, then there is probably a configuration of the null model parameters that will yield a likelihood close to the alternative model. This fact suggests the following heuristic: for every significant association, refit the null model using the applicable parameters from the alternative model as a starting point for EM. While this heuristic has no guarantees of helping, we have found that in practice it greatly reduces the number of spurious associations arising from case (2).

8.1.2

Problems with trees

A natural criticism of our approach is that we condition on a phylogeny, which of course cannot be known with complete certainty. There are, in fact, two major criticisms that can be leveled with regard to our use of phylogenies: (1) we ignore any uncertainty in the phylogeny, and (2) we infer the phylogeny assuming there is no selection pressure, a rather circular assumption given that we then set out to identify selection pressure acting on individual traits from which the phylogeny was inferred. Let us consider each of these criticisms in turn.

126

Uncertainty in phylogeny inference

The myriad assortment of phylogeny inference algorithms and their parameter settings makes it clear that inferring phylogenies is an imprecise art. Indeed, when we run a number of available programs on the data sets considered in this dissertation, each yields a slightly different phylogeny. Can we really just pick one and assume it’s the correct one? The good news is, we have found that the precise phylogeny used in our method does not have a huge impact on the results, and what impact it does have appears to be confined to reducing statistical power and does not appear to inflate our estimates of statistical significance (see subsection 5.2.1). In other unpublished work, we have found this observation to hold up on real data as well. Specifically, in one example, we ran the model on five “similar” trees (meaning the major topological features are preserved), then found that of those associations identified using any of the five trees at q < 0.05, 80% were found using all of the five trees.

Although we can be reasonably optimistic that the model is robust to imperfect trees, a solution that incorporated the phylogenetic uncertainty may increase statistical power and would at the very least be more intellectually satisfying. One possibility is to take a Bayesian perspective and compute the posterior distribution of phylogenies (using, for example Mr. Bayes [103], which will provide such a thing if given enough time). We could then sample trees from this posterior distribution and create a PDN for each tree. Because the PDN consists of local probability distributions, the question then focusses on how to integrate over the sampled trees in the local conditional probabilities. One simple possibility would to compute the alternative and null likelihoods as the average of the alternative and null distributions for each tree. The resulting likelihood ratio would then reflect the uncertainty in the tree structure.

127

How does selection pressure influence inference of the phylogeny? Even if we ignore the uncertainty in inferring the phylogeny, one must still wonder about the common assumption in the inference process that each site in the sequence is evolving neutrally. Not only is this assumption violated in general, but the whole point of the PDN is to identify where and how it is violated. Indeed, some of our collaborators have recently pointed out that the strongest associations can strongly bias the inference of the phylogeny [152]. In their case, simply removing the sites known to escape in response to B*57 drastically altered the inferred phylogeny. This observation suggests a simple, iterative solution: 1. Infer the phylogeny using all sites. 2. Build the PDN using the inferred phylogeny. 3. Identify the target variables involved in the strongest associations and remove the underlying codons from the DNA sequences. 4. Infer a new phylogeny using the DNA sequences modified in step three. 5. If the phylogeny is very similar to that from the last iteration, stop. Otherwise, continue with step two. In the end, we will be left with a phylogeny built from sites that are not under strong selection pressure (or at least, those that are under strong selection pressure contribute relatively little to the phylogeny inference procedure). The PDN from this last step can thus be considered as the most reliable. One concern might be that, because the first iterations of this algorithm involve a tree that is apparently incorrect, the associations we identify from those trees are likely incorrect and thus the wrong sites are removed, leading us down a wrong path. Two responses can be given to this. First, the fact that we are reasonably robust to wrong trees should guard against this

128

problem. Second, and perhaps more important, any association that is strong enough to alter the phylogeny inference is likely strong enough to be identified given most any phylogeny. Indeed, the associations identified in Matthews et al. were those that have long been known precisely because they can be identified using just about any method (including those that ignore the phylogeny altogether). Nevertheless, a more principled approach can be taken to this problem. Specifically, one could infer the structure of the phylogeny and the PDN simultaneously. One approach could be to use an EM-like procedure. That is, as we have demonstrated, given a phylogeny we can easily identify associations. The reverse is also likely to be true. Given a PDN, in which we model the selection pressure acting on each target variable, we should be able to infer a phylogeny that incorporates that selection pressure. By iterating back and forth between these two steps, we may expect to improve both our phylogenetic inference and the structure of the PDN.

8.1.3

Deterministic cycles

A more fundamental limitation of the PDN is an inherent limitation of dependency networks in general. Because a dependency network consists of local conditional probability distributions, in certain circumstances, true causal interactions can be impossible to detect. Consider the following causal model: X → Y → Z. Further, suppose the conditional probability table for Y → Z is the deterministic relation Pr {Z = 1|Y = 1} = 1 Pr {Z = 0|Y = 0} = 1

129

Now when we fit the local conditional probability distributions for the target variables Y and Z, we will find that Y is fully sufficient to predict Z and Z is fully sufficient to predict Y . Thus, if the relation between X and Y is non-deterministic, our forward selection procedure will never find X → Y , because there can be no gain in likelihood over the relation Z → Y (even though in the true causal model, the direction of causation is in reverse). An example of where we have seen this problem is in the case of B27-mediated escape in the epitope KK10. In the combined HOMER-Durban cohort, we were able to recover the known escape pathway B27 → R264K → L268M, which is one of two possible escape pathways originating at position 264. When we look only at the HOMER cohort, however, we do not see this pathway. What we do see is R264K → L268M, L268M → R264K. Upon closer inspection, we find that the selection parameter in both cases is 1, meaning if escape occurs in one position, escape must occur in the other position. Indeed, in the HOMER cohort, every patient with one of the polymorphisms has the other polymorphism. Consequently, we fail to find the association of these escape mutations with B27, which is the known cause of the escapes. One solution to this problem is to use a directed acyclic graphical model (DAG) and compute the joint likelihood. This is the approach of Poon et al [182]. We could still achieve the benefits of the Noisy Add model (linearity in the number of parameters and the conditional adaptation assumption) by using the procedure outlined in section 7.3. Namely, create a random ordering over all the traits, then run the PDN as described, but with the restriction that the only allowable predictors for

130

a given target trait are those that precede the target in the random ordering. The result is a DAG, for which the joint likelihood can be easily computed. The resulting model can be improved substantially by applying the above procedure to a dozen or so random orderings, and then choosing the best model according to some criterion (e.g., a Bayesian criterion or cross validation) [104]. Because the resulting model is acyclic, we cannot learn the (incorrect) circular model Y ↔ Z, and thus are likely to learn the (correct) model X → Y → Z. 8.1.4

Binarization of data and results

On finite data sets, binarization of multistate discrete data is a simple means of increasing the power to detect associations. In our case, a model that can incorporate all 20 possible amino acid states would represent an enormous increase in the number of parameters (up to 380 for the null model, if all possible transitions were to be considered; many more to capture selection pressure). Nevertheless, binarization has its drawbacks. The most obvious drawback comes in interpreting the results. Typically, the biologist wants to know what specific mutations are likely to arise in response to a given selection pressure. In the binarization process, we create multiple traits for each codon (one for each observed amino acid at that codon). In the best case, it can be difficult to reconstruct the escape process from the set of traits at that position. In the worst case, noise in the data can lead to inconsistencies or an incomplete picture. Furthermore, any evolutionary model that does not model the mutation of codons directly (i.e., either binarizes the data or uses some compression of the codon space, such as the 20 amino acids), is not a first order continuous time Markov process (CTMP), as our models assume. That is, the first order CTMP assumption states that the rate of transition between two states is constant. Because of redundancy in the mapping from trinucleotides to amino acids, however, this assumption does not hold. For example, the rate of transition from Leucine (Leu) to Phenylalanine (Phe)

131

is dependent on the specific trinucleotide that underlies the Leu. If the trinucleotide is CUU, then a single mutation is required to yield UUU, a Phe trinucleotide. If, however, the Leu trinucleotide is CUG, then two mutations are required to yield either UUU or UUC (the two possible Phe trinucleotides). Thus, the probability of transitioning is not constant, but rather depends on the underlying trinucleotide, which in turn is a function of the previous amino acid state that was visited. For example, if the trait had just transitioned from Phe to Leu, then it is more likely to transition back to Phe than if it had previously been a Valine, whose GUG trinucleotide is one mutation away from Leu’s CUG. The violation of the Markov assumption becomes even more pronounced when the data are binarized as the transition from not Leu to Leu can require anywhere from one to three DNA mutations, and thus the probability of the transition is a function of how long the trait has not been a Leu (the longer it hasn’t been a Leu, the more mutations are likely to require a transition back to Leu). Nevertheless, we must point out the the empirical results we have presented strongly suggest that PDN is robust to violations in the CTMP assumption. Furthermore, for many traits, unmodeled purifying selection will, in practice, constrain the range of possible amino acid states, effectively minimizing the likelihood that the rate of transition from Not Leu to Leu (for example) deviates from a constant. Indeed, for many of the codons we saw in the real data sets, only 2–4 amino acids were actually observed at that position. Although it is not feasible to learn the structure of the PDN using multistate distributions (especially one that fully satisfies the CTMP assumption), one could use a hybrid approach. Specifically, one could first fit the structure of the PDN using binarized data. The resulting structure could then be fixed, and the binarized traits could be collapsed back in to multistate traits. Finally, the parameters of the multistate model could be fit to the data, assuming the PDN structure from the binarized data. There are, of course, several open questions that have to be addressed, including how to deal with inconsistencies in the structure when binary

132

traits are collapsed back into their original multistate traits, and the precise definition of selection pressure over multiple states (the simple definition of positive and negative selection would no longer apply). Nevertheless, the result would be the statistical confidence in the structure provided by binarizing the data, with an easier task of interpreting the parameters of the model and the specific adaptations that are selected for in specific circumstances. In addition, this process, coupled with converting the PDN to a DAG as described in the previous section, would greatly facility the calculation and interpretation of joint inference. Such inference could, for example, lead to the inference of the most likely viral sequence to arise in response to the adaptation of a particular HLA repertoire, a functionality that could prove extremely useful for vaccine design, especially if we incorporate a model of which epitopes are actually targeted in an individual (as governed by the HLA repertoire in addition to immunodominance and/or vaccination). As the model currently stands, it is not possible to determine the rate of targeting a specific epitope. But as shall see in the next section, such an adaptation for the model may be possible with the advent of next generation sequencing data. 8.1.5

Extension to single genome sequencing data

The input data we used for this dissertation consisted of “bulk” sequences, which represent the consensus of the HIV population infecting an individual. The primary reason for this is that bulk sequences are cheap and easy to obtain. They also are intuitively pleasing as they presumably represent an estimate of “optimal” HIV sequence given the current immune system conditions, insofar as the most common virus is the most fit virus. The bulk sequence is, however, an imperfect reduction of the underlying data. A more accurate sequencing technology is generically referred to as single genome sequencing (SGS), which involves the sequencing of individual virions. Using traditional sequencing methods, the isolation and sequencing of individual virions is quite tedious. Thus, these data sets are currently rare and tend to be much smaller

133

than the data sets considered here. However, next generation sequencing technology promises to provide a high throughput means of sequencing thousands of individual virions in each patient. Is the PDN applicable to such data sets? The answer depends on the specific question. For example, if codon coevolution is the sole interest of the study, then the PDN can be applied to the SGS sequences without modification. As we have discussed, however, the ideal scenario is to account for environmental variables in addition to covariation. Indeed, in the studies we have considered, the interaction of HIV with the host environment is of primary interest. In this case, the Noisy Add model as it currently exists is inadequate. Consider the case where we want to compare a single HLA trait X to a single HIV trait Y using the leaf distribution escape. The parameterization of this model includes a single parameter s for selection pressure. In the notation of this dissertation, s = Pr {Yi = 0|Xi = 1, H = 1} ; that is, the probability that Y transitions to state 0 given that it started in state 1 and individual i has HLA X. If we consider the underly biology, however, we will notice that s should really be decomposed further. The probability of escape is not simply a function of whether the individual has HLA X, it is also a function of whether the individual’s immune system is actively targeting (using HLA X) the epitope containing Y . That is s = Pr {Yi = 0|Li = 1, H = 1} · Pr {Li = 1|Xi = 1}

(8.1)

where Li = 1 means individual i’s immune system is actively targeting the epitope containing Y . The new (hidden) trait L is conditioned on X because the epitope cannot be targeted without the HLA, but Pr {L|X} is non-deterministic because immunodominance implies that individuals who have the capacity to target the epitope may not (chapter 3). (Also, note that L is independent of H, because H corresponds to the state of Y in the absence of selection pressure.)

134

Figure 8.1: Univariate model with linked predictors. This phylogeny represents single genome sequences taken from two individuals (three sequences for the bottom individual, two for the top individual).

When we have only one sequence per patient, the two components of Equation (8.1) are unidentifiable. With multiple sequences per patient, however, the two components are identifiable, and failure to decompose s may result in a misleading model. Indeed, continuing our univariate example, a better model is the one given in Figure 8.1. In this model, the (observed) predictor HLA X points to an (unobserved) linker variable L, which in turn points to the observed target variable Y . The selection pressure s is now decomposed into two parts, s = (s1 , s2 ), such that s1 = Pr {L = 1|X = 1} and s2 = Pr {Y = 0|L = 1, H = 1} . Semantically, s1 is the probability that the immune system targets the relevant epi-

135

tope given that the patient has HLA X, and s2 is the probability that escape occurs given that the epitope is targeted. Note that s2 can be defined for each of the leaf distributions as in chapter 4, but s1 remains the same for all leaf distributions (i.e., Pr {L = 1|X = 0} = 0 for all leaf distributions). More generally, this linked conditional adaptation model is suitable for any case in which the presence of selection pressure from a predictive source is stochastic, and will be applied equally to groups of individuals (that is, if the selection pressure is present (or absent) for one individual in the group, then it will be present (absent) for all individuals in the group). Following this line of thinking, a linked version of both the univariate and Noisy Add models can be constructed (see Fig.8.2 for a mixed linked/unlinked example of Noisy Add). The resulting model should be a more faithful representation of the underlying biological interaction between HLA alleles and SGS sequences. Furthermore, the parameters s1 and s2 are identifiable and may provide valuable insight into the underlying biological mechanism. For example, does apparently moderate to low rate of escape s derive from a low rate s1 of epitope targeting, or a low rate s2 of escape in actively targeted epitopes? Note that, in the context of Noisy Add, it is straightforward to include unlinked traits as well. In this case, other codons would be represented as unlinked traits. The interested reader is referred to Appendix A for further technical details of this model, though thorough experimentation and application are left for future work.

136

H

Y

Σ

I1

X1

I2

X1

I1 H

Y

Σ

I2 I1

H

H H

Y

Σ

Y

Σ

Y

Σ

I2

L

X2

X1 X1

I1 I2

L

X2

I1 I2

X1

Figure 8.2: Noisy Add model with unlinked (X1 ) and linked (X2 ) predictors. The model takes the same form as in Figure 4.5, except that the linked predictor variables (e.g. HLA traits) have an extra hidden node. As in the univariate case (Figure 8.1), given X2 = 1, L = 1 with probability s1 . Given L1 = 1, the intermediate nodes I2 are 1 with probability s2 .

137

GLOSSARY ADAPTATION:

A particular mutation that gives rise to an increase in fitness in

a given environment and thus has become (or will become) prevalent in the population. ALLELE:

AIDS:

ARV:

One particular gene variation.

Acquired Immunodeficiency Syndrome Antiretroviral Therapy.

ANTIRETROVIRAL THERAPY:

Therapy involving any of a number of drugs de-

signed to thwart HIV infection. Major classes of antiretrovirals include drugs that prevent the proper function of HIV Protease, Reverse Transcriptase and Integrase, as well as drugs that interfere with HIV’s ability to fuse with certain CD4+ T-cells. ATTRACTION:

In the context of a conditional adaptation model, attraction de-

scribes the process whereby the presence of the predictor trait leads to mutation toward from the amino acid described by the target trait. When an HLA allele is the predictor trait, attraction is typically toward a specific resistant mutation. BAYESIAN INFORMATION CRITERION:

An asymptotic approximation to the marginal

log likelihood of a model, a Bayesian measure of how well the model represents the data. BIC:

Bayesian Information Criterion.

138

BULK SEQUENCING:

Broadly, any sequencing method that results in a single con-

sensus sequence for each individual. CALIBRATION CURVE:

A curve plotting (1 − Precision) against q-value.

CELLULAR IMMUNE RESPONSE:

Broadly, the part of the adaptive immune re-

sponse that is mediated by T-cells. Here, we focus on the targeting of infected cells for destruction through HLA-1-presented epitopes. CD4+ T-LYMPHOCYTES:

Also known as helper T cells or effector T cells, these

cells are involved in coordinating the immune response. Importantly, these are the cells preferentially targeted by HIV. CD8+ T-LYMPHOCYTES:

A class of T-cells, including CTL, that express CD8.

These cells are typically involved in the direct or indirect killing of infected cells. CLADE:

HIV-1 group M (the predominant group) is divided into nine distinct

subtypes, or clades. These clades can be easily identified by building a phylogeny from a large sample of group M sequences. Each major cluster in the tree is a clade. In this work, we consider sequences from clade B (the predominant clade in North American infections) and clade C (the predominant clade in South African infections) CODON:

Although codon is more precisely defined to be the sequence of three

RNA bases that encode an amino acid, here we will often use the term to refer to a position in the protein. For example, codon 3 is the 3rd position. CODON-CODON ASSOCIATION:

An association between two HIV codons. Through-

out this work, codon-codon association will typically refer to the strongest association between a any amino acid at either of the two codons.

139

COEVOLUTION:

Broadly, any process by which two or more traits have evolved

together along the same phylogeny, each trait mutually influencing the other. Broadly, an approach that seeks to identify the adap-

COMPARATIVE METHOD:

tive function of a trait by comparing the presence of that trait in different populations or species. COMPENSATORY MUTATION:

Mutation A is compensatory with respect to muta-

tion B if mutation B induces a fitness cost that is relieved by mutation A. CONDITIONAL ADAPTATION MODEL:

A probabilistic model we propose for the

PDN. In this model, the target trait is assumed to have evolved independently throughout evolutionary time until it reaches the environment in which we are able to observe it. In this environment, a model of selection pressure in constructed and the target trait is allowed to adapt to this selection pressure. CTL:

Cytotoxic T-Lymphocyte.

CYTOTOXIC T-LYMPHOCYTE:

Cells that recognize HLA-I-presented epitopes and

cause the death of infected cells. DAG:

Directed Acyclic Graphical model.

DECISION TREE MODEL:

A multivariate conditional adaptation model in which

the selection pressure derived from a set of predictor traits is computed using a Decision Tree, which is a compact way of describing the unique selection pressure that arises from every possible configuration of predictor traits. See page 39 for details. DEPENDENCY NETWORK:

A probabilistic graphical model. The structure of the

model consists of nodes and directed arcs, such that the arc A → B denotes

140

that B is statistically dependent on A. The probability component of the model is a set of local conditional probabilities, one for each node, that describes the probability distribution for each node conditioned on all other nodes that point to it in the graph. DIRECTED ACYCLIC GRAPHICAL MODEL:

A graphical model similar to the DN,

except that directed cycles are not allowed and the goal of the model is to maximize the joint likelihood over all the nodes in the graph. DIRECT ASSOCIATION:

If the inferred dependency network includes the chain A →

B → C, then A → B and B → C are direct associations. In some contexts, direct association refers to directly interacting traits in the true causal chain. DISCRIMINATION CURVE:

DN:

See Precision-Recall curve.

Dependency Network.

ELISPOT ASSAY:

EPITOPE:

see interferon-γ ELISpot assay.

Broadly, the part of a molecule that is the target of an immune response.

In the case of antibodies, an epitope is a region of exposed viral protein. In the case of the cellular immune response, and epitope is a linear fragment of HIV protein that is presented by HLA class I or class II molecules. ESCAPE:

See immune escape for a general definition. In the context of a condi-

tional adaptation model, escape describes the process whereby the presence of the predictor trait leads to mutation away from the amino acid described by the target trait. FDR:

False Discovery Rate.

141

FALSE DISCOVERY RATE:

The expected proportion of tests called significant that

are truly null. FET:

Fisher’s exact test.

FISHER’S EXACT TEST:

A standard test of independence between two binary traits.

Assume each trait is IID. FITNESS:

Roughly, the reproductive capacity of a specific virion. Virions that are

more fit will reproduce more rapidly; thus, their genetic material will come ot dominate the population. Fitness can be estimated in vitro using, for example, viral replication or competition assays. Selection pressure changes which virions are more fit than others, and the specific mutations that increase fitness are called adaptations. FIXATION:

An mutation is said to be fixed in a population when the vast majority

of individuals have the mutation. FOLLOW THE TREE:

A trait is said to follow the tree or phylogeny if that phy-

logeny (approximately) describes the evolutionary history of that trait. GAG:

An HIV polyprotein involved with the structure of the virion. Upon mat-

uration, the Gag protein is cleaved into several smaller proteins, including p17 (Capsid) and p24 (Matrix), which we look at in chapter 7. CTL response against Gag have been weakly correlated with control of viremia [116]. GWAS:

Genome Wide Association Study.

GENOME WIDE ASSOCIATION STUDY:

An approach to identifying genetic varia-

tions that cause specific phenotypes by comparing a large set of SNPs to phenotypes in a large number of individuals.

142

Human Leukocyte Antigen.

HLA:

HLA-CODON ASSOCIATION:

An association between an HLA allele and an HIV

codon. Throughout this work, HLA-codon association will typically refer to the strongest association between a given HLA allele and any amino acid at that codon. HLA-I:

HLA class I.

HUMAN LEUKOCYTE ANTIGEN:

The human major histocompatibility complex pro-

teins. There are two classes, I and II. In this work, we focus on class I, which are primarily responsible for presenting viral epitopes on the surface of infected cells. Among class I, there are three loci: A, B and C. Thus, each individual has six HLA genes (three from each parent). HIV:

In this document, shorthand for HIV-1.

HIV-1:

Human Immunodeficiency Virus Type 1. The lentivirus (a member of the

retrovirus family) that causes AIDS. IID:

Independent and identically distributed.

IMMUNE ESCAPE:

An immune escape mutation is an adaptation that reduces the

ability of an individual’s immune system to target HIV. IMMUNODOMINANCE:

An epitope is immunodominant if it elicits a strong CTL

response. It is subdominant if it elicits a weak response. The immunodominance hierarchy can refer either to the relative strength of epitope targeting at a given time point, or can refer to the progression of which epitopes are immunodominant throughout the course of infection. Patterns of immunodominance are

143

largely consistent in similar patients, though there appears to be a considerable amount of randomness as well. IMMUNODOMINATION:

The (incompletely understood) process by why the CTL

response actively keeps some epitopes subdominant while others are maintained as dominant. INDEPENDENT AND IDENTICALLY DISTRIBUTED:

A collection of random variables

is independent and identically distributed if each random variable has the same probability distribution and all random variables are statistically independent of the others. INDIRECT ASSOCIATION:

If the inferred dependency network includes the chain

A → B → C, then A → C is an indirect association. In some contexts, indirect association refers to indirectly interacting traits in the true causal chain. INTERFERON-γ ELISPOT ASSAY:

Measures the number of cells secreting interferon-

γ in under certain conditions. In this work, ELISpot specifically refers to tests designed determine whether a patient’s immune system can target an epitope. This is done by incubating patient-derived cells with the epitope, then testing for the secretion of interferon-γ, which indicates the active targeting of the epitope by CTL. LD:

Linkage Disequilibrium.

LEAF DISTRIBUTION:

The specific probability distribution of a conditional adap-

tation model that describes the probability of a given adaptation in response to selection pressure. LINKAGE DISEQUILIBRIUM:

Two genetic traits are in linkage disequilibrium if they

tend to be inherited together, such that two random variables represented the

144

two traits cannot be considered independent. In this work, the linkage disequilibrium among HLA alleles is considered. LIKELIHOOD:

The probability of the data given a specific model.

LIKELIHOOD RATIO TEST:

A method for comparing two models, by which the

maximum likelihood of model A is divided by the maximum likelihood of model B. If B is a special case of A (typically, B represents a null model that is a simplified version of A), then the ratio of the likelihoods is χ2 distributed, allowing us to compute a p-value analytically. LRT:

Likelihood Ratio Test.

MAXIMUM LIKELIHOOD:

The maximum likelihood over all possibly parameteriza-

tions of a model. MULTIVARIATE CONDITIONAL ADAPTATION MODEL:

Generally, a conditional adap-

tation model in which there are more than one predictor variables. MUTATION:

Typically refers to a spontaneous change in genomic DNA (RNA in

the case of HIV). In this document, mutation is often used as a synonym for an amino acid substitution, as we are primarily interested in mutations that lead to amino acid substitutions. NEGATIVE SELECTION:

The process by which deleterious mutations are removed

from the population due to their negative impact on fitness. In the context of Noisy Add, negative selection refers to pressure to move away from the amino acid (assuming by analogy that the amino acid is deleterious). NEUTRAL EVOLUTION:

If in a given position no mutation is favored by selection

pressure, then neutral evolution describes the process whereby the frequency of

145

each mutation in the population follows a random walk.

NOISY ADD MODEL:

A multivariate conditional adaptation model in which selec-

tion pressure is modeled as a stochastic, additive process from a set of predictor traits, each of which my contribute positive or negative selection pressure on the target trait.

NONSYNONYMOUS MUTATION:

A mutation in DNA/RNA that leads to an amino

acid change.

PDN:

Phylogenetic Dependency Network.

PFDR:

Positive False Discovery Rate.

PHENOTYPE:

Any non-genetic, observable trait.

PHYLOGENETIC DEPENDENCY NETWORK:

A dependency network in which the

local conditional probability component for each node is conditioned on the phylogeny using a model of evolution.

PHYLOGENY:

A tree structure that defines the ancestor-descendent relationships

of a group of species or populations.

PLASMA VIRAL LOAD:

POL:

The quantity of virus per milliliter of plasma.

The HIV polyprotein that is cleaved into Protease, Reverse Transcriptase,

and Integrase.

POLYMORPHISM:

A specific genetic variation at a specific site. We refer to poly-

morphisms both at the DNA/RNA level and at the protein level.

146

POSITIVE FALSE DISCOVERY RATE:

The FDR conditioned on the rejection of at

least one test. POSITIVE SELECTION:

The process by which an adaptation moves to fixation in

the population. In the context of Noisy Add, positive selection refers to selective pressure that favors the amino acid. PR CURVE:

PRECISION:

Precision-Recall curve. The proportion of tests called significant that are true alternative

tests (i.e., associations that are in fact “real”). PRECISION-RECALL CURVE:

A curve that plots precision against recall for a group

of methods. The curve for a given method is constructed by ordering all of the predictions of that method by q-value. The precision and recall are calculated for each q-value threshold and the resulting curve is plotted. PREDICTOR TRAIT:

In the context of a PDN, the predictor trait is any node on

which a target trait is conditioned. PURIFYING SELECTION:

P -VALUE:

A synonym for negative selection.

A measure of statistical significance. Roughly, the probability of seem-

ing something at least this significant under the null hypothesis. PVL:

Plasma Viral Load.

Q-VALUE:

The FDR analogue of the p-value. The q-value of a test is the minimum

FDR for any significance threshold that would cause this test to be rejected. QUASISPECIES:

The distinct population of HIV virions infecting a single patient.

Due to the rapid rate of mutation and adaptation in HIV, the HIV population

147

infecting each patient is typically quite different from that infecting any other patient.

RECALL:

The proportion of true alternative tests (i.e., associations that are “real”)

that are called significant.

REPULSION:

In the context of a conditional adaptation model, repulsion describes

the process whereby the absence of the predictor trait leads to mutation away from the amino acid described by the target trait.

RESISTANT FORM:

In the context of HLA-codon associations, the resistant form

is that correlated with the presence of the HLA, and thus presumed to provide relative resistance against HLA-mediated epitope targeting.

REVERSION:

Reversion occurs when a codon mutates back to the susceptible fol-

lowing the removal of the selection pressure. It is, in effect, an adaptation that undoes a previous adaptation, and typically occurs when the escape adaptation incurs a fitness cost on the virus. In this case, when the virus is transmitted to an HLA-mismatched recipient, the selection pressure favoring the adaptation is removed, and the adaptation is actively selected against due to the fitness cost. In the context of a conditional adaptation model, reversion describes the process whereby the absence of the predictor trait leads to mutation toward the amino acid described by the target trait.

SELECTION PRESSURE:

Any environmental trait that reduces the fitness of some

mutations and favors the fitness of other mutations is said to exert selection pressure.

SGS:

Single Genome Sequencing.

148

SINGLE GENOME SEQUENCING:

Broadly, any method by which single virion se-

quences are obtained. SIV:

Simian Immunodeficiency Virus. An HIV-like virus that infects some species

of monkeys. HIV is believed to have arisen from SIV around the turn of the 20th century [229]. SNP:

Single Nucleotide Polymorphism.

SINGLE NUCLEOTIDE POLYMORPHISM:

A genetic variation or mutation that oc-

curs at a single position. SUSCEPTIBLE FORM:

In the context of HLA-codon associations, the susceptible

form is that correlated with the absence of the HLA, and thus presumed to be susceptible to HLA-mediated epitope targeting. SYNONYMOUS MUTATION:

A mutation in DNA/RNA that does not lead to an

amino acid change. TARGET TRAIT:

In the context of a PDN, the target trait refers to the node for

which we are currently defining the local conditional probability distribution. TRAIT:

A certain characteristic of an individual, population or environment. In

this work, typical traits include whether or not a patient has a given HLA allele, and the specific amino acid observed at a specific position in an HIV protein. UNIVARIATE CONDITIONAL ADAPTATION MODEL:

A specific conditional adapta-

tion model in which there is only one predictor trait. VARIABLE:

In the context of a PDN, a variable is shorthand for a random variable

that represents a given trait for a given individual. For example, Yi is a random variable for trait Y in individual i.

149

VIRAL LOAD:

VIREMIA:

VIRION:

See plasma viral load.

The presence of virus in the bloodstream.

An individual virus particle.

150

BIBLIOGRAPHY [1] Addo MM, Yu XG, Rathod A, Cohen D, Eldridge RL, Strick D, Johnston MN, Corcoran C, Wurcel AG, Fitzpatrick CA, Feeney ME, Rodriguez WR, Basgoz N, Draenert R, Stone DR, Brander C, Goulder PJR, Rosenberg ES, Altfeld M and Walker BD (2003). Comprehensive epitope analysis of human immunodeficiency virus type 1 (HIV-1)-specific T-cell responses directed against the entire expressed HIV-1 genome demonstrate broadly directed responses, but no correlation to viral load. Journal of Virology 77(3):2081–2092. 198 [2] Allen TM, Altfeld M, Geer SC, Kalife ET, Moore C, O’Sullivan KM, DeSouza I, Feeney ME, Eldridge RL, Maier EL, Kaufmann DE, Lahaie MP, Reyor L, Tanzi G, Johnston MN, Brander C, Draenert R, Rockstroh JK, Jessen H, Rosenberg ES, Mallal SA and Walker BD (2005). Selective escape from CD8+ T-cell responses represents a major driving force of human immunodeficiency virus type 1 (HIV-1) sequence diversity and reveals constraints on HIV-1 evolution. Journal of Virology 79(21):13239–13249. 21, 113 [3] Allen TM, Altfeld M, Yu XG, O’Sullivan KM, Lichterfeld M, Le Gall S, John M, Mothe BR, Lee PK, Kalife ET, Cohen DE, Freedberg KA, Strick DA, Johnston MN, Sette A, Rosenberg ES, Mallal SA, Goulder PJR, Brander C and Walker BD (2004). Selection, transmission, and reversion of an antigen-processing cytotoxic T-lymphocyte escape mutation in human immunodeficiency virus type 1 infection. Journal of Virology 78(13):7069–7078. 20 [4] Allen TM, Yu XG, Kalife ET, Reyor LL, Lichterfeld M, John M, Cheng M, Allgaier RL, Mui S, Frahm N, Alter G, Brown NV, Johnston MN, Rosenberg ES,

151

Mallal SA, Brander C, Walker BD and Altfeld M (2005). De novo generation of escape variant-specific CD8+ T-cell responses following cytotoxic T-lymphocyte escape in chronic human immunodeficiency virus type 1 infection. Journal of Virology 79(20):12952–12960. 27 [5] ALTFELD M, ADDO MM, ROSENBERG ES, HECHT FM, LEE PK, VOGEL M, G. YX, DRAENERT R, JOHNSTON MN, STRICK D, ALLEN TM, FEENEY ME, KAHN JO, SEKALY RP, LEVY JA, ROCKSTROH JK, GOULDER PJR and WALKER BD (2003). Influence of HLA-B57 on clinical presentation and viral control during acute HIV-1 infection. AIDS 17(18):2581–2591. 105, 109 [6] Altfeld M and Allen TM (2006). Hitting HIV where it hurts: an alternative approach to HIV vaccine design. Trends in Immunology 27(11):504–510. 104 [7] Altfeld M, Kalife ET, Qi Y, Streeck H, Lichterfeld M, Johnston MN, Burgett N, Swartz ME, Yang A, Alter G, Yu XG, Meier A, Rockstroh JK, Allen TM, Jessen H, Rosenberg ES, Carrington M and Walker BD (2006). HLA alleles associated with delayed progression to AIDS contribute strongly to the initial CD8 T cell response against HIV-1. PLoS Medicine 3(10):e403. 28, 108, 114 [8] Aranzana MJ, Kim S, Zhao K, Bakker E, Horton M, Jakob K, Lister C, Molitor J, Shindo C, Tang C, Toomajian C, Traw B, Zheng H, Bergelson J, Dean C, Marjoram P and Nordborg M (2005). Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes. PLoS Genetics 1(5):e60. 13, 57, 68 [9] Asquith B, Edwards CTT, Lipsitch M and McLean AR (2006). Inefficient cytotoxic T lymphocyte-mediated killing of HIV-1-infected cells in vivo. PLoS Biology 4(4):e90.

152

[10] Asquith B and McLean AR (2007). In vivo CD8+ T cell control of immunodeficiency virus infection in humans and macaques. Proceedings of the National Academy of Sciences of the United States of America 104(15):6365. [11] Atchley WR, Terhalle W and Dress A (1999). Positional dependence, cliques, and predictive motifs in the bHLH protein domain. Journal of Molecular Evolution 48(5):501–516. 10, 11 [12] Atchley WR, Wollenberg KR, Fitch WM, Terhalle W and Dress AW (2000). Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Molecular Biology and Evolution 17(1):164–178. 89 [13] Auranen M, Varilo T, Alen R, Vanhala R, Ayers K, Kempas E, Ylisaukko-Oja T, Peltonen L and Jarvela I (2003). Evidence for allelic association on chromosome 3q25-27 in families with autism spectrum disorders originating from a subisolate of Finland. Molecular Psychiatry 8(10):879–884. 13 [14] Baker B, Block B, Rothchild A and Walker B (2009). Elite control of HIV infection: implications for vaccine design. Expert Opinion on Biological Therapy 9(1):55–69. 19 [15] Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society - Series B: Statistical Methodology 57(1):289–300. 42, 202 [16] Bernardin F, Kong D, Peddada L, Baxter-Lowe LA and Delwart E (2005). Human immunodeficiency virus mutations during the first month of infection are preferentially found in known cytotoxic T-lymphocyte epitopes. Journal of Virology 79(17):11523–11528. 20 [17] Bhattacharya T, Daniels M, Heckerman D, Foley B, Frahm N, Kadie C, Carlson J, Yusim K, McMahon B, Gaschen B, Mallal S, Mullins JI, Nickle DC, Herbeck

153

J, Rousseau C, Learn GH, Miura T, Brander C, Walker B and Korber B (2007). Founder effects in the assessment of HIV Polymorphisms and HLA allele associations. Science 315(5818):1583–1586. 11, 12, 16, 23, 24, 25, 28, 29, 30, 32, 63, 71, 72, 84, 86, 87, 90, 93, 95, 112, 113 [18] Bihl F, Frahm N, Di Giammarino L, Sidney J, John M, Yusim K, Woodberry T, Sango K, Hewitt HS, Henry L, Linde CH, Chisholm I John V., Zaman TM, Pae E, Mallal S, Walker BD, Sette A, Korber BT, Heckerman D and Brander C (2006). Impact of HLA-B alleles, epitope binding affinity, functional avidity, and viral coinfection on the immunodominance of virus-specific CTL responses. J Immunol 176(7):4094–4101. 114 [19] Borrow P, Lewicki H, Wei X, Horwitz MS, Peffer N, Meyers H, Nelson JA, Gairin JE, Hahn BH, Oldstone MB and Shaw GM (1997). Antiviral pressure exerted by HIV-1-specific cytotoxic T lymphocytes (CTLs) during primary infection demonstrated by rapid selection of CTL escape virus. Nature Medicine 3(2):205– 211. 20, 21 [20] Brander C, Frahm N and Walker BD (2006). The challenges of host and viral diversity in HIV vaccine design. Current Opinion in Immunology 18(4):430–437. 28 [21] Brockman MA, Schneidewind A, Lahaie M, Schmidt A, Miura T, DeSouza I, Ryvkin F, Derdeyn CA, Allen S, Hunter E, Mulenga J, Goepfert PA, Walker BD and Allen TM (2007). Escape and Compensation from Early HLAB57-Mediated Cytotoxic T-Lymphocyte Pressure on Human Immunodeficiency Virus Type 1 Gag Alter Capsid Interactions with Cyclophilin A. Journal of Virology 81(22):12608–12618. 22, 29, 104, 105, 106, 113, 114 [22] Brown AJ, Korber BT and Condra JH (1999). Associations between amino

154

acids in the evolution of HIV type 1 protease sequences under indinavir therapy. AIDS Research and Human Retroviruses 15(3):247–253. [23] Brumme ZL, Brumme CJ, Carlson JM, Streeck H, John M, Eichbaum Q, Block BL, Baker B, Kadie C, Markowitz M, Jessen H, Kelleher AD, Rosenberg E, Kaldor J, Yuki Y, Carrington M, Allen TM, Mallal S, Altfeld M, Heckerman D and Walker BD (2008). Marked epitope- and allele-specific differences in rates of mutation in human immunodeficiency type 1 (HIV-1) Gag, Pol, and Nef cytotoxic T-lymphocyte epitopes in acute/early HIV-1 infection. Journal of Virology 82(18):9216–9227. 75, 102, 108, 109, 110, 114 [24] Brumme ZL, Brumme CJ, Heckerman D, Korber BT, Daniels M, Carlson J, Kadie C, Bhattacharya T, Chui C, Szinger J, Mo T, Hogg RS, Montaner JSG, Frahm N, Brander C, Walker BD and Harrigan PR (2007). Evidence of differential HLA class I-mediated viral evolution in functional and accessory/regulatory genes of HIV-1. PLoS Pathogens 3(7):e94. 25, 26, 27, 28, 29, 71, 72, 75, 77, 84, 86, 87, 90, 93, 95, 96, 98, 99, 110, 113, 114 [25] Brumme ZL, Tao I, Szeto S, Brumme CJ, Carlson JM, Chan D, Kadie C, Frahm N, Brander C, Walker B, Heckerman D and Harrigan PR (2008). Human leukocyte antigen-specific polymorphisms in HIV-1 Gag and their association with viral load in chronic untreated infection. AIDS 22(11):1277–1286. 26, 71, 75, 77, 84, 90, 98, 113, 114, 115, 116, 198 [26] Bruno WJ (1996). Modeling residue usage in aligned protein sequences via maximum likelihood. Molecular Biology and Evolution 13(10):1368–1374. 11 [27] Buck MJ and Atchley WR (2005). Networks of coevolving sites in structural and functional domains of serpin proteins. Molecular Biology and Evolution 22(7):1627–1634. 11, 65

155

[28] Bugawan TL, Klitz W, Blair A and Erlich HA (2000). High-resolution HLA class I typing in the CEPH families: analysis of linkage disequilibrium among HLA loci. Tissue Antigens 56(5):392–404. 29 [29] Butler MA and King AA (2004). Phylogenetic comparative analysis: a modeling approach for adaptive evolution. The American Naturalist 164(6):683–695. 9 [30] Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG and Hirschhorn JN (2005). Demonstrating stratification in a European American population. Nature Genetics 37(8):868–872. 13 [31] Carlson J, Kadie C, Mallal S and Heckerman D (2007). Leveraging hierarchical population structure in discrete association studies. PLoS One 2(7):e591. 12, 13, 23, 34, 36, 37, 45, 71, 80, 86, 87, 90, 95, 110, 118 [32] Carlson JM and Brumme ZL (2008). HIV evolution in response to HLArestricted CTL selection pressures: a population-based perspective. Microbes and Infection 10(5):455–461. 113, 114 [33] Carlson JM, Brumme ZL, Rousseau CM, Brumme CJ, Matthews P, Kadie C, Mullins JI, Walker BD, Harrigan PR, Goulder PJR and Heckerman D (2008). Phylogenetic dependency networks: inferring patterns of CTL escape and codon covariation in HIV-1 Gag. PLoS Computational Biology 4(11):e1000225. 37 [34] Carrington M and O’Brien SJ (2003). The influence of HLA genotype on AIDS. Annual Review of Medicine 54(1):535–551. 19, 20 [35] Chen L and Lee C (2006). Distinguishing HIV-1 drug resistance, accessory, and viral fitness mutations using conditional selection pressure analysis of treated versus untreated patient samples. Biology Direct 1(1):14. 12

156

[36] Cheng C and Pounds S (2007). False discovery rate paradigms for statistical analyses of microarray gene expression data. Bioinformation 1(10):436–446. 204, 228 [37] Chio A, Schymick JC, Restagno G, Scholz SW, Lombardo F, Lai SL, Mora G, Fung HC, Britton A, Arepalli S, Gibbs JR, Nalls M, Berger S, Kwee LC, Oddone EZ, Ding J, Crews C, Rafferty I, Washecka N, Hernandez D, Ferrucci L, Bandinelli S, Guralnik J, Macciardi F, Torri F, Lupoli S, Chanock SJ, Thomas G, Hunter DJ, Gieger C, Wichmann HE, Calvo A, Mutani R, Battistini S, Giannini F, Caponnetto C, Mancardi GL, La Bella V, Valentino F, Monsurro MR, Tedeschi G, Marinou K, Sabatelli M, Conte A, Mandrioli J, Sola P, Salvi F, Bartolomei I, Siciliano G, Carlesi C, Orrell RW, Talbot K, Simmons Z, Connor J, Pioro EP, Dunkley T, Stephan DA, Kasperaviciute D, Fisher EM, Jabonka S, Sendtner M, Beck M, Bruijn L, Rothstein J, Schmidt S, Singleton A, Hardy J and Traynor BJ (2009). A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Human Molecular Genetics in press. doi:10.1093/hmg/ddp059. 199 [38] Choi SS, Vallender EJ and Lahn BT (2006). Systematically assessing the influence of 3-dimensional structural context on the molecular evolution of mammalian proteomes. Molecular Biology and Evolution 23(11):2131–2133. 114 [39] Codo˜ ner FM and Fares MA (2008). Why should we care about molecular coevolution? Evolutionary Bioinformatics 4:237–246. 10, 30, 89, 116, 117 [40] Conaway MR (1990). A random effects model for binary data. Biometrics 46(2):317–328. 46 [41] Crawford H, Prado JG, Leslie A, Hue S, Honeyborne I, Reddy S, van der Stok M, Mncube Z, Brander C, Rousseau C, Mullins JI, Kaslow R, Goepfert P, Allen S, Hunter E, Mulenga J, Kiepiela P, Walker BD and Goulder PJR (2007).

157

Compensatory mutation partially restores fitness and delays reversion of escape mutation within the immunodominant HLA-B*5703-restricted Gag epitope in chronic human immunodeficiency virus type 1 infection. Journal of Virology 81(15):8346–8351. 20, 104, 105, 109, 110 [42] Dalmasso C, Broet P and Moreau T (2005). A simple procedure for estimating the false discovery rate. Bioinformatics 21(5):660–668. 207, 226, 227 [43] Dangl JL and Jones JD (2001). Plant pathogens and integrated defence responses to infection. Nature 411(6839):826–833. 68 [44] Deeks SG and Walker BD (2007). Human immunodeficiency virus controllers: mechanisms of durable virus control in the absence of antiretroviral therapy. Immunity 27(3):406–416. 19 [45] Deforche K, Silander T, Camacho R, Grossman Z, Soares MA, Van Laethem K, Kantor R, Moreau Y, Vandamme AM and on behalf of the non B Workgroup (2006). Analysis of HIV-1 pol sequences using Bayesian networks: implications for drug resistance. Bioinformatics 22(24):2975–2979. 15, 117 [46] Degnan JH and Rosenberg NA (2006). Discordance of species trees with their most likely gene trees. PLoS Genetics 2(5):e68. [47] Dekker JP, Fodor A, Aldrich RW and Yellen G (2004). A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments. Bioinformatics 20(10):1565–1572. 89 [48] Delport W, Scheffler K and Seoighe C (2008).

Frequent toggling between

alternative amino acids is driven by selection in HIV-1. 4(12):e1000242. 12

PLoS Pathogens

158

[49] Dempster A, Laird N and Rubin D (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society - Series B: Statistical Methodology 39(1):1–38. 44, 191 [50] Devlin B and Roeder K (1999). Genomic control for association studies. Biometrics 55(4):997–1004. 13 [51] Devlin B, Roeder K and Wasserman L (2001). Genomic control, a new approach to genetic-based association studies. Theoretical Population Biology 60(3):155– 166. 13 [52] Dominicus A, Skrondal A, Gjessing HK, Pedersen NL and Palmgren J (2006). Likelihood ratio tests in behavioral genetics: problems and solutions. Behavior Genetics 36(2):331–340. [53] Draenert R, Allen TM, Liu Y, Wrin T, Chappey C, Verrill CL, Sirera G, Eldridge RL, Lahaie MP, Ruiz L, Clotet B, Petropoulos CJ, Walker BD and Martinez-Picado J (2006). Constraints on HIV-1 evolution and immunodominance revealed in monozygotic adult twins infected with the same virus. Journal of Experimental Medicine 203(3):529–539. 20, 21 [54] Draenert R, Le Gall S, Pfafferott KJ, Leslie AJ, Chetty P, Brander C, Holmes EC, Chang SC, Feeney ME, Addo MM, Ruiz L, Ramduth D, Jeena P, Altfeld M, Thomas S, Tang Y, Verrill CL, Dixon C, Prado JG, Kiepiela P, MartinezPicado J, Walker BD and Goulder PJ (2004). Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. Journal of Experimental Medicine 199(7):905–915. 20, 99, 105, 109 [55] Duda A, Lee-Turner L, Fox J, Robinson N, Dustan S, Kaye S, Fryer H, Carrington M, McClure M, Mclean AR, Fidler S, Weber J, Phillips RE, Frater AJ, and the SPARTAC Trial Investigators (2009). HLA-associated clinical progression

159

correlates with epitope reversion rates in early human immunodeficiency virus infection. Journal of Virology 83(3):1228–1239. [56] Edwards BH, Bansal A, Sabbaj S, Bakari J, Mulligan MJ and Goepfert PA (2002). Magnitude of functional CD8+ T-cell responses to the Gag protein of human immunodeficiency virus type 1 correlates inversely with viral load in plasma. Journal of Virology 76(5):2298–2305. 115 [57] Efron B, Tibshirani R, Storey J and Tusher V (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96(456):1151–1160. 196 [58] Evans DM and Cardon LR (2006). Genome-wide association: a promising start to a long race. Trends in Genetics 22(7):350–354. [59] Falush D, Stephens M and Pritchard JK (2003). Inference of population structure using multilocus genotype data linked loci and correlated allele frequencies. Genetics 164(4):1567–1587. [60] Felsenstein J (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6):368–376. 10, 34, 43, 44, 187 [61] Felsenstein J (1985). Phylogenies and the comparative method. The American Naturalist 125(1):1–15. 6, 7, 9, 23, 117 [62] Felsenstein J (2004). Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, MA. 9 [63] Felsenstein J (2005). PHYLIP (Phylogeny Inference Package) version 3.6. Tech. rep., Department of Genome Sciences, University of Washington, Seattle, WA. 58

160

[64] Felsenstein J (2005). Using the quantitative genetic threshold model for inferences between and within species. Philosophical Transactions of the Royal Society of London B Biological Sciences 360(1459):1427–1434. 10, 117 [65] Fischer W, Perkins S, Theiler J, Bhattacharya T, Yusim K, Funkhouser R, Kuiken C, Haynes B, Letvin NL, Walker BD, Hahn BH and Korber BT (2007). Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nature Medicine 13(1):100–106. 28 [66] Fisher RA (1922). On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85(1):87–94. 4, 196, 200 [67] Fodor AA and Aldrich RW (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56(2):211–221. 65 [68] Fossen T, Wray V, Bruns K, Rachmat J, Henklein P, Tessmer U, Maczurek A, Klinger P and Schubert U (2005). Solution structure of the human immunodeficiency virus type 1 p6 protein. Journal of Biological Chemistry 280(52):42515– 42527. 66 [69] Frahm N, Kiepiela P, Adams S, Linde CH, Hewitt HS, Sango K, Feeney ME, Addo MM, Lichterfeld M, Lahaie MP, Pae E, Wurcel AG, Roach T, St John MA, Altfeld M, Marincola FM, Moore C, Mallal S, Carrington M, Heckerman D, Allen TM, Mullins JI, Korber BT, Goulder PJR, Walker BD and Brander C (2006). Control of human immunodeficiency virus replication by cytotoxic T lymphocytes targeting subdominant epitopes. Nature Immunology 7(2):173– 178. [70] Frahm N, Linde C and Brander C (2006). Identification of HIV-derived, HLA

161

class I restricted CTL epitopes: insights into tcr repertoire, CTL escape and viral fitness. In HIV Molecular Immunology, Korber BTM, Brander C, Haynes BF, Koup R, Moore JP, Walker BD, and Watkins DI, Eds. Los Alamos National Laboratory, Theoretical Biology and Biophysics, pp. 03–28. 99, 109, 115, 116 [71] Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN and Altshuler D (2004). Assessing the impact of population stratification on genetic association studies. Nature Genetics 36(4):388–393. [72] Friedrich TC, Frye CA, Yant LJ, O’Connor DH, Kriewaldt NA, Benson M, Vojnov L, Dodds EJ, Cullen C, Rudersdorf R, Hughes AL, Wilson N and Watkins DI (2004). Extraepitopic compensatory substitutions partially restore fitness to simian immunodeficiency virus variants that escape from an immunodominant cytotoxic-T-lymphocyte response. Journal of Virology 78(5):2581–2585. 104 [73] Friedrich TC, Valentine LE, Yant LJ, Rakasz EG, Piaskowski SM, Furlott JR, Weisgrau KL, Burwitz B, May GE, Leon EJ, Soma T, Napoe G, Capuano I Saverio V., Wilson NA and Watkins DI (2007). Subdominant CD8+ T-cell responses are involved in durable control of AIDS virus replication. Journal of Virology 81(7):3465–3476. 114 [74] Fukami-Kobayashi K, Schreiber D and Benner S (2002). Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences. Journal of Molecular Biology 319(3):729–743. 11, 16 [75] Gao X, Bashirova A, Iversen AKN, Phair J, Goedert JJ, Buchbinder S, Hoots K, Vlahov D, Altfeld M, O’Brien SJ and Carrington M (2005). AIDS restriction HLA allotypes target distinct intervals of HIV-1 pathogenesis. Nature Medicine 11(12):1290–1292. 108

162

[76] Gaschen B, Taylor J, Yusim K, Foley B, Gao F, Lang D, Novitsky V, Haynes B, Hahn BH, Bhattacharya T and Korber B (2002). Diversity considerations in HIV-1 vaccine selection. Science 296(5577):2354–2360. 18, 28, 114 [77] Geldmacher C, Currier JR, Herrmann E, Haule A, Kuta E, McCutchan F, Njovu L, Geis S, Hoffmann O, Maboko L, Williamson C, Birx D, Meyerhans A, Cox J and Hoelscher M (2007). CD8 T-cell recognition of multiple epitopes within specific Gag regions is associated with maintenance of a low steady-state viremia in human immunodeficiency virus type 1-seropositive patients. Journal of Virology 81(5):2440–2448. 115 [78] Genovese C and Wasserman L (2004). A stochastic process approach to false discovery control. Annals of Statistics 32(3):1035–1061. 207, 227 [79] Gilbert PB (2005). A modified false discovery rate multiple-comparisons procedure for discrete data, applied to human immunodeficiency virus genetics. Journal of the Royal Statistical Society - Series C: Applied Statistics 54(1):143–158. 198, 228 [80] Gobel U, Sander C, Schneider R and Valencia A (1994). Correlated mutations and residue contacts in proteins. Proteins 18(4):309–317. 10 [81] Goepfert PA, Lumm W, Farmer P, Matthews P, Prendergast A, Carlson JM, Derdeyn CA, Tang J, Kaslow RA, Bansal A, Yusim K, Heckerman D, Mulenga J, Allen S, Goulder PJR and Hunter E (2008). Transmission of HIV-1 Gag immune escape mutations is associated with reduced viral load in linked recipients. Journal of Experimental Medicine 205(5):1009–1017. 74 [82] Goulder P, Bunce M, Krausa P, McIntyre K, Crowley S, Morgan B, Edwards A, Giangrande P, Phillips R and McMichael A (1996). Novel, cross-restricted, conserved, and immunodominant cytotoxic T lymphocyte epitopes in slow pro-

163

gressors in HIV type 1 infection. AIDS Research and Human Retroviruses 12(18):1691–8. [83] Goulder PJ, Altfeld MA, Rosenberg ES, Nguyen T, Tang Y, Eldridge RL, Addo MM, He S, Mukherjee JS, Phillips MN, Bunce M, Kalams SA, Sekaly RP, Walker BD and Brander C (2001). Substantial differences in specificity of HIV-specific cytotoxic T cells in acute and chronic HIV infection. Journal of Experimental Medicine 193(2):181–194. 27, 108 [84] Goulder PJ, Phillips RE, Colbert RA, McAdam S, Ogg G, Nowak MA, Giangrande P, Luzzi G, Morgana B, Edwards A, McMichael AJ and Rowland-Jones S (1997). Late escape from an immunodominant cytotoxic T lymphocyte response associated with progression to AIDS. Nature Medicine 3(2):212–217. 20, 21, 26, 104 [85] Goulder PJR and Watkins DI (2004). HIV and SIV CTL escape: implications for vaccine design. Nature Reviews Immunology 4(8):630–640. 18, 19, 20, 21, 26, 105, 108, 113, 114 [86] Goulder PJR and Watkins DI (2008). Impact of MHC class I diversity on immune control of immunodeficiency virus replication. Nature Reviews Immunology 8(8):619–630. 19 [87] Guindon S and Gascuel O (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology 52(5):696– 704. 44, 57, 98 [88] Halperin I, Wolfson H and Nussinov R (2006). Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families. Proteins 63(4):832–845. 116

164

[89] Hansen TF (1997). Stabilizing selection and the comparative analysis of adaptation. Evolution 51(5):1341–1351. 9 [90] Hansen TF, Pienaar J, Orzack SH and Crandall K (2008). A comparative method for studying adaptation to a randomly evolving environment. Evolution 62(8):1965–1977. [91] Harrigan PR, Hogg RS, Dong WWY, Yip B, Wynhoven B, Woodward J, Brumme CJ, Brumme ZL, Mo T, Alexander CS and Montaner JSG (2005). Predictors of HIV drug-resistance mutations in a large antiretroviral-naive cohort initiating triple antiretroviral therapy. The Journal of Infectious Diseases 191(3):339–347. 197 [92] Harvey PH and Pagel MO (1991). The Comparative Method in Evolutionary Biology. Oxford University Press, New York, NY. 6, 9, 89, 117 [93] Heckerman D (1998). A tutorial on learning with Bayesian networks. In Learning in Graphical Models, Jordan MI, Ed. MIT press, Cambridge, MA, pp. 301– 354. 15, 46, 49, 187, 192 [94] Heckerman D, Chickering DM, Meek C, Rounthwaite R and Kadie C (2001). Dependency networks for inference, collaborative filtering, and data visualization. The Journal of Machine Learning Research 1:49–75. 2, 16, 31, 118 [95] Heckerman D, Kadie C and Listgarten J (2007). Leveraging information across HLA alleles/supertypes improves epitope prediction. Journal of Computational Biology 14(6):736–746. 99 [96] Helgason A, Yngvadottir B, Hrafnkelsson B, Gulcher J and Stefansson K (2005). An Icelandic example of the impact of population structure on association studies. Nature Genetics 37(1):90–95. 13

165

[97] Henderson CR (1984). Applications of linear models in animal breeding. University of Guelph, Guelph. 14 [98] Hill C, Worthylake D, Bancroft D, Christensen A and Sundquist W (1996). Crystal structures of the trimeric human immunodeficiency virus type 1 matrix protein: implications for membrane association and assembly. Proceedings of the National Academy of Sciences of the United States of America 93(7):3099–3104. 107 [99] Hirschhorn JN and Daly MJ (2005). Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6(2):95–108. [100] Honeyborne I, Prendergast A, Pereyra F, Leslie A, Crawford H, Payne R, Reddy S, Bishop K, Moodley E, Nair K, van der Stok M, McCarthy N, Rousseau CM, Addo M, Mullins JI, Brander C, Kiepiela P, Walker BD and Goulder PJR (2007). Control of human immunodeficiency virus type 1 is associated with HLA-B*13 and targeting of multiple Gag-specific CD8+ T-cell epitopes. Journal of Virology 81(7):3667–3672. 110 [101] Housworth EA, Martins EP and Lynch M (2004). The phylogenetic mixed model. The American Naturalist 163(1):84–96. 14 [102] Huelsenbeck JP, Nielsen R and Bollback JP (2003). Stochastic mapping of morphological characters. Systematic Biology 52(2):131–158. 11, 13, 16 [103] Huelsenbeck JP, Ronquist F, Nielsen R and Bollback JP (2001). Bayesian inference of phylogeny and its impact on evolutionary biology.

Science

294(5550):2310–2314. 126 [104] Hulten G, Chickering DM and Heckerman D (2001). Learning Bayesian networks from dependency networks: a preliminary study. In Proceedings of the

166

Ninth International Workshop on Artificial Intelligence and Statistics, Bishop CM and Frey BJ, Eds. 119, 130 [105] Iversen AKN, Stewart-Jones G, Learn GH, Christie N, Sylvester-Hviid C, Armitage AE, Kandl R, Beattie T, Lee JK, Li Y, Chotiyarnwong P, Dong T, Xu X, Luscher MA, MacDonald K, Ullum H, Klarlund-Pedersen B, Skinhoj P, Fugger L, Buus S, Mullins JI, Jones EY, van der Merwe PA and McMichael AJ (2006). Conflicting selective forces affect T cell receptor contacts in an immunodominant human immunodeficiency virus epitope. Nature Immunology 7(2):179–189. 104, 106, 110 [106] Jiao S and Zhang S (2008). On correcting the overestimation of the permutationbased false discovery rate estimator. Bioinformatics 24(15):1655. 196, 211, 212 [107] Jojic N, Jojic V, Frey B, Meek C and Heckerman D (2006). Using “epitomes” to model genetic diversity: rational design of HIV vaccine cocktails. In Advances in Neural Information Processing Systems18, Weiss Y, Sch¨olkopf B, and Platt J, Eds. MIT Press, Cambridge, MA, pp. 587–594. 28 [108] Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ and Eskin E (2008). Efficient control of population structure in model organism association mapping. Genetics 178(3):1709–1723. [109] Karlsson AC, Iversen A, Chapman JM, de Oliviera T, Spotts G, McMichael AJ, Davenport MP, Hecht FM and Nixon DF (2007). Sequential broadening of CTL responses in early HIV-1 infection is associated with viral escape. PLoS One 2(2):e225. 108 [110] Kaslow R, Carrington M, Apple R, Park L, Munoz A, Saah A, Goedert J, Winkler C, O’Brien S, Rinaldo C, Detels R, Blattner W, Phair J, Erlich H and Mann D (1996). Influence of combinations of human major histocompatibility

167

complex genes on the course of HIV–1 infection. Nature Medicine 2(4):405–411. 105 [111] Kass I and Horovitz A (2002). Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations. Proteins 48(4):611–617. [112] Kawashima Y, Pfafferott K, Frater J, Matthews P, Payne R, Addo M, Gatanaga H, Fujiwara M, Hachiya A, Koizumi H, Kuse N, Oka S, Duda A, Prendergast A, Crawford H, Leslie A, Brumme Z, Brumme C, Allen T, Brander C, Kaslow R, Tang J, Hunter E, Allen S, Mulenga J, Branch S, Roach T, John M, Mallal S, Ogwu A, Shapiro R, Prado JG, Fidler S, Weber J, Pybus OG, Klenerman P, Ndung/’u T, Phillips R, Heckerman D, Harrigan PR, Walker BD, Takiguchi M and Goulder P (2009). Adaptation of HIV-1 to human leukocyte antigen class I. Nature in press. doi:10.1038/nature07746. [113] Kelleher AD, Long C, Holmes EC, Allen RL, Wilson J, Conlon C, Workman C, Shaunak S, Olson K, Goulder P, Brander C, Ogg G, Sullivan JS, Dyer W, Jones I, McMichael AJ, Rowland-Jones S and Phillips RE (2001). Clustered mutations in HIV-1 Gag are consistently required for escape from HLA-B27restricted cytotoxic T lymphocyte responses. Journal of Experimental Medicine 193(3):375–386. 20, 21, 104 [114] Kennedy BW, Quinton M and van Arendonk JA (1992). Estimation of effects of single genes on quantitative traits. Journal of Animal Science 70(7):2000–2012. 13, 14 [115] Kiepiela P, Leslie AJ, Honeyborne I, Ramduth D, Thobakgale C, Chetty S, Rathnavalu P, Moore C, Pfafferott KJ, Hilton L, Zimbwa P, Moore S, Allen T, Brander C, Addo MM, Altfeld M, James I, Mallal S, Bunce M, Barber LD, Szinger J, Day C, Klenerman P, Mullins J, Korber B, Coovadia HM, Walker BD and Goulder PJR (2004). Dominant influence of HLA-B in mediating the

168

potential co-evolution of HIV and HLA. Nature 432(7018):769–775. 56, 105, 116 [116] Kiepiela P, Ngumbela K, Thobakgale C, Ramduth D, Honeyborne I, Moodley E, Reddy S, de Pierres C, Mncube Z, Mkhwanazi N, Bishop K, van der Stok M, Nair K, Khan N, Crawford H, Payne R, Leslie A, Prado J, Prendergast A, Frater J, McCarthy N, Brander C, Learn GH, Nickle D, Rousseau C, Coovadia H, Mullins JI, Heckerman D, Walker BD and Goulder P (2007). CD8 T-cell responses to different HIV proteins have discordant associations with viral load. Nature Medicine 13(1):46–53. 98, 105, 115, 116, 141, 198 [117] Kimmel G, Jordan M, Halperin E, Shamir R and Karp R (2007). A randomization test for controlling population stratification in whole-genome association studies. The American Journal of Human Genetics 81(5):895–905. [118] Klenerman P and McMichael A (2007). AIDS/HIV: finding footprints among the trees. Science 315(5818):1505. 24 [119] Kloetzel P (2001). Antigen processing by the proteasome. Nature Reviews Molecular Cell Biology 2(3):179–187. 64 [120] Korber B, Gaschen B, Yusim K, Thakallapally R, Kesmir C and Detours V (2001). Evolutionary and immunological implications of contemporary HIV-1 variation. British Medical Bulletin 58(1):19–42. [121] Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn B, Wolinsky S and Bhattacharya T (2000). Timing the ancestor of the HIV-1 pandemic strains. Science 288(5472):1789–1796. 18 [122] Korber BT, Farber RM, Wolpert DH and Lapedes AS (1993). Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope

169

protein: an information theoretic analysis. Proceedings of the National Academy of Sciences of the United States of America 90(15):7176–7180. 10, 89 [123] Koup R, Safrit J, Cao Y, Andrews C, McLeod G, Borkowsky W, Farthing C and Ho D (1994). Temporal association of cellular immune responses with the initial control of viremia in primary human immunodeficiency virus type 1 syndrome. Journal of Virology 68(7):4650–4655. 19 [124] Kuhner MK and Felsenstein J (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution 11(3):459–468. [125] Lander ES and Schork NJ (1994). Genetic dissection of complex traits. Science 265(5181):2037–2048. [126] Langaas M, Lindqvist BH and Ferkingstad E (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society - Series B: Statistical Methodology 67(4):555–572. 203, 207, 227 [127] Larson SM, Di Nardo AA and Davidson AR (2000). Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. Journal of Molecular Biology 303(3):433–446. 11 [128] Lawrence RW, Evans DM and Cardon LR (2005). Prospects and pitfalls in whole genome association studies. Philosophical Transactions of the Royal Society of London B Biological Sciences 360(1460):1589–1595. [129] Lee B, Nachmanson L, Robertson GG, Carlson J and Heckerman D (2008). Det. (distance encoded tree): a scalable visualization tool for mapping multiple traits

170

to large evolutionary trees. Tech. Rep. MSR-TR-2008-97, Microsoft Research. 91 [130] Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3(9):e161. [131] Leslie A, Kavanagh D, Honeyborne I, Pfafferott K, Edwards C, Pillay T, Hilton L, Thobakgale C, Ramduth D, Draenert R, Le Gall S, Luzzi G, Edwards A, Brander C, Sewell AK, Moore S, Mullins J, Moore C, Mallal S, Bhardwaj N, Yusim K, Phillips R, Klenerman P, Korber B, Kiepiela P, Walker B and Goulder P (2005). Transmission and accumulation of CTL escape variants drive negative associations between HIV polymorphisms and HLA. Journal of Experimental Medicine 201(6):891–902. 21, 102, 115 [132] Leslie AJ, Pfafferott KJ, Chetty P, Draenert R, Addo MM, Feeney M, Tang Y, Holmes EC, Allen T, Prado JG, Altfeld M, Brander C, Dixon C, Ramduth D, Jeena P, Thomas SA, John AS, Roach TA, Kupfer B, Luzzi G, Edwards A, Taylor G, Lyall H, Tudor-Williams G, Novelli V, Martinez-Picado J, Kiepiela P, Walker BD and Goulder PJR (2004). HIV evolution: CTL escape mutation and reversion after transmission. Nature Medicine 10(3):282–289. 20, 21, 22, 36, 104, 110 [133] Letvin NL (2006). Progress and obstacles in the development of an AIDS vaccine. Nature Reviews Immunology 6(12):930–9. 19 [134] Lichterfeld M, Yu XG, Le Gall S and Altfeld M (2005). Immunodominance of HIV-1-specific CD8+ T-cell responses in acute HIV-1 infection: at the crossroads of viral and host genetics. Trends in Immunology 26(3):166–171. 26 [135] Listgarten J, Brumme Z, Kadie C, Xiaojiang G, Walker B, Carrington M, Goul-

171

der P and Heckerman D (2008). Statistical resolution of ambiguous HLA typing data. PLoS Computational Biology 4(2):e1000016. 116 [136] Listgarten J and Heckerman D (2007). Determining the number of non-spurious arcs in a learned DAG model: investigation of a Bayesian and a frequentist approach. In Proceedings of the 23rd Annual Conference on Uncertainty in Artificial Intelligence, UAI Press. 80 [137] Liu Y, McNevin J, Zhao H, Tebit DM, Troyer RM, McSweyn M, Ghosh AK, Shriner D, Arts EJ, McElrath MJ and Mullins JI (2007). Evolution of human immunodeficiency virus type 1 cytotoxic T-lymphocyte epitopes: fitness-balanced escape. Journal of Virology 81(22):12179–12188. 22, 27 [138] Lockless SW and Ranganathan R (1999). Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286(5438):295–299. 11, 89 [139] Lohmueller KE, Pearce CL, Pike M, Lander ES and Hirschhorn JN (2003). Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nature Genetics 33(2):177–182. [140] Lynch M (1991). Methods for the analysis of comparative data in evolutionary biology. Evolution 45(5):1065–1080. 14 [141] Maddison DR (1990). Phylogenetic Inference of Historical Pathways and Models of Evolutionary Change. PhD thesis, Harvard University, Cambridge, MA. 10, 13, 16 [142] Malcolm BA, Wilson KP, Matthews BW, Kirsch JF and Wilson AC (1990). Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing. Nature 345(6270):86–89. 114 [143] Malim MH and Emerman M (2001). HIV-1 sequence variation. Cell 104(4):469– 472. 18, 114

172

[144] Marchini J, Cardon LR, Phillips MS and Donnelly P (2004). The effects of human population structure on large genetic association studies. Nature Genetics 36(5):512–517. 13, 95 [145] Marchini J, Donnelly P and Cardon LR (2005).

Genome-wide strategies

for detecting multiple loci that influence complex diseases. Nature Genetics 37(4):413–417. 13 [146] Martin LC, Gloor GB, Dunn SD and Wahl LM (2005). Using information theory to search for co-evolving residues in proteins. Bioinformatics 21(22):4116–4124. 89 [147] Martinez-Picado J, Prado JG, Fry EE, Pfafferott K, Leslie A, Chetty S, Thobakgale C, Honeyborne I, Crawford H, Matthews P, Pillay T, Rousseau C, Mullins JI, Brander C, Walker BD, Stuart DI, Kiepiela P and Goulder P (2006). Fitness cost of escape mutations in p24 Gag in association with control of human immunodeficiency virus type 1. Journal of Virology 80(7):3617–3623. 22, 105 [148] Martins EP (1996). Phylogenies and the Comparative Method in Animal Behavior. Oxford University Press, New York, NY. 117 [149] Martins EP (2000). Adaptation and the comparative method. Trends in Ecology & Evolution 15(7):296–299. 9, 117 [150] Martins EP, Diniz-Filho JAF and Housworth EA (2002). Adaptive constraints and the phylogenetic comparative method: a computer simulation test. Evolution 56(1):1–13. [151] Martins EP and Hansen TF (1997). Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. The American Naturalist 149(4):646–667.

173

[152] Matthews PC, Leslie AJ, Katzourakis A, Crawford H, Payne R, Prendergast A, Power K, Kelleher AD, Klenerman P, Carlson J, Heckerman D, Ndung’u T, Walker BD, Allen TM, Pybus OG and Goulder PJR (2009). HLA footprints on HIV-1 are associated with inter-clade polymorphisms and intra-clade phylogenetic clustering. Journal of Virology in press. doi:10.1128/JVI.02017-08. 127 [153] Matthews PC, Prendergast A, Leslie A, Crawford H, Payne R, Rousseau C, Rolland M, Honeyborne I, Carlson J, Kadie C, Brander C, Bishop K, Mlotshwa N, Mullins JI, Coovadia H, Ndung’u T, Walker BD, Heckerman D and Goulder PJR (2008). Central role of reverting mutations in HLA associations with human immunodeficiency virus set point. Journal of Virology 82(17):8548– 8559. 39, 73, 84, 86, 87, 90, 95, 104 [154] McMichael AJ and Rowland-Jones SL (2001). Cellular immune responses to HIV. Nature 410(6831):980–987. [155] Migueles SA, Sabbaghian MS, Shupert WL, Bettinotti MP, Marincola FM, Martino L, Hallahan CW, Selig SM, Schwartz D, Sullivan J and Connors M (2000). HLA B*5701 is highly associated with restriction of virus replication in a subgroup of HIV-infected long term nonprogressors. Proceedings of the National Academy of Sciences of the United States of America 97(6):2709–2714. 105 [156] Miyahira Y, Murata K, Rodriguez D, Rodriguez JR, Esteban M, Rodrigues MM and Zavala F (1995). Quantification of antigen specific CD8+ T cells using an ELISPOT assay. Journal of Immunological Methods 181(1):45–54. 198 [157] Moore CB, John M, James IR, Christiansen FT, Witt CS and Mallal SA (2002). Evidence of HIV-1 adaptation to HLA-restricted immune responses at a population level. Science 296(5572):1439–1443. 21, 22, 23, 24, 27, 29, 56, 63, 65, 84, 85, 86, 87, 90, 93, 95, 102, 113, 114, 115, 116

174

[158] Muse SV (1995). Evolutionary analyses of DNA sequences subject to constraints of secondary structure. Genetics 139(3):1429–1439. 10, 13, 16, 89 [159] Neher E (1994). How frequent are correlated changes in families of protein sequences?

Proceedings of the National Academy of Sciences of the United

States of America 91(1):98–102. 11 [160] Nei M (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution 3(5):418–426. 12 [161] News in Brief (2007). HIV vaccine failure prompts merck to halt trial. Nature 449(7161):390. 19 [162] Ngumbela K, Day C, Mncube Z, Nair K, Ramduth D, Thobakgale C, Moodley E, Reddy S, de Pierres C, Mkhwanazi N, Bishop K, van der Stok M, Ismail N, Honeyborne I, Crawford H, Kavanagh D, Rousseau C, Nickle D, Mullins J, Heckerman D, Korber B, Coovadia H, Kiepiela P, Goulder P and Walker B (2008). Targeting of a CD8 T cell env epitope presented by HLA-B*5802 is associated with markers of HIV disease progression and lack of selection pressure. AIDS Research and Human Retroviruses 24(1):72–82. 116 [163] Nickle DC, Jensen MA, Gottlieb GS, Shriner D, Learn GH, Rodrigo AG, Mullins JI, Gao F, Bhattacharya T, Gaschen B, Taylor J, Moore JP, Novitsky V, Yusim K, Lang D, Foley B, Beddows S, Alam M, Haynes B, Hahn BH and Korber B (2003). Consensus and ancestral state HIV vaccines. Science 299(5612):1515c– 1518. 28 [164] Nickle DC, Rolland M, Jensen MA, Pond S, Deng W, Seligman M, Heckerman D, Mullins JI and Jojic N (2007). Coping with viral diversity in HIV vaccine design. PLoS Computational Biology 3(4):e75. 28

175

[165] Nielsen R and Yang Z (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148(3):929–936. 12 [166] Nietfeld W, Bauer M, Fevrier M, Maier R, Holzwarth B, Frank R, Maier B, Riviere Y and Meyerhans A (1995). Sequence constraints and recognition by CTL of an HLA-B27-restricted HIV-1 Gag epitope. The Journal of Immunology 154(5):2189–97. 104 [167] Nodelman U, Shelton C and Koller D (2005). Expectation maximization and complex duration distributions for continuous time bayesian networks. In Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence, UAI Press, pp. 431–44. 49, 192, 193 [168] Noivirt O, Eisenstein M and Horovitz A (2005). Detection and reduction of evolutionary noise in correlated mutation analysis. Protein Engineering Design and Selection 18(5):247–253. 10 [169] Nordborg M, Borevitz JO, Bergelson J, Berry CC, Chory J, Hagenblad J, Kreitman M, Maloof JN, Noyes T, Oefner PJ, Stahl EA and Weigel D (2002). The extent of linkage disequilibrium in Arabidopsis thaliana. Nature Genetics 30(2):190–193. 59 [170] Olmea O, Rost B and Valencia A (1999). Effective use of sequence correlation and conservation in fold recognition. Journal of Molecular Biology 293(5):1221– 1239. [171] Pagel M (1994).

Detecting correlated evolution on phylogenies: a general

method for the comparative analysis of discrete characters. Philosophical Transactions of the Royal Society of London B Biological Sciences 255(1342):37–45. 10, 11, 13, 16, 89

176

[172] Palella FJ, Delaney KM, Moorman AC, Loveless MO, Fuhrer J, Satten GA, Aschman DJ, Holmberg SD and Investigators THOS (1998). Declining morbidity and mortality among patients with advanced human immunodeficiency virus infection. New England Journal of Medicine 338(13):853–860. 1 [173] Pazos F, Helmer-Citterich M, Ausiello G and Valencia A (1997). Correlated mutations contain information about protein-protein interaction. Journal of Molecular Biology 271(4):511–523. 10 [174] Peters HO, Mendoza MG, Capina RE, Luo M, Mao X, Gubbins M, Nagelkerke NJD, MacArthur I, Sheardown BB, Kimani J, Wachihi C, S. T and Plummer FA (2008). An integrative bioinformatic approach for studying escape mutations in human immunodeficiency virus type 1 Gag in the pumwani sex worker cohort. Journal of Virology 82(4):1980–1992. 12, 114, 116 [175] Peyerl FW, Barouch DH, Yeh WW, Bazick HS, Kunstman J, Kunstman KJ, Wolinsky SM and Letvin NL (2003). Simian-human immunodeficiency virus escape from cytotoxic T-lymphocyte recognition at a structurally constrained epitope. Journal of Virology 77(23):12572. 104 [176] Peyerl FW, Bazick HS, Newberg MH, Barouch DH, Sodroski J and Letvin NL (2004). Fitness costs limit viral escape from cytotoxic T lymphocytes at a structurally constrained epitope. Journal of Virology 78(24):13901. 104 [177] Pollock DD and Taylor WR (1997). Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Engineering 10(6):647–657. 10 [178] Pollock DD, Taylor WR and Goldman N (1999). Coevolving protein residues: maximum likelihood identification and relationship to structure. Journal of Molecular Biology 287(1):187–198. 11, 13, 16, 17, 51, 52, 65, 89

177

[179] Poon A and Chao L (2005). The rate of compensatory mutation in the DNA bacteriophage φX174. Genetics 170(3):989–999. 106, 108, 114 [180] Poon A, Davis BH and Chao L (2005). The coupon collector and the suppressor mutation: estimating the number of compensatory mutations by maximum likelihood. Genetics 170(3):1323–1332. [181] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). Evolutionary interactions between N-linked glycosylation sites in the HIV-1 envelope. PLoS Computational Biology 3(1):e11. 11, 13, 15 [182] Poon AFY, Lewis FI, Pond SLK and Frost SDW (2007). An evolutionarynetwork model reveals stratified interactions in the V3 loop of the HIV-1 Envelope. PLoS Computational Biology 3(11):e231. 11, 15, 16, 89, 117, 129 [183] Pounds S and Cheng C (2006). Robust estimation of the false discovery rate. Bioinformatics 22(16):1979. 204, 209, 221, 227 [184] Press W, Teukolsky S, Vetterling W and Flannery B (1992). Numerical Recipes in C. Cambridge University Press, New York, NY. 52, 53 [185] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA and Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38(8):904–909. 13 [186] Pritchard JK, Stephens M, Rosenberg NA and Donnelly P (2000). Association mapping in structured populations. The American Journal of Human Genetics 67(1):170–181. 13 [187] Pritchard L, Bladon P, M O Mitchell J and J Dufton M (2001). Evaluation of a novel method for the identification of coevolving protein residues. Protein Engineering 14(8):549–555. 89

178

[188] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ and Sham PC (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3):559–575. [189] Reiner AP, Ziv E, Lind DL, Nievergelt CM, Schork NJ, Cummings SR, Phong A, Burchard EG, Harris TB, Psaty BM and Kwok PY (2005). Population structure, admixture, and aging-related phenotypes in African American adults: the Cardiovascular Health Study. The American Journal of Human Genetics 76(3):463–477. [190] Ridley M (1983). The Explanation of Organic Diversity: The Comparative Method and Adaptations for Mating. Oxford University Press, Oxford. 10, 13, 15, 16, 30, 89 [191] Risch NJ (2000). Searching for genetic determinants in the new millennium. Nature 405(6788):847–856. [192] Rodrigue N, Lartillot N, Bryant D and Philippe H (2005). Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 347(2):207–217. [193] Rodriguez F, Harkins S, Slifka MK and Whitton JL (2002). Immunodominance in virus-induced CD8+ T-cell responses is dramatically modified by DNA immunization and is regulated by gamma interferon. Journal of Virology 76(9):4251. [194] Rolland M, Nickle DC and Mullins JI (2007). HIV-1 group M conserved elements vaccine. PLoS Pathogens 3(11):e157. 73 [195] Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA and Feldman MW (2002). Genetic structure of human populations. Science 298(5602):2381–2385. 13

179

[196] Rousseau CM, Daniels MG, Carlson JM, Kadie C, Crawford H, Prendergast A, Matthews P, Payne R, Rolland M, Raugi DN, Maust BS, Learn GH, Nickle DC, Coovadia H, Ndung’u T, Frahm N, Brander C, Walker BD, Goulder PJR, Bhattacharya T, Heckerman DE, Korber BT and Mullins JI (2008). HLA class I-driven evolution of human immunodeficiency virus type 1 subtype C proteome: immune escape and viral load. Journal of Virology 82(13):6434–6446. 26, 27, 72, 75, 77, 84, 86, 87, 90, 95, 98, 99, 110, 113, 114, 115, 116, 198 [197] Sanchez-Merino V, Farrow M, Brewster F, Somasundaran M and Luzuriaga K (2008). Identification and characterization of HIV-1 CD8+ T cell escape variants with impaired fitness. The Journal of Infectious Diseases 197(2):300–8. [198] Satten GA, Flanders WD and Yang Q (2001). Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. The American Journal of Human Genetics 68(2):466– 477. 13 [199] Schmitz JE, Kuroda MJ, Santra S, Sasseville VG, Simon MA, Lifton MA, Racz P, Tenner-Racz K, Dalesandro M, Scallon BJ, Ghrayeb J, Forman MA, Montefiori DC, Rieber EP, Letvin NL and Reimann KA (1999). Control of viremia in simian immunodeficiency virus infection by CD8+ lymphocytes. Science 283(5403):857–860. 19 [200] Schneidewind A, Brockman MA, Sidney J, Wang YE, Chen H, Suscovich TJ, Li B, Adam RI, Allgaier RL, Mothe BR, Kuntzen T, Oniangue-Ndza C, Trocha A, Yu XG, Brander C, Sette A, Walker BD and Allen TM (2008). Structural and functional constraints limit options for cytotoxic T-lymphocyte escape in the immunodominant HLA-B27-restricted epitope in human immunodeficiency virus type 1 capsid. Journal of Virology 82(11):5594–5605. 104, 105, 110, 114 [201] Schneidewind A, Brockman MA, Yang R, Adam RI, Li B, Le Gall S, Rinaldo

180

CR, Craggs SL, Allgaier RL, Power KA, Kuntzen T, Tung CS, LaBute MX, Mueller SM, Harrer T, McMichael AJ, Goulder PJR, Aiken C, Brander C, Kelleher AD and Allen TM (2007). Escape from the dominant HLA-B27-restricted cytotoxic T-lymphocyte response in Gag is associated with a dramatic reduction in human immunodeficiency virus type 1 replication. Journal of Virology 81(22):12382–12393. 20, 21, 29, 104, 105, 110, 113, 114 [202] Schneidewind A, Brumme ZL, Brumme CJ, Power KA, Reyor LL, O’Sullivan K, Gladden A, Hempel U, Kuntzen T, Wang YE, Oniangue-Ndza C, Jessen H, Markowitz M, Rosenberg ES, Sekaly RP, Kelleher AD, Walker BD and Allen TM (2008). Transmission and long-term stability of compensated CD8 escape mutations. Journal of Virology in press. doi:10.1128/JVI.01108-08. [203] Self SG and Liang KY (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association 82(398):605–610. [204] Setakis E, Stirnadel H and Balding DJ (2006). Logistic regression protects against population structure in genetic association studies. Genome Research 16(2):290–296. 13 [205] Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL and Mullins JI (1999). Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73(12):10489– 10502. 18 [206] Sinha H, Nicholson BP, Steinmetz LM and McCusker JH (2006). Complex genetic interactions in a quantitative trait locus. PLoS Genetics 2(2):e13.

181

[207] Stewart JJ, Watts P and Litwin S (2001). An algorithm for mapping positively selected members of quasispecies-type viruses. BMC Bioinformatics 2(1):1. 12 [208] Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society - Series B: Statistical Methodology 64(3):479–498. 196, 202, 203, 206, 208, 216, 221, 226 [209] Storey JD (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics 31(6):2013–2035. 196, 205, 208, 209, 221, 226 [210] Storey JD, Taylor JE and Siegmund D (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society - Series B: Statistical Methodology 66(1):187–205. 203 [211] Storey JD and Tibshirani R (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100(16):9440–9445. 42, 196, 203, 209 [212] Streeck H, Lichterfeld M, Alter G, Meier A, Teigen N, Yassine-Diab B, Sidhu HK, Little S, Kelleher A, Routy JP, Rosenberg ES, Sekaly RP, Walker BD and Altfeld M (2007). Recognition of a defined region within p24 Gag by CD8+ T cells during primary human immunodeficiency virus type 1 infection in individuals expressing protective HLA class I alleles. Journal of Virology 81(14):7725–7731. [213] Suel GM, Lockless SW, Wall MA and Ranganathan R (2003). Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nature Structural and Molecular Biology 10(1):59–69. 11

182

[214] Tang C, Ndassa Y and Summers MF (2002). Structure of the N-terminal 283residue fragment of the immature HIV-1 Gag polyprotein. Nature Structural and Molecular Biology 9(7):537–543. 105, 106, 107 [215] Taylor WR and Hatrick K (1994). Compensating changes in protein multiple sequence alignments. Protein Engineering 7(3):341–348. 10 [216] Thomas DC, Haile RW and Duggan D (2005).

Recent developments in

genomewide association scans: a workshop summary and review. The American Journal of Human Genetics 77(3):337–345. [217] Thornsberry JM, Goodman MM, Doebley J, Kresovich S, Nielsen D and Buckler ES (2001). Dwarf8 polymorphisms associate with variation in flowering time. Nature Genetics 28(3):286–289. 13 [218] Tillier ERM and Lui TWH (2003). Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 19(6):750–755. [219] Turnbull EL, Lopes AR, Jones NA, Cornforth D, Newton P, Aldam D, Pellegrino P, Turner J, Williams I, Wilson CM, Goepfert PA, Maini MK and Borrow P (2006). HIV-1 epitope-specific CD8+ T cell responses strongly associated with delayed disease progression cross-recognize epitope variants efficiently. The Journal of Immunology 176(10):6130–6146. 27 [220] Ueno T, Motozono C, Dohki S, Mwimanzi P, Rauch S, Fackler OT, Oka S and Takiguchi M (2008). CTL-mediated selective pressure influences dynamic evolution and pathogenic functions of HIV-1 Nef. The Journal of Immunology 180(2):1107. [221] UNAIDS/WHO (2007). AIDS epidemic update. http://www.unaids.org. 1

183

[222] Vogel TU, Horton H, Fuller DH, Carter DK, Vielhuber K, O’Connor DH, Shipley T, Fuller J, Sutter G, Erfle V, Wilson N, Picker LJ and Watkins DI (2002). Differences between T cell epitopes recognized after immunization and after infection. The Journal of Immunology 169(8):4511–4521. [223] Voight BF and Pritchard JK (2005). Confounding from cryptic relatedness in case-control association studies. PLoS Genetics 1(3):e32. 13 [224] Wang Y and DeLisi C (2006). Inferring protein-protein interactions in viral proteins by co-evolution of conserved side chains. Genome 17(1):23–35. 116 [225] Wang YE, Li B, Carlson JM, Streeck H, Gladden AD, Goodman R, Schneidewind A, Power KA, Toth I, Frahm N, Alter G, Brander C, Carrington M, Walker BD, Altfeld M, Heckerman D and Allen TM (2009). Protective HLA class I alleles that restrict acute-phase CD8+ T-cell responses are associated with viral escape mutations located in highly conserved regions of human immunodeficiency virus type 1. Journal of Virology 83(4):1845–1855. 73 [226] Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong PD and Shaw GM (2003). Antibody neutralization and escape by HIV-1. Nature 422(6929):307–312. 18 [227] Williams SG and Lovell SC (2009). protein structural divergence.

The effect of sequence evolution on

Molecular Biology and Evolution in press.

doi:10.1093/molbev/msp020. [228] Wollenberg KR and Atchley WR (2000). Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Sciences of the United States of America 97(7):3288–3291. 10, 11

184

[229] Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, Bunce M, Muyembe JJ, Kabongo JMM, Kalengayi RM, Van Marck E, Gilbert MTP and Wolinsky SM (2008). Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature 455(7213):661–664. 18, 148 [230] Xie Y, Pan W and Khodursky AB (2005). A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 21(23):4280–4288. 196, 211, 212 [231] Yang Z, Nielsen R, Goldman N and Pedersen AMK (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites.

Genetics

155(1):431–449. 12 [232] Yanofsky C, Horn V and Thorpe D (1964). Protein structure relationships revealed by mutational analysis. Science 146(3651):1593–1594. 114 [233] Yeang CH and Haussler D (2007). Detecting coevolution in and among protein domains. PLoS Computational Biology 3(11):e211. 11 [234] Yeh WW, Cale EM, Jaru-Ampornpan P, Lord CI, Peyerl FW and Letvin NL (2006). Compensatory substitutions restore normal core assembly in simian immunodeficiency virus isolates with Gag epitope cytotoxic T-lymphocyte escape mutations. Journal of Virology 80(16):8168–8177. 104 [235] Yewdell J (2005). The seven dirty little secrets of major histocompatibility complex class I antigen processing. Immunological Reviews 207(1):8–18. [236] Yewdell JW (2006). Confronting complexity: real-world immunodominance in antiviral CD8+ T cell responses. Immunity 25(4):533–543. 26 [237] Yewdell JW and Bennink JR (1999). Immunodominance in major histocompatibility complex class I-restricted T lymphocyte responses. Annual Review of Immunology 17(1):51–88. 26, 108

185

[238] Yewdell JW and Haeryfar SM (2005). Understanding presentation of viral antigens to CD8+ T cells in vivo: the key to rational vaccine design. Annual Review of Immunology 23(1):651–682. [239] Yokomaku Y, Miura H, Tomiyama H, Kawana-Tachikawa A, Takiguchi M, Kojima A, Nagai Y, Iwamoto A, Matsuda Z and Ariyoshi K (2004). Impaired processing and presentation of cytotoxic-T-lymphocyte (CTL) epitopes are major escape mechanisms from CTL immune pressure in human immunodeficiency virus type 1 infection. Journal of Virology 78(3):1324. 20 [240] Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S and Buckler ES (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38(2):203–208. 13 [241] Yu XG, Lichterfeld M, Chetty S, Williams KL, Mui SK, Miura T, Frahm N, Feeney ME, Tang Y, Pereyra F, LaBute MX, Pfafferott K, Leslie A, Crawford H, Allgaier R, Hildebrand W, Kaslow R, Brander C, Allen TM, Rosenberg ES, Kiepiela P, Vajpayee M, Goepfert PA, Altfeld M, Goulder PJR and Walker BD (2007). Mutually Exclusive T-Cell Receptor Induction and Differential Susceptibility to Human Immunodeficiency Virus Type 1 Mutational Escape Associated with a Two-Amino-Acid Difference between HLA Class I Subtypes. Journal of Virology 81(4):1619–1631. 14, 110 [242] Zhang J and Rosenberg HF (2002). Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates. Proceedings of the National Academy of Sciences of the United States of America 99(8):5486–5491. 114 [243] Zu˜ niga R, Lucchetti A, Galvan P, Sanchez S, Sanchez C, Hernandez A, Sanchez H, Frahm N, Linde CH, Hewitt HS, Hildebrand W, Altfeld M, Allen TM, Walker

186

BD, Korber BT, Leitner T, Sanchez J and Brander C (2006). Relative dominance of Gag p24-specific cytotoxic T lymphocytes is associated with human immunodeficiency virus control. Journal of Virology 80(6):3122–3125. 115

187

Appendix A NEXT GENERATION SEQUENCING: EXTENDING THE MODEL TO SINGLE GENOME SEQUENCES In this appendix, we briefly outline the likelihood calculation and EM steps for the Single Genome Sequencing (SGS) case, which was introduced and motivated in subsection 8.1.5. These calculations follow directly from known methods (see, e.g., [93], but we provide a derivation here to clarify the model. A thorough exploration of the behavior of the model on real and synthetic data is left for future work. A.1

Likelihood calculation

The calculation of likelihood on a tree is a well studied problem, first made popular in the phylogenetics community by Felsenstein [60]. Although we heavily use this approach in all of our models, we have thus far omitted a thorough discussion of likelihood calculation as it is generally familiar to the community. We will find it useful here, however, to briefly review the likelihood calculation.

A.1.1

Notation

For a given phylogeny Ψ, observed target variables D = Y1 , . . . , YN , observed predictor variables X = X1 , . . . , Xn , and model parameters θ, the likelihood of the model with respect to θ is written LΨ (D|θ, X) = Pr {D|θ, Ψ, X} .

(A.1)

As the tree is held constant throughout, we will drop Ψ for the rest of this discussion. The goal is first compute L(D|θ), then to find θ that maximize L. First, let us

188

introduce some notation. Let {Vk } represent the set of nodes in the phylogeny. If we are considering the model of conditional adaptation, the {Vk } will include the hidden nodes described in chapter 4. The ordering is arbitrary, but for convenience, let V0 be the root (which is itself arbitrary, as the models we consider are reversible), and let V1 , . . . , VN denote the nodes corresponding to variables Y1 , . . . , YN (as previously, we will sometimes simply refer to node Vi , 1 ≤ i ≤ N , by its corresponding variable Yi ). We denote the parent of a node Vk as p(Vk ) and the set of children of Vk as c(Vk ). Now the observed data D consists of the values of the target and predictor attributes for each individual i, each of which corresponds to a leaf on the tree. Let D(Vk ) represent the data belonging to the subset of individuals in the subtree rooted at node Vk . Recall that the parameters of our model are θ = (π, λ, s).

A.1.2

Likelihood calculation in the unlinked case

Given this notation, the likelihood can then be written as L(D|θ, X) =

X v∈{0,1}

Pr {V0 = v|θ}

Y

L(D(Vc )|θ, X, V0 = v)

(A.2)

Vc ∈c(V0 )

where L(D(Vk )|θ, X, V0 = v) is the likelihood of the subtree rooted at Vk conditioned on V0 = v. Note that Pr {V0 = 1|θ} = π, and Pr {V0 = 0|θ} = 1 − π. The conditional likelihood of an arbitrary node given its parent can be recursively defined as L(D(Vk )|θ, X, p(Vk ) = p) = X Y Pr {Vk = v|θ, p(Vk ) = p} L(D(Vc )|θ, X, Vk = v) (A.3) v∈{0,1}

Vc ∈c(Vk )

if Vk is a branch node. Note that Pr {Vk = v|p(Vk ) = p, θ} is the transition probability and is a function of the parameters π and λ and the branch length tk between Vk and its parent. If Vk is a leaf in the phylogeny, then the likelihood is given by the leaf

189

distribution as defined in chapter 4, which we write as L(D(Vi )|θ, X, p(Vi ) = p) =

X

Pr {Y = y|θ, X, H = p} 1 {Yi = y} ,

(A.4)

y∈{0,1}

where the right hand side of the equation is the familiar leaf distribution, which computes the probability that the trait Y will be y given that the corresponding hidden node is pi . Note that we only want to sum the probabilities for the observed value of Yi . This condition is provided by the identity function 1 {·}, which evaluates to 1 if the condition is true and 0 otherwise. Thus, the likelihood is simply equal to the leaf distribution corresponding to the observed value of trait Y . A.1.3

Likelihood calculation in the SGS case

Now consider the case where our data consists of SGS sequences, such that we have multiple HIV sequences for each individual. As before, our first step will be to construct a phylogeny for all the sequences. Due to the rapid rate of evolution of HIV, we expect that sequences sampled from a given individual will form a subtree within the larger tree, though this is not guaranteed. For simplicity, however, we will assume this is the case, though the derivation discussed in this appendix can be easily generalized to avoid this assumption. Previously, we had one HIV sequence per individual. When considering the target trait Y and the predictor trait X, we therefore labeled individual variables Yi and Xi for each individual i. Now, each individual has multiple sequences. Therefore, define Yi,g to be the variable corresponding the gth sequence from the ith individual, with Hi,g denoting the corresponding hidden variable. For simplicity, we will focus on the univariate model with the trait X corresponding to an HLA allele (though the results generalize to Noisy Add as well). Therefore, will will continue to index X as Xi . Consider Figure 8.1 on page 134. The addition of the linked nodes L have two implications. First, they change the leaf distribution calculation to Equation (8.1), as discussed in subsection 8.1.5. Second, our likelihood calculation must now integrate

190

out the linked nodes. Intuitively, this can be done by first fixing the values of the linked nodes, then computing the likelihood. We then simply compute the weighted average of the likelihoods under every possible assignment of linked nodes, weighted by the prior probability of a given configuration, which is given by the parameter s1 and the observed predictor variables X. To make this explicit, we can expand Equation (A.1) around the linked nodes. Specifically, let L1 , . . . , LN be the set of variables represent the linked trait L for each individual i ∈ {1, . . . , N }. Then the likelihood can be written as L(D|θ, X) =

X l1 ∈{0,1}

···

X

L(D|θ, X, L1 = l1 , . . . , LN = lN ) × · · ·

lN ∈{0,1}

. . . × Pr {L1 = l1 , . . . , LN = lN |X} . (A.5) The structure of the graph implies that each linked node Li is conditionally independent of all other linked nodes, given X. Thus, we can compute Pr {L1 = l1 , . . . , LN = lN |X} =

Y

Pr {Li = li |Xi } ,

(A.6)

i

where

Pr {Li = li |Xi } =

   s1       1 − s 1    0      1

if Xi = 1 and li = 1 if Xi = 1 and li = 0

(A.7)

if Xi = 0 and li = 1 if Xi = 0 and li = 0

A straightforward computation of Equation (A.5) would involve an exponential number of terms. However, the independencies implied by the tree allow us to simplify the expression in terms of subtrees, much as we did in Equation (A.3). Notice how each linked node points to a set of individuals. Let Mi be the most recent common ancestor (MRCA) of the leaves pointed to by Li , where the MRCA is defined to be root of the smallest subtree (i.e., that subtree with the fewest nodes) that contains all of the leaves in question. For simplicity, suppose each linked node

191

corresponds to a unique MRCA. That is for i 6= j, Mi 6= Mj . Further, suppose that for all i 6= j, Mi is not a descendent of Mj , meaning Mj is not on the path from Mi to the root node. This will typically be the case for the univariate model, though it clearly is not the case for Noisy Add. Nevertheless, the results easily generalize and this assumption will simplify the following discussion. Given these assumptions, the (in)dependencies implied by the phylogeny (Figure 8.1) imply that L(D(Mi )|θ, X, L1 , . . . , LN ) = L(D(Mi )|θ, X, Li ).

(A.8)

That is, the likelihood of the subtree of Mi is conditionally independent of all linked nodes except Li . Thus, the conditional likelihood of the subtree rooted at Mi is easily computed using L(D(Mi )|θ, X, p(Mi ) = p) =

X

L(D(Mi )|θ, X, p(Mi ) = p, Li = l) Pr {Li = l|Xi } ,

l∈{0,1}

(A.9) where L(D(Vi )|θ, X, p(Vi ) = p, Li = l) =  P   if Vi is a leaf   y∈{0,1} Pr {Y = y|θ, Xi , H = p, Li = l} 1 {Yi = y} P (A.10) v∈{0,1} Pr {Vi = v|θ, p(Vi ) = p} × · · ·   otherwise  Q  · · · × Vc ∈c(Vi ) L(D(Vc )|θ, Xi , Vi = v, Li = l) This observation allows us to use each Mi as a cut note. That is, the likelihood calculation proceeds as in the unlinked case, except when a MRCA node is encountered. When a MRCA is encountered, we calculate the likelihood of the subtree rooted at the MRCA using Equation (A.8). A.2

Expectation maximization

Having described the computation of the likelihood under the linked model, and assuming the reader is familiar with expectation maximization (EM) [49] on trees,

192

we can now briefly described the EM procedure for the linked model. Recall that EM consists of two steps: the Expectation/E step, in which the expected values of the hidden variables are computed with respect to their posterior probability distribution, and the Maximization/M step, in which the parameters of the model are chosen so as to maximize the likelihood of the model conditioned on the expected values of the hidden variables. In the case of a continuous time Markov process (CTMP), the M step requires the expected transition probabilities between every parent-child pair in the tree [167]. Using the notation developed in the previous section, we can write the posterior joint probability between node Vk and its parent p(Vk ) as Pr {p(Vk ), Vk |θ, X, D} . Note that the posterior is conditioned on all the observed data, not just the data in the subtree rooted at Vk . For the unlinked case, these probabilities can be easily computed using standard message passing algorithms (e.g. [93]). Briefly, during the likelihood calculation, the partial posterior conditional probability Q Pr {Vk |θ, p(Vk )} Vc ∈c(Vk ) L(D(Vc )|θ, X, Vk ) Pr {Vk |θ, X, D(Vk ), p(Vk )} = L(D(Vk )|θ, X, p(Vk ))

(A.11)

is computed and stored. Similarly, a biproduct of computing the likelihood at the root is the posterior marginal probability Pr {V0 |θ, X, D} =

Pr {V0 }

Q

Vc ∈c(V0 )

L(D(Vc )|θ, X, V0 )

L(D|θ, X)

.

(A.12)

Given the posterior marginal probability of any node, we can recurse back down the tree, computing posterior joint probabilities along the way. Given the posterior marginal probability of node Vk ’s parent, the posterior joint probability is simply Pr {p(Vk ), Vk |θ, X, D} = Pr {Vk |θ, X, D(Vk ), p(Vk )} × Pr {p(Vk )|θ, X, D}

(A.13)

and the posterior marginal probability of Vk is obtained by taking Equation (A.13) and summing over all potential values of p(Vk ). Note that the conditional independencies

193

afforded by the tree structure of the model mean that Pr {Vk |θ, X, D(Vk ), p(Vk )} is independent of all observed data not in the subtree of Vk , making this simple message passing scheme possible. Given these posterior joint probabilities, we can now compute the expected value of the CTMP sufficient statistics, as well as maximize the parameters with respect to those expected values, using the method of [167]. As discussed in section 4.4, this process also yields the posterior joint probability distribution between the hidden variables H and their corresponding observed target variables Y Pr {Yi , Hi |θ, D, Xi } , from which we can maximize the selection parameter s according to whichever conditional adaptation model we are using. A.2.1

EM for the linked case

EM for the linked case proceeds much like the likelihood calculation. That is, we use Mi as a cut node and calculate the posterior probabilities for each node in the subtree rooted at Mi for each value of Li . The first step in this process is to compute the posterior Pr {Li |θ, X, D}. To do so, we first condition on the parent of Mi , isolating Li from the rest of the tree, then use Bayes’ rule to calculate the partial posterior: Pr {Li |θ, Xi , D(Mi ), p(Mi )} =

Pr {D(Mi )|θ, Li , p(Mi ), Xi } Pr {Li |Xi } . L(D(Mi )|θ, Xi , p(Mi ))

(A.14)

Note that each time on the right hand side of the equation follows naturally from our linked likelihood calculations. Now we simply multiply Equation (A.14) by the posterior marginal probability of p(Mi ) ((A.12)) to yield Pr {Li |θ, X, D} =

X

Pr {Li |θ, Xi , D(Mi ), p(Mi ) = p} × Pr {p(Mi ) = p|θ, X} .

p∈{0,1}

(A.15)

194

The parameter s1 can then be maximized by P Pr {Li = 1|θ, D, Xi = 1} 1 {Xi = 1} P . s1 = i i 1 {Xi = 1}

(A.16)

With the posterior of Li in hand, we can now easily compute the posterior joint probability between any node and it’s parent in the subtree rooted at Mi by conditioning on Li . Specifically, Equation (A.13) becomes X

Pr {p(Vk ), Vk |θ, X, D} =

Pr {Vk |θ, X, D(Vk ), p(Vk ), Li = l} × · · ·

l∈{0,1}

· · · × Pr {p(Vk )|θ, X, D, Li = l} Pr {Li = l|θ, X, D}

(A.17)

Note that Pr {p(Mi )|θ, X, D, Li = l} = Pr {p(Mi )|θ, X, D}, allowing us to restrict this process of conditioning on Li to the subtree rooted at Mi . Finally, to maximize s2 , we first compute the posterior Pr {Hi,g , Yi,g |θ, Xi = 1, D, Li = 1} = Pr {Yi,g |θ, X, D, Hi,g , Li = 1} × · · · · · · × Pr {Hi,g |θ, X, D, Li = 1} Pr {Li = 1|θ, X, D} , (A.18) then maximize s2 according to the appropriate leaf distribution and the posterior observed for each sequence. Finally, recall that throughout this discussion we have made the simplifying assumption that each linked node has a unique MRCA and no MRCA is a descendent of any other MRCA. When these restrictions are removed, the result is that each linked node is no longer independent of all other linked nodes. Consider the case where two linked nodes Li and Lj share the same MRCA (as will likely happen for the Noisy Add model). The above discussion can be easily extended to include this case by noting that we now must condition on Li and Lj simultaneously. Likewise, if Mj is a descendent of Mi , then all calculations within the subtree rooted at Mj will need to be conditioned on both Li and Lj . In general, the computational complexity will scale exponentially with the number of linked nodes on which we must simultaneously

195

condition. In practice, this scenario is most likely to occur in the Noisy Add model, in which case the calculation will tend to be exponential in the number of linked nodes that are included as predictors for a given target variable. In the examples discussed in this dissertation, linked predictor variables correspond to HLA alleles. In our data, it is uncommon for a single target variable to be predicted by more than a handful of HLA alleles; thus, in this domain, it is unlikely that the computational burden of the linked model will be large enough to make this approach impractical.

196

Appendix B ON COMPUTING FDR FOR FISHER’S EXACT TEST The model selection approach we have used throughout this work is based on estimated false discovery rates. Although our FDR estimates have generally been well-calibrated on synthetic data, the theory of FDR estimation was developed for continuous statistics and does not always perform well on discrete data, especially when the number of observations is small. In this appendix, we digress a bit from the main focus of the work to discuss the implications of FDR on discrete data. In particular, we focus on the application of FDR estimation to two by two contingency tables, for which p-values can be exactly calculated using Fisher’s exact test (FET) [66]. Although the results discussed in this appendix are not directly applicable to phylogenetic dependency networks, owing to the impracticality of computing exact pvalues over our phylogenetic models, there are a number of cases in sequence analysis in which the phylogeny has no bearing on the null distribution of variables, allowing us to use FET directly. In these cases, the results presented here provide a substantial increase in power by leading to less-biased FDR estimates while maintaining provably conservative asymptotic guarantees. The idea is as follows. A key assumption in Storey’s [208, 209, 211] estimation of FDR is that the p-values under the null distribution are distributed according to Uniform[0,1]. In this case, for any α ∈ [0, 1] the probability that we sample a p-value that is at most α is simply α. In discrete data, however, the p-value distribution under the null is often heavily skewed toward one. In such cases, as we have noted (subsection 5.1.2), one can estimate the the FDR based on an empirical null distribution that is sampled from, for example, randomization testing (see refs. [57, 106, 230]). In the

197

case of FET, we can analytically perform exhaustive randomization testing, allowing us to compute the exact null distribution. As well will show, this exact calculation allows us to compute a tight estimation of FDR, as well as to prove some asymptotic conservative guarantees. B.1

Examples of FET for sequence data

Let us begin by describing some examples in which FET is the appropriate test for significance. Whereas the bulk of this dissertation has focused on techniques involving phylogenetic correction, here we briefly outline some examples where population structure is non-existent, making FET the most appropriate test. Longitudinal HIV resistance data. The most direct way to identify sequence associations is threw longitudinal analysis. In this study design, the HIV is sequenced in each individual at the start of the trial, and is resequenced at the end of the trial. This approach is most appropriate when the researcher has complete control over the application of the predictor variable. One such example is provided by the study of Richard Harrigan and colleagues [91]. In this study, approximately 700 patients who had not been on antiretroviral drugs were enrolled and their consensus HIV sequences sequenced (this is the HOMER cohort from chapter 7). These patients then initiated one of several types of HIV therapy. Several years later, the consensus HIV sequences were again determined for each patient and compared to the original sequences. The goal was to identify mutations that were associated with specific antiretroviral drugs. When we used the data as an application for the PDN, we considered only the initial sequences (as we needed sequences that did not reflect drug-related adaptation). In this case, it was important to correct for phylogeny. Here, however, we can observe the transitions taking place, rendering the phylogeny irrelevant. For each amino acid ak at position k, we construct a binary variable A such that A = 1 if ak was observed only in the pre-trial sequence, and A = 0 if ak was observed in both the pre-trial and post-trial sequences. The variable A is then tested for independence against each

198

of the drugs in the trial. Here, we tested three classes of drugs over 1194 variables representing observed amino acids at positions in the HIV Protease and Reverse Transcriptase proteins, resulting in 3582 total tests, each with 281 observations. We will refer to this data set as “Resistance”. Epitope mapping. The cellular arm of the immune response identifies and destroys infected cells. Such cells can be identified by the unique strings of viral protein fragments, called epitopes, that are displayed on the surface of infected cells. These epitopes are displayed by human leukocyte antigen (HLA) proteins. Thousands of HLA variations have been identified in humans. Thus, a critical component of HIV vaccine design is the identification and characterization of the set of epitopes that each HLA allele can present on the cell surface. One high-throughput experimental method for epitope mapping is the use of overlapping peptide (OLP) scans, in which a sliding window of 10-15 amino acid peptide fragments are created based on the viral genome of interest. Each OLP is then tested against cells from dozens of patients, each of whom contains six HLA alleles, using interferon-γ ELISpot assays [1, 156]. We then attempt to map HLA alleles to responses. Thus, we test whether patients with the given HLA allele are more likely to have a positive ELISpot assay for a given OLP than patients without the HLA allele. Here, we reanalyze the data of Kiepiela et al. [116], comparing comparing 219 HLA alleles with 343 OLPs, resulting in 74,774 total tests, with an average of 724 observations per test. These data were collected as part of the Durban cohort analyzed in chapter 7. This data set is denoted “Epitope”. Sieve analysis. In sieve analysis we try to identify positions in the viral genome at which the variability differs between two different populations [79]. For example, if the position is more variable in patients from one region than those from another, it may indicate that different forces are acting on that region. Following Gilbert, we have created a data set consisting of 567 HIV clade B sequences from the HOMER cohort [25] and 522 HIV clade C sequences from the Durban cohort [116, 196]. For

199

each of 363 position in the HIV Gag protein, we compared the frequency at which sequences match the consensus for their respective clades. We will call this data set “Sieve”. Linkage disequilibrium mapping. Linkage disequilibrium (LD) occurs when two sites on a chromosome are not statistically independent of each other. This typically occurs when the sites are nearby on the chromosome, such that inheritance of a given allele at one site is correlated with inheritance of a given allele at the other site. When performing genome wide association studies (GWAS), it is helpful to identify a set of variations (SNPs) that are in LD with each other, so that potentially redundant results can be identified. We have taken the SNP data from a recent GWAS [37], binarized the data and tested for independence between each of 401,017 pairs of neighboring SNPs. There was an average of 1085 observations per test. We this data set “SNP”. Synthetic data To demonstrate various claims throughout this appendix, we use synthetic data sets constructed to closely mimic the above real data sets. We will describe the construction of these data sets in section B.5, after we have introduced some useful notation and concepts.

B.2 B.2.1

Background Contingency Tables and Fisher’s Exact Test

Consider an experiment that tests the independence of two random binary variables X and Y . Suppose there are n observations, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). The results can be summarized in a contingency table t = (a, b, c, d), as defined in Table B.1. We denote the marginal counts of t as θt = (θX , θX¯ , θY , θY¯ ), which represent the number of times each variable is observed in each state. These counts capture the maximum likelihood estimators for the marginal probabilities Pr {X = 1} and Pr {Y = 1}. We will often drop the superscript t when it is clear from context.

200

Table B.1: 2 × 2 contingency table on binary variables X and Y . P

Y =1

Y =0

X

X=1

a

b

θX = a + b

X=0 P Y

c

d

θX¯ = c + d

θY = a + c

θY¯ = b + d

n

Our goal is to test the null hypothesis H = 0 that X and Y are independent. Let T be the random variable representing a table with marginals θ. Fisher [66] showed that, if X and Y are independent and each is independent and identically distributed (IID), then the probability of observing T = t is given by the hypergeometric distribution:

pr(t) , Pr T = t|H = 0, θ

t

=

θX a

n θY

θX ¯ c

.

(B.1)

From Equation (B.1), a two-tailed marginal p-value can be computed for t using p(t) = Pr pr(T ) ≤ pr(t)|θT = θt =

X

Pr {t0 } · 1 {Pr {t0 } ≤ Pr {t}} ,

(B.2)

t0 ∈perm(t)

where perm(t) = {t0 : θt = θt }, and 1 {·} is the indicator function that evaluates 0

to one if the constraints are satisfied and zero otherwise. We call Equation (B.2) a marginal p-value because it is conditioned on the marginals θ. Many statistics have a uniform distribution of p-values under the null. That is, if P is a continuous random variable representing a p-value, Pr {P ≤ α|H = 0} = α. It is evident from Equation (B.1), however, that the distribution of FET p-values under the null may be strongly skewed toward one and therefore Pr {P ≤ α|H = 0} ≤ α. n For example, the minimum achievable p-value for a given n is 1/ n/2 , and that can only be achieved when the marginals are evenly distributed (i.e., when θX = θX¯ = n2 ). Figure B.1 shows the distribution of p-values from our various data sets.

201

500

2000

400

1500

300 1000 200 500

100 0

0 -9

-8

-7

-6

-5

-4

-3

-2

-1

0

-9

-8

A Resistance

-7

-6

-5

-4

-3

-2

-1

0

-3

-2

-1

0

B Epitope

100

20000

75

15000

50

10000

25

5000

0

0 -9

-8

-7

-6

-5

-4

-3

-2

-1

C Sieve

0

-9

-8

-7

-6

-5

-4

D SNPs

Figure B.1: The histogram of marginal p-values for the various data sets. p-values are transformed using log2 p.

B.2.2

Positive False Discovery Rates

Suppose that m hypothesis tests over 2 × 2 contingency tables t1 , . . . , tm are simultaneously tested using Fisher’s exact test, with corresponding marginal p-values p1 , p2 , . . . , pm , and we wish to estimate or control the false positive rate in some way. Given the values of H1 , H2 , . . . , Hm , where Hi = 0 if the ith null hypothesis is true and Hi = 1 if the null hypothesis is false, and a rejection region Γ, the possible results are summarized in Table B.2. Note that only m, R and W are observed. Here, we will focus on nested rejection regions defined by the Type I error rate α. That is, given an α ∈ [0, 1], we will reject all tests with P ≤ α. For convenience, we will use α to denote this rejection region.

202

Table B.2: Outcomes when testing m hypotheses. Hypothesis

Accept

Reject

Total

Null true

U (Γ)

V (Γ)

m0

Alternative true

T (Γ)

S(Γ)

m1

Total

W (Γ)

R(Γ)

m

Benjamini and Hochberg [15] proposed controlling the False Discovery Rate (FDR), which they defined to be

V (α) F DR(α) , E . R(α)

(B.3)

When R(α) = 0, this quantity is undefined. In this case, Benjamini and Hochberg defined F DR(α) to be 0, which is equivalent to defining V (α) R(α) > 0 · Pr {R(α) > 0} . F DR(α) , E R(α)

(B.4)

In practice, we are only interested in p-value thresholds that result in the rejection of at least one test in our data. Thus, Storey [208] proposed the positive false discovery rate (pFDR), which is conditioned on the rejection of at least one test: V (α) pF DR(α) = E Pr {R(α) > 0} R(α) 1 = · F DR(α). Pr {R(α) > 0}

(B.5) (B.6)

In order to estimate pF DR(α), we need to make some assumptions about the hypothesis tests. One useful set of assumptions proposed by Storey [208] is that the random variables Hi are IID Bernoulli variables with prior probability Pr {Hi = 1} = 1 − π0 and that the p-values are IID with Pi |Hi ∼ (1 − Hi ) · F0 + Hi · F1 ,

(B.7)

for some null distribution F0 and some alternative distribution F1 . Under these as-

203

sumptions, Storey [208] showed that pF DR(α) = Pr {H = 0|P ≤ α} π0 · Pr {P ≤ α|H = 0} Pr {P ≤ α} E [V (α)] = . E [R(α)] =

(B.8) (B.9) (B.10)

For small samples, it may be quite likely that no tests are rejected at threshold α. Therefore, Storey [208] proposed the following estimator for pFDR: \ pF DR(α) ,

c π ˆ0 · Pr{P ≤ α|H = 0} . c c Pr{P ≤ α} · Pr{R(α) > 0}

(B.11)

Equation (B.11) provides estimators for each term of Equation (B.9), with an added correction term that estimates Pr {R(α) > 0}. (Note that we did not employ this term in the previous chapters, as we were concerned with large FDR estimates over a large number of samples, meaning Pr {R(α) > 0} was close to 1.) For p-values that are not biased towards 0 under the null distribution, Storey [208] showed that \ pF DR(α) ,

W (λ) · α , (1 − λ)R(α) (1 − (1 − α)m )

(B.12)

for some well-chosen λ, 0 ≤ λ < 1, is conservative both asymptotically and in expectation for finite samples using the intuition that Pr {P ≤ α|H = 0} ≤ α and π0 ≈

W (λ) . (1−λ)m

Storey [208, 211] and others (e.g., [126]) have proposed methods for

choosing λ. In practice, Storey suggested estimating pF DR(α) for each observed p-value pi and proved that, under the aforementioned independence assumptions, the estimates are simultaneously conservative for all pi [210]. When the tests are continuous and uniformly distributed, Pr {P ≤ pi |H = 0} = pi , making Storey’s estimator quite tight in practice. When the tests are discrete, however, Pr {P ≤ pi |H = 0} can be significantly less than α. For example, in the case of Fisher’s exact test, each observed p-value pi is a marginal p-value dependent on the marginals θi . Although

204

Pr {P ≤ pi |H = 0, θi } = pi , for some test j 6= i, Pr {P ≤ pi |H = 0, θj } ≤ pi . Thus, the overall probability Pr {P ≤ Pi |H = 0} ≤ pi , making Storey’s estimate overly conservative. Although the marginal p-values do not provide good estimators of the overall probability Pr {P ≤ α|H = 0}, precluding a simple procedure of computing pFDR for each pi using Storey’s approach, the null distribution can be estimated by randomization tests, as has been discussed in the microarray literature (for review, see [36]). Nevertheless, Pounds and Cheng [183] point out that, even assuming Pr {P ≤ α|H = 0} = α, estimation methods for π0 that rely on continuous assumptions are not applicable to discrete data. Instead, they propose defining π ˆ0 , 2¯ p,

(B.13)

where p¯ is the arithmetic mean of the observed marginal p-values. Although Equation (B.13) also assumes a uniform p-value distribution under the null hypothesis, this conservative estimator tends to yield tighter estimates than other methods when the true π0 is sufficiently small [183]. B.3

Computing pFDR for Fisher’s exact test

Fisher’s exact test provides an efficient means of computing Pr {P ≤ α|H = 0, θ} exactly. Although the resulting marginal p-values should not be applied directly to pFDR estimation, the computation can be efficiently leveraged to provide a tight estimate for each of the terms in Equation (B.11). In this section, we define each of these estimates and show that each estimate is unbiased or conservative (in this context, an estimator is conservative if it is expected to overestimate terms in the numerator of Equation (B.11) or underestimate terms in the denominator of Equation (B.11)). We evaluate the asymptotic convergence properties of Equation (B.11), showing that it results in a conservative FDR estimate and characterize conditions under which the estimates will be closer to the true values. In addition, we discuss

205

ways in which the exact test can be further leveraged to increase power by filtering irrelevant tests. In what follows, we begin with Storey’s [209] assumptions. Specifically, we assume that p-values are IID and follow the mixture model Equation (B.7) and the Hi are IID Bernoulli random variables. Because we are using Fisher’s exact test, we also need to consider the marginals θ, which we assume are IID Although we we make no assumptions about the specific distribution of θ, we assume that the p-values depend on the marginals according to the mixture model P |H, θ ∼ (1 − H) · F0 (θ) + H · F1 (θ),

(B.14)

where F0 (θ) follows the hypergeometric distribution as defined in Equation (B.2) and F1 (θ) is the generating function of the alternative model. We further assume that θ is independent of H, though we will later relax this assumption. B.3.1

Estimating Pr {P ≤ α|H = 0} from pooled p-values

To estimate Pr {P ≤ α|H = 0}, we use a pooled p-value estimate, which is the average probability that each test has P ≤ α: m

1 X c Pr{P ≤ α|H = 0} , Pr {P ≤ α|H = 0, θi } . m i=1

(B.15)

Because FET computes Pr {P ≤ α|H = 0, θi } by considering every permutation of the data that is consistent with θi , it can be seen that Equation (B.15) is the result of computing every possible permutation of the data. Furthermore, given that H and θ are independent, it follows that the pooled p-values are unbiased estimates of Pr {P ≤ α|H = 0} (see Lemma 1 in section B.6). Finally, because Pr {P ≤ α|H = 0, θ} α for many marginals observed in practice, Equation (B.15) can provide a substantial gain in statistical power over the c estimate Pr{P ≤ α|H = 0} , α. Figure B.2 shows the advantage of using pooled p-values over α.

206

0.6

0.15

0.4

0.1

0.2

0.05

0

0 0

0.5

0

A Resistance

0.5

B Epitope

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

0.5

0

C Sieve

0.5

D SNPs

Figure B.2: Pooled vs. marginal p-values. The dashed line shows the α estimator. The distance between the dashed line and the solid line is the gain from using the pooled p-values.

B.3.2

Estimating Pr {P ≤ α} and Pr {R(α) > 0}

Given that R(α) is the number of observed tests with p ≤ α, it follows that R(α) = Pr {P ≤ α} . E m

(B.16)

We therefore follow Storey [208] in defining R(α) c Pr{P ≤ α} , . m

(B.17)

Because Equation (B.11) would be undefined for R = 0, Storey [208] defines R(α) ∨ 1 c Pr{P ≤ α} , , m

(B.18)

207

where R(α) ∨ 1 = max(R(α), 1), though in practice we are typically only interested in values for α that were observed in our data set. c \ To determine pF DR(α), we also require the estimate Pr{R(α) > 0}. It follows from our independence assumptions that Pr {R(α) > 0} = 1 − (1 − Pr {P ≤ α})m

(B.19)

≤ 1 − (1 − Pr {P ≤ α|H = 0})m ,

(B.20)

m c c Pr{R(α) > 0} , 1 − 1 − Pr{P ≤ α|H = 0}

(B.21)

making

a conservative estimate of Pr {R(α) > 0}.

B.3.3

Estimating π0

The final step in the pF DR computation is to estimate π0 . In this section, we use a general framework for conservatively estimating π0 [42] and show how existing methods fit within this framework. Given the mixture model Equation (B.14), we can write Pr {P = p} = π0 · Pr {P = p|H = 0} + π1 · Pr {P = p|H = 1} ,

(B.22)

where π1 = Pr {H = 1} = 1 − π0 [42, 78, 126]. Thus, it follows that Pr {P = p} ≥ π0 · Pr {P = p|H = 0}

(B.23)

and π0 ≤

Pr {P = p} . Pr {P = p|H = 0}

Moreover, for any non-negative function ρ(·) we can also write P p ρ(p) Pr {P = p} π0 ≤ P . p ρ(p) Pr {P = p|H = 0}

(B.24)

(B.25)

208

Therefore, estimating π0 using Pm P i=1 p ρ(p) · 1 {pi = p} , π ˆ 0 , Pm P i=1 p ρ(p) · Pr {P = p|H = 0, θi }

(B.26)

where ρ(·) is any non-negative function, is always a conservative estimate in expectation (see Lemma 2). Furthermore, in the limit, Equation (B.26) asymptotically converges to π0 + π 1 ·

E [ρ(p)| H = 1] , E [ρ(p)| H = 0]

which implies that the ρ(·) function minimizing

E[ ρ(p)|H=1] E[ ρ(p)|H=0]

will yield the least biased

estimator (see Lemma 3). Equation (B.26) gives us great flexibility in computing π0 estimates. One such estimation method is given by Storey: [208, 209] π ˆ0 (λ) =

#{pi > λ} (1 − λ)m

(B.27)

for some tuning parameter 0 ≤ λ < 1. For uniformly distributed statistics, (1 − λ)m = E [#{π > λ}| H = 0] .

(B.28)

c For contingency tables, we can estimate Pr {P > λ|H = 0} using Pr{P ≤ α|H = 0}, which results in an unbiased estimate for E [#{π > λ}| H = 0]. Therefore, Equation (B.27) is a special case of Equation (B.26) in which   0 if p ≤ λ, ρ(p) =  1 otherwise.

(B.29)

As λ → 0, we have increasingly conservative bias, with π ˆ0 = 1 when λ = 0, whereas the variance of the π ˆ0 estimate increases as λ → 1 due to the decreasing number of observations. Indeed, different heuristic approaches have been proposed to balance the bias-variance tradeoff inherent in picking λ. Equation (B.26) suggests an orthogonal heuristic that may be useful in estimating π ˆ0 : choosing a weighting function ρ(·)

209

such that more weight is applied to tests with high p-value. A natural choice for a weighting function is ρ(p) = Pr {P ≤ p|H = 0, θi } = pi , which is equivalent to π ˆ0 ,

E [P ] . E [P |H = 0]

(B.30)

Under this weighting function, tests with low p-values will still contribute to the π0 estimate, but not as much as tests with high p-values. In principle, we could define ρ(·) such that we are effectively summing p-values over the range λ < p ≤ 1; in practice, however, we have found the π ˆ0 estimate to be quite stable over a wide range of λ, and so simply set λ = 0. Pounds and Cheng [183] suggested the estimator π ˆ0 , 2¯ p for discrete statistics, where p¯ is the average observed marginal p-value. Estimator (B.30) is similar to their estimate, except that for contingency tables we can compute E [P |H = 0] exactly, rather than assuming E [P |H = 0] = 0.5. Table B.3 compares the true π0 for synthetic data against the π0 estimators of Equation (B.26) and of Storey [209] (evaluated for λ = 0.51 ) and Pounds and Cheng [183] computed using marginal p-values. Equation (B.26) provides a tight and conservative estimate, substantially increasing power of methods that assume a uniform p-value distribution. Similarly, accounting for the exact p-value distribution results in a substantially lower π0 estimate on each of the real data sets (Table B.4).

B.3.4

Convergence properties

We have presented estimates for each component of pF DR, showing how each is either \ unbiased or conservative. It follows that pF DR is asymptotically conservative as m → ∞. We can, however, be more precise in estimating the convergence properties of our pF DR estimate. Specifically 1

We used the available R code from http://genomics.princeton.edu/storeylab/qvalue/index.html. Results are truncated at 1. λ = 0.5 was chosen because the spline-fitting method [211] always results in estimates of π0 = 1 for these discrete data sets.

210

Table B.3: Comparing π ˆ0 estimations for synthetic data sets derived from the Epitope data with different true π0 . Storey’s method was evaluated at λ = 0.5. E[P ] E[P |H=0]

2¯ p

Storey

0.001

0.086

0.12

0.11

0.05

0.13

0.18

0.17

0.1

0.18

0.25

0.24

0.2

0.27

0.38

0.37

0.3

0.36

0.52

0.51

0.4

0.45

0.64

0.63

0.5

0.58

0.69

0.68

0.6

0.66

0.79

0.79

0.7

0.75

0.91

0.90

0.8

0.83

1.01

1

0.9

0.92

1.13

1

1

1.003

1.24

1

π0

Theorem 1. Given m tests, in which the P-values are IID and distributed according to the mixture Equation (B.14), the H are IID Bernoulli random variables, and for each test i, θi is independent of Hi , and given a non-negative function ρ(·): a.s. \ lim pF DR(α) =

m→∞

ρ(p)|H=1] π0 + π1 EE[[ ρ(p)|H=0]

π0

· pF DR(α)

(B.31)

Proof. Provided in section B.6. This convergence theorem shows that, for large samples, our pFDR estimate is conservative and becomes tightest when π ˆ0 is computed using a ρ(·) function that is expected to be much lower for true alternative hypotheses than for true null hypotheses.

211

Table B.4: Comparing π ˆ0 estimations for the real data sets. E[P ] E[P |H=0]

2¯ p

Storey

Sieve

0.63

0.82

0.69

SNPs

0.34

0.44

0.41

Epitope

0.99

1.84

1

Resistance

0.97

1.38

1

data set

B.3.5

Dependent marginals

Until now, we have assumed the following mixture model: P |H, θ ∼ (1 − H) · F0 (θ) + H · F1 (θ),

(B.32)

where θ is independent of H. It is conceivable, however, that true alternative tests will tend to have more balanced marginals that permit lower p-values. For example, in the Resistance data, it is possible that positions in which no mutations confer drug resistance may be more conserved due to purifying selection, an evolutionary process that results in less observed variation. Under these conditions, we might expect the pFDR estimate to become more conservative because a substantial proportion of the balanced marginals included in our pooled-p-value estimate belong to true alternative tests causing us to overestimate Pr {P ≤ α|H = 0}. This conservative bias is analogous to that observed in the microarray community—a bias that arises from permutation testing over alternative data that has a higher variance than the null data [106, 230]. In this section, we formalize the problem for discrete statistics and show that our pFDR estimate becomes asymptotically more conservative as the tendency increases for true alternative tests to have more evenly distributed marginals than null tests. Note that these results hold only for π0 estimators in which ρ(·) is nondecreasing, a slightly more restrictive definition than we used in the previous sections. In principle, the conservative bias could be reduced using heuristic measures similar

212

to those proposed in the microarray community [106, 230]; however, these heuristics are not guaranteed to result in a conservative pFDR estimate, so we will not explore them further here. Let us consider a special case in which Pr {θ|H = 1} = 6 Pr {θ|H = 0}. Specifically, we shall consider the case where the marginals of true alternative hypotheses tend to have larger n and/or are more balanced, meaning that the marginals of the true alternative hypotheses tend to permit smaller p-values than the marginals of the true null hypotheses. In this case, P

assumed

θ

Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 0} ≤ X Pr {P ≤ α|H = 0, θ} · Pr {θ|H = 1} . (B.33) θ

Under these assumptions, we can derive the following large sample theorem: Theorem 2. Given m tests that follow the assumptions of Theorem 1, a non-decreasing function ρ(·), and Assumption (B.33), then \ lim pF DR(α) ≥

m→∞

ρ(p)|H=1] π0 + π1 EE[[ ρ(p)|H=0]

π0

· pF DR(α).

(B.34)

Hence, under the above stated assumptions, and provided ρ(p) is non-decreasing in p, our pFDR and FDR estimates will be asymptotically more conservative than the case where the marginals are independent of H. In practice, it is not known whether H and θ are independent. Consequently, we recommend using the more restricted class of ρ(·) functions allowed by Theorem 2. B.3.6

Filtering irrelevant tests

The discreteness of the data provides a unique opportunity to increase power further. For any discrete test, there exists an α such that Pr {P ≤ α} = 0. It turns out that including such tests in our pFDR estimate will typically increase the conservative bias. Thus, power can often be improved by first filtering out all all tests that couldn’t

213

possibly achieve the significance threshold α. As we shall see, such filtering will still result in an asymptotically conservative estimate and, in some cases, will provably increase power. Because the range of the data is finite, computation of the most significant achievable p-value given the marginals is possible. We can define the minimum achievable p-value for fixed marginals as p∗ (θ) ,

min p(t).

(B.35)

t∈perm(θ)

When computing pF DR(α), we can ignore (filter) all tests i such that p∗ (θi ) > α.

(B.36)

For a set of contingency tables T, we can now write − T = T+ α ∪ Tα

(B.37)

∗ T+ α = {ti : ti ∈ T ∧ p (θi ) > α}, and

(B.38)

∗ T− α = {ti : ti ∈ T ∧ p (θi ) ≤ α}.

(B.39)

for the disjoint sets

∗ Filtering on p∗ (θ) > α can be seen as estimating pF DR over T− α . Let pF DR (α) and ∗

\ pF DR (α) be the true and estimated pFDR, respectively, for T− α . Then the following theorem holds. Theorem 3. Given m tests that follow the assumptions of Theorem 1, pF DR∗ (α) = pF DR(α).

(B.40)

Proof. Recall that, for independent, identically distributed tests, pF DR(α) =

E [V (α)] . E [R(α)]

(B.41)

214

Because E [R(α)| Pr {p(T ) ≤ α} = 0] = 0,

(B.42)

∗ it follows that pF DR(α) will be the same for T and T− α . Therefore, pF DR (α) =

pF DR(α). Therefore, filtering on p∗ (θi ) > α) does not change the true pF DR. Furthermore, under the assumptions of Theorem 2, we have a.s.

∗

\ pF DR∗ (α) ≤ lim pF DR (α), m→∞

(B.43)

∗

\ \ Consequently, we can compute pF DR (α) instead of pF DR(α). Moreover, if true alternative tests are monotonically more concentrated at lower p-values; that is, if Pr {P ≤ α|H = 1} Pr {P ≤ α|H = 0}

(B.44)

is non-increasing in α, then ∗

a.s.

\ \ lim pF DR (α) ≤ lim pF DR(α),

m→∞

m→∞

(B.45)

meaning that filtering is asymptotically guaranteed to provide a tighter estimate and therefore increase power (see Lemma 4 in Appendix for details). Although this condition is often not met for discrete data, it is often approximately met and provides ∗

\ \ a good rationale for filtering. In addition, because both pF DR(α) and pF DR (α) are asymptotically greater than pF DR(α), for large samples we can compute both the filtered and unfiltered estimates and choose whichever yields the lower value. The only estimate that filtering changes is π ˆ0 . Let π ˆ0 (α) denote the estimated π0 over T− ˆ0 (1) denote the estimated π0 over T. It may be that π ˆ0 (α) ≤ π ˆ0 (1), α (α), and π and π ˆ0 (α) is not a conservative estimate of the true π0 of the original set of tests T. π ˆ0 (α) is, however, a conservative estimate of π0 among the filtered set of tests T− ˆ0 (α) is a conservative estimate of the proportion of tests that are α (α). That is, π truly null among those tests that could achieve significance level α. In addition to

215

1 0.9 0.8 pi(a) 0.7 0.6

0.5 0.4 0.0001

0.001

0.01

0.1

1

α Epitope

Resistance

SNPs

Sieve

Figure B.3: The advantage of using π0 (α) computed using the filtering technique over not filtering. Because filtering only affects π ˆ0 , these gains result in proportionally reduced (yet conservative) pFDR estimates.

providing increased power, π ˆ0 (α) may provide valuable information in cases were a large proportion of tests could not achieve α. In such cases, the overall π0 may be quite high, but the π0 among tests that could achieve α (those that we are interested in) may be much lower. Figure B.3 demonstrates the advantage of using π ˆ0 (α) over π ˆ0 (1). B.4

Numerical results

To explore the applicability of our proposed pFDR estimator, we created a number of Epitope-derived synthetic data sets with different number of tables that follow the mixture model assumptions above, allowing for an unequal distribution of marginals as defined in Assumption (B.33) (see the Appendix for details). For each of these data sets, we plotted the estimated pF DR(α) against the true proportion of false discoveries using p < α as the threshold (Figure B.4). In practice, it is often the case that pF DR(α) > pF DR(β) for some β > α.

216

0.1

0.01

0.001

70k 10k 35k

0.0001 0.0001

0.001

0.01

0.1

Figure B.4: Estimated pFDR vs. true false discovery proportion for synthetic data with increasing number of tables generated from the Epitope data set. Estimates above the dashed line are conservative.

Therefore, there is no reason to choose α as the rejection region, because choosing β will result in more rejected tests and a lower proportion of false positives among those rejected tests. For this reason, Storey [208] proposed the q-value, defined to be \ q(α) , min pF DR(β). β≥α

(B.46)

To demonstrate the power gains of our method in practice, we conclude by comparing the number of significant results for each of our example data sets as a function of the q-value threshold (Figure B.5). As can be seen, our conservative estimates result in a substantial increase in the number of tests called significant at a variety of thresholds.

217

0.05

0.02

0.04

0.015

0.03 0.01

0.02

0.005

0.01 0

0

0

0.1

0.2

0.3

0.4

0.5

0

A Resistance

0.1

0.2

0.3

0.4

0.5

B Epitope

0.8

1

0.6

0.8 0.6

0.4

0.4

0.2

0.2

0

0 0

0.1

0.2

0.3

0.4

0.5

C Sieve

0

0.1

0.2

0.3

D SNPs

Figure B.5: Plotting the portion of rejected cases vs. q-values for the real data sets. The solid line is the proposed method for discrete data and the dotted line is the S&T method using marginal p-values.

B.5

Creating synthetic data sets

In real data sets, the properties of the data that we estimate through our pFDR computation, such as π0 or the true pFDR are of course unknown. In such cases, synthetic data sets that allow manipulation of these properties can provide insights as to the usefulness of various estimators. It is important, however, to create synthetic data sets that are as close as possible to the real data. In this section we explain the procedure that was used to create the synthetic data sets. To simulate the real data, we used only marginals that were observed. Given such marginals we first decide whether the synthetic table that we create will be null or alternative. For example, if we are interested in fixing π0 , we can thus ensure that π0 of the tables that we create are nulls.

218

B.5.1

Creating null and alternative tables from given marginals

To create a null table given a set of marginals θ = {θX , θX¯ , θY , θY¯ }, we simulate n tests where each test has a result {X, Y } such that X is independent of Y . For each such test we select the result X ∈ {1, 0} following Pr {X = 1|H0 } = Y ∈ {1, 0} following Pr {Y = 1|H0 } =

θY n

θX . n

We select

.

To create an alternative table, in which X and Y are not independent, we simulate tests by first selecting X using the same procedure as above, and select Y |X, using the following distribution: Pr {Y = 1|H1 , X = 1} =

B.5.2

a θX

and Pr {Y = 1|H1 , X = 0} =

c . θX ¯

Selecting marginals

We have created two different types of data sets, one where all the marginals come from the same distribution, and one where the marginals distribution depends on whether the table is null or alternative. In the case of a single distribution of marginals, we divided the observed marginals into 10 exponential bins [1, 1/2], (1/2, 1/4], (1/4, 1/8], . . . and place each marginal θ into a bin according to min{θX , θX¯ , θY , θY¯ }/max{θX , θX¯ , θY , θY¯ }. We then choose a bin uniformly, and select a set of marginals uniformly from the bin. We then designate the selected marginal as null with probability π0 and generate the table accordingly. This approach biases us towards choosing marginals that permit lower p-values, which enables us to generate interesting alternative tables, even when we force π0 to be much lower than it is in the real data. When the distribution of marginals depends on the whether the table is null or alternative, we draw the θ from bin b ∈ 1, . . . , 10 with probability 1/210−b−1 for a null table and with probability 1/2b for an alternative table.

219

B.6

Proofs and Remarks

In this appendix, we formalize the theoretical results from the main paper. For brevity, we will write H0 to mean the event H = 0 and H1 to mean the event H = 1. Lemma 1. Given m tests, in which the P-values are IID and distributed according to the mixture Equation (B.14), the H are IID Bernoulli random variables, and for each test i, θi is independent of Hi , # " m 1 X Pr {P ≤ α|H0 , θi } = Pr {P ≤ α|H0 } . E m i=1

(B.47)

Proof of Lemma 1. Because θ is independent of H, we can write Pr {P ≤ α|H0 } =

X

=

X

Pr {P ≤ α|H0 , θ} · Pr {θ|H0 }

(B.48)

Pr {P ≤ α|H0 , θ} · Pr {θ}

(B.49)

θ

θ

where the summation is over all possible marginals. Furthermore, " # m 1 X 1 {θj = θ} = Pr {θ} . E m j=1

(B.50)

Thus, "

Pr {P ≤ α|H0 } =

X θ

"

m

1 X Pr {P ≤ α|H0 θ} E 1 {θj = θ} m j=1 m

#

1 XX =E Pr {P ≤ α|H0 , θ} · 1 {θj = θ} m j=1 θ " # m 1 X =E Pr {P ≤ α|H0 , θi } . m i=1

(B.51) # (B.52)

Lemma 2. Let ρ(·) be any non-negative function. Then, under the assumptions of Lemma 1,

# Pm P i=1 p ρ(p) · 1 {pi = p} ≥ π0 . E Pm P i=1 p ρ(p) · Pr {P = p|H0 , θi } "

(B.53)

220

Proof of Lemma 2. Recall that Pr {P = p} = π0 · Pr {P = p|H0 } + π1 · Pr {P = p|H1 } ,

(B.54)

where π1 = Pr {H1 } = 1 − π0 . Thus, it follows that Pr {P = p} ≥ π0 · Pr {P = p|H0 }

(B.55)

X

(B.56)

and X

ρ(p) Pr {P = p} ≥

p

ρ(p) Pr {P = p|H0 } · π0

p

for any non-negative function ρ(·). Thus, it follows that P p ρ(p) Pr {P = p} . π0 ≤ P p ρ(p) Pr {P = p|H0 } It follows analogously to the proof of Lemma 1 that " m # X 1 Pr {P = p|H0 } = Pr {P = p|H0 , θi } E m i=1 and

" m # X 1 Pr {P = p} = 1 {pi = p} . E m i=1

(B.57)

(B.58)

(B.59)

Thus, it follows that P ρ(p) m1 E [ m 1 {pi = p}] Pm i=1 (B.60) =P 1 p ρ(p) m E [ i=1 Pr {P = p|H0 , θi }] hP P i m ρ(p) {p = p} 1 E i i=1 p i. = hP P (B.61) m ρ(p) Pr {P = p|H , θ } E 0 i i=1 p P P Because p ρ(p) Pr {P = p} is a linearly increasing function of p ρ(p) Pr {P = p|H0 }, P ρ(p) Pr{P =p} P p ρ(p) Pr{P =p|H0 } p

P

p

it follows from Jensen’s inequality that hP P i " # Pm P m E i=1 p ρ(p) 1 {pi = p} i=1 p ρ(p) 1 {pi = p} hP P i ≤ E Pm P (B.62) m i=1 p ρ(p) Pr {P = p|H0 , θi } ρ(p) Pr {P = p|H , θ } E 0 i i=1 p Thus, π0 ] ≥ π0 . E [ˆ

221

Remark 1. Storey [208, 209] argued that, for continuous statistics, we would expect most of the observations with p close to 1 to be true null, and thus a natural estimate for π0 is π ˆ0 (λ) =

#{pi > λ} (1 − λ)m

(B.63)

for some tuning parameter 0 ≤ λ < 1. This procedure assumes a continuous underlying distribution, such that (1 − λ) = Pr {pi > λ|Hi = 0} for all i. It can be shown that Equation (B.27) is a special case of Equation (B.26) in which   0 if p ≤ λ, ρ(p) =  1 otherwise.

(B.64)

Proof. P Pm i=1 ρ(p) 1 {pi = p} p π ˆ0 = P Pm i=1 ρ(p) Pr {P = p|H0 , θi } p P Pm i=1 p>λ 1 {pi = p} = Pm P i=1 p>λ Pr {P = p|H0 , θi } #{pi > λ} i=1 Pr {P > p|H0 , θi } #{pi > λ} = . c m · Pr{P > p|H0 } = Pm

(B.65) (B.66) (B.67) (B.68)

For discrete statistics, Pr {pi > λ|H0 } ≥ (1 − λ),

(B.69)

thus, it follows that π ˆ0 ≤

#{pi > λ} , (1 − λ)m

(B.70)

making it a tighter estimate of π0 than we get when assuming the statistics are continuous. Remark 2. In an argument similar to what lead to Equation (B.26), Pounds and Cheng [183] pointed out that π0 ≤

E [P ] . E [P |H0 ]

(B.71)

222

Assuming m

1 X pi E [P ] = p¯ , m i=1

(B.72)

and E [P |H0 ] ≥ 12 , Pounds and Cheng suggest defining π ˆ0 , 2 · p¯.

(B.73)

It turns out that this Pounds-Cheng approach is a special case of Equation Equation (B.26), with a conservative approximation for E [p| H0 ]. Proof. Let ρ(p) = p, then π ˆ0 = = = = ≤

Pm P i=1 p p · 1 {pi = p} Pm P i=1 p p · Pr {P = p|H0 , θi } Pm 1 pi Pmm i=1 1 i=1 E [P | H0 , θi ] m p¯ P c θ E [P | H0 , θ] · Pr{θ} p¯ b [P |H0 ] E p¯ , 0.5

(B.74) (B.75) (B.76) (B.77) (B.78)

b [p|H0 ] is our unbiased estimate of E [P |H0 ] and we define Pr{θ} c where E as in Equation (B.50). Lemma 3. Under the assumptions of Lemma 2, a.s.

lim π ˆ 0 = π0 + π1

m→∞

E [ρ(p)| H1 ] . E [ρ(p)| H0 ]

(B.79)

Proof. By the strong law of large numbers, equations (B.58) and Equation (B.59) c c imply that Pr{P = p|H0 } converges almost surely to Pr {P = p|H0 } and Pr{P = p} converges almost surely to Pr {P = p}. Thus, it follows from Equation (B.60) that P a.s. p ρ(p) Pr {P = p} lim π ˆ0 = P . (B.80) m→∞ p ρ(p) Pr {P = p|H0 }

223

Furthermore, it follows from Equation (B.22) that P P ρ(p) Pr {P = p} p ρ(p) Pr {P = p|H1 } P p = π 0 + π1 · P . p ρ(p) Pr {P = p|H0 } p ρ(p) Pr {P = p|H0 }

(B.81)

Thus, P p ρ(p) Pr {P = p|H1 } limm→∞ π ˆ 0 = π 0 + π1 · P p ρ(p) Pr {P = p|H0 } a.s.

= π0 + π 1 ·

(B.82)

E [ρ(p)| H1 ] . E [ρ(p)| H0 ]

Proof of Theorem 1. c π ˆ0 · Pr{P ≤ α|H0 } m→∞ Pr{P c c ≤ α} · Pr{R(α) > 0}

(B.83)

c limm→∞ π ˆ0 · limm→∞ Pr{P ≤ α|H = 0} . c c limm→∞ Pr{P ≤ α} · limm→∞ Pr{R(α) > 0}

(B.84)

\ limm→∞ pF DR(α) = lim =

c By the strong law of large numbers, Lemma 1 implies that Pr{P ≤ α|H = 0} conc verges almost surely to Pr {P ≤ α|H0 }, equation (B.16) implies that Pr{P ≤ α} c converges almost surely to Pr {P ≤ α}, and Pr{R(α) > 0} converges almost surely to 1. Thus ˆ0 · Pr {P ≤ α|H0 } a.s. limm→∞ π \ limm→∞ pF DR(α) = Pr {P ≤ α} limm→∞ π ˆ0 = · pF DR(α). π0

(B.85) (B.86)

Finally, it follows from Lemma 3 that a.s.

\ lim pF DR(α) =

m→∞

ρ(p)|H1 ] π0 + π1 EE[[ ρ(p)|H 0]

π0

· pF DR(α).

(B.87)

Proof of Theorem 2. The proof follows analogously to that of Theorem 1 by noting that the present assumptions lead to a.s.

c limm→∞ Pr{P ≤ α} = Pr {P ≤ α}

(B.88)

a.s.

c lim Pr{P ≤ α|H0 } ≥ Pr {P ≤ α|H0 }

m→∞

a.s.

lim π ˆ 0 ≥ π 0 + π1 ·

m→∞

E [ρ(p)| H1 ] . E [ρ(p)| H0 ]

(B.89) (B.90)

224

We shall prove each of these statements in turn. c Equation Equation (B.88) follows immediately by noting that our estimate Pr{P ≤ α} ,

R∨1 m

is not affected by the distribution of θ. Equation Equation (B.89) can be

seen by noting that we can no longer use equality Equation (B.49) and must instead use Pr {P ≤ α|H0 } =

X

Pr {P ≤ α|H0 , θ0 } · Pr {θ0 |H0 } .

(B.91)

θ0

Thus, we have c lim Pr{P ≤ α|H0 } X a.s. Pr {P ≤ α|H0 , θ0 } · Pr {θ0 } =

m→∞

(B.92) (B.93)

θ0

X

=

Pr {P ≤ α|H0 , θ0 } × · · ·

θ0

· · · × Pr {θ0 |H0 } · π0 + Pr {θ0 |H1 } · π1 X Pr {P ≤ α|H0 , θ0 } · Pr {θ0 |H0 } + · · · = π0

(B.94)

θ0

· · · + π1

X

Pr {P ≤ α|H0 , θ0 } · Pr {θ0 |H1 }

(B.95)

θ0

≥ π0

X

Pr {P ≤ α|H0 , θ0 } · Pr {θ0 |H0 } + · · ·

θ0

· · · + π1

X

Pr {P ≤ α|H0 , θ0 } · Pr {θ0 |H0 }

(B.96)

θ0

= Pr {P ≤ α|H0 } ,

(B.97)

where the inequality follows from Assumption (B.33). Finally, Inequality (B.90) follows from the fact that the added assumptions of Theorem 2 only affect the denominator of our π0 estimate Equation (B.26). Furthermore, Inequality (B.89) implies a.s.

c lim Pr{P ≥ α|H0 } ≤ Pr {P ≥ α|H0 } ,

m→∞

(B.98)

225

from which it follows that X a.s. X c lim ρ(p) · Pr{P = α|H0 } ≤ ρ(p) · Pr {P = α|H0 } m→∞

p

(B.99)

p

for any non-decreasing function ρ(p). Thus, it follows that E [ρ(p)] E [ρ(p)| H0 ] E [ρ(p)| H1 ] = π 0 + π1 · . E [ρ(p)| H0 ] a.s.

limm→∞ π ˆ0 ≥

(B.100)

Lemma 4. Under the assumptions of Theorem 3, if Pr {P ≤ α|H1 } Pr {P ≤ α|H0 }

(B.101)

is non-increasing in α, then ∗

a.s.

\ \ lim pF DR (α) ≤ lim pF DR(α).

m→∞

m→∞

(B.102)

Proof. Recall our large sample estimate c π ˆ0 · Pr{P ≤ α|H0 } \ pF DR(α) = c Pr{P ≤ α} P 1 π ˆ0 · m · m m i=1 Pr {P ≤ α|H0 , θi } = R(α) ∨ 1 Pm π ˆ0 · i=1 Pr {P ≤ α|H0 , θi } = R(α) ∨ 1

(B.103) (B.104) (B.105)

Removing n tests with p∗ (θ) > α will have no effect on (R(α) ∨ 1) or on m X

Pr {P ≤ α|H0 , θi } .

i=1

We will show, however, that, under the present assumptions, our π0 estimate under filtering will almost surely be lower than our π0 estimate without filtering. Let p+ denote the event p∗ (θ) > α and p− denote the event p∗ (θ) ≤ α. From equation (B.81), we can write E [ρ(p)|H1 ] E [ρ(p)|H0 ] E [ρ(p)|H1 , p+ ] Pr {p+ } + E [ρ(p)|H1 , p− ] Pr {p− } = π0 + π 1 · E [ρ(p)|H0 , p+ ] Pr {p+ } + E [ρ(p)|H0 , p− ] Pr {p− } a.s.

limm→∞ π ˆ 0 = π 0 + π1 ·

(B.106)

226

Let π ˆ0 (α) ,

E [ρ(p)|p− ] E [ρ(p)|H0 , p− ]

(B.107)

be the estimated π0 over T− α . We wish to show that lim π ˆ0 (α) ≤ lim π ˆ0 (1),

m→∞

m→∞

(B.108)

which, by Equation (B.106) is true if an only if E [ρ(p)|H1 , p+ ] Pr {p+ } + E [ρ(p)|H1 , p− ] Pr {p− } E [ρ(p)|H1 , p− ] ≥ . E [ρ(p)|H0 , p+ ] Pr {p+ } + E [ρ(p)|H0 , p− ] Pr {p− } E [ρ(p)|H0 , p− ]

(B.109)

Thus, it follows that Equation (B.108) is true if and only if E [ρ(p)|H1 , p− ] E [ρ(p)|H1 , p+ ] ≥ . E [ρ(p)|H0 , p+ ] E [ρ(p)|H0 , p− ]

(B.110)

Now Assumption (B.44) implies that Pr {P > α|H1 , p+ } Pr {P > α|H1 , p− } ≥ , Pr {P > α|H0 , p+ } Pr {P > α|H0 , p− }

(B.111)

from which Inequality (B.110), and hence Lemma 4, follows from the constraint that ρ(·) is non-decreasing. B.7

Discussion

The false discovery rate has proven to be an extremely useful tool when testing large numbers of tests, as it allows the researcher to balance the number of significant results with an estimate of the proportion of those results that are truly null. Storey presented novel methods for estimating pFDR and q-values for general test statistics [208, 209]. He factored the pFDR computation into several components and suggested estimators for each component. Perhaps the most discussed component is the π0 —the proportion of tests that are expected to be null over the entire data set. For example, Dalmasso and colleagues [42] derived a class of π0 estimators for continuous distributions that take the same form as Equation (B.26) and explored properties of ρ(·). They proved that a certain class of convex ρ(·) functions yielded provably less biased π0 estimators

227

than ρ(p) = p. Similarly, Genovese and Wasserman [78] explore several estimators under a mixture model framework that assumes a uniform continuous null distribution and provide estimates of confidence intervals, and Langaas and colleagues [126] use the mixture model to define pi0 estimators that perform particularly well under certain continuous convexity assumptions. When the data are finite, however, some of the underlying assumptions used by the above methods, such as the uniform distribution of p values under the null and the convexity and monotone distribution of p values under the alternative, are violated [183]. In such cases, some of the methods developed for general statistics become overly conservative, and some may provide anti-conservative estimates. For example, the estimators of Dalmasso et al. [42] assume that the null distribution is non-increasing in p. As we have seen, contingency tables provide a common example where these assumptions are grossly violated, even when the number of observations in each table is quite high. In these cases, the use of marginal p-values leads to severe conservative bias in the FDR estimation. Pounds and Cheng [183] addressed the conservative bias of FDR estimation on finite data by proposing a new π0 estimator. This estimator avoids the extreme conservative bias of Storey’s spline-fitting method on finite data, in which π0 estimates at λ = 1 may have more bias rather than less. On our data sets, the two methods were comparable, the method of Pounds and Cheng was comparable to Storey’s estimator at λ = 0.5. A key assumption in the method of Pounds and Cheng is that the expected p-value under the null hypothesis is 0.5, which was grossly violated in all of our contingency table data sets. Replacing this assumption with the exact null distribution substantially decreased the bias in all our tests. Our theoretical results indicate that optimal ρ(·) is that which minimizes the ratio of the expected ρ(·) under the alternative hypothesis to the expected ρ(·) under the null hypothesis. Other ρ(·) functions than those described here may thus yield less biased estimates. Several authors have proposed randomization testing as a means of dealing with

228

non-uniform or unknown p-values distributions, with a focus on non-uniform continuous distributions (see [36] for review). Focusing on Fisher’s exact test allows us to implement exact permutation tests efficiently even for very large data sets, resulting in exact estimation of the pooled null distribution, a straightforward analysis of the convergence properties, and the removal of numerical error from the estimation. Furthermore, the exact null distribution allows us to identify and remove tests that cannot be called significant, thereby increasing power. This approach was first proposed by Gilbert [79], who proposed choosing a p-value threshold p0 and removing a priori all tests for which no permutation of the contingency table results in p ≤ p0 . To choose p0 , Gilbert suggested using a derivative of the Bonferroni adjusted p-value. Unfortunately, it can be shown that this threshold is too aggressive and will often remove tests that should be considered significant. In contrast, choosing p0 = α leaves the true pFDR unchanged while often achieving an increase in statistical power. This paper provides estimators for the various components of the pFDR, based on a permutation testing approach. We combine here several ideas that were previously suggested, adapting them to the important case of contingency tables. As we have shown above, our methods can rapidly provide tight estimates of pFDR and q-values for very large data sets. Although we have chosen to focus on Fisher’s exact test, analogous results can be derived for any discrete test for which all permutations of the data can be efficiently computed.

229

VITA Jonathan Carlson was born and raised in Beaverton, Oregon. In 2003, he graduated from Dartmouth, where he met a beautiful girl named Kate, pole vaulted, and batted cleanup for the Fighting Mullets. Although the Mullets made the intramural championships several times, they had a propensity for choking and never came away with a T-shirt. In 2004, Jonathan married Kate, finally said goodbye to Dartmouth, and moved back west, hoping to find a team that could come through in the clutch. In 2006, he signed with the Infrared Sox in the University of Washington co-rec league and with the Fleas in the men’s league. He went on to win two T-shirts with the IR Sox, setting team records for home runs and slugging percentage, and one T-shirt with the Fleas. In 2009 he graduated with his Ph.D. in computer science and engineering from the University of Washington. He currently resides in Marina del Rey, California, is a researcher for the eScience group of Microsoft Research, and is a free agent.

Phylogenetic Dependency Networks

Phylogenetic Dependency Networks

Suggest Documents

Phylogenetic Dependency Networks: Inferring Patterns of ... - Microsoft

Distinguishing Phylogenetic Networks

Distinguishing Phylogenetic Networks

Phylogenetic networks - CiteSeerX

Tree-based unrooted phylogenetic networks

Algorithms for Visualizing Phylogenetic Networks

Connections Dependency Centrality from Bipartite Social Networks

Dependency-based Convolutional Neural Networks for Sentence ...

Fast Learning of Relational Dependency Networks

Dependency Networks for Collaborative Filtering and ...

Networks of Dependency: Re-configurations of ...

Dependency Networks for Relational Data - CiteSeerX

Patterns in syntactic dependency networks - Revista Redes

Dependency-based Convolutional Neural Networks for Sentence ...

1 Relational Dependency Networks - Purdue CS

Intertwining phylogenetic trees and networks - PeerJ

Improved Layout of Phylogenetic Networks - Semantic Scholar

Comparison of Tree-Child Phylogenetic Networks

The Structure of Level-k Phylogenetic Networks

Rooted Phylogenetic Networks for Exploratory Data Analysis

Phylogenetic Networks, Trees, and Clusters - Semantic Scholar

Inferring Phylogenetic Networks with Maximum ... - Semantic Scholar

Binets: fundamental building blocks for phylogenetic networks

Maximum Likelihood of Phylogenetic Networks - CiteSeerX