Statistical Approaches for Molecular and Systems

1 downloads 0 Views 27MB Size Report
functions of biochemical processes drive the research in molecular biology. Remarkable insights ...... narios of model predictions. For models with tens to ...
Statistical Approaches for Molecular and Systems Biology

INAUGURAL-DISSERTATION zur Erlangung des Doktorgrades der ¨ fur Fakultat ¨ Mathematik und Physik ¨ Freiburg im Breisgau der Albert-Ludwigs-Universitat

vorgelegt von Clemens Kreutz November 2011

Dekan: Leiter der Arbeit: Referent: Korreferent: Pr¨ufer Theoretische Physik: Pr¨ufer Experimentalphysik: Tag der Verk¨undigung des Pr¨ufungsergebnisses:

Prof. Dr. Kay K¨onigsmann Prof. Dr. Jens Timmer Prof. Dr. Jens Timmer Prof. Dr. Martin Schumacher Prof. Dr. Thomas Filk Prof. Dr. Hanspeter Helm 12. Dezember 2011

Publications Gene expression/DNA microarrays: 1. S. Zellmer, W. Schmidt-Heck, A. Bauer, C. Meyer, T. Lehmann, T. Sparna, P. Godoy, P. Amin, W. Schormann, E. Bedawy, S. Hammad, C. Kern, C. Kreutz, J. Timmer, G. Walz, F. von Weizs¨acker, P. Th¨urmann, S. Dooley, I. Merfort, R. Guthke, J. Hengstler, R. Gebhardt (2010). The transcription factors ETF, E2F and SP-1 are involved in cytokine-independent proliferation of murine hepatocytes. HEPATOLOGY, 52(6): 2127-2136. 2. S. Lassmann, C. Kreutz, A. Schoepflin, U. Hopt, J. Timmer and M. Werner (2009). A novel approach for reliable microarray analysis of microdissected tumor cells from formalin-fixed and paraffin-embedded colorectal cancer reSection specimens. JOURNAL OF MOLECULAR MEDICINE, 87: 211-224. 3. B. Rumberger, C. Kreutz, Ch. Nickel, M. Klein, S. Lagoutte, S. Teschner, J. Timmer, P. Gerke, G. Walz, J. Donauer (2009). Combination of Immunosuppressive Drugs Leaves Specific Fingerprint on Gene Expression in-vitro. IMMUNOPHARMACOLOGY AND IMMUNOTOXICOLOGY, 1-10. 4. K. Bartholom´e, C. Kreutz, and J. Timmer (2009). Estimation of Gene Induction Enables a Relevance-based Ranking of Gene Sets. JOURNAL OF COMPUTATIONAL BIOLOGY, 16: 959-967. 5. B. Rumberger, O. Vonend, C. Kreutz, J. Wilpert, J. Donauer, K. Amann, R. Rohrbach, J. Timmer, G. Walz, P. Gerke (2007). cDNA microarray analysis of adaptive changes after renal ablation in a sclerosis-resistant mouse strain. KIDNEY & BLOOD PRESSURE RESEARCH, 30 (6): 377-387 6. M. Lindenmeyer, C. Kern, T. Sparna, J. Donauer, J. Wilpert, J. Schwager, D. Porath, C. Kreutz, J. Timmer, I. Merfort (2007). Microarray analysis reveals influence of the sesquiterpene lactone parthenolide on gene transcription profiles in human epithelial cells. LIFE SCIENCES, 80 (17): 1608-1618. 7. D. Pfeifer, M. Pantic, I. Skatulla, J. Rawluk, C. Kreutz, U. Martens, P. Fisch, J. Timmer, H. Veelken (2007). Genome-wide analysis of DNA copy number changes and LOH in CLL using high-density SNP arrays. BLOOD, 109 (3): 1202-1210. 8. G. Schieren, B. Rumberger, M. Klein, C. Kreutz, J. Wilpert, M. Geyer, D. Faller, J. Timmer, I. Quack, L. Rump, G. Walz, J. Donauer (2006). Gene profiling of polycystic kidneys. NEPHROLOGY DIALYSIS TRANSPLANTATION, 21 (7): 1816-1824. 9. X. Fang, M. Zeisel, J. Wilpert, B. Gissler, R. Thimme, C. Kreutz, T. Maiwald, J. Timmer, W. Kern, J. Donauer, M. Geyer, G. Walz, E. Depla, F. von Weizsacker, H. Blum, T. Baumert

(2006). Host cell responses induced by hepatitis C virus binding. HEPATOLOGY, 43 (6): 1326-1336. 10. P. Goerttler, C. Kreutz, J. Donauer, D. Faller, T. Maiwald, E. M¨arz, B. Rumberger, T. Sparna, A. Schmitt-Gr¨aff, J. Wilpert, J. Timmer, G. Walz, HL. Pahl (2005). Gene expression profiling in polycythaemia vera: overexpression of transcription factor NF-E2. BRITISH JOURNAL OF HAEMATOLOGY, 129 (1): 138-150.

Western blotting: 11. C. Kreutz, M. M. Bartolome-Rodriguez, T. Maiwald, M. Seidl, H. E. Blum, L. Mohr, J. Timmer (2007) An error model for protein quantification. BIOINFORMATICS, 23 (20): 27472753. 12. M. Schilling, T. Maiwald, S. Bohl, M. Kollmann, C. Kreutz, J. Timmer, U. Klingm¨uller (2005). Quantitative data generation for systems biology: the impact of randomisation, calibrators and normalisers. IEE PROCEEDINGS SYSTEMS BIOLOGY, 152 (4): 193200. 13. M. Schilling, T. Maiwald, S. Bohl, M. Kollmann, C. Kreutz, J. Timmer, U. Klingm¨uller (2005). Computational processing and error reduction strategies for standardized quantitative data in biological networks. FEBS JOURNAL, 272 (24): 6400-6411.

Experimental design: 14. C. Kreutz, and J. Timmer (2009). Systems Biology: Experimental Design. FEBS JOURNAL, 276: 923-942. 15. U. Klingm¨uller, A. Bauer, S. Bohl, P. Nickel, K. Breitkopf, S. Dooley, S. Zellmer, C. Kern, I. Merfort, T. Sparna, J. Donauer, G. Walz, M. Geyer, C. Kreutz, M. Hermes, F. G¨otschel, A. Hecht, D. Walter, Egger, K. Neubert, C. Borner, M. Brulport, W. Schormann, C. Sauer, F. Baumann, R. Preiss, S. MacNelly, P. Godoy, E. Wiercinska, L. Ciuclan, J. Edelmann, K. Zeilinger, M. Heinrich, U. Zanger, R. Gebhardt, T. Maiwald, R. Heinrich, J. Timmer, F. von Weizsacker, J. Hengstler (2006). Primary mouse hepatocytes for systems biology approaches: a standardized in vitro system for modelling of signal transduction pathways. IEE PROCEEDINGS SYSTEMS BIOLOGY, 153 (6): 433-447. 16. T. Maiwald, C. Kreutz, A. Pfeifer, S. Bohl, U. Klingm¨uller, J. Timmer (2007). Dynamic pathway modeling: Feasibility analysis and optimal experimental design. ANNALS OF THE NEW YORK ACADEMY OF SCIENCES, 1115: 212-220.

Model identification: 17. A. Raue, C. Kreutz, T. Maiwald, U. Klingmueller, J. Timmer (2011). Addressing Parameter Identifiability by Model-Based Experimentation. IET SYSTEMS BIOLOGY, 5(2): 120130. 18. A. Raue, C. Kreutz, T. Maiwald, J. Bachmann, M. Schilling, U. Klingm¨uller, J. Timmer

(2009). Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. BIOINFORMATICS. 25(15):1923-1929. 19. S. Hengl, C. Kreutz, J. Timmer, T. Maiwald (2007) Data-based identifiability analysis of non-linear dynamical models. BIOINFORMATICS, 23 (19): 2612-2618. Dynamic modeling applications: 20. J. Bachmann, A. Raue, M. Schilling, M. Bohm, C. Kreutz, D. Kaschek, H. Busch, N. Gretz, W. Lehmann, J. Timmer, U. Klingmueller (2011). Division of labor by dual feedback regulators controls JAK2/STAT5 signaling over broad ligand range. MOLECULAR SYSTEMS BIOLOGY, 7: 516-531. 21. M. Schilling, T. Maiwald, S. Hengl, D. Winter, C. Kreutz, W. Kolch, W. Lehmann, J. Timmer, U. Klingm¨uller (2009). Theoretical and experimental analysis links isoform-specific ERK signalling to cell fate decisions. MOLECULAR SYSTEMS BIOLOGY, 5:334-352. Submitted publications: 22. C. Kreutz, A. Raue, J. Timmer. Title: Likelihood based observability analysis and confidence intervals for predictions of dynamic models. SUBMITTED. Preprint version: http://arxiv.org/abs/1107.0013. 23. C. Kreutz, J.S. Gehring, D. Lang, R. Reski, J. Timmer, S.A. Rensing TSSi - An R package for transcription start site identification from 5’ mRNA tag data. SUBMITTED. 24. K. Sa Ferreira, C. Kreutz, S. MacNelly, K. Neubert, A. Haber, M. Bogyo, J. Timmer, C. Borner. Caspase-3 feeds back on caspase-8, Bid and XIAP in type I Fas signaling in primary mouse hepatocytes. SUBMITTED. 25. A. Raue, C. Kreutz, F. Theis, J. Timmer. Joining Forces of Bayesian and Frequentist Methodology: A Study for Inference in the Presence of Non-Identifiability. SUBMITTED.

Publicly available software: 1. C. Kreutz, J.S. Gehring, D. Lang, R. Reski, J. Timmer, S.A. Rensing (2011). TSSi: Transcription Start Sites Identification from 5’ mRNA tag data. Bioconductor R-package. 2. J.S. Gehring, C. Kreutz, J. Timmer (2009). les: Identifying Differential Effects in Tiling Microarray Data. Bioconductor R-package. 3. J.S. Gehring, K. Bartholome, C. Kreutz, J. Timmer (2007). GSRI: Gene Set Regulation Index. Bioconductor R-package. Awards: 1. Team crux: C. Kreutz, A. Raue, B. Steiert, J. Timmer. Dialogue for Reverse Engineering Assessments and Methods (DREAM6): Parameter Estimation and Experimental Design Challenge. Best performing participants 2011.

Contents Introduction 1 Gene expression and DNA chips 1.1 Gene expression . . . . . . . . . . . . . . . . . . . . . . . 1.2 DNA microarrays . . . . . . . . . . . . . . . . . . . . . . . 1.3 Data heterogeneity and processing of two color microarrays 1.4 Estimation and testing of differential expression . . . . . . . 1.5 High dimensionality and multiple testing . . . . . . . . . . 1.6 Gene expression after partial hepatectomy . . . . . . . . . . 1.6.1 Biological background . . . . . . . . . . . . . . . . 1.6.2 Data analysis . . . . . . . . . . . . . . . . . . . . . 1.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . 1.7 Ranking of genes: significance vs. relevance . . . . . . . . 1.7.1 Assessment of the performance of rankings . . . . . 1.7.2 Simulation study . . . . . . . . . . . . . . . . . . . 1.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 1.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 1.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . 1.8 Confidence across methods and technologies . . . . . . . . 1.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Data and methodology . . . . . . . . . . . . . . . . 1.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 1.8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . 1.9 Applicability of microarrays to fixed tissues . . . . . . . . . 1.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Hierarchical clustering . . . . . . . . . . . . . . . . 1.9.3 Data processing . . . . . . . . . . . . . . . . . . . . 1.9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . 1.9.5 Discussion and Summary . . . . . . . . . . . . . . 1.10 Identification of housekeeping genes . . . . . . . . . . . . . 1.10.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.2 Existing approaches . . . . . . . . . . . . . . . . . 1.10.3 Methodological considerations . . . . . . . . . . . 1.10.4 Identification strategy . . . . . . . . . . . . . . . . 1.10.5 Validation experiment . . . . . . . . . . . . . . . .

1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 6 7 8 11 15 18 18 19 21 25 26 26 27 29 33 34 35 35 36 39 44 47 48 48 50 51 54 55 56 57 60 61 63

1.10.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 An error model for immunoblot data 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Immunoblotting . . . . . . . . . . . . . . . . . . . . . 2.3 Experimental data . . . . . . . . . . . . . . . . . . . . . 2.4 Additive vs. multiplicative noise . . . . . . . . . . . . . 2.5 Mixed effects models . . . . . . . . . . . . . . . . . . . 2.6 Assessing required effects . . . . . . . . . . . . . . . . 2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Background correction and response variables . . 2.7.2 Simulation study . . . . . . . . . . . . . . . . . 2.7.3 Error model selection for housekeeping proteins 2.7.4 Validation of the housekeeping assumption . . . 2.7.5 Application to the time course measurements . . 2.7.6 Model selection for the superior error model . . 2.8 Matrix notation . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 3 Modeling of flow cytometry data 3.1 Flow Cytometry . . . . . . . . . . . . . . . . . . . 3.2 Methodology . . . . . . . . . . . . . . . . . . . . 3.2.1 2D-Analysis . . . . . . . . . . . . . . . . 3.2.2 1D-Analysis . . . . . . . . . . . . . . . . 3.2.3 Quality control . . . . . . . . . . . . . . . 3.2.4 Further analyses . . . . . . . . . . . . . . 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Sensitivity to the data processing strategies 3.3.2 Estimation bias . . . . . . . . . . . . . . . 3.3.3 Dynamic model for insulin binding . . . . 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . 4 Experimental design in systems biology 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 The design problem . . . . . . . . . . . . 4.2.1 The mathematical models . . . . 4.2.2 External perturbations . . . . . . 4.2.3 Measurement times . . . . . . . . 4.2.4 Observables . . . . . . . . . . . . 4.2.5 Experimental constraints . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

65 66 67

. . . . . . . . . . . . . . .

69 69 70 71 72 73 74 76 77 80 82 84 85 88 90 93

. . . . . . . . . . . . .

95 95 97 97 101 104 105 107 107 113 115 123 125 127

. . . . . . .

129 130 132 132 133 133 133 133

4.3

4.4 4.5 4.6 4.7 4.8

4.2.6 Prior knowledge . . . . . . . . . . . . . . . Determination of optimal designs . . . . . . . . . . . 4.3.1 Experimental design for parameter estimation 4.3.2 Experimental design for model discrimination Illustrative examples . . . . . . . . . . . . . . . . . Limitations . . . . . . . . . . . . . . . . . . . . . . Sampling strategies . . . . . . . . . . . . . . . . . . Confounding . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5 Observability analysis and confidence intervals for model predictions 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 The prediction profile likelihood . . . . . . . . . . . . . . . . . . . . 5.2.2 The validation profile likelihood . . . . . . . . . . . . . . . . . . . . 5.2.3 Re-parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Profile likelihood threshold . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Comparison of prediction and validation confidence intervals . . . . . 5.2.6 Prior information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 Availability and implementation . . . . . . . . . . . . . . . . . . . . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Small illustration model . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 MAP kinase signaling model . . . . . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . .

134 135 136 139 143 147 149 152 153

. . . . . . . . . . . . . .

157 158 160 160 162 164 164 166 168 168 169 169 174 178 180

Summary

180

Acknowledgement

185

Danksagung

185

Bibliography

185

Introduction It is a common phenomenon in nature that complex macroscopic behavior emerges from basic microscopic principles. A classical example is the variety of physical phenomena arising from the four elementary forces or the diversity of chemical compounds and substances which result from basic electromagnetic interactions between atoms and molecules. In cell biology, the complexity and variety arises from profoundly coupled elementary biochemical reactions in the cells. The dynamics of these molecular interactions is in most cases sufficiently described by an elementary law, namely the law of mass action which leads to a coupled system of first order ordinary differential equations for the kinetics of cellular compounds. How can the dynamic complexity of life arise out of such simple molecular interactions? How do cells and the organism become robust against a wide range of perturbations from the environment? How can cells with the same genome show such a variety in terms of morphology, behavior, and responses as it is realized as an example in higher organisms? How could such a complexity of the biochemical processes be developed during the evolution? These fundamental questions as well as the insight that many severe diseases originate from dysfunctions of biochemical processes drive the research in molecular biology. Remarkable insights about the molecular interactions at the inter- and intra-cellular level governing cellular processes have been achieved in this field in the last decades. For many organisms, almost all molecular compounds of the cells could be identified. However, at the same time it has been realized that knowing the constituents alone is not sufficient to be able to understand and predict the cellular behavior. The reason is that substantial properties emerge from precisely tuned dynamic interplay of cellular constituents resulting in complex molecular interaction networks. For some cell systems, a rather complete view of the molecular interactions governing the major processes has been accomplished. For other systems, especially for higher organisms and for invivo conditions, biologists are often still far away from having a complete understanding. Here, the molecular mechanisms in the cells are often such complex that, despite remarkable insights obtained for specific conditions, it is difficult to assess whether the current knowledge constitutes already a rather complete picture for a wide range of physiological conditions, or whether the current view gained under specific experimental settings is only the tip of the iceberg and several essential processes governing the cellular behavior in general are still not yet discovered. The classical experimental approach of identifying one molecular interaction after the other by perturbations of single compounds has turned out as rather inefficient for deciphering biochemical interaction networks. To this end, it is essential to integrate a wide variety of experimental conditions and observations and precisely compare them with predictions to validate or refine the current state of knowledge. This demand is approached in systems biology which is a rather young discipline, merging molecular biology and theoretical sciences. In systems biology, it is intended

Introduction

to turn relationships between molecular compounds, which are often only available at the level of correlations, into mechanistic understanding in terms of mathematical models of the causal interactions. This procedure allows the inference of the dynamics of reaction networks. The models are also utilized to predict the cellular behavior, to test hypotheses by validation experiments, and to interpret modules of reaction networks as elementary systemic components like switches, filters, amplifiers, or fluctuation suppressors. During the last years, a very rapid progress has been made in the development of new experimental techniques allowing for improved quantitative analyses of cellular constituents. Nowadays, high-throughput techniques have been established allowing for simultaneous monitoring of up to thousands of different molecules if some restrictions in terms of experimental conditions are fulfilled. Weakly abundant molecular compounds can be targeted, visualized, or even amplified by specific labeling techniques. Sometimes, a quantification is feasible at the level of single cells or even for intracellular compartments. The improved temporal and spatial resolution promoted the applicability of mathematical modeling in molecular biology and the emergence of systems biology. However, despite the achieved technological progress, restrictions remain and the investigated systems in molecular biology are still quite difficult to access experimentally, at least in comparison to other scientific fields. Most experimental techniques are only applicable under restricted conditions, e.g. only for cell cultures or for microorganisms which respond more specifically to experimental manipulations. Moreover, it is often impossible to study a target of interest isolated from the environment. Cells of an organ, as an example, behave in many respects completely different from a cell culture. Another example are receptor proteins which cannot be studied adequately in vitro because realistic in vivo behavior of the receptors requires an intact intracellular molecular environment as well as the interactions with the cell membrane. A further challenge in molecular biology is the circumstance that there are no exact copies of the experimental units, e.g. of the cells or the individuals. The biological micro-environment cannot be controlled by the experimenter, although it could have a remarkable impact on the experimental outcomes. Therefore, in contrast to experimental physics or other more technical scientific disciplines, experimental conditions in life sciences can only be controlled very restrictively. For complex cellular systems, it is difficult or even impossible to obtain equal, or even almost comparable experimental conditions, repeatedly. Heterogeneities of the experimental samples caused by a different history and environment of the system of interest has to be regarded as biological noise. This constitutes a challenge for experimental planning as well as for the statistical approaches used for the analyses of the experimental data to draw valid conclusions. In addition, the applied experimental techniques suffer from a relatively large amount of observational noise, e.g. in comparison to experiments in physics. This technical noise in combination with biological variability is often in a similar range as the investigated effects of interest. This demands for sufficient experimental repetitions but also for statistical approaches regarding the heterogeneity of the collected data. The heterogeneity of the experimental data in molecular biology often occurs in an applicationspecific manner requiring for adaptations and extensions of established methods for data processing, data analysis, and modeling to appropriately account for all sources of noise. In this thesis,

2

comprehensive methods for analyzing three biochemical techniques have been established and applied, i.e. for one and two color DNA microarrays (Chapter 1), for Western- and Immunoblotting (Chapter 2), as well as for flow cytometry (Chapter 3). Up to now, modeling in systems biology was predominantly focused on the biological processes of interest, i.e. on a proper translation of the existing knowledge into spatial or kinetic models. However, one insight of this work is that for statistical inference, not only a model of the underlying processes is needed. Primarily, models of the data are required comprising the underlying processes as well as the biological variability between the evaluated samples and the observational noise of the measurement techniques. This demands for a new methodology for modeling but also for optimal experimental design which is established in Chapter 4 by linking the classical design considerations for dynamic systems with the existing statistical methodology applied in biostatistics, e.g. for designing clinical studies. In comparison to classical biostatistics, a major difference of the statistical methodology required for systems biology is that the models are predominantly not phenomenological. Instead, mechanistic models are applied where the major components of the model are given by real processes like biochemical reactions. Such molecular interactions can be translated, e.g. by the rate equation approach, into systems of ordinary differential equations. Appropriately processed measurements are then utilized to identify the mechanisms or model structures and their parameters which are in sufficient agreement with the experimental observations. Statistically, this is approached by model discrimination and parameter estimation methods. A resulting mechanistic model very efficiently allows for conclusions. However, statistical inference for such models is usually challenging because the models are typically highly nonlinear and characterized by highdimensional parameter spaces causing new issues like parameter non-identifiability. In chapter 5, approaches for the investigation of such non-identifiabilities as well as for calculating appropriate confidence intervals for model parameters and model predictions have been developed. This methodology is very general and is therefore also applicable in other scientific disciplines.

3

1 Gene expression and DNA chips The focus of this chapter focuses is the statistical analysis of gene expression data measured by DNA microarrays. This measurement technique allows a parallel quantification of the transcription of thousands of genes. On the one hand, this mass of data improves the identification of systematic errors and offers possibilities to adjust for them. On the other hand, new methodological issues are risen like appropriate preprocessing, multiple testing, or efficient and unbiased estimation of effects.

The DNA contains the bulk of the inheritable information about a cell’s components and properties. However, the knowledge of the DNA sequence is by no means sufficient to understand, or even predict the behavior of a cell. Cells of the same organism contain the same DNA. Nevertheless, a variety of different cells types occur in higher organisms, e.g. the cells from different organs, which show a wide spectrum of very different properties and functions. This variety in the cells’ behavior is a consequence of different activities of the genomic loci, i.e. of the frequency by which coding regions of the DNA are read. DNA microarrays constitute an experimental technique for the quantification of the gene expression, i.e. the activity of the genes, for almost the complete genome. Major technical advances for microarrays were made in the middle of the 1990s. Until 1995, less than a hundred scientific manuscripts based on gene expression microarrays had been published. In 2000 there were more than three thousands and in 2005, even more than 24 000. Because of this relevance for biomedical research, and because of the promising amount of experimental data, statistical analyses of gene expression data is extensively studied within the last ten years. Statisticians were challenged by the high dimensionality of the experimental data and by the variety of strategies for their interpretation. This chapter constitutes an introduction to the microarray technique and the basic data processing strategies. Then, some methodological aspects like multiple testing and the identification of target genes on the basis of linear statistical models are discussed. Three applications are presented, namely the identification of the regulated genes during liver regeneration, the identification of housekeeping genes for the investigation of signal transduction induced gene regulation, and the evaluation of the applicability of microarrays formalin-fixed paraffin-embedded tissues. In addition, the bias-variance trade-off in the identification of target-genes as well as the sensitivity of the outcomes on the preprocessing have been studied.

1.1 Gene expression

Transcription factors, DNA folding, DNA Methylation, ...

Ribosomes, siRNA, ...

„transcription“

„translation“

DNA sequence

mRNA sequence

Protein sequence

Figure 1.1: Gene expression refers to the process of information flow from the DNA sequence to the functional units, in most cases the proteins. In a first step, the so-called transcription, the DNA is translated into mRNA. In the second translation step, the mRNA sequences are translated into proteins. Transcription is regulated, e.g. by transcription factors or DNA folding, translation could be regulated by the number of ribosomes or for example by inhibiting so-called small interfering RNAs (siRNA) molecules.

1.1 Gene expression The functioning and behavior of a cell is mainly determined by the abundances of intracellular proteins. Proteins are polypeptides, i.e. linear chains of amino-acids, which are multiply folded. The sequence information of the proteins is coded in the DNA. The process of the protein production from the sequence information in the DNA is summarized in Figure 1.1. The initial step, namely reading and translating the DNA sequence into a corresponding RNA sequence is called transcription. RNA molecules which encode for protein sequences, are called messenger RNAs or mRNAs. Messenger RNA is transported to the ribosomes and there again copied into a protein sequence. This step is called translation. The rate of the transcription, or more precisely, the mRNA concentrations at a certain point in time, is called the gene expression. Cells of the same organism have the same genome, i.e. an identical DNA sequence 1 . The cell type and the cellular state is mainly determined by the level of available protein amounts which is on his part a consequence of the level the genes’ transcription. This level of transcription is regulated by a variety of processes. Transcription factors that bind to the DNA are necessary to initiate transcription. Cofactor molecules can enhance or inhibit transcription factor binding and activity. The binding rate of the transcription factors can be further regulated by DNA folding, by methylation of the DNA or by the chromatin structure. Synthesized RNA molecules can be inhibited by small interfering RNA (siRNA) molecules or regulated via degradation. Most physiological functions of a cell are controlled via regulation of the transcription. Hormones, as an example, can bind to receptors at the cell membrane and thereby activate a signaling 1

There are exceptions from this general statement, e.g. genetic polymorphisms, i.e. natural variations of specific genomic regions, polyploidy, i.e. more than two pairs of the chromosomes, or mutations [K ELADA et al 2003].

6

1 Gene expression and DNA chips

Figure 1.2: The basic principle of the DNA-chip technology. RNA is reversely transcribed into stable cDNA. After labeling with a fluorophore, the cDNA is hybridized to complementary sequences located at the chip. The fluorescence intensity measured at one site on the chip, where a specific DNA sequence is spotted, corresponds to the concentration of the complementary cDNA and the respective RNA.

cascades which enhance and/or inhibit the transcription of certain genes. Also, during the cell cycle, i.e. the process of cell division, the transcriptional level of a majority of the genes is regulated. Similarly the circadian rhythms are induced at the molecular level via changes of the gene expression levels. Since transcription is closely coupled to many cellular processes, the comparison of the gene expression between types of cells or between different experimental conditions constitutes a valuable tool for the investigation of the molecular mechanisms in living cells. In addition, dysfunctions of the regulation of the gene expression are also causes of many severe diseases. Thus, the identification of genes that are differentially expressed between patients and healthy controls, can be utilized for diagnosis and possibly for the identification of the biochemical malfunctions leading to the disease.

1.2 DNA microarrays DNA microarrays or DNA chips allow for a high-throughput quantification of RNA abundances. A DNA microarray consists of up to more than 50 thousands single-stranded DNA nucleotide sequences, captured as spots on a coated glass plate (see Figure 1.2). For measuring RNA abundance in a certain probe, the RNA molecules are isolated and reversely transcribed into a complementary cDNA sequence. These reversely transcribed sequences are labeled by fluorophores [D O & C HOI 2007] and are then hybridized to the spotted complementary sequences at the chip. The measured intensity of a spot correlates with the RNA concentration of the corresponding RNA sequence in

7

1.3 Data heterogeneity and processing of two color microarrays

the probe. The microarray technology allows for an efficient simultaneous quantification of the gene expression of almost the whole genome of an organism. A restriction is the fact that only the expression average over many cells can be quantified. In addition, a single cell preparation can only be evaluated once. Therefore, different experimental conditions cannot be compared for the same biological sample.

1.3 Data heterogeneity and processing of two color microarrays Preprocessing denotes the step from raw measurements, e.g. the scanned intensities, to the data which is used for the statistical analyses addressing the questions of interest. In the following section, the commonly applied preprocessing steps of data from spotted two-color DNA-chips are summarized.

There are two major different technical microarray approaches which also differ in the data processing procedure. For spotted- or two-color microarrays, the capture DNA is spotted to a glass slide. Panel (A) in Figure 1.3 shows a section of a spotted DNA chip. To account for different spot size, i.e. different amount of capture DNA, a standard cDNA sample of constant concentration which labeled with another fluorescent dye, is hybridized as a reference together with the probe sample of interest. Then, the relative data, i.e. fluorescence intensity ratios of the probe- over the reference samples, are analyzed [C HEN et al 1997]. The most commonly used fluorescence dyes are the two Cyanines, namely Cy3 and Cy5. Cy3 has an emission wavelength of around 570 nm, i.e. in the green range of the spectra. The corresponding emission of Cy5 is red at a wavelength of around 670 nm. The raw outcomes of a single microarray experiment are two scanned images of the slide for both emission wavelengths. The two bitmaps of pixel intensities are processed by an image processing software to automatically detect and characterize the spots of interest. For each spot, averages of foreground and background intensities, as well as characteristics for the size and shape of the spot are determined as illustrated in Figure 1.3, panel (B). In addition, the homogeneity of the foreground intensities within a spot is evaluated, e.g. as a coefficient of variation. Two types of systematic errors are observed for expression data measured with the two color microarray technique [G ORYACHEV et al 2001]. The first bias originates from a different stability of Cy3 and Cy5, and from unequal affinities of mRNA molecules which are labeled with distinct dyes. Due to these characteristics, it is commonly observed that more intense spots have larger Cy5 intensities than spots with small intensities. This bias is illustrated in panel (C) in Figure 1.3. Here, the log ratios log2 (Cy5/Cy3) of both intensity channels are plotted against log2 (Cy3 · Cy5). This so-called Bland-Altman or MA-plot constitutes a 45 degree rotated scatterplot of the logintensities of Cy3 against Cy5. A within-slide normalization procedure is applied after background subtraction to adjust for this bias. Usually the bias is estimated and subtracted by the locally

8

1 Gene expression and DNA chips

Figure 1.3: Panel (A) shows a section of a scan of a two color microarray. Here, three blocks are shown. The displayed color is the superposition of the Cy3 and Cy5 intensity channels. Panel (B) illustrates the determination of the foreground and background intensities for a single spot. In Panel (C), the bias of both fluorophores is shown and the withinslide normalization is illustrated. In panel (D), boxplots of the intensities measured on different microarrays are displayed before and after a between-chip normalization. Here, quantile normalization is applied yielding to identical distributions.

weighted scatterplot smoothing (LOWESS) algorithm [B ERGER et al 2004, B OLSTAD et al 2003, C LEVELAND 1979]. In addition to the bias due to the different fluorophores, systematic shifts of measurements obtained on different chips are observed. In panel (D) of Figure 1.3, boxplots of intensities originating from different slides are displayed. Because thousands of features are measured in parallel, it is expected for a common experimental setup, that only a minority of genes are differentially expressed. Therefore, similar intensity distributions on all chips are expected. The comparability of different slides can be ensured by an additional between-slide normalization step as illustrated in Figure 1.3, panel (D). Here, the data could be standardized to the same means and variances or, alternatively, monotonically transformed to obtain the same quantiles of the data distributions of the analyzed microarrays. This strategy is called quantile-normalization. Note that the between-slide normalization depends on all analyzed microarrays. Adding or remov-

9

1.3 Data heterogeneity and processing of two color microarrays

Figure 1.4: Panel (A) shows the major steps of the preprocessing of gene expression data measured by the two color microarrays technique. In panel (B), the major decisions for a processing strategy are summarized.

ing a single microarray sample from an analysis has an impact on all processed data points. This fact can be utilized for experimental design considerations, i.e. for choosing the sample size large enough that removing a chip has an acceptably small impact. Beside the two major systematic errors, namely the between-chip variability and the bias generated by the two different fluorophores, also spatial noise within a slide occurs. A special type of such spatial noise originates from the printing process of the spots’ capture DNA on the glass slide. For the custom made cDNA chip used in the Core Facility Genomics in Freiburg, 16 needles are used in parallel to spot the capture DNA of 16 different DNA sequences. After printing of one set of 16 capture DNAs, the needles are cleaned and used to pick up the next capture DNA sequences. Spots which are printed by the same needle are located in one block of spots on the slide (see Figure 1.3, panel (A)). Spatially correlated noise emerges, if the printing quality of the needles differ, and/or the needles are not perfectly washed between consecutive dottings. Such spatial noise is partially eliminated by a local background subtraction. The within-slide normalization as illustrated in panel (C) can be applied for each block independently to account for a potential bias between the spots of different needles. In addition, the

10

1 Gene expression and DNA chips

amount of spatially correlated noise is examined as an additional criterion to decide whether the quality of the data fulfills the requirements. In the case of strong spatial noise, a chip is removed from further analyses. Other characteristics for the evaluation of the data quality are the global ratio of foreground over background, the magnitude of the bias between both fluorophores, and the RNA integrity number ¨ (RIN) which is a measure for the extent of RNA degradation [S CHR ODER et al 2006]. In our analyses two strategies are pursued, a stringent and a non-stringent quality filtering strategy, because there are no well-established thresholds for the mentioned data quality criteria. The sensitivity of the outcomes to both strategies are compared to ensure that the interpretation of the data is not affected by the stringency of the quality control. In addition to the chip quality, the data quality of individual spots is assessed by the spot size, the spot shape, the within spot intensity variation, and by the measured absolute intensities. The absolute intensities have to be evaluated to ensure that the measurements differ sufficiently from background, and are not affected by saturation effects. Spots which are suspicious are removed from the analyses, if the stringent analysis strategy is applied. In panel (A) of Figure 1.4 the preprocessing steps of two color microarray data are summarized. Panel (B) shows alternative data preprocessing strategies. The impact of the choice of the data preprocessing strategy on the outcomes after the statistical analyses is discussed in Section 1.8.

1.4 Estimation and testing of differential expression This section provides a model based introduction of classical statistical tests like t-test or ANOVA. The described methodology is applicable in the absence of systematic errors which is only valid for microarray data after preprocessing.

In a basic experimental setup, microarray data would be generated under several experimental conditions to identify genes which are regulated, i.e. differentially expressed under the examined conditions. Estimates of the expression differences and their significance determined by statistical tests are used to identify these genes of interest. In the case of normally distributed noise, the difference of the means is the maximum likelihood estimate and the t-test the corresponding uniformly most powerful (UMP) unbiased test to check, whether the difference is significantly different from zero [N EYMAN & P EARSON 1936, 1938]. Uniformly most powerful means that the power to detect a violation of the null hypothesis is maximal, independent of the desired significance level. The t-test can only be applied to pairs of experimental conditions. In more complex experimental setups with more than two conditions or more than one factorial predictor variable, a more general formulation is required. In a linear model with more than one predictor variable, a measurement response is a sum of several influences. Thereby, the effects of the different predictor variables on the responses have to be estimated from experimental observations. These efforts are accomplished by statistical models which provide a general mathematical formulation and are applicable to a wide range of issues.

11

1.4 Estimation and testing of differential expression

The central limit theorem of statistics states that the distribution of a sum X ε= ρm

(1.1)

m

of independent, identically distributed sources of noise ρm converges against a normal distribution ε ∼ N (µ, σ 2 ) for m → ∞. This theorem holds independent of the specific form of the distribution ρm . It is only required that the distribution has finite expectation value and variance. Because of the central limit theorem, the observational noise ε is very commonly assumed as normally distributed or Gaussian. Additive noise yields measurements yij = xi + εij , εij ∼ N (0, σ 2 )

(1.2)

where xi denotes the true values given the experimental conditions i = 1, . . . , n. The index j = 1, . . . , Ni enumerates the Ni experimental evaluations of condition i. Here, experimental condition is used as a general term to distinguish data points which cannot be considered as repetitions of the same experimental setup. The term experimental condition comprise for example the choice of the biological entity, the applied treatment, or the points in time of the measurement. The model (1.2) is linear with respect to its parameters and can therefore be rewritten ~y = X θ~

(1.3)

in matrix notation where the N data points ~y ∈ M (N × 1) are reordered as a vector. The matrix elements (X)nk ∈ {0, 1} of the design matrix X ∈ M (N × nθ ) indicate whether parameter θk , k = 1, . . . , nθ has an impact on the nth data point. In the case of Gaussian noise, maximum likelihood estimation is equivalent to least squares [H ONERKAMP 2002]. Least squares estimates of linear parameters are obtained via multiplication b θ~ = X † ~y

of the data ~y with the pseudo- or generalized inverse  −1 X † := X > X X > ∈ M (nθ × N )

of the design matrix X. The parameter estimates are normally distributed   b ~ Π θ~ ∼ N θ,

(1.4)

(1.5)

(1.6)

with covariance matrix

Π = σ 2 (X † X †> )> = σ 2 (X > X)−1 .

(1.7)

It is convenient to account for the amount of observational noise by rescaling the measurements yn0 :=

yn σ

(1.8)

X θ~ σ

(1.9)

as well as the model predictions K θ~ :=

12

1 Gene expression and DNA chips

where the matrix elements of K are given by (K)nk =

(X)nk . σ

(1.10)

Then, equivalently to equation (1.4), least squares parameter estimates are obtained by multiplication of the rescaled data b θ~ = K † ~y 0 (1.11) with the pseudo-inverse of K. The covariance matrix

−1  >  ∈ M (nθ × nθ ) = K >K Π = K † K †>

of the parameter estimates θbk yields pointwise confidence intervals or standard errors SE(θbk ) =

p Πii .

(1.12)

(1.13)

In the heteroscedacic case, i.e. the case of a varying amount of observational noise e.g. σ → σn , only the rescaling equations (1.8) and (1.10) changes, although equations (1.11) and (1.12) remain unchanged. The standard error SE(θbk ) ∝ σ (1.14) is proportional to the square root of the sampling variance for linear models. If a set of N measurements is repeated n times, the standard error decreases 1 SE(θbk ) ∝ √ n

(1.15)

proportional to one divided by the square root of the number of repetitions. For known σ 2 , the standard score or z-score z=

θbk SE(θbk )

(1.16)

is normally distributed. In a z-test, this fact is utilized to test θb against zero.

In contrast, if the variance σ 2 is estimated from the same data, the ratio is t-distributed θbk ∼ t(df ) c θbk ) SE(

(1.17)

with df = N − nθ degrees of freedom. In correspondence to a t-test, the cumulative density c θbk )) of the t-distribution is evaluated to test whether θk is significantly function cdft(df ) (θbk / SE( different from zero. The p-value for the one sided test of the null hypothesis that θk is equal to zero and for the alternative hypothesis θk > 0 is ! θbk . (1.18) p = 1 − cdft(df ) c θbk ) SE( 13

1.4 Estimation and testing of differential expression

For the alternative hypothesis θk 6= 0, i.e. for the two sided test, the extremes of the lower and upper quantiles are evaluated to obtain a p-value ! !# " θbk θbk , 1 − cdft(df ) . (1.19) p = 2 · min cdft(df ) c θbk ) c θbk ) SE( SE( After fitting a linear model like equation (1.3), the within group sum of squares, i.e. the sum SSW =

Ni X

X

i=i1 ,...,imW n=1

yn − (X)nk θbk σ

!2

(1.20)

of squared standardized residuals is χ2 distributed with N − mW degrees of freedom. Similarly, the between group sum of squares, i.e. the sum SSB :=

X

i=i1 ,...,imB

θbi − meanj=i1 ,...,imB θbj SE(θbi )

!2

(1.21)

over mB standardized group means is χ2 distributed with m − 1 degrees of freedom, if θk1 = · · · = θkm = const. The ratio 1 mB −1 1 N −mW

SSB SSW

∼ F (mB − 1, N − mW )

(1.22)

of both corresponding mean squares is F (mB −1, N −mW ) distributed with mB −1 and N −mW degrees of freedom. Because the sampling variance σ 2 cancels out in equation (1.22) 1 mB −1 1 N −mW

SSB SSW

1 mB −1

= 1 N −mW

P

i=i1 ,...,imB

P

0 k=k10 ,...,km

W



θk − meanj=i1 ,...,im θbj B √ Nk

PNk  n=1

2

yn − (X)nk θbk

2 ∼ F (mB − 1, N − mW ) ,

(1.23) this ratio can also be used in the case of an unknown sampling variance. The numerator of equation (1.22), i.e. the sum (1.21) can be restricted to any subset of groups to test a subset of parameters of interest. It is therefore utilized to test several parameters against a constant. Note, that the distributional statements in equations (1.17) and (1.22) are exact, i.e. they hold for any number of repetitions and does not require asymptotic assumptions. Asymptotically, i.e. for large number of repetitions, both equations (1.17) and (1.23) become equivalent to the χ2 used in a Likelihood Ratio test. The introduced least squares estimation is equivalent to maximum likelihood estimation in the case of Gaussian noise. If the noise is not normally distributed, e.g. in the presence of outliers, a nonparametric procedure could be applied. As a generally applicable approach, a rank transformation of the data would be applied as an initial step. Then, the same least squares approach is applicable to the rank transformed data to obtain results which are robust against outliers [C ONOVER 1971]. This procedure is called nonparametric because no parametric assumption about the distribution of the noise is made. The corresponding tests are the Mann-Whitney U-test [BAUER 1972]

14

1 Gene expression and DNA chips

for testing equality of a response under two conditions, and the Kruskal Wallis test [H OLLANDER & W OLFE 1973] for testing equality under more than two conditions. Because fluorescence intensities are non-negative and usually mainly affected by multiplicative noise, the statistical computations are often done on the logarithmic intensity scale [Q UACKENBUSH 2002]. The log-transformation of the data leads approximately to a symmetric noise distribution for microarray data [H UBER et al 2005]. As an established standard, the logarithm of base two is used. In this case, an effect size equals to one after the transformation corresponds to a change by a factor of two of the underlying mRNA concentration.

1.5 High dimensionality and multiple testing The large number of features measured with microarrays is often considered as a disadvantage of microarray data because the adjustment of the p-values to “account for multiple testing” decreases the significance of any outcome. In this section, I try to clarify consequences of measuring and testing many thousands of features for statistical reasoning.

The term multiple testing, classically refers to the situation where a single hypothesis is tested repeatedly. Here the claim would already be dismissed in the case of a single significant outcome, i.e. if only one out of all tests significantly rejects the null hypothesis. However, because the chance of having a single significant test increases with the number of applied tests, even if the null hypothesis is fulfilled, also the number of false positive rejections increase with the number of tests. The probability of at least a single significant result in a series or family of tests, given that H0 is fulfilled, is called the familywise error rate (FWER). The familywise error rate can be controlled either by a more stringent significance level for the individual tests, or by a corresponding adjustment of the p-values [W RIGHT 1992]. If the familywise error rate is controlled, the adjusted p-values p0 have to satisfy Prob(min(p0 ) ≤ α) = α . (1.24) By this way, the probability of at least one significant test is equal to a predefined significance level α. This is the classical adjustment of p-values to account for multiple testing of a single hypothesis. Procedures for controlling the familywise error rate are proposed by B ONFERRONI [1935], H OCHBERG [1988], H OLM [1979] and H OMMEL [1988]. For biomarker detection, however, usually not the classical setting of multiple testing is realized. Here, usually not many genes are used to test a single hypothesis, rather, many similar or equivalent null hypotheses H0Gene A : Gene A is constantly expressed H0Gene B : Gene B is constantly expressed H0Gene C : Gene C is constantly expressed ...

15

1.5 High dimensionality and multiple testing

are tested for many genes, simultaneously. The distinction is essential because in such a setting, there is no need to adjust p-values for valid statistical reasoning. Here, the familywise error rates are identical to the p-values because all hypotheses are only tested once. Consequently, the false positive rate, i.e. the proportion false positive predictions, is still equal to the significance level α which is used to reject the tested hypotheses. However, simultaneous statistical testing of a large number of features leads to an increasing absolute number of false positive predictions. As an example, for a commonly used significance level of α = 0.01, a false positive rate of one percent is expected. Because the microarray technique allow for simultaneous testing of up to fifty thousands of genes, a false positive rate of one percent corresponds to up to five hundred false positive predictions, even if the null hypothesis is fulfilled for all genes. Therefore, in exploratory studies, with the objective in discovering interesting genes instead of validation of previously states hypotheses, it is meaningful to adjust the p-values to control the false discovery rate (FDR). The false discovery rate is defined as the proportion of falsely rejected hypotheses out of all rejected hypotheses. Because under the null hypothesis, a uniform distribution of the p-values is expected, the false discovery rate can be estimated from the deviation of the empirical distribution of the p-values from a uniform distribution [BARTHOLOME et al 2009, S TOREY & T IBSHIRANI 2003]. The false discovery rate can be controlled by adjusting the p-values according to Prob(p0F DR ≤ α ∧ H0 true) =α . Prob(p0F DR ≤ α)

(1.25)

Then, the ratio of the number of falsely rejected hypotheses over all rejected hypotheses is equal to the significance level. Reviews about the techniques for controlling the false discovery rate are given in [B ROBERG 2005, D UDOIT et al 2003]. For microarray data, the classical situation of multiple testing which requires control of the familywise error rate, would be a test of the null hypothesis H0 : There is no change in the gene expression. In a corresponding example, stated in Figure 1.5 (A), three biomarkers yielding equivalent information about the liver damage have been evaluated to test a single hypothesis, namely whether the state of the liver is normal. This is a classical situation where an adjustment for multiple testing is required. The same holds if the patient in the example would repeat the analysis of the blood in several laboratories. For exploratory studies, e.g. for the second example (B), it is often intended to control the false discovery rate. In the example, athletes are probed for illegal doping without reasonable suspicion. A similar situation is very common in exploratory gene expression studies. Here, the goal is the identification of regulated genes without the use of prior knowledge. This also demands for controlling the false discovery rate. In contrast, if previous to a microarray study, there is already a predefined candidate gene for which a hypothesis intended to be tested, then no adjustment of the p-value would be required if the gene is tested on the basis of microarray data. In this case, it does not matter that other genes are measured in parallel. Example (C) in Figure 1.5 shows an analogous example. If there

16

1 Gene expression and DNA chips

(A) Repeated testing of a single hypothesis: the familywise error rate should be controlled. A person with a healed liver disease wants to know whether there is still some liver damage. Three standard markers indicating death of liver cells are examined. In addition, the patient initiate several repeated examinations to be very sure. In both situations, reasonable conclusion require adjustment for multiple testing, i.e. controlling the familywise error rate. (B) Many distinct hypotheses are tested in parallel: the false discovery rate should be controlled. The blood of a sportsmen is probed in a laboratory, e.g. as a regular check for doping. Several different markers are examined to identify any potential abnormality. In this exploratory situation, the false discovery rate could be controlled. Then further examinations would initiated, if markers are strongly suspicious. (C) Besides a hypothesis of interest, additional test are performed in parallel: no adjusting for multiple testing required. The blood of a patient suffering from diabetes type II is examined to determine the glucose level. If a standard laboratory examination of the blood involves further biomarker evaluations, which are specific for other disease, no adjustment for multiple testing is required for the glucose test. Figure 1.5: An example illustrating the decision of an appropriate procedure to control the number of false positive predictions.

is a reasonable suspicion for a patient of having a disease, then a p-value for the corresponding biomarker would not be adjusted for multiple testing, if there are ancillary other markers which are routinely evaluated in parallel. The risk of drawing invalid conclusions and the necessity of adjustments for multiple testing is sometimes interpreted as a disadvantage of high-throughput techniques. However, the challenges in the interpretation of the data are rather related to exploratory issues than to the high-throughput measurement technique. If standard “low-throughput” techniques as for example quantitative reverse transcription polymerase Chain Reaction (RT-PCR) do not have a superior precision in terms of experimental noise, then nothing argues against a corresponding high-throughput technique like microarrays. They provide much more complete experimental information about the gene expression, i.e. about the realized biological noise. The large number of features allow for a reliable control of the false discovery rate which is impossible for classical techniques. In addition, more advanced preprocessing strategies become feasible if the amount of measurements increase. Highthroughput data allow for outlier and artifact detection and enhance possibilities for the adjustment of systematic errors. Low-throughput techniques require a restriction of the experimental focus before the experiment. Then, any analysis is limited to this setting and there are no possibilities to adjust for the effects of other genes. For high-throughput techniques, prior knowledge about the genes of interest could be similarly utilized without a limitation.

17

1.6 Gene expression after partial hepatectomy

In the case of a restricted set of genes of interest, the gene expression data of the uninteresting genes could provide essential information of the biological sample which has been evaluated. If, as an example, it turns out that the previously stated hypothesis is sensitive to circadian rhythms, then genes related to circadian processes could be evaluated to determine whether the samples differed by chance. In such a situation, the data of these genes could be utilized to adjusted for such unwanted effects. The analogous argumentation holds for any disturbing effect which is related to transcription. For these reasons, a high-throughput technique like microarrays seems to be preferable. Here, it is possible to extend the scope of results by conditioning to the expression of other genes.

1.6 Gene expression after partial hepatectomy In this section, microarray gene expression data is analyzed to discover key regulatory genes of regeneration of the murine liver after a partial hepatectomy. It illustrates estimation and testing of effects in a multifactorial setting. In addition, this data set is utilized for methodological considerations in Section 1.8.

1.6.1 Biological background The liver is the organ which is mainly responsible for the detoxification of the body from harmful substances. Therefore the liver has to be robust against a damage of the cells or even parts of the tissue. Due to this physiological function, the liver shows a unique potential to regenerate after injury. As an example, after a two third hepatectomy, i.e. after removing two third of the liver via a dissection, the mass of the liver regenerates within around three days in mice up to the original mass. The original histology and function is restored within eight to ten days [PAHLAVAN et al 2006, H SIEH et al 2009]. The goal of the study presented in this section, is the identification of the genes which are mainly involved in this regeneration process. In addition to the common hepatocyte driven regeneration, there is another possibility of liver regereration which is induced if the mature hepatocytes are unable to proliferate. In this case, oval cells, a certain type of stem cells, migrate into the liver, proliferate, and differentiate into hepatocytes. The second goal of the study was the investigation of the differences of hepatocyte and oval cell driven regeneratation at the gene expression level. For both types of regeneration processes, seven different points in time have been evaluated, namely before hepatectomy and after 1, 2, 3, 4, 6 and 8 days (see Figure 1.6, panel (A)). The gene expression was measured with custom made 22k two color cDNA microarrays produced in the Core Facility Geneomics in Freiburg. For each experimental condition, three mice have been examined, yielding to 42 samples in total. The oval cell driven regeneration is induced by a treatment with Acetylaminofluoren (AAF) which blocks the regular proliferation of hepatocytes. Note, that each cell preparation, i.e. each biological sample, can be evaluated only once.

18

1 Gene expression and DNA chips

1.6.2 Data analysis The hybridized chips have been scanned with an Axon 4000A scanner and processed by GenePix 3.0 image processing software to detect the spots and quantify their fluorescence. Then, the median background was subtracted from foreground intensities for each spot. The median of the pixel-wise intensity ratios of both dyes has been used as the measurement response. Within-slide normalization has been performed using the locally weighted scatterplot smoothing (LOWESS) [YANG et al 2002] algorithm to adjust for the bias between the green and the red fluorophores. Between-slide normalization has been performed by a linear scaling to obtain the same mean and standard deviation of the logarithmic intensity ratios within each slide. The impact of the choice of the preprocessing strategy on the outcomes is examined in detail in Section 1.8. After preprocessing, each gene is analyzed separately. The effects of the experimental conditions on the expression of a gene are estimated using the linear model ystj = G + Ss + Tt + At + εstj , εstj ∼ N (0, σ 2 )

(1.26)

with the gene expression ystj response after the treatment s at the t-th point in time in the j-th experimental repetition. The parameter G constitutes the offset of the expression of the analyzed gene, Ss is the effect of the treatment s = 1 (none) and s = 2 (AAF). The parameters Tt , t ∈ {1, . . . , 7} account for the time dependency of the gene expression without AAF treatment. Similarly, At represents the interaction between time and treatment, i.e. the AAF specific effect of the expression over time. The experimental design does not allow to discriminate the gene experession offset G, the effect of the standard treatment S1 and the time effects T1 and A1 at the first evaluated point in time. This over-parameterization is resolved by defining the offset G as the expression log-ratio of the analyzed gene under the standard treatment at first measured time point. Then, the other confounding parameters S1 and T1 are zero and can be removed from the model. Also, A1 is set to zero. Thereby, parameter S2 is the treatment effect for t = 1. Altogether, fourteen effects have been estimated corresponding to the fourteen different experimental conditions. The interpretation of the parameters is depicted in panel (B) of Figure 1.6. The model (1.26) is linear with respect to the parameters. If the parameters are redefined as a vector θ~ := (G, S2 , T2 , T3 , T4 , T5 , T6 , T7 , A2 , A3 , A4 , A5 , A6 , A7 )> , (1.27) matrix notation yields to ~y = X θ~ + ~ε

(1.28)

where ystj and εstj have also been reordered as vectors ~y and ~ε, respectively. The matrix elements (X)ni ∈ {0, 1} of the design matrix X indicate whether the ith parameter has an effect on the nth data point. Because the time dependency of the gene expression, namely the difference to the first time point, is of primary interest for both treatments, the model parameters are not orthogonal. For the chosen parameterization, a noise realization εstj has an impact on more than a single parameter, leading to a non-diagonal covariance matrix of the parameter estimates. The alternative orthogonal

19

1.6 Gene expression after partial hepatectomy

Figure 1.6: Panel (A) shows the experimental conditions examined in this project. Seven different times have been evalutated for the hepatocyte driven liver regeneration as well as for stem cell driven regeneration which is induced by treatment with AAF. Three replicates have been generated for each condition. Panel (B) shows the parametrization of the experimental conditions. Beside a global offset parameter θ1 , a treatment effect θ2 (blue color) affecting all conditions with AAF treatment, time effects representing the treatment independent induction of expression during the regeneration (red color), as well as interaction effects between the treatment and the time dependency (green color) are estimated. As indicated by the red arrows, the parameters θ3 , . . . , θ8 are defined as the difference in the expression to the first point in time. Similarly, θ9 , . . . , θ14 are defined as the interactions between time and treatment effects relative to the first point in time.

parameterization ystj = θst + εstj

(1.29)

would not allow for a direct estimation of the differential expressions of interest. Note that the choice of the parameterization has no influence on the statistical test of the hypothesis that the expression of a gene is constant over time for a given treatment. Equation (1.4) has been utilized to estimate the parameters, e.g. the fold change Tb2 between the first two points in time. The corresponding confidence intervals, e.g. SE(Tb2 ), are obtained by 20

1 Gene expression and DNA chips

b2 have been used to identify the equations (1.12) and (1.13) . The parameter estimates Tb2 and A so called immediate early genes which are regulated at the beginning of the regeneration process. equation (1.22) has been utilized to test θ3 = · · · = θ8 = 0 ⇐⇒ T2 = · · · = T7 = 0 ,

(1.30)

i.e. whether there is a significante regulation over time, or to test θ9 = · · · = θ14 = 0 ⇐⇒ A2 = · · · = A7 = 0 ,

(1.31)

i.e. whether there is a significant interaction of time and treatment effects. Both tests discriminate between the full model (1.26) and the nested models ystj = G + Ss + At + εstj

(1.32)

ystj = G + Ss + Tt + εstj ,

(1.33)

and respectively. In addition, mean-squares MSQT =

7 2 1 Xb Tt − meanj=1,...,7 (Tbj ) 6

(1.34)

t=1

of the time effects, the corresponding p-values pT as well as the mean-squares MSQA =

7 2 1 X b bj ) At − meanj=1,...,7 (A 6

(1.35)

t=1

of the interaction effects and the corresponding p-values pA have been determined. Note, that Tb1 b1 are zero due to the non-identifiabilities. All p-values have been adjusted to control the and A false discovery rate (FDR) by the method of Benjamini-Hochberg [B ENJAMINI & H OCHBERG 1995] as introduced in Section 1.5.

1.6.3 Results Table 1.1 shows the number of genes with a FDR adjusted p-value smaller than 0.05 for all parameters as well as for all time- and interaction effects. After the adjustment, no gene has been found to be significant according to this confidence level for the interaction effects A2 , . . . , A7 , as well as for the time effects T4 and T7 . In contrast, a large number of genes are significantly induced by the AAF treatment and a large number of genes showed significant regulation over the whole time interval. The histograms in panel (A) of Figure 1.7 and the corresponding empirical cumulative density functions of the p-values confirm this outcome. The distribution is clearly shifted towards zero for the p-values of the time effect T (red color) and the AAF effect S (green color). In contrast, the distribution of the p-values pA for interaction is almost uniformly distributed. This is the expected outcome if the tested null hypothesis of no interaction is fulfilled for all genes.

21

1.6 Gene expression after partial hepatectomy

Effect Time Treatment Time-treatment interaction

Parameter MSQT Sb MSQA

Effect

Number of genes with pFDR ≤ 0.05 1348 2099 0

Tb2 Tb3 Tb4 Tb5 Tb6 Tb7 b2 A b3 A b4 A b5 A b6 A b7 A

Parameter θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10 θ11 θ12 θ13 θ14

Number of genes with pFDR ≤ 0.05 421 27 0 4 1 0 0 0 0 0 0 0

Table 1.1: The numbers of significant genes after adjustment to control the false discovery rate (FDR). 1348 genes showed a significant regulation over time and even more genes, i.e. 2099, showed a significant treatment effect. No genes has been found with pFDR ≥ 0.05 for the time-treatment interaction. After one day, more than four hundred genes are significantly induced, i.e. have a significant fold change Tb2 . After the second day, almost no gene is significant for the chosen significance level pFDR ≤ 0.05. Similarly, no gene has a significant interaction effect A2 , . . . , A7 .

In panel (B), the ranking of the genes on the basis of the p-values are compared with the corresponding ranking on the basis of the effect size estimates, namely on the basis of the treatment effect Sb2 and the mean-squares MSQT and MSQA . The concordance is quantified by the number of commonly identified genes relative to the number of considered top ranking genes. The outcomes strongly differ for both identification approaches. For a hundred identified genes, only an agreement of around 20% has been found. The sensitivity of the outcome on the identification approach is further examined in Sections 1.7 and 1.8. Table 1.2 shows ten candidate genes identified for all comparisons of interest. In the part (A), the genes with the larges regulation over time according to the mean-squares MSQT are denoted. Part B shows the corresponding genes with the most significant regulation over time. In part (C) and (D), the genes are displayed with the largest interaction effects and, respectively, with the most significant interactions. However, none of these genes is significant after the adjustment to control the FDR. In addition, the mostly up- and down-regulated genes at the beginning of the regeneration process are denoted in part (E) and (F), respectively, as well as the most significant early regulated genes (G). Lipocalin (Lcn2) has been found as the most prominently upregulated gene after partial hepatectomy. It is found as the most regulated gene at the beginning as well as over the whole examined time interval. This gene is also within the most significant candidates according to pT2 and has the largest, however insignificant, time-treatment interaction. Lipocalin 2 is known to participate in the mammalian innate immune response of the liver [L IU & N ILSEN -H AMILTON 1995, S UNIL

22

1 Gene expression and DNA chips

(E)

(F)

(G)

pA

pFDR A

b2 T

b2 ) SE(T

1.7 1.3 1.0 0.5 0.6 0.3 0.3 0.3 0.7 0.2 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.2 0.0 0.1 2.1 1.7 1.5 1.4 1.4 1.3 1.3 1.2 1.2 1.1 0.7 0.5 0.1 0.8 0.2 0.4 0.4 0.3 1.2 0.5 1.7 1.3 1.0 0.8 0.6 1.0 0.8 0.3 0.4 0.2 0.7 0.3 0.8 0.2 0.4 0.3 0.3 0.3 0.2 0.2 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 1.7

1.718e-1 6.129e-1 6.545e-1 5.655e-1 5.294e-1 7.086e-1 7.205e-1 8.032e-1 9.144e-2 8.168e-1 8.145e-2 5.830e-1 1.701e-2 9.031e-2 4.588e-1 1.942e-1 5.879e-1 1.508e-1 8.262e-1 1.153e-1 4.619e-3 1.718e-1 6.605e-2 1.284e-2 4.724e-2 6.129e-1 3.474e-2 3.361e-4 7.522e-2 5.299e-1 1.417e-5 3.238e-5 4.980e-5 9.257e-5 1.035e-4 1.302e-4 1.590e-4 1.605e-4 3.361e-4 3.796e-4 1.718e-1 6.129e-1 6.545e-1 4.121e-1 3.391e-1 5.882e-1 2.899e-1 8.746e-1 5.535e-1 8.607e-1 9.144e-2 6.092e-1 1.137e-2 5.742e-1 7.284e-1 1.123e-1 4.081e-1 6.701e-1 7.577e-1 6.132e-1 9.031e-2 1.701e-2 1.153e-1 8.145e-2 1.356e-1 4.588e-1 6.118e-1 3.040e-1 5.830e-1 1.718e-1

9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.951e-1 9.943e-1 9.965e-1 9.943e-1 9.943e-1 9.470e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.965e-1 9.943e-1 8.892e-1 9.943e-1 9.943e-1 9.470e-1 9.943e-1 9.943e-1 9.929e-1 6.620e-1 9.943e-1 9.943e-1 3.184e-1 3.638e-1 3.730e-1 3.878e-1 3.878e-1 4.007e-1 4.007e-1 4.007e-1 6.620e-1 6.620e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.965e-1 9.943e-1 9.965e-1 9.943e-1 9.943e-1 9.470e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.470e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1 9.943e-1

5.4 5.3 4.8 -0.3 -0.6 0.4 0.1 -0.1 -2.3 0.2 0.9 0.6 0.9 0.9 0.8 0.6 0.7 1.5 0.7 1.0 0.1 5.4 0.9 0.2 0.9 5.3 1.1 0.7 -0.2 0.9 -0.1 -0.1 0.5 0.4 -0.1 0.4 0.8 0.5 0.7 -0.3 5.4 5.3 4.8 3.2 2.8 2.7 2.6 2.5 2.1 2.0 -2.3 -1.8 -1.8 -1.7 -1.7 -1.7 -1.6 -1.6 -1.5 -1.5 0.9 0.9 1.0 0.9 0.8 0.8 1.3 1.2 0.6 5.4

0.8 1.1 1.0 0.6 0.7 0.6 0.5 0.6 0.5 0.6 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.6 0.8 0.7 0.5 0.6 1.1 0.9 0.4 0.7 1.5 0.2 0.2 0.1 0.2 0.1 0.2 0.2 0.2 0.4 0.2 0.8 1.1 1.0 0.7 0.6 0.9 0.6 0.7 0.5 0.5 0.5 0.5 0.5 0.4 0.7 0.4 0.5 0.6 0.5 0.5 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.8

p T2 4.622e-7 3.203e-5 3.160e-5 6.261e-1 3.482e-1 4.460e-1 8.132e-1 8.673e-1 3.005e-5 7.565e-1 5.428e-8 3.330e-7 3.480e-8 2.805e-8 8.362e-8 9.181e-6 1.481e-5 2.269e-6 9.316e-7 5.186e-8 8.346e-1 4.622e-7 1.711e-1 6.796e-1 1.366e-1 3.203e-5 2.267e-1 7.065e-2 7.700e-1 5.772e-1 6.361e-1 5.290e-1 1.320e-4 1.018e-1 5.825e-1 3.456e-2 5.355e-4 1.063e-2 7.065e-2 2.546e-1 4.622e-7 3.203e-5 3.160e-5 8.540e-5 3.434e-5 5.735e-3 2.962e-4 6.156e-4 4.617e-4 7.890e-4 3.005e-5 1.082e-3 3.208e-3 4.742e-4 1.567e-2 1.236e-4 1.302e-3 7.939e-3 9.195e-3 3.260e-3 2.805e-8 3.480e-8 5.186e-8 5.428e-8 7.256e-8 8.362e-8 1.291e-7 2.887e-7 3.330e-7 4.622e-7

pFDR T 2

1.030e-3 8.277e-3 8.277e-3 9.224e-1 8.055e-1 8.529e-1 9.657e-1 9.763e-1 8.069e-3 9.544e-1 3.024e-4 8.246e-4 3.024e-4 3.024e-4 3.106e-4 4.495e-3 5.499e-3 2.810e-3 1.600e-3 3.024e-4 9.713e-1 1.030e-3 6.510e-1 9.368e-1 6.037e-1 8.277e-3 7.089e-1 4.711e-1 9.572e-1 9.071e-1 9.261e-1 8.874e-1 1.772e-2 5.411e-1 9.080e-1 3.464e-1 3.611e-2 1.942e-1 4.711e-1 7.323e-1 1.030e-3 8.277e-3 8.277e-3 1.379e-2 8.570e-3 1.404e-1 2.661e-2 3.919e-2 3.341e-2 4.472e-2 8.069e-3 5.378e-2 9.916e-2 3.355e-2 2.345e-1 1.717e-2 5.991e-2 1.652e-1 1.804e-1 9.984e-2 3.024e-4 3.024e-4 3.024e-4 3.024e-4 3.106e-4 3.106e-4 4.111e-4 8.043e-4 8.246e-4 1.030e-3

Regulated over time

MSQA

2.949e-5 1.246e-3 2.061e-3 2.486e-4 3.988e-4 1.149e-4 1.012e-4 4.960e-4 3.113e-5 2.873e-4 3.353e-7 3.938e-7 1.586e-6 1.586e-6 1.775e-6 1.775e-6 3.595e-6 7.273e-6 8.265e-6 8.265e-6 2.580e-2 2.949e-5 6.567e-1 6.536e-1 6.938e-1 1.246e-3 7.105e-1 8.498e-2 7.464e-1 9.211e-1 6.888e-3 5.933e-1 7.736e-4 6.999e-2 1.681e-3 3.097e-1 1.204e-1 5.697e-1 8.498e-2 4.143e-1 2.949e-5 1.246e-3 2.061e-3 1.114e-2 3.593e-3 4.568e-1 2.901e-2 3.667e-3 3.378e-2 1.266e-2 3.113e-5 8.085e-2 2.291e-2 2.816e-4 1.924e-1 3.557e-4 3.424e-1 2.619e-2 5.437e-2 1.126e-2 1.586e-6 1.586e-6 8.265e-6 3.353e-7 8.265e-6 1.775e-6 2.872e-5 1.280e-5 3.938e-7 2.949e-5

Regulated over time

pFDR T

3.384e-8 1.191e-5 2.535e-5 1.148e-6 2.266e-6 2.398e-7 1.990e-7 3.060e-6 3.869e-8 1.390e-6 1.488e-11 3.496e-11 2.195e-10 2.817e-10 3.963e-10 4.729e-10 1.117e-9 2.583e-9 3.435e-9 3.882e-9 1.110e-3 3.384e-8 3.788e-1 3.739e-1 4.287e-1 1.191e-5 4.515e-1 6.964e-3 5.101e-1 8.375e-1 1.636e-4 2.971e-1 5.872e-6 5.085e-3 1.799e-5 7.074e-2 1.231e-2 2.734e-1 6.964e-3 1.319e-1 3.384e-8 1.191e-5 2.535e-5 3.285e-4 6.013e-5 1.657e-1 1.321e-3 6.187e-5 1.654e-3 3.925e-4 3.869e-8 6.450e-3 9.257e-4 1.338e-6 2.829e-2 1.939e-6 8.734e-2 1.134e-3 3.429e-3 3.345e-4 2.817e-10 2.195e-10 3.882e-9 1.488e-11 4.607e-9 3.963e-10 2.423e-8 9.092e-9 3.496e-11 3.384e-8

Largest interaction effects

pT

17.4 16.3 11.8 7.4 7.3 6.9 6.3 6.2 5.6 5.6 0.7 0.4 0.6 0.6 0.5 0.5 0.7 2.0 0.4 0.6 2.6 17.4 0.7 0.5 0.6 16.3 0.4 0.8 0.4 0.5 0.4 0.1 0.2 0.4 0.2 0.1 0.2 0.1 0.8 0.1 17.4 16.3 11.8 4.4 3.7 2.0 2.9 4.9 2.0 2.6 5.6 1.5 1.3 3.4 1.9 2.1 0.6 2.0 1.8 1.9 0.6 0.6 0.6 0.7 0.4 0.5 0.9 0.9 0.4 17.4

Most significant interaction

(D)

MSQT

Initially up-regulated

(C)

Lcn2 Mt2 Mt1 Kif20a Ccnb1 Cdc2a Pbk Cdca3 Arrdc3 Depdc1a Rpl13a Arf1 Rps29 2310016E02Rik NA Mpv17 Eef1g Ldha Rps9 Eef1b2 Ccbl1 Lcn2 NA NA Thbs1 Mt2 NA Hmgcs2 NA Dis3 NA Prkch NA Cpsf4 NA Ubap2l Cltb Tex261 Hmgcs2 NA Lcn2 Mt2 Mt1 Mfsd2 NA Cxcl1 NA Tnfrsf12a Mrap Apoc2 Arrdc3 Car2 Hsd3b6 Serpina6 Hsd3b4 Selenbp1 Fbxo5 5830443L24Rik 5830443L24Rik Chst10 2310016E02Rik Rps29 Eef1b2 Rpl13a Rpl38 NA P4hb Pole4 Arf1 Lcn2

Initially down-regulated

(B)

Gene Symbol

BG070106 BG063925 BG077818 BQ552419 BG078426 BG064846 BG063624 BG066499 BG072824 BQ552348 BG087204 BG078294 BG086475 BG079190 BG085369 BG074905 BG072779 BG064797 BG075914 BG073107 BG082698 BG070106 BQ557202 BQ562651 BQ560890 BG063925 BG063446 BG087678 BQ559975 BQ552205 BG071666 BQ563011 BQ552315 BQ554805 BG065645 BG077490 AA409219 BG063459 BG087678 BQ556207 BG070106 BG063925 BG077818 BQ551062 BQ557415 BG080268 BG075190 BG080700 BQ554646 BG085960 AA408993 AA409025 AA409074 AA409079 AA409081 AA409115 AA409126 AA409155 AA409157 AA409219 BG079190 BG086475 BG073107 BG087204 BG073438 BG085369 BG086377 BG076915 BG078294 BG070106

Significantly regulated initally

(A)

Gene Bank ID

Table 1.2: The regulated genes after partial hepatectomy. Ten identified genes are displayed for maximal regulation over time on the basis of to mean-squares (A) and p-values (B), as well as for the AAF specific regulation over time (C) and (D), respectively. Because there are only very small time-treatment interaction effects, the genes with the largest interaction effects are not significant according the FDR adjusted p-values. In addition, the early up- (E) and down regulated genes (F) on the basis of the fold-change estimates and on the basis of the p-values (G) are displayed. The criteria according to which the genes are selected are highlighted in bold face type. 23

1.6 Gene expression after partial hepatectomy

Figure 1.7: Panel (A) shows histograms of the p-values as well as the cumulative density function of the time and the treatment effects, as well as for the time-treatment interaction, namely the AAF specific regulation over time. The shift of the p-values towards zeros for the time and treatment effects indicate a strongly significant regulation. In contrast, almost no AAF specific regulation is observed. Panel (B) shows the agreement of the lists of target genes on the basis of the significance and the absolute effect size estimates. For a hundred identified genes, there the overlap is below 20%. For one thousand genes, the concordance is between 30% and 40%.

et al 2007]. It is produced by Toll Like Receptors (TLR), which are induced during immune responses [F LO et al 2004]. Lipocalin 2 has already been described as an early marker for liver damage [B YKOV et al 2007]. Metallothionein 1 (Mt1) and Metallothionein 2 (Mt2) are also found as strongly regulated at the beginning as well as over the whole time interval. Metallothioneins are capable to bind both, physiological as well as xenobiotic metals, e.g. zinc and copper, or, cadmium and silver, respectively. It is known that regenerating hepatocytes require large amounts of zinc within a short time [C HERIAN & K ANG 2006]. The Metallothioneins as well as Lipocalin are known to be induced by Interleukin-6 (IL6) [D E et al 1990, H ERNANDEZ et al 2000] which is a major regulator during liver regeneration [PAHLAVAN et al 2006]. The role of Metallothionein in liver regeneration is also discussed in [C AIN & G RIFFITHS 1984, M ARGELI et al 1994, W EBB 1987].

24

1 Gene expression and DNA chips

1.6.4 Conclusions

In this section, the gene expression data obtained after partial hepatectomy is analyzed using a two factorial linear model allowing for a separate estimation of the treatment-, time effects, and the treatment-time interaction. Many genes are regulated over time and also many genes have been found as regulated due to the AAF treatment. However, almost no interactions of time and treatment effects has been observed. The major regeneration specific gene regulation seems to occur within the first two to three days. After three days almost no gene is significant according to a false discovery rate of 0.05. In this setting, the discrimination between a general treatment effect and the interaction with the time dependency is absolutely essential. Otherwise, e.g. if the expression data for the hepatocyte and the oval cell driven regeneration are directly compared at a certain point in time, primarily treatment specific genes would be identified which are not related to the oval cell specific regeneration. In our experimental setup, the population average expression of all cells in the liver tissue is examined. Oval cells constitute only a minor fraction of the examined cells of less than 10%. Therefore, the regulation of the gene expression in the oval cells is weakened in the observations by the expression of the other cell types. This is one explanation for the fact that no oval cell specific expression pattern during the regeneration process has been found. The strongest regulated genes have been identified on the basis of the magnitude of the regulation and on the basis of the significance. The outcome, i.e. the ranking of the genes is highly dependent on the utilized statistic, i.e. whether the genes are ranked according to estimated effects or according to the significance. A ranking of the genes according to effect size seems to produce biologically more reasonable results. The significance based rankings yield almost only genes with a very small absolute induction. This point is addressed in the following Section 1.7. Some genes which have already been known to be related to liver damage, inflammation and proliferation could be identified as strongly regulated in our experimental setting. In this section, primarily the methodological issues have been discussed. A further biological interpretation, e.g. in terms of the underlying pathways or transcription factors which could be responsible for the regulation, is still a methodological issue in computational biology. This issue is for example discussed in [DAS & DAI 2007].

For this project, I acknowledge my collaborators from the Core Facility Genomics of Prof. Gerd Walz in Freiburg. His group performed the experiments and generated the gene expression data.

25

1.7 Ranking of genes: significance vs. relevance

1.7 Ranking of genes: significance vs. relevance Identification and ranking of genes based on microarray data is usually performed according to the significance of statistical tests, i.e. according to calculated p-values. In this section, it will be shown that a ranking based on estimations of effect sizes is preferable in standard settings of microarray studies. This is a general outcome which is relevant for all issues aiming the identification of specific items out of a large number of possible candidates.

In early applications of the microarray technique in the 1990th, the ratios or fold changes of the genes’ expression have been measured under various experimental conditions and used as the primary experimental outcome, e.g. for identification of regulated genes. Then, statisticians enforced the necessity of experimental repetitions and the application of statistical tests to obtain significance statements. The identification of genes of interest was then performed according to the level of significance, namely according to the obtained p-values. In 2005, the U.S. Food and Drug Administration (FDA) performed a large study called Microarray Quality Control Project (MAQC) investigating the reproducibility of microarray data [S HI et al 2006, S TRAUSS 2006]. One outcome of this study was, that estimates of the effect size, i.e. rankings according to the fold change estimates, are better reproducible between different laboratories and between different technological platforms than rankings based on significance [G UO et al 2006]. This surprising result caused many discussions because an appropriately chosen test statistics contains the condensed, complete information of the available experimental data and was therefore supposed to be superior to the fold change, which constitutes an estimate of only the first moment of the distribution of measured differences. In this section, the results of a simulation study are presented showing the benefit of rankings based on effect size estimates. A major conclusion is that the dimension reduction of microarray data sets, which is a major step in various statistical approaches, could be improved if the reduction is not exclusively based on gene-wise significance. In another project, we could validate this conclusion. There, it could be shown that the performance of classification approaches is improved if the feature selection is done according to the estimated amount of differential expression.

1.7.1 Assessment of the performance of rankings A statistical procedure for the identification of differentially expressed genes can be interpreted as a classifier dividing the evaluated genes into “regulated” and “unregulated”. If the truth is known, the performance of such a classifier can be assessed in terms of sensitivity and specificity. The receiver operating characteristic (ROC) curve is the dependency of the true positive rate, i.e. the sensitivity, on the false positive rate, i.e. one minus the specificity, for different numbers of identified genes. By chance, i.e. if a random classifier would be applied, the sensitivity would increase proportional to the false positive rate resulting in a straight line with slope equals to one as the ROC curve. The area under the curve (AUC) is often used to assess the ROC curve by a single number. A random classifier has an expected AUC of 0.5.

26

1 Gene expression and DNA chips

Instead of a binary classification into regulated and unregulated genes, gene lists are often desired to agree with the true underlying extent of regulation of the genes. Then, the performance ~ ∆) ~ of the genes according of a statistics is preferably assessed by a comparison of the ranks R( ~ with the ranking R( ~ T~ ) obtained from a statistics T~ . The to their true differential expression ∆, incidence of the N top ranking genes GbT (N ) := {i|Ri (T~ ) ≥ ngenes − N }

(1.36)

G(N ) := {i|R(∆i ) ≥ ngenes − N }

(1.37)

with the mostly regulated genes

can be assess with Dice coefficient [D ICE 1945, O CH et al 2003] dT (N ) =

2|G(N ) ∩ GbT (N )| , |G(N )| + |GbT (N )|

(1.38)

i.e. the size of the intersection of two sets of genes relative to their average sizes. The coefficient dT (N ) ∈ [0, 1] is equal to one, if both gene lists comprise the same genes, i.e. GbT (N ) ≡ G(N ). The dependency of the Dice coefficient on the number of selected genes can be plotted to assess the performance of a statistics. This indicates the expected overlap of a gene list with the underlying truth. By chance, i.e. a ranking based on a statistics without any relationship to the target, dT (N ) is expected to be equals to N/ngenes . This yields in a line with slope equals to 1/ngenes and an AUC of 0.5 · ngenes . Note, that unregulated genes have equal ranks. For equal values, it is common to assign average ranks. If a fraction P ≤ 1 of genes is assumed as differentially expressed, the unregulated genes yield ranks ~ = meanj=(1−P )n ,...,n (j) , ∀i ∈ {(1 − P )ngenes , . . . , ngenes } . Ri (∆) genes genes

(1.39)

In this case, the ranks are not strongly monotonically increasing and therefore a random statistic does not result in a straight line. In fact, a step ( N for N < (1 − 12 P ) ngenes P ·ngenes drand (N ) = (1.40) N 1 P ·ngenes + (1 − P ) for N ≥ (1 − 2 P ) ngenes is obtained for drand (N ) at N = (1− 12 P )ngenes . To avoid such an unintended step, the expectation of the Dice coefficient is calculated for a random assignments of the ranks if equal values occur. N Then the Dice coefficient still fulfills the intended property drand (N ) = ngenes . The Dice coefficient is also applied in the following sections to assess the incidence of rankings obtained by different data processing strategies in Section 1.8, and for the evaluation of the data quality for formalin-fixed paraffin-embedded tissues in Section 1.9.

1.7.2 Simulation study A simulation study has been performed to compare the performance of rankings based on effect ~b with rankings based on the t-statistics T~2 := ~t. In the following, the size estimates T~1 := ∆ 27

1.7 Ranking of genes: significance vs. relevance

~b is called fold change ranking. The ranking R( ~ ~t) is equivalent to a ranking according ~ ∆) ranking R( to p-values calculated from the t-statistic and is therefore simply called p-value ranking. For the simulation, measurements yigj = µig + εigj , i = 1, . . . , ngenes ; g = 1, 2; j = 1, . . . , Nj

(1.41)

with Gaussian observational noise εigj ∼ N (0, σi2 ) and gene dependent variance σi2 are assumed. Because only changes ∆i = µi2 − µi1 (1.42) of the true underlying expression µig between both groups g = 1, 2 of samples are of interest, the true expression µi1 in the first group of experiments is set to zero µi1 := 0 , ∀i ⇒ ∆i ≡ µi2

(1.43)

without loss of generality. In the simulations, the true differential expression ∆i is assumed as uniformly distributed ∆i ∼ U (0, ∆max ) , i = 1, . . . , P · ngenes (1.44) for a fraction of P ≤ 1 regulated genes. The other (1−P )ngenes genes are assumed as unregulated, i.e. ∆i = 0 , i = P · ngenes + 1, . . . , ngenes . (1.45) In the simulations, P has been chosen 1% resp. 10%. The mean differential expression of the regulated fraction of genes is chosen as the measurement scale, i.e. all numbers have been defined in physical units of meani=1,...,P ·ngenes (∆i ) = 1. (1.46) Because the true regulation is assumed as uniformly distributed, a mean fold change of one for the regulated genes corresponds to a maximal fold change ∆max = 2. The variances of the observational noise is defined as gene dependent with a uniformly distributed order of magnitude, i.e. σi ∼ σmedian · dσ U (−1,1) .

(1.47)

The parameter dσ denotes the extent of the gene dependency. In practical applications, we observed a gene dependency corresponding to a value of dσ = 2. As an example, the measurements used in the application presented in Section 1.10 yield 0.40 for the ratio of the 10% quantile of the empirical standard deviations over the median. Here, the the ratio of the 90% quantile over the median is 2.01. In the application presented in Section 1.9, the two quantiles are 0.51 and 1.78, respectively. Therefore, dσ = 2 seems a rather realistic value for the gene dependency of the noise level. A setting with a higher gene dependency is simulated with dσ = 5, i.e. the variances σi2 vary between 1/5 · σmedian and 5 · σmedian . A median standard deviation of σmedian = 1/2 is chosen for a representative setting of a large signal-to-noise ratio. A setting with a worse signal-to-noise ratio is simulated with σmedian = 2. All chosen parameters for the simulation study are summarized in Table 1.3. As it will be shown later, the gene dependency dσ is the critical parameter in the simulations which determines whether fold change rankings or p-value rankings are superior.

28

1 Gene expression and DNA chips

Parameter Number of genes True fold change Observational noise Gene dependent amount of noise Median amount of observational noise Factor for gene dependency of the noise Sample size within each experimental condition Proportion of differentially expressed genes

Value ngenes = 10000 ∆ ∼ U (0, 2) εij ∼ N (0, σi2 ) σi ∼ σmedian · dσ U (−1,1) σmedian ∈ {0.5, 1} dσ ∈ {2, 5} Nj ∈ {3, 10} P ∈ {1%, 10%}

Table 1.3: Parameters chosen for the simulation study

Figure 1.8: Comparison of rankings on the basis of effect size estimates and p-values for simulated data. In the simulation settings, all parameter combinations shown in Table 1.3 are examined. Panel (A) shows rank correlations between the true ranking and fold change rankings (black) and between the true ranking and p-value rankings (red). The gene lists obtained on the basis of estimated fold changes yield better rank correlations for all simulation settings. Panel (B) shows the distribution of the rank of the mostly regulated gene for many noise realizations. For a medium gene dependency of the noise, namely dσ = 2, the fold change rankings are preferable. For highly gene dependent noise, i.e. dσ = 5, the significance based ranking is superior for some noise realizations. However, in most circumstances the fold change criterion yield a superior ranks.

1.7.3 Results Panel (A) of Figure 1.8 shows boxplots of rank correlations for every simulation setting obtained for a hundred noise realization. The boxes indicate the inter-quartile range between the 25% and the 75% quantiles. The dashed lines at the end of the boxes, so called whiskers, show the spread of all values. The maximal whisker length is limited to 1.5 times the inter-quartile range. Outliers beyond the whiskers are depicted as dots. Medians are plotted as horizontal lines inside the box.

29

1.7 Ranking of genes: significance vs. relevance

A rank correlation of one would correspond to an identical order of the genes. Random rankings would yield an expected rank correlation of zero. For all simulation settings, i.e. for all parameter combinations summarized in Table 1.3, the fold change rankings (black color) yield superior-, i.e. larger rank correlations. Panel (B) shows the ranks obtained for the mostly regulated gene. In this figure, a rank equal to one corresponds to the gene which is predicted to be mostly regulated. For a typical gene dependency of the noise, namely for σi between 1/2 · σmedian and 2 · σmedian , the fold change ranking is again superior. For highly gene dependent noise, i.e. if σi is chosen between 1/5·σmedian and 5 · σmedian , there is a strong dependency of the outcome on the noise realization. Although, here both methods yield similar performance, a slight advantage of the fold change rankings is observed. Figure 1.9 shows Dice coefficients dt (N ) using p-values (red lines) and d∆ b (N ) using fold change estimates (black lines). The average curves over one hundred noise realizations are plotted. The standard error of the averages are very small and only hardly visible. For standard deviations which differ by a factor of two in both directions, the fold change rankings are superior (compare panel (A)). If the noise differ by a factor of five in both directions, p-value rankings are superior if a small number of genes is identified (panel (B)). This outcome is qualitatively independent on the other parameters of the simulation study. Figure 1.10 shows ROC curves for all chosen simulation settings. Again, rankings based on the effect size estimates are superior for noise which is moderately varying between the genes. Panel (B) shows the outcome for strongly gene dependent noise amounts. Here, the p-value rankings are superior for a small number of identified genes. Altogether, the comparison of the performance of rankings based on fold change estimates and p-values is only slightly dependent on the proportion of regulated genes, on the average signalto-noise ratio, and on the number of experimental repetitions for realistic settings of microarray studies. The decisive parameter is the gene dependency of the noise dσ. This outcome is plausible because fold change estimates bi = µ ∆ bi1 − µ bi2 (1.48)

are calculated independently on the observed variances σ bi2 , i.e. the spread of measurement repetitions is ignored. This yields biased and therefore suboptimal rankings if the measurement noise is gene dependent. In a common analysis, a threshold for the statistics is defined to identify genes of interest which are further evaluated. Such a threshold is usually chosen stringent in terms of false positive predictions. The number of false positives is intended to be kept at a small level for the price of obtaining an increased number of false negatives. This is achieved by identification of only the strongly noticeable genes. Genes which are measured accurately are not likely to excess such a threshold by chance. Against this, genes with large amount of observational noise are more likely for being identified because larger fluctuations of the fold change estimates occur. The t-statistic ti =

bi ∆   bi ci ∆ SE 30

(1.49)

1 Gene expression and DNA chips

Figure 1.9: The Dice coefficient for different number of identified genes. The Dice coefficient d∆ (N ) as a measure of the incidence of gene lists is superior for dσ = 2, i.e. for a noise level varying by a factor between 1/2 and 2 for different genes (panel (A)). The green arrows indicates an increasing proportion P of regulated genes. The orange arrows indicates an increasing noise level. Parameter Nj denotes the number of experimental repetitions. The insets show the same for the interesting part of the gene lists, namely the outcome for the hundred top ranking genes. In panel (B), a strong gene dependency of the noise level is assumed. Here, the p-value rankings are superior.

is defined as the ratio of the fold change estimates over the corresponding standard errors   p bi = σ ci ∆ SE bi 1/N1 + 1/N2 . (1.50)

Here, the fold changes estimates are considered relative to the estimation accuracy. It can be shown, that the distribution of the t-statistic is independent on the true amount of observational noise σi . Exactly this insight is utilized for the t-test. Therefore, every gene has the same chance for being identified and the obtained gene lists are therefore unbiased with respect to a gene dependency of the noise. However, because the second moment, namely the variances σi2 , have to be estimated in addition

31

1.7 Ranking of genes: significance vs. relevance

Figure 1.10: Receiver operator characteristic curves. Rankings based on effect size estimates are superior for dσ = 2 (panel (A)) independently on the number of regulated genes, on the signal-to-noise ratio, and on the number of experimental repetitions. In contrast, panel (B) shows the setting with a strongly gene dependent noise, i.e. dσ = 5, the p-value rankings are superior for very specific analyses, namely at the left lower corner of the ROC curves. The area under the curve is, however, clearly superior for the fold change rankings.

to the first moment from the data, the variance of the t-statistic is increased in comparison to the fold statistic. This is another instance of the general bias-variance trade-off of estimators [C OX & H INKLEY 1994]. Rankings based on effect size estimation are biased, but they show decreased variability. The decisive parameter determining which of both statistics yield superior performance, is the gene depencendy of the noise. This parameter specifies whether a biased but more precise, or whether an unbiased but less accurate statistics is preferable.

32

1 Gene expression and DNA chips

1.7.4 Discussion If an analysis is intended to uncover the differential expression of genes and to reproduce the true amount of regulation, rank correlations as well as Dice coefficients are appropriate objective functions to assess the quality of a gene ranking. For this goal, fold changes which are the maximum likelihood estimates of the differential expression, would be a reasonable choice for the statistic. The fold change, i.e. the mean, is known as a sufficient statistics for the estimation of the location of a Gaussian distribution in the case of known variances [C OX & H INKLEY 1994]. The term sufficient denotes that by calculating of the mean, the complete information of the data for the estimation of the parameter of the measurement distribution is utilized. Therefore, fold change estimates are maximally informative for a ranking of genes in two extreme situations, namely in the case of a gene independent-, or in the case of a known amount of the measurement noise. In these situations, the estimation of the variances out of the data does not yield additional information. Fold change estimates are biased because the variability of the measurements of the genes is not considered. Because of fluctuations, noisy genes are more likely for being at the extreme ends of fold change rankings. More precisely measured genes tend to yield medium ranks. This bias could be reduced by shrinking the fold change estimates towards zero for genes with larger noise levels. Then the amount of shrinkage would be chosen according to σ bi . This strategy would finally lead to a statistic which is similar, or even equivalent, to the t-statistic. This way, the bias would decrease, but at the same time the variability of rankings would increase. However, if the goal of an analysis is in the separation between regulated and unregulated genes, then the t-test would be applied. For Gaussian noise, the t-statistics is uniformly most powerful for one-sided testing of the hypothesis that the regulation of a gene is zero [S AMUEL KOTZ 1985]. This means that there is no test statistics with a superior power for detecting deviations from this hypothesis. In this setting, the performance should be assessed with respect to the classification into regulated and unregulated genes, i.e. by the ROC curve. Here, the sensitivity and specificity of the identification of regulated genes is determined without taking the true amount of regulation. It has been shown that the values of the t-statistic show increased variability because the calculation requires the estimation of both, of the effect size, and the amount of noise for every gene. This drawback can be decreased in a Bayesian framework, namely if prior knowledge about the genes’ noise levels is utilized [S MYTH 2004]. In a regularized t-test, the gene specific variances are estimated 0

2 σ bi2 = (1 − λ)b σi2 + λσ(0) i , 0≤λ≤1

(1.51)

2 [BALDI & L ONG 2001, C UI et al as a weighted sum of the sampling variances and the priors σ(0) 2005]. This decreases the variability, but introduces at the same time a bias due to imperfect prior knowledge. The larger the regularization parameter λ, the smaller the variability and the larger the bias. For maximal regularization, i.e. if exclusively prior knowledge is used (λ = 1), the outcome is the same as the fold change ranking. If no prior knowledge is utilized, the regularized t-test is equivalent with the standard t-test.

Both, the interpretation of the t-statistic as a shrinked estimate of the fold change, and the

33

1.7 Ranking of genes: significance vs. relevance

interpretation of the fold change as a maximally regularized t-statistic shows that both approaches are related. The fold statistic is adequate in the special case of perfectly known respectively of gene independent noise levels. The p-value rankings are adequate if the noise is strongly gene dependent and no prior knowledge about their variation is available. Here, I introduced the Dice coefficient as an assessment criterion on the basis of the true amount of regulation. In addition, the ROC is utilized to assess the performance of the statistics with respect to prediction whether a gene is differentially regulated. The outcomes of the simulations are only slightly dependent on the choice between both criteria.

1.7.5 Conclusions The t-statistic is the uniformly most powerful statistic for the one-sided test for equal means in the case of Gaussian noise. Therefore, rankings of genes based on this statistic have been expected to be superior in comparison to rankings obtained from effect size estimates, namely estimates of the fold change of differential expression. However, in practice gene lists obtained from p-values have surprisingly been shown as less reproducible by new experiments. In this section, I confirmed this empirical observation by simulations. It was shown that the gene dependency of the measurement variability, which is in practice due to technical- as well as biological noise, is the decisive parameter determining the performance of both statistics. Estimates of the effect size are sufficient for the extreme case of known sampling variances. The t-statistic is optimal for the other extreme situation of large but unknown sampling variances. It has been discussed, how both statistics are linked via the regularized t-test and a shrinked estimation of fold changes. The amount of shrinkage or regularization required for an optimal performance depends on the gene dependency of the noise level. Thereby, this parameter came out as decisive in the simulations. For gene expression data, prior knowledge about the variance of a gene can be derived from measurements of the other genes. This estimate can be done taking into account the absolute level 2 2 of the measurements, i.e. the prior variances σ(0) µi ) are defined as depending on the i → σ(0) i (b measurement averages. Publicly available data could also be used to estimate the genes’ technical variability [K IM & PARK 2004]. Whether regularized methods really improve the outcome of an analysis, heavily depends on the quality of the prior knowledge. A reliable determination of such prior knowledge is hampered by the fact, that the noise level differs between experiments. This holds especially for the biological variability, which is usually highly context dependent. Although there are many established statistical approaches using regularization, commonly the items of interest are identified purely on the basis of significance. However, in the case of many competing items, it could be preferable to make the identification based on effect size estimates. As an example, this comprises the issue of biomarker identification on the basis of high-throughput data, as well as classifier training, i.e. the identification of informative genes for the prediction of class memberships.

34

1 Gene expression and DNA chips

1.8 Confidence across methods and technologies Within typical gene expression studies several ambiguous choices between competing statistical approaches have to be taken. For microarray data, often slight changes in the data analysis alters the results noticeably. In this section, the impact of changes in the data processing is systematically evaluated and compared to the consistency of results obtained for two different assays, i.e. for the cDNA platform as well, as for the Affymetrix GeneChip platform. This evaluation was a major step for the establishment of a standard data processing strategy for the cDNA platform at the Core Facility Genomics in Freiburg. Studying the agreement of different strategies is essential to reliably examine the level of confidence in the experimental outcomes.

1.8.1 Introduction In recent years, several approaches have been developed for the microarray data preprocessing as well as for addressing the typical hypotheses in gene expression studies. In Section 1.3, the decisions to be made during the data preprocessing of two-color microarrays have been summarized. The ambiguity comprises the aggregation of foreground and background pixel intensities of a scan, the extent of the adjustment for systematic errors, and stringency according to quality demands (see Figure 1.4). Similar issues also exist for other microarray platforms, like the most frequently used Affymetrix Gene Chip technology. For this platform, there are around half a dozen of well-established data processing procedures for which a reasonable eligibility has been shown. Although intensive research has been performed to compare and assess the different approaches, for the most issues no definite and generally preferable strategy could be found. For a given question, the performances of the competing approaches strongly depend on the amount and quality of the data, on the biological variability, and on the correlations and magnitude of the gene regulation. Even slight variations in the setting or in the goal of a study could have a sensible impact on the efficiency of a statistical approach. However, in experiments addressing a biological issue, the data sets are often too small and the experimental design too restrictive to determine an application-specific optimal statistical approach. In Section 1.7, it has been argued that even for the very classical issue, namely for the identification of marker genes, it is a priori impossible to decide whether it is more efficient to identify genes on the basis of effect size estimates or on the basis of the significance. Even more, there are several reasonable procedures for both, for the estimation as well as for the statistical testing. The approaches could differ in the statistical model, in the assumptions about the observational noise, as well as in the amount of regularization. Finally, also different technological platforms can be utilized to address a biological question. Again, there is no generally preferable technology [S HI et al 2006, S HIPPY et al 2006]. The current opinion is that different platforms have distinct preferences and provide complementary information about the underlying gene expression. For a comparison of the outcomes obtained from different platforms, it is essential to apply several data processing strategies to achieve gen-

35

1.8 Confidence across methods and technologies

Label A B C D E F G H I g r m v p

Technology cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA cDNA Affymetrix Affymetrix Affymetrix Affymetrix Affymetrix

Description Standard setting Ratio of medians (instead of median of ratios) over pixel intensities within a spot Mean of ratios (instead of median) over the pixel intensities within a spot Bayesian background correction (instead of local subtraction) LOWESS smoothing parameter of 0.2 (instead of 0.1) Block-wise LOWESS correction (instead of global) Elimination of data points close to the background and/or large within spot CV Elimination of data points with potential saturation effects Quantile normalization (instead of linear scaling) Gene Chip Robust Multi-Array Analysis (GCRMA) [W U et al 2004] Robust Multi-Array Analysis (RMA) [I RIZARRY et al 2003b] Affymetrix Microarray Suite 5.0 (MAS5) algorithm [A FFYMETRIX 2002] Variance Stabilization Normalization (VSN) [H UBER et al 2002] Probe Logarithmic Intensity Error Estimate (PLIER) [M ILLER & H UBBELL 2007]

Table 1.4: Summary of the nine applied data processing strategies for the cDNA data, and the five approaches applied for the Affymetrix data sets.

eral results. In this section, several alternative data processing procedures are systematically compared for the two-color cDNA and for the one-color Affymetrix microarray technology. The same mRNA samples have been quantified with both platforms. It will be shown that within both technologies, the slight changes in the data analysis have a noticeable effect on the results. The disagreement is even more prominent between both platforms. The presented approach constitutes a general methodology to assess the confidence in the experimental results, i.e. in the identification of biochemical targets on the basis of high-throughput data, if several competing experimental and statical strategies are available.

1.8.2 Data and methodology In this section, a subset of the hepatectomy data set which has already been examined in Section 1.6 is utilized, namely the gene expression measurements generated at at day zero, one and two after partial hepatectomy of black six mice livers. The RNA of three biological replicates have been measured by the custom made cDNA chip of the Core Facility Genomics in Freiburg. In addition, the experiments have been repeated by a quantification of the same RNA extracts with the Affymetrix Mouse Genome 430 2.0 Arrays. For the cDNA data, nine different preprocessing strategies have been applied. These strategies are summarized in Table 1.4. As the default setting for the preprocessing, medians of the pixelwise intensity ratios between the red and the green channels have been chosen as measurement response, the spot-specific background is subtracted and the locally weighted scatterplot smoothing (LOWESS) [YANG et al 2002] is applied for the within-slide normalization. In addition, the data is linearly scaled between different slides. The same setting has been used in Section 1.6.

36

1 Gene expression and DNA chips

As alternative measurement responses, “means of ratios” and “ratios of medians” of the pixel intensities of both dyes within a spot have been examined. Bayesian background subtraction [I RIZARRY et al 2003b] has been applied as an alternative approach for background subtraction. For the within slide normalization, the smoothing parameter has been changed and LOWESS has been applied block-wise. Quantile normalization has been applied as an alternative between-slide normalization. Also the impact of the stringency of the quality filtering was appraised. Here, the effect of removing spots which have a foreground over background intensity ratio below two, and spots with a coefficient of variation of the within-spot intensities which is above two, have been removed. As another alternative strategy, spots with intensities close to the saturation threshold have been eliminated. Because the pixel intensities are stored as 16 bit integers, the intensity range is between one and 65 536. To evaluate the impact of saturation, all spots with an intensity above 65 000 have been removed. Data preprocessing of the Affymetrix one-color microarrays comprises background correction, adjustment for unspecific binding, condensation of measurements within a single probe-set to a single value and normalization between different chips [B OLSTAD et al 2003, I RIZARRY et al 2003a]. For the Affymetrix microarray platform, five well-established data processing approaches have been applied, namely the Robust Multi-Array Analysis (RMA) [I RIZARRY et al 2003b], the Gene Chip Robust Multi-Array Analysis (GCRMA) [W U et al 2004], the Affymetrix Microarray Suite 5.0 (MAS5) algorithm [A FFYMETRIX 2002], Variance Stabilization Normalization (VSN) [H UBER et al 2002], and the Probe Logarithmic Intensity Error Estimate (PLIER) [M ILLER & H UBBELL 2007, T HERNEAU & BALLMAN 2008]. For the identification of the genes which are involved in the early regeneration phase after hepatectomy, i.e. which are regulated within the first two days after hepatectomy, Analysis of variances (ANOVA) has been performed. ANOVA has been discussed in Section 1.4. The genes are ranked according to the mean squares, as well as according to the calculated p-values. Mean squares are the counterparts of the effect size estimates for more than two experimental conditions. The ANOVA p-values depend monotonically on the test statistic, here the F -statistic, therefore a ranking of the genes according to the p-values is equivalent to a ranking according to test statistic. For the comparison of the outcomes of the two approaches, Dice coefficients have been determined to assess the concordance of gene rankings. Dice coefficient have been introduced in Section 1.7 as a measure for the overlap of gene lists. If a gene is regulated between the samples, measurements of oligo or cDNA sequences of such a gene is expected to be correlated. Therefore, the performances of the individual procedures have been assessed by evaluating the correlation of measurements of sequences which are mapped to the same gene. These correlations have been compared with the outcome for the same number of randomly chosen pairs of sequences.

37

1.8 Confidence across methods and technologies

Figure 1.11: In Panel (A), the ranked data, the ranks of the mean squares, and the ranks according to the ANOVA test statistics are compared for several preprocessing strategies. The strategies “C” and “F”, namely the use of within-spot intensity averages instead of the median, as well as applying a block-wise within-slide normalization have a very noticeable impact on the outcome. Panel (B) shows Dice coefficients measuring the concordance of the gene rankings. The red lines indicate the outcome obtained according to the test statistic, the black lines show the results according to mean square values. In the plots above the diagonal, the Dice coefficient is plotted for the range up to all genes, below the diagonal, the concordance within the hundred top ranking genes is depicted. Here, the rankings on the basis of mean squares are less sensitive to changes in the data processing strategy. In panel (C), the correlations of the data of multiple cDNA sequences from the same genes are plotted. The gray lines indicate the reference obtained from random pairs of sequences. Here, all approaches yield a similar performance. Only if data points close to the background are removed from the analysis, i.e. method “G”, the correlations are increased. However, this is an artifact due to a decreased number of data points. Therefore the same effect is also observed for the reference curve, as indicated by the black arrows.

38

1 Gene expression and DNA chips

1.8.3 Results

Panel (A) in Figure 1.11 shows scatterplots for the ranked data, the ranked mean square values and the corresponding ranked test statistic. The deviations from the diagonal are only due to minor changes within the preprocessing. The test statistic seems to be slightly more affected than the mean square values. Method “C”, i.e. the use of intensity means over the pixels within a spot instead of the median, yields a very noticeable effect on the processed data. Also method “F”, namely the block-wise within-slide normalization, has a very strong effect on the outcomes. Panel (B) shows Dice coefficients as measures for the concordance of the rankings. The outcomes obtained by ranking of the genes according to the mean squares are plotted as black lines, the results and according to the test statistic are displayed in red. In the figures above the diagonal, the range from a single- up to all genes is evaluated. In the plots below the diagonal, only the range up to a hundred genes are depicted. This corresponds to the interesting part of the gene rankings. Here, Dice coefficients range roughly between 0.7 and 0.8 for the mean squares. The results obtained on the basis of the test statistics are clearly more affected by changes in the data processing which is indicated by a decreased concordance. If the average over the within-spot intensities is used as the measurement response, i.e. method “C”, a very low concordance according to the test statistic is observed, although the ranking on the basis of the mean squares are quite reproducible. The same holds, if spots with a minor signal-tonoise ratio are eliminated, i.e. for method “G”. Here, the number of data points is decreased which obviously strongly affect the test statistic. In panel (C), the cumulative density is depicted for the correlations between pairs of cDNA sequences which correspond to the same gene (colored lines). This provides an objective criterion for the reproducibility of the measurements. All data processing approaches yield a similar performance. Only for method “G”, i.e. if spots with a minor signal-to-noise ratio or intensities close to the background have been removed, the correlations are slightly increased. However, a decreased number of data points automatically leads to a broader distribution of correlations and to a less sigmoidal shape of the cumulative distribution. As indicated by the black arrows, this effect is also observed for method “G” in the reference distribution obtained for randomly selected pairs of spots. Therefore, the deviation observed for method “G” seems primarily to be an artifact. For the Affymetrix one-color technology, there is a very strong effect of the data processing strategy on the single data points, on the mean squares, and on the test statistic. This is depicted in panel (A) in Figure 1.12. Panel (B) shows the outcome for the subset of genes which has been identified as present in the majority of the samples. This requirement is fulfilled for almost 21.000 out of the 45.000 probe-sets. The plots in panel (B) indicate that this filtering step essentially removes the sensitive genes which considerably improves the reproducibility between different preprocessing approaches. Figure 1.13 shows the concordance of gene rankings obtained by the five data processing strategies. Dice coefficients are depicted for all genes in the panels (A) and for the “present” genes in panel (B). Again, the plots below the diagonal depict the concordance up to one hundred genes, the figures above the diagonal show the outcome for the complete range up to all genes.

39

1.8 Confidence across methods and technologies

Figure 1.12: Ranked data, the ranks of the mean squares, and the ranks according to the test statistics are compared for several data processing approaches. For the one-color microarrays, the outcomes differ strongly on the approach if all genes are examined (panel (A)). If only the genes with present calls in the majority of samples are evaluated, the results are clearly more reproducible (panel (B)). At the level of single genes, the ranks of the mean squares and the ranks of the test statistic show similar variability.

Up to a hundred genes, the results on the basis of the mean squares are more reproducible. The Method “p”, i.e. the PLIER approach, yields a very minor concordance if all genes are analyzed (panel (A)). Here, the difference of the outcome in comparison to the result after utilizing the present calls is such obvious that it seems recommended to use the PLIER approach only in combination with filtering of the genes using the present calls. Panel (C) shows the cumulative densities of the correlation coefficient of the data of probe-sets from the same gene. Here, the GCRMA approach outperforms. The MAS5 algorithm yields the worst rating. If only present called genes are analyzed, the RMA, VSN, and PLIER approaches are superior. Here, again the MAS5 approach yields the worst outcome. For the comparison of the results obtained by both technological platforms, the probe set identifiers of the oligonucleotide sequences on the Affymetrix one-color microarray had to be mapped to the Gene Bank Accession IDs used for the cDNA sequences on the two-color array. For this purpose, the gene conversion tool of the Database for Annotation, Visualization and Integrated Discovery (DAVID) has been utilized [D ENNIS et al 2003]. For 8020 out of 22 555 Gene Bank

40

1 Gene expression and DNA chips

Figure 1.13: Dice coefficients are depicted for the outcome of different data processing methods for one-color microarrays. The lower left part of the plots in panels (A) and (B) shows Dice coefficients for the first hundred genes. This constitutes the interesting part of the gene lists. Here, rankings according to the mean square statistics are mostly less dependent on the data processing than rankings based on the ANOVA test statistic. The upper right figures show the Dice coefficient for the whole range up to all genes. Panel (B) shows the same after filtering out the genes which yield data close to the background of unspecific binding. Thereby, the sensitivity on the data processing clearly decreases. Panel (C) shows the correlation of probe-sets of the same genes (colored line) in comparison to randomly selected pairs of probesets (gray color). The GCRMA approach yields superior correlations. However, this outcome changes if only genes with present calls are analyzed. Here, RMA, VSN, and PLIER are superior. The MAS5 algorithm yields the worst performance in both cases.

Accession identifiers, one or several corresponding probe set IDs could be found within the present called probe-sets on the one-color microarray. In the case of an ambiguous assignment, a matching identifier is randomly selected to obtain the same number of features for both technologies which is a prerequisite for the comparisons. Figure 1.14 shows the concordance of the gene rankings within- and between the platforms.

41

1.8 Confidence across methods and technologies

Figure 1.14: Dice coefficients are plotted for all approaches for both technologies. The mean squares (red curves) are in general more reproducible within-, but especially between different platforms. Again, the upper right figures show the outcome for all genes, the lower left figures shows the overlap for the hundred top ranking genes.

Within the first hundred genes, Dice coefficients decrease to a level of around 0.2 to 0.3 for the rankings on the basis of mean squares (black lines) between both technologies. For rankings according to the test statistic, Dice coefficients even decrease below a level of 0.1 (red lines). For the PLIER approach, the concordance is close to zero for both statistics. The concordance on the range of all genes, is displayed in the figures above the diagonal. All curves are rather close to a diagonal line which would be the expected outcome for random rankings. The within-technology comparisons in Figure 1.14 slightly differ from the previous Figure 1.13 because only the subset of features is depicted for which a mapping between both technologies has been found. A hierarchical clustering of all processed data sets has been applied to evaluate the similarity of the measurements on the level of all genes. Hierarchical clustering is discussed in more detail

42

1 Gene expression and DNA chips

Figure 1.15: Hierarchical clustering of the processed data shows that on the level of all genes, the experimental condition, namely the times after hepatectomy is the most dominating effect. This constitutes a minimal prerequisite for a reliable analysis of the gene expression. However, the data tend to cluster more according to the technology, than according to the samples.

in Section 1.9.2. To adjust for systematically different measurement scales of both technologies, the mean of the logarithmic expression values for each individual gene was subtracted within both technologies. This step has no influence on the results discussed above, i.e. on the mean square and on the ANOVA test statistic. For clustering, Pearson correlations have been used as a pairwise measure for similarity. Average linkage has been chosen for the comparisons of groups of items. Figure 1.15 shows the expression data as a heatmap after clustering of both dimensions, i.e. of the processed data sets and of the genes. All samples, both technologies, and all data processing approaches cluster in three separated groups according to the experimental condition, i.e. according to the point in time after the partial hepatectomy. This confirms, that the factor of interest is the dominating effect. The least difference is observed for the different data processing strategies within a technology for the same sample. This confirms that all data processing procedures yield informative outcomes. Another reasonable outcome is that the expression data measured at day one and two are more similar than the expression profiles before hepatectomy, i.e. at day zero. However, at the level of the technologies and samples, the processed data sets tend to cluster rather according to the technology as according to the biological sample. Only the data sets of three out of the nine samples cluster completely together, namely the samples one, three and six. In agreement with the comparison of the concordance of gene lists, this is a clear hint that there are systematic measurement deviations between both technologies, which are not removed by any data

43

1.8 Confidence across methods and technologies

Gene Lcn2 2310016E02Rik Mfsd2 Mt1 Mt2 BQ557415 Cxcl1 P4hb Rpl13a Rps29 BG075190 BG085369 Eef1b2 Pole4 Rpl38 Tnfrsf12a Arf1 Mrap Apoc2 Overlap with method A

A fold + + + + + + + + + +

B p + + + + + + + + + + -

fold + + + + 40%

p + + + + + + + + + + 100%

C fold + + + + + + + + + 90%

D p + + + + + 50%

fold + + + + + + 60%

p + + + + + + + + + + 100%

Method E fold p + + + + + + + + + + + + + + + + + + + + 100% 100%

F fold + + + + + + + + 80%

G p + + + 30%

fold + + + + + + + + + 90%

H p + + + + + + + + + 90%

fold + + + + + + + + + + 100%

I p + + + + + + + + + + 100%

fold + + + + + + + + + + 100%

Frequency p + + + + + + + + 80%

94% 50% 50% 50% 50% 44% 44% 44% 44% 44% 39% 39% 39% 39% 39% 39% 33% 33% 22%

Table 1.5: Comparison of the ten identified immediate early genes after hepatectomy for different data analysis approaches. Lipocalin (Lcn2) is identified in 17 out of 18 cases. The other genes are only identified in a half or less than a half of the cases. Some genes are preferably selected on the basis of significance and others have been selected because of the fold change estimates.

processing approach. Another, but minor reason for this outcome could be an imperfect mapping of the features evaluated by both technologies. In the previous Section 1.6, the so-called “immediate-early genes” which are upregulated initially after hepatectomy have been identified on the basis of the standard data processing method. Ten genes are selected on the basis of fold change estimates and another ten genes on the basis of ANOVA p-values. Table 1.5 shows the outcome for all data processing strategies. Only Lipocalin is identified by almost all data analysis strategies.

1.8.4 Conclusions Microarray gene expression data can be processed and analyzed by several reasonable approaches. The performances of the competing methods usually depends on the experimental setting. Therefore, there are no generally preferable strategies. However, the outcomes usually noticeably depend on slight changes in the analysis of the data. In this Section, a systematic evaluation of this impact has been presented. The correlations of features which are assigned to the same gene are calculated to assess the performance of single strategies. Dice coefficients of gene rankings for one up to all genes are calculated to compare the outcomes within and between the cDNA and the Affymetrix platforms. Moreover, it could be shown that both technologies and all applied data processing approaches yield informative data. A hierarchical clustering yields reasonable results and the data of features corresponding to the same gene have been found as clearly correlated. Here, the Affymetrix technology yields a superior outcome. On the other hand, at the level of single genes, the results

44

1 Gene expression and DNA chips

Technologies Affy, cDNA, RT-PCR Affy, cDNA, cDNA Affy, Affy, cDNA

Compared conditions 2 treatments, cell line 4 breast cancer cell lines Murine livers, 2 strains

Affy, cDNA

5 healthy human tissues colonic mucosa Dilution series, 6 mixtures of human blood and placenta Angiotensin treated mice vs. control, 2 times Human HNSC carcinoma, controls, metastases Dilution series, mixtures of four cell lines 3 human cell lines, reference RNA Human breast cancer cell line, treatment vs. control Mixtures of UHRR and HBRR,

Affy, bead Affy, cDNA, RT-PCR Affy, cDNA, RT-PCR Affy, cDNA, RT-PCR cDNA, Affy, one-c, RT-PCR Affy, one-c One-c, one-c, cDNA, RT-PCR Affy, one-c, one-c, one-c, bead Affy, one-c, one-c, one-c, bead, RT-PCR Affy, cDNA Affy, Illumina

Dilution of std. reference RNA, 4 mixtures of UHRR and HBRR Dilution of std. reference RNA, 4 mixtures of UHRR and HBRR, Human liver, lung, spleen

Human monocytes vs. macrophages cDNA, cDNA, RT-PCR Human cell line, 2 treatments, 2 times

# replicates 3 1, 3, 3 4

Affy data processing MAS5 MAS5, RMA RMA

# features 47 ≈ 1-2k ≈ 1k, 4k

1

MAS5

≈700

2

RMA

≈ 36k

DCHIP

≈ 6k-12k Correlations, clustering, PCA

2-4 6, 4, 2

MAS5, VSN, GCRMA ≈ 4k

2

RMA, MAS5

3

RMA

2

RMA -

≥ 15

MAS5, RMA, GCRMA, PLIER PLIER

≈ 12k

≈ 11.5k

5

MAS5, GCRMA, VSN, DCHIP RMA

2

-

≈ 9k

≥3

Correlations, gene ontology

≈ 4k-12k Correlations, Venn diagrams rank concordance ≈ 1k Correlations, Venn diagrams, clustering ≈ 8.5k Correlations, clustering, gene ontology ≈ 12-20k Correlations, Venn diagrams,

10-20

≥ 15

Approaches for the comparison Correlations Correlations Correlations, Venn diagrams, PCA, log-odds concordance of p-values Correlations, present calls concordance Correlation with dilution

≈ 13k

≈ 11.5k

Ref. [Y UEN et al 2002] [J RVINEN et al 2004] [W OO et al 2004] [M AH et al 2004] [BARNES et al 2005] [L ARKIN et al 2005] [S CHLINGEMANN et al 2005] [I RIZARRY et al 2005] [DE R EYNIS et al 2006] [S EVERGNINI et al 2006] [PATTERSON et al 2006]

Correlations, [S HIPPY et al 2006] titration response Correlations, Venn diagrams, [C HEN et al 2007b] clustering Correlations, Venn diagrams, [S TAFFORD & B RUN 2007] clustering, discriminant analysis Correlations, Venn diagrams, present [M AOUCHE et al 2008] call concordance, gene ontology Correlations, Venn diagrams, [H OCKLEY et al 2009] clustering, PCA

Table 1.6: Summary of the studies comparing different microarray technologies. “Affy” refers to the Affymetrix platform, “one-c” to another one-color microarray platform and “cDNA” to two-color platforms. “UHRR” and “HBRR” refer to Universal Human Reference RNA, respectively Human Brain Reference RNA, respectively. In most studies, the outcome is assessed only by Venn diagrams and/or on the basis of correlations, either of the data, or of the fold change estimates. Also clustering and principal component analyses (PCA) have been performed. Only in a single study, namely in [I RIZARRY et al 2005], the concordance of gene rankings is systematically analyzed. Most studies used only a single preprocessing strategy. The applied algorithm for the Affymetrix technology is denoted.

have been shown as noticeably dependent on the data processing. Between both technologies, a very strong disagreement has been found, at least for the purpose of identification of the mostly regulated genes. On the basis of the mean squares, the cDNA microarray data yields Dice coefficients of around 80% for a hundred identified genes. This means, that only 80% of the genes are identified, if a single parameter within the preprocessing of the cDNA data is changed to another reasonable value. The rankings on the basis of ANOVA p-values have been found as more sensitive to changes in the data processing. Here, the concordances vary strongly between the processing strategies. In most comparisons, Dice coefficients were roughly around 50% for the hundred top ranking genes. For the Affymetrix oligonucleotide arrays, filtering according to the present calls has been shown as essential. After filtering, Dice coefficients are between 50% and 95% for the hundred top ranking genes for the five applied preprocessing approaches if the mean squares are utilized. For the ANOVA test statistic, Dice coefficients were roughly around 50%.

45

1.8 Confidence across methods and technologies

In the existing literature, it could already be shown that filtering by the present call removes probe-sets which are likely to be unreliable while preserving regulated genes [M C C LINTICK & E DENBERG 2006]. In this study, the utilization of mismatch probes for the determination of the present calls has again shown as essential for reliable results. This contradicts with the decisions which have been made for the development of recent chip technologies. As an example, the latest Affymetrix tiling and exon array technologies do not contains any mismatch probes. For the utilized data, the comparisons between both technologies yield Dice coefficients for the top hundred genes in the range between around 20% and 30% for rankings on the basis of the mean squares. Here, the Dice coefficient was even close to zero if the F -statistic is used for the ranking. In general, inter-platform studies are expected to yield a lower level of concordance than intraplatform comparisons because more sources of variability contribute. Like in intra-platform comparisons, distinct data processing approaches have to be applied. In addition, systematic errors, another realization of the technical noise, and possibly an imperfect mapping between the features on both microarray platforms contribute. In our study, the same realization of the biological noise is analyzed by both techniques because the same RNA extracts have been measured. Each experimental microarray technique has its specific preferences. In this sense, the scope of results obtained by an experiment is always restricted to the utilized platform. Therefore, validation experiments should preferably be performed with another technique to obtain more general conclusions. Several other studies have been performed to assess the reproducibility of gene expression measurements across several platforms. The largest studies have been performed within the Microarray Quality Control (MAQC) project which was initiated by the U.S. Food and Drug Administration (FDA) in 2005-2006. For the comparison of microarray platforms [PATTERSON et al 2006], the manufacturers’ standard data processing approaches have been applied. In analogy to the present call filter, around half of the features on the arrays have been removed before the comparison due to quality control filters. Mixtures of two standardized mRNA samples have been quantified, namely of the Strategene Human Brain Reference RNA (SUHRR) and of the Ambion Human Brain Reference RNA (AHBRR). In comparison to our project, more reproducible results are expected because the standard RNA samples are not affected by biological variability between replicates and have therefore a very small level of intra-group variability. On the other hand, the samples “human” vs. “brain” are expected to be more heterogeneous than the measurements in our hepatectomy project where only a single tissue is examined. In this MAQC study, a concordance of genes with p-values below 0.01 and 0.05 and a fold change above 1.5, two, and four has been examined. For these settings, Dice coefficients between 35% and 88% have been determined [PATTERSON et al 2006]. Table 1.6 summarizes the existing literature. In most cases, the argumentation is performed on the level of correlations, either of the single data points, or for fold change estimates. However, the observed correlations are difficult to generalize because they strongly depend on the magnitude of the underlying gene regulation. In most studies, also Venn-diagrams are determined to assess the concordance of the identified genes. Venn-diagrams allow for a proper interpretation concerning

46

1 Gene expression and DNA chips

the intended analyses, but are very sensitive to the chosen threshold, i.e. on the number of identified genes. Clustering and principal component analysis (PCA) are in most studies performed to confirm that all platforms yield qualitatively the some outcome at the level of all genes. Only a single study, namely [I RIZARRY et al 2005], the concordance of gene lists is analyzed systematically, i.e. independent on thresholds and for several data processing approaches. Here, a calibration like design has been chosen, i.e. different mixtures of four human cell lines have been analyzed. Ten laboratories contributed for this project. A concordance within the range of 0.3 and 0.8 has been derived in this study. For our project, it was essential to evaluate the level of reproducibility for the custom cDNA platform which is utilized at the Core Facility Genomics in Freiburg. Although generalization of the results to other projects and platforms is restricted, the applied methodology is very general and could also be applied in other projects. In addition to the single gene level, an alternative interpretation of the gene expression data is obtained by gene set analyses [BARTHOLOME et al 2009]. Here, groups of genes are classified according to their functional annotation. The functional groups are analyzed and ranked according the their regulation. Equivalent to the procedure presented in this section, the impact of the chosen data processing strategy, or the comparison between both technologies could also applied on the level of gene sets. A major drawback of classical gene expression analyses is that only single preprocessing methods are considered. However, the statistical measures for the confidence hold only within a certain approach and does not include the sensitivity with respect to the preprocessing. This uncertainty in the outcome due to the choice of a specific analysis, can only be appraised if several approaches are applied in parallel. In the hepatectomy project, only a single out of the nineteen target genes, namely Lipocalin, is identified independently on the preprocessing for the mean square- and the F -statistic. In addition, four transcripts are identified across all preprocessing strategies for a single statistic.

For this project, I acknowledge my collaborators from the Core Facility Genomics of Prof. Gerd Walz, especially Dr. Thorsten Kurz for intensive and promoting discussions. He also performed the bulk of the experiments.

1.9 Applicability of microarrays to fixed tissues In this section, an approach is presented which allows for reliable microarray analyses of formalin fixed paraffin embedded tissues. The data quality is assessed relative to a standard cell preparation and a strategy is introduced to adjust for the systematic noise. The results of this chapter have been published in L ASSMANN et al [2009].

47

1.9 Applicability of microarrays to fixed tissues

1.9.1 Introduction Formalin fixation and paraffin embedding (FFPE) constitutes a standard sample preparation procedure in histopathology, i.e. in the microscopic diagnostics of cancer tissues obtained by biopsy, resection during surgery or even autopsy. During the last decades, large tumour tissue banks with thousands up to millions of FFPE samples have been established in western countries. As an example, in the tumor bank of the Ludwig Heilmeyer Tumor Center at the University Hospital in Freiburg, around 50.000 new samples per year have been collected. Such amounts of human cancers samples constitutes an invaluable resource in cancer research. In addition, FFPE tissues would allow for retrospective clinical studies with less efforts and costs as fresh tissues. However, the microarray technique was so far only applicable to fresh or to fresh frozen (FF) tissues. The reason for the limited applicability of routinely processed FFPE tissues are RNA fragmentation and cross-linking of the RNA with other molecules. The efforts in research enabling microarray analyses for FFPE samples comprise the use of shorter RNA sequences, and the improvement of RNA extraction- and processing protocols [B IBIKOVA et al 2004, C HEN et al 2007a, K ARSTEN et al 2002, L OUDIG et al 2007]. These efforts were partly successful for microarray hybridization [S CICCHITANO et al 2006]. Further improvements were achieved by using a new microarray techniques [B IBIKOVA et al 2008], by cDNA-mediated annealing, selection, extension, and ligation [B IBIKOVA et al 2004, R AVO et al 2008] as well as random priming for the detection of degraded RNA. Some proof on concept studies also evaluating limitations have been provided [C OUDRY et al 2007, F RANK et al 2007, L INTON et al 2008, S RIVASTAVA et al 2008] This study includes resection specimens from patients on which a surgery has been performed to remove a colorectal cancer. In our application, five cases, i.e. five patients with probes of both, fresh frozen and corresponding FFPE cancer tissues have been evaluated. To examine the data quality in a clinical setting, the five FFPE data sets have been additionally compared with three healthy FFPE tissues from the same patients and with four additional cases suffering from another histotype of colorectal tumour. Four 5 µm serial sections of FFPE tissues have been deparaffinized. The RNA of FFPE and FF tissues has been isolated, amplified and hybridized to Affymetrix hgu133plus2 chip with 54675 oligonucleotide probe sets. Exemplary sections of FFPE tissues are depicted in Figure 1.16. We discovered that, due to strong systematic noise, the data from FFPE tissues cannot be compared to FF tissues. However, it will be shown that after an adjustment for this bias, the FFPE data can still be used to discriminate different cases and different tumor subtypes. This constitutes a highly relevant finding for the evaluation of the clinical applicability of microarrays for FFPE tissues.

1.9.2 Hierarchical clustering Hierarchical clustering is a pattern recognition approach which is widely applied for microarray data analyses to find groups of samples or groups of genes with similar expression patterns (see e.g. S PEED [2003]). In general, cluster analyses are applied to assign a set of observations into

48

1 Gene expression and DNA chips

Figure 1.16: Panel (A) shows a formalin fixed paraffin embedded colorectal tissue for a healthy sample. Panel (B) shows a tubular histotype of malignant colorectal tumour tissue and panel (C) depicts the mucinous histotype.

subsets with high similarity within-, and less similarity between the subsets. Cluster analyses can be interpreted as unsupervised classification, because the clusters are assigned without any information about joint properties of the observations. The outcome of hierarchical clustering is a dendrogram, i.e. a tree diagram of hierarchically nested subsets. For hierarchical clustering over all evaluated samples, the major branches of the dendrogram indicate the dominating effects on gene expression measurements. After the definition of an appropriate measure of similarity, a similarity matrix is calculated containing the pairwise similarities of two samples. In our application, one minus Pearson correlation coefficient is used. Pearson correlation enables straight interpretation. Moreover, it is only slightly dependent on the number of genes, and independent on possibly varying scaling of the intensities from different microarrays. Prominent alternative distance measures are the Euclideanand the Mahalanobis distance. The similarity measure or the similarity matrix respectively, allows only for pairwise comparisons of single samples. Therefore, a linkage rule has to be specified, defining the similarity of two groups of samples using the pairwise similarities. Average linkage, i.e. the mean of one minus the pairwise correlation, has been chosen for our cluster analyses. Alternative linkage rules are complete linkage using the least pairwise similarity, i.e. the maximal distance, or single linkage, i.e. the best pairwise similarity. The similarity matrix and the linkage rule allow for the construction of a nested sequence of clusters. In agglomerative clustering, the dendrogram is obtained starting with each observation forming an own cluster of size one. Larger clusters are then iteratively build by combining the most similar observations or groups of observations. Divisive clustering is the contrary approach.

49

1.9 Applicability of microarrays to fixed tissues

Here, the dendrogram is constructed by iterative divisions, starting with a cluster containing all observations. Overviews about hierarchical clustering are provided in [D OUGHERTY et al 2002, ¨ J IANG et al 2004, T IBSHIRANI et al 1999, R AHNENF UHRER 2005]. In this application, the effect of the fixation procedure on the gene expression measurements is intended to be compared with changes in the expression between different cases and different types of tumors. The expression of all genes is used commonly to compare the measurement outcome for different samples. For this purpose, the discovery of the dominating effects on the gene expression, clustering constitutes an appropriate methodology.

1.9.3 Data processing The raw data in the *.cel Affymetrix standard file format has been processed with the Gene Chip robust multi-array (GCRMA) approach [W U et al 2004]. An extra quantile normalization step has been applied to the data set to ensure, that patterns in the data are not due to global changes in the intensity levels of different chips. For a single probe set, only relative changes between the samples are informative. The average intensity is not informative because the intensities of different probe sets are not comparable. Therefore, for constantly expressed genes, substantially, only observational noise around the probe set specific average would be measured. As many genes are expected to be weakly regulated between the patients, it is meaningful to remove these “noisy” genes from the analyses. Otherwise, important effects could disappear in the observational noise of the unregulated genes. Dimension reduction by elimination of uninformative genes has to be applied carefully (see e.g. [S IMON et al 2003]). Usually, it is intended to apply an unsupervised dimension reduction. Otherwise, it is likely to produce biased estimates or artifacts in further analyses. My first strategy is filtering the genes according to their estimated regulation over all examined samples. Different numbers of probe sets i with the largest variance Varj yij over all samples j have been selected. This dimension reduction strategy is unsupervised, because no information about the samples is used. However, on the one hand, genes showing larger technical or biological variability are preferred. On the other hand, this filter strategy assumes that the samples are representative with respect to dominating sources for gene regulation. Assume, as an example, a subset of informative genes for a certain property of the samples. If the samples are not representative, e.g. the property is realized only in a single sample, then the subset of genes has a worse chance for being selected in comparison to a subset of genes which is informative for a property which is realized in a half of the samples. Thus, unequal amounts of noise for the probe sets and not perfectly representative samples could yield to a not perfectly representative subset of genes, obtained after dimension reduction. In extreme cases, this could in turn yield to biased estimates and to misinterpretations. Therefore, as a second noise reduction strategy, Affymetrix’ present call is examined. Present calls have already been discussed in Section 1.8.3. Only probe sets which are labeled as “present” in all samples have been evaluated. Thereby, weakly expressed genes with measurements close to the background are removed. Because, especially on the log-scale, the amount of regulation is as-

50

1 Gene expression and DNA chips

sumed to be independent on the total expression, this alternative strategy yield another dimension reduction as the first strategy and seems therefore adequate for the purpose of confirmation. In our data set, the dimension is reduced from more than 54 675 probe sets to around 12 150 probe sets. The second dimension reduction strategy prefers the selection of genes with large total expression. Here, a bias could be generated, if genes with low expression behave differently from highly expressed genes. A realistic issue in our project is, that the FFPE procedure is expected to yield a decreased amount of measurable RNA. In this case, the FFPE data quality would be overestimated, if probe sets close to the background are removed in an initial dimension reduction step. Therefore, the second dimension reduction strategy is only applied as a validation of results generated by the first strategy.

1.9.4 Results In panel (A) of Figure 1.17, the upper right part of the matrix shows correlation coefficients of the gene expression data for the five cases for fresh frozen (FF) and formalin fixed paraffin embedded (FFPE) tissues. Obviously, the preparation procedure, is the dominating source of data heterogeneity leading to decreased correlations. Within a preparation the correlations are around 0.95 for FF and within 0.91 − 0.95 for FFPE. For different cell preparation, the correlations decrease to around 0.9. For the interpretation of the correlations, it is important to consider that there are three different effects determining the order of magnitude of the correlation coefficient. First, a better reproducibility in terms of technical and biological variability increase correlations. In addition, the true extend of differences in transcription between the samples, i.e. whether there is a strong or weak regulation, as well as the range of intensities for the probe sets on the chip obtained by a certain preprocessing strategy have also major impact. Therefore, claiming reproducibility of the data based on the absolute values of the correlation coefficient has to be made carefully. The lower left part of Figure 1.17 shows Dice coefficients as introduced in Section 1.7.1. Again, between both cell preparation approaches, the reproducibility is decreased. Within a preparation, the overlap of the 100 genes with maximal intensity is in the range of 62% - 82%. For different cell preparation, the overlap decreases to 37% - 67%. Hierarchical clustering (Figure 1.18) confirms that the cell preparation is the dominating influence on the expression data. The first branch separates the data sets clearly according to the preparation. All these preliminary results show, that the FFPE preparation procedure introduce strong systematic errors. The question emerges, whether there is any information in the FFPE data at all. If the data contains information about the gene expression, different probe sets coding for the same gene would be correlated within the FFPE samples, if they are regulated between cases. If genes are constantly expressed, then also probe sets for the same gene are expected to be uncorrelated. Addressing this idea, the pairwise correlations have been calculated for the 10.000 probe sets with maximal regulation over all samples. In Figure 1.17, panel (C)-(E), histograms of correlation coefficients (blue) are compared with correlations obtained for randomly chosen pairs of probe sets

51

1.9 Applicability of microarrays to fixed tissues

(B)

Data set

Dice coefficient

(A)

56

44

56

44%

Dice coef.

0.18% 1 0 0

all

Data set

No. of probe sets

All Samples

(D)

FFPE Samples

(E)

Fresh Frozen Samples

Frequency

(C)

Number of probe sets

Correlation

Correlation

Correlation

Figure 1.17: Panel (A) shows correlation coefficients (upper right part, red colors) and Dice coefficients (lower left part, brown colors) of pairs of samples for the hundred genes with largest intensities. The fresh frozen samples are labelled in black color (S1-S5), the FFPE samples have blue labels (S1-S5). As in previous sections, the black curves depict the dependency of the Dice coefficients on the number of considered top ranking genes. Both, correlation and similarity of gene rankings, decrease if data sets from different preparations are compared. Panel (B) shows the Dice coefficients of a ranking according to regulation within the FF and within the FFPE samples. An overlap of 44% is observed for the 100 probe sets with largest variance. The black line depicts the overlap expected by chance. In panel (C), histograms of correlation coefficients between pairs of probe sets corresponding to the same gene are depicted and compared with correlations for random pairs of probe sets. Probe sets from the same gene are clearly more correlated. Both, panel (B) and (C) confirms, that there is information about the gene regulation in the FFPE data.

52

1 Gene expression and DNA chips

Figure 1.18: The heat map of expression data obtained after hierarchical clustering of both, the samples (vertical axis) and the probe sets (horizontal axis) as well as the dendrogram on the left show, that the preparation procedure is the dominating effect on the gene expression measurements. Here, thousand mostly regulated genes according to variance over all samples have been used. The data has been normalized to a mean of zero and a standard deviation of one for each gene.

Figure 1.19: Heat map obtained after adjustment for the systematic error generated by the FFPE procedure. Now, the data sets cluster according to the cases (S1-S5). This indicates that an adequate measurement procedure and data analysis enable informative microarray data for FFPE tissues. If both, FFPE and FF tissues are analyzed in common, the adjustment is essential.

(red) out of the same portion of mostly regulated probe sets. The shift of the distribution in panel (D) indicates that the FFPE data contains information about the gene expression, although the shift is enhanced for fresh frozen samples (see panel (E)).

53

1.9 Applicability of microarrays to fixed tissues

There seems to be useful information about the gene expression in FFPE data which is, however, covered by the bias introduced by the fixation. A linear model yijp = Gij + Pp + εijp , εijp ∼ N (0, σi2 )

(1.52)

is used to estimate the cell preparation effects P1 (FF) and P2 (FFPE) and to adjust the data 0 yijp = yijp − Pp

(1.53)

of gene i and case j for this systematic noise. Substantially, the log-intensities for each probe set are centered by this way to a mean of zero in both preparation groups. This elimination step is independent on the information which sample belongs to which case. Therefore, the corrected data can still be considered as unsupervised with respect to the five cases. Figure 1.19 shows a hierarchical clustering using the top 100 regulated probe sets. In the dendrogram, the FFPE and FF pairs for each case cluster together. Figure 1.19 shows that after the adjustment for the FFPE bias, the gene expression data identify different cases. This confirms our intention, that an adequate experimental protocol enables microarray analyses of FFPE tissues. The adjustment for the systematic error is essential, especially, if FFPE tissues are analyzed in common with another cell preparation procedure. In a clinical application, however, the data is required to detect more subtle changes in gene expression as between two individuals. As an example, it would be intended to classify malignant from healthy tissues or to predict a malignant cancer subtype. In order to confirm the outcome, the gene expression of healthy FFPE tissues of three cases was additionally measured. Further, four FFPE tissues from other patients suffering from the mucinous histotype of colorectal cancer have been evaluated. Using this data, we could show that the FFPE microarray data is capable to separate healthy from the malignant tissues as well as the mucinous and tubular histotype. Figure 1.20 shows that a hierarchical clustering with the thousand mostly regulated genes yield three branches with a perfect separation of the healthy and malignant tissues and even between the two tumour subtypes. A hundred up to five thousands of genes have been selected to confirm that this outcome is qualitatively independent on the number of selected genes used for cluster analysis. Also, the second dimension reduction, using only present called probe sets, leads to same result.

1.9.5 Discussion and Summary Formalin fixation and paraffin embedding (FFPE) is the standard cell preparation in histopathology. Many diagnostic examinations are established in clinical practice for FFPE tissues. In addition, routinely processed FFPE tissue samples represent an extensive and valuable source for large-scale microarray based retrospective studies. Up to now, the RNA abundances in FFPE tissues is measured using RT-PCR. Using microarray technique as a high throughput measurement technique opens the door for refinements in clinical diagnostics and promises progresses in cancer research. Moreover, many other techniques applied in molecular biology require the fixation step to maintain the cells. As an example, for flow cytometric analyses, as presented in Chapter 3, fixation of

54

1 Gene expression and DNA chips

Figure 1.20: FFPE microarray data in a clinically relevant setting. The FFPE data enables the separation of malignant from healthy tissues. Further, it allows the identification of the mucinous (“muc”) and the tubular tumour histotype.

the cells by formalin is also required. Here, we could demonstrate that the use of the latest microarray technique in combination with an adequate RNA isolation, amplification and random priming and an adaptation of the data analysis procedure enables the generation of reliable microarray data from FFPE tissues. This finding constitutes one important step in the establishment of clinical routine diagnostic procedures for FFPE tissues based on microarray gene expression data. Further details of this work are published [L ASSMANN et al 2009]. For this project, I acknowlegde Dr. Silke Lassmann from the group of Prof. Werner in the Pathological Institute of the University Hospital in Freiburg for the very fruitful collaboration. She initiated the project and generated the microarray data.

1.10 Identification of housekeeping genes In this section, a methodology for the identification of constantly expressed genes is introduced. The classical statistical approaches for the converse issue, the detection of biomarkers, i.e. for differently expressed genes, are only limited applicable for this purpose. Six candidates have been determined on the basis of microarray data. Four of them could be validated experimentally by quantitative real-time RT-PCR.

Housekeeping genes are constitutively expressed genes, i.e. they are continuously transcribed at a constant level. These genes are thought to be involved in basic processes, e.g. for the sustenance of the cells. A permanent constant expression is thought to be an essential requirement for the roles of these genes. In practice, housekeeping genes are important as experimental controls and for normalization.

55

1.10 Identification of housekeeping genes

For example, RNA concentration measurements by quantitative real-time reverse transcription polymerase chain reaction (RT-PCR), are not comparable between different experimental runs because RNA isolation and reverse transcription have varying efficiencies [H OLODNIY 1999]. Since the efficiency is assumed as a multiplicative source of noise, ratios of the expression of the target genes to a housekeeping gene are used as a comparable experimental outcome. For such applications, it is obvious that valid housekeeping genes are essential for the quantitative interpretation of the data. There are some classical housekeeping genes which are widely used in the past, namely βactin which is major component of the cytoskeleton, the enzyme Glyceraldehyde 3-phosphate dehydrogenase (GAPDH) which is essential in glucose metabolism, or the ribosomal RNA sequence 18S. However, recent technological advances improving the precision of gene expression measurements showed that these classical housekeepers are by no means constantly expressed in all circumstances [BAS et al 2004, B USTIN 2000, S CHMITTGEN & Z AKRAJSEK 2000, S ELVEY et al 2001]. Recently, it has even been demanded whether it is in general possible to find valid housekeeping genes which are constant under all, or at least under a wide range, of physiological conditions [A NDERSEN et al 2004]. A way out of this issue is a context dependent selection of the control genes. If the scope of an experimental study is defined, a methodology is then required for an application-specific identification of a valid set of housekeepers. In this section, such a strategy is introduced. On the basis of microarray data, candidate housekeeping genes are proposed for the investigation of signal transduction in primary mouse hepatocytes. These candidates have been validated experimentally by qRT-PCR.

1.10.1 Data The goal of this project is the identification of housekeeping genes in primary mouse hepatocytes. These genes are intended to be used as control genes for studies of the gene expression by qRTPCR after stimulation of the cells’ signaling pathways, namely after stimulation with Transforming Growth Factor-β (TGFβ), with Hepatocyte Growth Factor (HGF) or with Interleukin-6 (IL6) which are three signalling molecules which are known to promote cell-division processes and immune responses. The Affymetrix Mouse Genome 430 2.0 Array with 45101 probe sets has been used for the microarray experiments. 57 microarray data sets were generated for some experimental conditions of interest. Thereof, 49 could be used to identify potential housekeeping genes. Eight data sets had to be excluded because of artifacts. Details about the examined experimental conditions for the microarray data are summarized in Table 1.7. The HGF and cell preparation experiments were performed in “laboratory A” of Prof. Jan Hengstler at the Department for Molecular and Forensic Toxicology in Leipzig. Here, hepatocytes from three different individuals are examined. The IL6 and TGFβ experiments have been performed in the “laboratory B” of Dr. Ursula Klingm¨uller at the German Cancer Research Center (DKFZ) in Heidelberg. Here, three biological replicates have been used, each consisting of pooled cell extracts out of two mice livers. All these experimental conditions and cell preparation

56

1 Gene expression and DNA chips

Table 1.7: Overview about the experimental conditions for the microarray data. Eight data sets had to be excluded from the analysis because of artifacts. The excluded microarrays are indicated in brackets.

procedures are representative for future studies for which proper housekeeping genes are required. For the experimental validation, the new candidate genes have been evaluated by quantitative real time RT-PCR, together with known housekeeping genes and some target genes. Here, comparable experimental settings were evaluated. Two points in time have been measured, namely 8h and 24h after HGF and TGFβ treatments and 2h and 6h after IL6 stimulation. In parallel to each stimulation experiment, untreated cells were evaluated at time point 0h. As an out-of-sample condition in terms of the biological setting, cells have been treated with Insulin and measured after 1h and 3h. Also the negative control at the time point 0h was conducted. The different points in time were necessary, because for the different treatments the cells respond on different time scales. Three biological replicates have been used for the qRT-PCR experiments. These validation experiments were completely performed in “laboratory B” in Heidelberg.

1.10.2 Existing approaches There are two applications [L EE et al 2007, S U et al 2007] where microarray data has been used to assess traditional housekeepers or to identify new candidates. In addition, there are some approaches on the basis of qRT-PCR data. In the following, these existing approaches are briefly summarized. Univariate nonparametric tests, i.e. the Mann-Whitney- and the Wilcoxon test, were used in [B EILLARD et al 2003] to identify valid housekeeping candidates using qRT-PCR gene expression

57

1.10 Identification of housekeeping genes

data of leukemia patients. In a multi-center retrospective study of more than 250 patients, 14 housekeeping candidates were evaluated. Only a single gene had desired properties, e.g. showed a non-significant change between different groups of patients, e.g. between healthy and leukemia patients or between different leukemia subtypes. Microarray data of the expression of classical housekeeping genes has been analyzed by an ANOVA approach in [L EE et al 2007]. Here, the measurements of a gene had been sorted and a linear regression is applied. The ratio of the slope over R2 , i.e. over the amount of explained variation, is then used to assess a gene’s variability. The identified candidate control genes had been tried to be validated out-of-sample. However, it could be shown that the expression levels out-of-sample still vary. b obtained from the tThe confidence interval for the differential expression or fold change ∆ distribution was used in [H ALLER et al 2004] to determine genes which are least regulated between two experimental conditions. An equivalence test for the hypotheses H0 : ∆ ∈ / [−a, a] vs. H1 : ∆ ∈ [−a, a]

(1.54)

has been constructed by checking whether the confidence interval

b = CIα (∆)

(

∆| cdf−1 t (α/2)

b at a level α is part of the estimate ∆

) b ∆−∆ −1 √ ≤ cdft (1 − α/2) ≤ σ b/ n

? b ⊂ CIα (∆) [−a, a]

(1.55)

(1.56)

of the equivalence region [−a, a]. For a gene which fulfills equation (1.56) it was concluded that it is certainly not regulated by more than a between the two conditions. Such genes are potential housekeeping candidates. The equivalence test methodology for identification of housekeeping genes has two free parameters, the size of the equivalence region defined by a and the significance level α. In [H ALLER et al 2004], equivalence regions of [log2 (1/2), log2 (2)] and [log2 (1/3), log2 (3)] corresponding to fold changes of a factor of two and three, respectively, had been applied to qRT-PCR data of known housekeeping genes. Genes for which H0 is rejected, i.e. genes for which the confidence region is completely included into the equivalence region, had been identified as housekeeping candidates. A so-called GeNorm approach is introduced and applied in [VANDESOMPELE et al 2002]. Here, the variability of the expression of a potential housekeeping gene I is assessed by the average standard deviation MISD := meani6=I ( SDj (yIj − yij ))

(1.57)

of the differences to all other potential housekeeping genes i. A corresponding average of the

58

1 Gene expression and DNA chips

variances MIVar :=

meani6=I Varj (yIj − yij )

(1.58)

=

meani6=I ( Varj (yIj ) + Varj (yij ) − 2 Covj (yIj , yij ))

(1.59)

=

meani6=I ( Varj (yIj ) + Varj (yij )) , iff uncorrelated

(1.60)

=

(nI − 1) Varj (yIj ) + meani6=I Varj (yij )

(1.61)

=

(nI − 1) Varj (yIj ) + meani Varj (yij ) − Varj (yIj )

(1.62)

=

(nI − 2) Varj (yIj ) + const.

(1.63)

would lead to a ranking according to the variances across all experimental conditions j, if the expression of candidate housekeeping genes are assumed as uncorrelated. This shows that the GeNorm approach is closely related to a ranking according to the total variances. However, the outcome critically depends on the assumption that the evaluated genes have uncorrelated regulation and observational noise. Genes are preferred by the GeNorm approach with small correlations to other genes. The assumption of uncorrelated changes is only reasonable for preselected housekeeping candidates. In contrast, high throughput analyses of microarray data are expected to yield large amount of correlated fluctuation due to the functional relationship of the genes. Therefore, an application of the GeNorm approach is problematic for microarray data. In [A NDERSEN et al 2004], the model yigj = αig + βgj + εigj , εigj ∼ N (0, σ 2 )

(1.64)

with the expression level αig of gene i in the experimental condition g has been applied to determine housekeeping genes from real time quantitative qRT-PCR data. βgj represents the amount P of RNA in the sample j. Under the assumptions αig ∼ N (0, γ 2 ) and i αig independent from P g, a stability value was defined as the average α bi = g α big of the best linear unbiased predictor (BLUP) of α big and the standard error SE(b αi ).

This approach constitutes a generalization of the equivalence test approach described above. Here, more than two experimental conditions could be examined. In addition, one free parameter of the equivalence test, namely the size a of the equivalence region (1.56) could be eliminated by sorting the genes according to the smallest a where the equivalence null hypothesis is rejected. The second free parameter of the equivalence test, the confidence level α, is here set to 0.68 because one standard error is added to the BLUP. Another linear model, namely yij = µ + Gi + Tj + εij , εij ∼ N (0, σi2 )

(1.65)

with a global offset µ, tissue effects Tj and gene effects Gi has been used to identify housekeeping genes in breast cancer [S ZABO et al 2004]. The estimated error variance σ bi2 has been used as a criterion to select housekeeping candidates. In addition, two extensions of model equation (1.65) regarding more complex correlation structures of εij were applied. However, both extensions have been found as inferior with respect to the Bayesian Information Criterion.

59

1.10 Identification of housekeeping genes

Bootstrap estimates of the inter-quartile range were used in [S U et al 2007] to identify a new reference gene on the basis of 66 microarray data sets of lung ardenocarcinoma tissues. A new candidate housekeeping gene is identified which yields an improved correlation of microarray and qRT-PCR data in comparison to the classical housekeeping gene Gapdh.

1.10.3 Methodological considerations Traditional statistical procedures address the issue of the identification of differentially expressed genes. Here, the null hypothesis whether a gene is constantly expressed is tested. Thereby, genes are identified which show a significant induction of the gene expression, i.e. which obviously bunch out from the noise. Such statistical tests are only capable to reject the hypothesis but they are not directly applicable for the converse issue. In other words, only genes which are not appropriate as housekeeping control gene are identified, but for an unknown magnitude of the observational noise, it is impossible to make confident statements whether the expression of a gene is constant because such conclusions always depend on the accuracy and amount of the available data. Analysis of variances (ANOVA), as an example, is a well-established procedure to identify gens which are not constantly expressed across several experimental conditions. Although significance indicate that a gene is not suited as a housekeeping gene, an insignificant result does not allow for the conclusion that the gene is constant. In fact, the insignificant outcome could also be due to a superimposed large amount of noise and/or to insufficient number of experimental repetitions. Nevertheless, the statistical models like equations (1.64), (1.65) which are the underlying models of ANOVA can be utilized to estimate between group- as well as the within group variation of the genes. The existing model based approaches differ in the way of combining different types of variations to a scalar utility function which is evaluated to obtain a ranking of the genes. A proper combination of several variance components is given by a clear definition of the desired property of a housekeeping gene, i.e. a specification how between- and within experiment noise should be penalized. In our project, primary cells have been evaluated which are in general difficult to extract and prepare, and are known to have many potential sources of unintended transcriptional regulation. The intention of our project was to identification of “globally constant” control genes. It was not desired to distinguish between all potential sources of noise. Therefore, it was decided to use the total variation over all samples as a utility function to rank the genes for potential housekeeping candidates. Such a strategy is valid, if the microarray data is representative, i.e. the data which is used to predict housekeeping genes has to be generated under comparable experimental conditions as the intended application. Otherwise, the housekeeping candidates would be preferably selected for being constant under non-relevant or under certain over-represented conditions. As an example, time course measurements were intended after a stimulation of hepatocytes with TGFβ. For this experiment, three repetitions are planned for different points in time. Thereby, the experimental setup for the microarray data has been chosen equivalently, namely three replicates have been measured for different times after stimulation.

60

1 Gene expression and DNA chips

Figure 1.21: Overview about the processing steps for an identification of new housekeeping candidates.

The available knowledge about the experimental conditions of the data has been utilized by ANOVA, i.e. genes which showed significant variations, e.g. a time- or dose dependency, have been eliminated. Global effects for all genes between the experimental runs like Tj in equation (1.65) did not have to be considered because they were eliminated for microarray data during the preprocessing, namely by normalization. Bootstrap resampling of the samples has been utilized to obtain more robust results with respect to the available samples. Further, biological knowledge gained from public databases has been used. In the following section, a detailed description of the identification procedure is provided and six new housekeeping candidates are determined.

1.10.4 Identification strategy The sequence dependent robust multi-array analysis (GCRMA) algorithm [W U et al 2004] has been applied to process the raw microarray data. This algorithm comprises a log-transformation of the intensities, a sequence dependent background subtraction and adjustment for unspecific binding, as well as condensing of the measurements of probes belonging to the same probe set. Here, and in the whole Section 1.10, the terms “probe set” and “gene” are used exchangeable. In Figure 1.21, an overview is provided about all steps, applied for the housekeeping candidate identification on the basis of the microarray data. Initially, the present call criterion [P OUNDS & C HENG 2005] is utilized to eliminated all genes which are strongly affected by unspecific binding,

61

1.10 Identification of housekeeping genes

Figure 1.22: In panel (A), the empirical variances for the genes which fulfill the “present” criterion are plotted against the average signals. Signals in the medium intensity range show increased variability in comparison to the lower- and upper intensity ranges. The three defined intervals are depicted in light blue color. The six housekeeping candidates are plotted as red dots. Panel (B) shows the empirical frequency distribution of the genes’ standard deviations. One challenge in the identification of housekeepers is due to the large number of competing features with low variability, i.e. the large number of genes with a standard deviation close to zero. The inset shows the distribution on a logarithmic scale.

or yield measurements close to the background. Genes with an “absent call” in more than 50% of the samples have been eliminated in this initial filtering step. Control genes used for qRT-PCR experiments should have similar expression as the genes of interest. This minimizes the risk of an intensity dependent bias. Therefore, three intensity ranges, each containing one third of all genes, have been defined to obtain housekeeping candidates with a low-, a medium-, and a high expression level. In panel (A) of Figure 1.22, the intervals are depicted as vertical lines. Here, the variance is plotted against the average intensity for all genes. The finally selected housekeeping candidates are highlighted with red dots. Panel (A) also shows a slight “banana”-like shape, i.e. genes in the medium intensity range have an increased minimal variability. This observation could be a hint for a detection bias for low intensities and for a saturation bias occurring for large intensities. The definition of several intensity intervals is essential because otherwise only genes with very large intensities would be identified as housekeeping candidates. As a third filtering step, 2% of the genes with the least total variance over all samples have be selected for all three intensity intervals. In addition, bootstrap resampling of the microarray samples was conducted to obtain bootstrap estimates of the total variance. Genes which are in less than 95% of the resampling estimates not within the least varying 2%, are also removed from the list of candidates. This step decrease the dependency on single data sets and ensures that only genes with an accurate estimate of the variance remain.

62

1 Gene expression and DNA chips

Then, ANOVA is applied to eliminate genes with a significant variation between the different conditions. Samples with the same cell preparation procedure, the same stimulation and time point of the measurement have been considered as repetitions. Thereby, genes with significant time-treatment interaction or dependency on the cell preparation were removed from the list of housekeeping candidates. Here, a significance level of p < 0.05 has been used. The employment of this filtering procedure, i.e. the steps one to five in Figure 1.21, leads to 41 candidates for the low intensity range, to nine genes for the medium-, and to 23 genes for the high intensity interval. These candidates have been further evaluated with respect to available knowledge concerning their biological function. For this purpose, the Database for Annotation, Visualization and Integrated Discovery (DAVID) [H UANG et al 2007] has been utilized. This database comprises 14 categories of gene annotation, e.g. the prominent functional categories according the Gene Ontology (GO) [T HE G ENE O NTOLOGY C ONSORTIUM 2001, H ARRIS et al 2004], the relationship of genes to signaling pathways, e.g. according to Kyoto Encyclopedia of Genes and Genomes (KEGG) [K ANEHISA & G OTO 2000, O GATA et al 1999], as well as sequence information. Genes with any conceivable functional relationship to the planned studies have been removed. In addition, the final candidate genes are intended to have distinct biological functions and sequence motifs. This minimizes the risk of unintended confounding in later applications. A reference gene needs additional biochemical properties to allow for an accurate quantification by qRT-PCR (see e.g. [U DVARDI et al 2008]). Therefore, the remaining candidates were evaluated with respect to their applicability in qRT-PCR. Finally, six genes have been selected for all three intensity intervals for experimental validation, namely the gene encoding for the solute carrier family 25 member 29 (Slc25a29), for the ubiquitin-conjugating enzyme E2R 2 (Ube2r2), for the ATP-binding cassette, sub-family B MDR/TAP, member 8 (Abcb8), for the ankyrin repeat and FYVE domain containing 1 (Ankfy1), for the ribosomal protein L31 (Rpl31), and for the procollagen type III N-endopeptidase (Poln3).

1.10.5 Validation experiment The six new housekeeping candidates determined on the basis of microarray data are tried to be validated by quantitative real time qRT-PCR. The candidates have been measured together with five classical housekeeping genes, namely 18S, β-actin, Gapdh, Hprt, and TBP. In addition, three target genes, Cyclin D1, Hamp and Pal1 have been evaluated to enable a comparison of the variability of the housekeeping candidates with typical target genes. These target genes are expected to be transcribed as a response for a stimulation of the TGFβ pathway. Coefficients of variations have been calculated for the concentration estimates obtained by qRTPCR. Figure 1.23 shows that the new housekeeping candidates tend to be more stable than the traditional ones. Here, only one traditional housekeeper, namely Tbp shows the same stability as the best candidates. Actually, Tbp was already found as a stable genes in the regenerating liver [TATSUMI et al 2008].The traditional housekeeping gene 18S showed a very large coefficient of variation.

63

1.10 Identification of housekeeping genes

Figure 1.23: The coefficients of variation (CV) obtained in the validation experiment using qRTPCR. The housekeeping candidates (labeled with a blue “C”) outperform the traditional housekeeping genes (black “H”). In addition three target genes (red “T”) have been evaluated. Panel (A) shows an “in sample” validation experiment, i.e. comparable experimental conditions as for the microarray data have been chosen. Panel (B) shows data obtained after stimulation with Insulin. This constitutes an out-ofsample validation because another pathway is activated as for the identification of the housekeeping candidates. Panel (C) shows the outcome for the complete data set.

For the results displayed in panel (A), other cells, but a comparable experimental setup as for the microarray data has been used. In terms of the biological setting, this constitutes an in-sample validation. The performance obtained by in-sample validation experiments is often strongly biased if candidates have been selected out of a large number of competing features. Therefore, it was supposed that the new housekeeping candidates are more specific to the experimental setting used for the identification, as the traditional housekeepers. The traditional ones were expected to be applicable in more general circumstances. For an exploration of this hypothesis, the hepatocytes have been treated with Insulin. Insulin activates the gene transcription via the Insulin signaling pathway and constitutes thereby a biological out-of-sample setup. Under this setting, the coefficients of variation increase for all evaluated non-target genes (see Figure 1.23, panel (B)). Again, the new candidates tend to be more stable than the classical housekeeping genes. This indicates that the housekeeping candidates are also applicable in related experimental setups. After Insulin treatment, Rpl31 and Ube2r2 are found as superior. Panel (C) shows the performance for the complete data set. In all three settings, two candidate genes, namely Slc25a29 and Ankfy1 showed a poorer performance. Rpl31 and Ube2r2 seems to be the most promising candidates. Also Abcb8 and Pold3 showed coefficients of variation which are superior to the classical housekeeping genes. Tbp showed the most stable expression among the classical housekeepers. The largest variability was observed for 18S.

64

1 Gene expression and DNA chips

1.10.6 Discussion

For microarray data sets, a reliable identification of housekeeping genes is hampered by the large number of potential candidates. Here, hundreds or even thousands of features compete for being selected as a candidate. Panel (B) of Figure 1.22 shows a histogram of the sampling variances of all genes on the microarrays. Because a bulk of genes have similarly small variations, a reliable identification of housekeepers is at least as challenging as the converse issue, the identification of regulated genes. Therefore, biological prior knowledge about the genes should be used. Thousands of microarray studies have been performed during the last decades and many genes have been identified as being regulated under certain circumstances. Utilizing this knowledge allows for more robust decisions. In this project, biological knowledge was incorporated by the use of gene annotations which is available in public databases. Another strategy would be a direct use of publicly available experimental microarray data. Because the identification of housekeeping genes is a very general task, it is reasonable to utilize experimental data which is generated under a very wide range of different conditions. Gene expression measurements are influenced by several experimental parameters. In addition to the biological noise, the experimental procedure, namely the cell preparation, the RNA extraction, and the hybridization introduce variations. Because some sources of noise have a common impact on several samples and genes, a non-trivial correlation structure of the noise is expected. Such correlated noise can be regarded by assuming non-trivial covariances for the noise distribution, or, by the introduction of predictor variables to adjust for known sources of noise. Such extensions are essential to obtain unbiased estimates of differential expression. The issue of adjusting for systematic errors is also discussed in Chapter 2 in the context of protein quantification. The drawback of error model extensions is the estimation of additional parameters from the data. This increases the variability of the estimates of all variance components. In other words, more advanced error models decrease the bias in the estimates of variance components, but increase the variability of these estimates. This is another instance of the general bias-variance trade-off which was already discussed in Section 1.7.3 in the context of gene rankings for the identification of differentially expressed genes. For the identification of constant genes, biased estimation is a smaller issue, because the strength of any bias is related to the magnitude of different sources of noise which are by definition small for housekeeping genes. Even in the case of a reliable estimation with a detailed error model, the variance components have to be aggregated to a scalar objective function. In our project, the issue is resolved by the choice of an experimental design which is similar to the designs planned for the applications of the housekeepers. Then, the total variance is the natural and proper aggregation of all possibly occurring variance components. A further requirement for housekeeping genes emerges, if they are intended to be used as control genes for qRT-PCR experiments. Control genes should have similar absolute expression levels as the genes of interest. This makes qRT-PCR experiments more robust against systematic errors which are related to absolute expression levels. In our application, the genes were assigned to three intensity intervals to obtain housekeeping candidates with a low-, a medium-, and with a

65

1.10 Identification of housekeeping genes

high expression. This also allowed us to avoid selecting too many ribosomal genes clustered in the high intensity interval. However, microarray data does, in general, not allow for a quantitative interpretation of the intensities between different features. Although, the measured intensities are correlated to the RNA abundances for the individual probe sets, the data for different probe sets cannot be compared due to different biochemical properties. For that reason, the division into three intensity ranges is a rather indefinite classification of the genes with respect to the true underlying expression level. Finally, I would like to state that in a formal argumentation, the scope of the identified housekeeping candidates is restricted to the scope of the utilized knowledge. The conclusions drawn on the basis of the microarray data are restricted to the examined experimental conditions. To some extent, a more general scope is achieved by the use of biological prior knowledge, e.g. functional annotations. However, it is a priori unknown how the housekeeping candidates perform out-of-sample, i.e. in new experimental conditions. The same argumentation holds for different technological platforms. If genes are accurately measurable by a certain microarray platform, it is not ensured that the same holds for qRT-PCR. Conversely, microarray specific variance components can emerge for some genes, which are in turn not relevant for qRT-PCR applications.

1.10.7 Summary Housekeeping genes are essential as control genes in many experimental approaches. For this purpose, some traditional housekeeping genes are widely applied, although it has been shown that there are many exceptions where these genes are clearly regulated. Therefore, more robust housekeeping genes have to be identified, or, the classical housekeeping genes have to be validated before they are intensively used in applications. There are some well-established statistical approaches for the identification of regulated genes. In this section, it has been argued that these procedures are not applicable for the converse issue without adaptations. The identification of constant genes is hampered by the fact that this issue cannot be directly addressed by statistical tests. In addition, a lot of potential features compete for being selected as a candidate if microarray data is used. In contrast to the identification of marker genes, housekeepers will never occur as outstanding in analyses of experimental data. They have always tried to be identified out of the large pool of unremarkable features with no obvious pattern of regulation. Here, I proposed a strategy for the identification of new housekeeping candidates on the basis of microarray data which was generated in two collaboration laboratories. For this purpose, a set of 57 microarray data sets which are representative for later applications of the housekeeping control genes have been analyzed. A set of six biologically reasonable new housekeeping genes have been identified. The traditional housekeepers have been evaluated and new housekeeping candidates have been proposed. Four of them showed very stable behavior in the qRT-PCR validation experiments. Actually, they outperform the five evaluated classical housekeeping genes. Rpl31 and Ube2r2 showed the most

66

1 Gene expression and DNA chips

stable expression. In the future, they will be used in the experiments as additional control genes in gene expression studies. For this project, I acknowledge Dr. Peter Nickel who did the major part of the experimental work. In addition, I thank Dr. Seong-Hwan Rho who coordinated the project. He also did the functional analyses of the candidate genes.

1.11 Summary In this chapter, the statistical analysis of microarray gene expression data has been investigated comprising all steps from the raw data up to the identification of target genes. It has been shown that several data processing steps are required to obtain unbiased outcomes. In addition, it was shown that minor changes in the data processing strategy could lead to noticeable changes in the results. This emphasizes the need of an appropriate data processing as well as the necessity of the evaluation of the sensitivity of the outcomes to the data processing. The estimation and testing of microarray gene expression data on the basis of statistical models has been introduced and applied to identify the key regulatory genes during liver regeneration. It could be shown that in a common setting for microarray studies, an identification of target genes on the basis of effect size estimates is superior to the commonly used ranking according to significance. This outcome could be interpreted in terms of the bias-variance trade-off. Further, a generally applicable workflow for the identification of housekeeping genes has been established. This strategy is applied to identify new housekeeping candidates which could be validated experimentally. In another application, microarray data could be adjusted for a cell preparation specific bias, i.e. caused by formalin-fixation and paraffin-embedding which is a major preparation procedure in histopathology. It could be shown that the adjustment improves the data quality sufficiently for clinical applications. The microarray technology was the first quantitatively working high-throughput technique in molecular biology. On the one hand, the vast amount of experimental data allowed for the application of very advanced statistical approaches. On the other hand, some well-established statistical procedures had been pointed out as only restrictively applicable in the high-dimensional setting. The recent improvements of microarray technology enable a more refined investigation of transcription as well as of the genome itself. As an example, high-throughput evaluation of splice variants of the genes are feasible, or the identification of disease related genomic aberrations. There are also efforts to establish chip technologies for the quantification of protein abundances. Moreover, the recently developed next generation sequencing techniques constitutes a fundamentally different approach of the experimental investigation of transcription. This rapid progress in the technical developments rise in parallel new methodological issues. The pace of demanding efficient and practically applicable statistical approaches is still a great challenge for the statistical community.

67

2 An error model for immunoblot data In molecular biology, immunoblotting is a widely applied technique for protein quantification. In this chapter, statistical models are applied to immunoblotting data in order to identify sources of systematic errors like background, technical- and biological variability, as well as its their distribution. In contrast to a sequential data processing approach, the entire data processing is done in a single, comprehensive step to estimate the time and treatment dependency of the protein abundances. A log-transformation of the data is suggested to obtain additive normally distributed noise because main sources of variability were identified as multiplicative and lognormally distributed. The error model accounting for technical as well as biological variability allows for a more precise estimation of the underlying dynamics of protein concentrations. In comparison to a standard data processing approach, the signal-to-noise ratio can be improved up to a factor of ten. The superior error model has been validated out of sample. The performed procedure is very general and can also be applied to derive error models for other experimental techniques. The results of this chapter have been published in [ K REUTZ et al 2007].

2.1 Introduction Studies of protein abundance by Western blotting or immunoblotting have been widely used for biological as well as biochemical investigations [K URIEN & S COFIELD 2006]. Immunoblotting allows the analysis of protein concentrations in cell populations without enhancing their basal expression even for low abundant proteins. Additionally, post-translational modifications, e.g. phosphorylation, can be quantified by immunoblotting. Such modifications are crucial for the biological functions of proteins in signaling pathways. The quantitative analysis of protein phosphorylation dynamics is therefore essential for the establishment of systems biological approaches. Unfortunately, the immunoblotting technique displays a minor signal-to-noise ratio and it is difficult to obtain reproducible quantitative measurements. A further source of noise is the biological variability, especially if e.g. primary cells are used for experiments. Another problem is that the common assumption of normally distributed noise is strongly violated. The distribution of measurements for given experimental conditions are described by error models. Error models can be used to detect and adjust for systematic noise. Such error reduction procedures are very common for some other experimental technologies. For example, it has been shown that microarray data should be transformed to fulfill desired statistical properties [D URBIN & ROCKE 2003, H UBER et al 2002, ROCKE & D URBIN 2003]. It has also been proven that log-transformed microarray intensities have to be corrected for systematic errors [B OLSTAD et al 2003, YANG et al 2002] and that error models can be used to estimate differential expression

2.2 Immunoblotting

[I DEKER et al 2000, PAVELKA et al 2004]. Statistical analyses of high-throughput techniques like microarrays intensities has been studied extensively. A challenging but necessary next step is the development of accurate statistical approaches for other experimental techniques. The determination of an appropriate error model comprises the following decisions: • Which background correction procedure is appropriate? • Which data transformation should be applied to obtain a desired noise distribution? • Which sources of systematic errors exist and should therefore be accounted for in an error model? • Which systematic errors can be modeled as normally distributed random variables? Because these issues depend on each other, they cannot be answered separately. Therefore, several combinations of transformation, background correction and systematic errors have to be considered. For this purpose, a set of 26 error models has been investigated. The optimal error model has been determined for housekeeping proteins and is then validated out of sample for time resolved protein measurements. The superior model requires a log-transformation of the measured intensities and accounts for biological as well as experimental noise. Thereby, normally distributed residuals are obtained. It will be shown that this model improves reproducibility of the data and increases signal-to-noise ratio by more than a factor of ten in comparison to raw background subtracted intensities. Additionally, it is demonstrated how error models are extended to estimate the time dependency of protein concentrations after stimulation in common with reliable confidence intervals. Despite a large amount of measurement noise, this approach leads to reliable time courses estimates which can be further evaluated in dynamic modeling approaches.

2.2 Immunoblotting A Western blot or immunoblot is a biochemical assay for the detection of specific proteins [B URNETTE 1981, 2009]. Western blots can be used in combination with a separation procedure like centrifugation allowing for investigation of sub-cellular components, e.g. the cytoplasm or the nuclei, separately. Within the experimental protocol, the cells are first extracted and prepared according to the desired experimental conditions. By lysis, the cells’ membranes are broken using osmotic mechanisms to get access to the intracellular compounds. After the destruction of the proteins’ secondary and tertiary structure by strongly reducing reactants, the proteins are separated according to their charge and molecular weight using gel electrophoresis, e.g. with Sodium dodecyl sulfate (SDS) polyacrylamid gels. The samples are loaded in different lanes on the gel, enabling measurements of different experimental conditions in parallel. In order to make the proteins in the gel accessible for antibodies which bind specifically to certain aminoacid sequences, the proteins are transferred to a membrane, e.g. on a nitrocellulose- or polyvinylidene fluoride (PVDF) paper. The transfer can be done by using capillary- or electrostatic

70

2 An error model for immunoblot data

Figure 2.1: Each preparation in our data set has been repeatedly measured on 8-16 gels. The quantified foreground intensities of the spots correspond to protein concentrations. The background was determined locally around the spots.

forces. For the quantification of protein amounts, the membrane is usually probed with a combination of two antibodies. The primary antibody is specific for the proteins of interest and is generated in a specific host organism like rabbit or goat. After incubation with the primary antibody, a secondary antibody like anti-rabbit or anti-goat is used to detect all primary antibodies. The secondary antibody is linked to a reporter enzyme becoming chemiluminescent after supply of an appropriate substrate. If more than one secondary antibody binds to the primary antibody, the signal is enhanced. A scanner or film is used for quantification of the luminescence which yields data corresponding to the measured protein amounts. Immunoprecipitation is an extension of western blotting. Here, a second primary antibody is used to quantify protein complexes. The first primary antibody is used to fish out a certain protein sequence including all the bound compounds. Then the second primary antibody is used together with the standard western blot technique to quantify the fished protein complexes. In comparison to other experimental techniques, western blotting and immunoprecipitation require relatively large experimental efforts. However, both techniques constitute one of the most sensitive methods for quantification of lower protein abundances. Another benefit of western blotting is that certain protein modifications, e.g. phosphorylations, can be detected. A disadvantage of western blotting is that only the average protein amount of many cells, i.e. the cell population average, can be measured.

2.3 Experimental data For this project, measurements of twelve observables, i.e. of housekeeping proteins and of proteins involved in the Insulin signaling pathway, have been used. Time courses of activated (indicated by ‘ ∗ ’) Insulin receptor (IR∗ ) and Insulin receptor substrate (IRS-1∗ ) as well as binding of phosphoinositide kinase (PI3K) to IRS-1 and activation of extracellular regulated kinases (ERK-1∗ and ERK-2∗ ) as functional outcome of the Insulin pathway have been measured for different Insulin stimulations.

71

2.4 Additive vs. multiplicative noise

Additionally, total IR, IRS-1, ERK-1 and ERK-2 concentrations as well as some housekeeping proteins (glycoprotein gp96, cellular heat shock cognate hsc70, β-actin) have been measured. Further, one positive control of total activation after pervanadate stimulation and one negative control without addition of Insulin have been performed within each stimulation with Insulin. The total concentrations and the housekeeping proteins are not affected by Insulin treatment and are therefore considered to be constant. Altogether 3642 data points are analyzed comprising 2108 measurements of time depending events, 1123 measurements of housekeeping proteins, and 411 controls. A cell preparation consists of primary hepatocytes obtained from two murine livers. On each gel, 10-20 different probes have been measured in adjacent lanes. The probes have been loaded in a randomized order on the gels, i.e. not in the chronological order [S CHILLING et al 2005a]. Foreground intensities, as well as background intensities have been determined for each spot as indicated in Figure 2.1. Details of the cell preparation and stimulation are described in [K REUTZ et al 2007].

2.4 Additive vs. multiplicative noise The protein concentrations as the variables of interest cannot be observed directly. Therefore, other quantities y, e.g. fluorescence intensities, which are related to the variables of interest x are determined experimentally. Experimental techniques are usually optimized to achieve a linear dependency of y on x, e.g. to avoid saturation effects. Nevertheless, the measurements y are always affected by measurement noise. The most common measurement errors are additive. If P many independent additive sources εi contribute to observed noise ε = i εi , the measurements y = α + βx + ε , ε ∼ N (0, σε2 )

(2.1)

are normally distributed with a variance σε2 . The constant offset α represents a systematic shift, β denotes a scaling factor. In accordance to equation (2.1) most statistical procedures assume a linear relationship between the measurements and the underlying variable of interest affected by additive normally distributed errors. If these assumptions are violated, statistical analyses have to be refined or, usually the easier way [ATKINSON 1981a], the measurements have to be transformed. Q Under weak assumptions, multiplicative noise η = i εi leads to log-normally distributed 2 measurement errors η ∼ eN (0,ση ) . Multiplicative noise is often observed for non-negative data, e.g. fluoresence intensities [E CKHARD L IMPERT & A BBT 2001]. For measurements y˜ with such multiplicative errors 2 y˜ = β0 + β1 xβ2 η , η ∼ eN (0,ση ) (2.2) holds. According to this model, y˜ − β0 is log-normally distributed with parameters β1 , β2 , and ση . A log-transformation y = log(˜ y − β0 ) of this data leads to equation (2.1) for log(x) with α = log(β1 ), β = β2 , ε = log(η), and σε = ση . After such a log-transformation, all statistical methods assuming normally distributed noise can be applied, e.g. averaging can be performed for calculation of expectation values. Additionally,

72

2 An error model for immunoblot data

error bars which are asymmetric for the observations y˜ become symmetric on the logarithmic scale. For immunoblotting, an error model y˘ = β0 + β1 x(1 + η), η ∼ N (0, ση )

(2.3)

assuming additive errors with a standard deviation proportional to signal intensities was already introduced [S CHILLING et al 2005b, S WAMEYE et al 2003]. For small multiplicative noise, i.e. ση  x, this model represents a first order approximation of the multiplicative error model (2.2) for the special case β2 = 1. This approximation shows intensity dependent measurement error bars. But, in contrast to model (2.2), model (2.3) yields symmetric error bars on the measurement scale, i.e. for intensities. Recapitulating, in the case of multiplicative noise a log-transformation can be used to allow the application of statistical methods which assume additive Gaussian errors. The following analyses show that immunoblotting data is dominated by multiplicative noise and that log-transformation is sufficient to obtain Gaussian errors.

2.5 Mixed effects models Equation (2.1) describes a linear relationship of, possibly transformed, observations with the true underlying constant protein concentrations x. In general, the variable y on the left hand side which represents the experimental observations is called response- or dependent variable. The variables on the right hand side which are used to explain variation in the data y, are called predictor variables or independent variables. For stimulation dependent proteins or if systematic errors should be accounted for, the model has to be extended by introducing new predictor variables, i.e. x is replaced by parameters depending on the experimental conditions which is described in this section. Overall systematic errors are considered in equation (2.1) by the offset α. Additional errors like a preparation or a gel specific systematic shift would lead to a biased estimate of underlying concentrations x. If sources of such systematic influences are known, the data can be adjusted by extending model (2.1). Auxiliary new predictor variables, so called effects, can be added at the right hand side of (2.1). In the case of multiplicative influences, β0 in equation (2.2) can be replaced by a product of different effects. An effect represents the influence on the measurements of one experimental parameter. Finding a suitably enlarged model for the measurements enables an unbiased estimation of the underlying concentrations. Discrete experimental parameters are modeled as factorial effects. For example, gel differences can be accounted for by the estimation of gel effects Gg . The index g = 1, . . . , ngels enumerates different gels and ngels parameters Gg have to be estimated to adjust for these systematic errors. More than one index indicates that the magnitude of an effect depends on more than one experimental parameter. For factorial predictor variables, the number of parameters can be reduced if an effect is interpreted as a random variable. Then, in contrast to a fixed effect, a normally distributed random

73

2.6 Assessing required effects

Effect name Observable O Preparation P Gel number G Background B Time effect T

Type of effect discrete, fixed discrete, fixed/random discrete, fixed/random continuous, fixed discrete, fixed

Representation in the models always considered considered or discarded always nested within P predictor or response variable not required for housekeeping proteins

Table 2.1: Properties of regarded effects. The error models differ in the way in which they account for background, protein-, preparation- and gel specific effects.

effect could be assumed. In this case, only one parameter, i.e. the variance of the random effect, has to be estimated. In our example, a random gel effect is accounted by estimation of the gel-togel variability σgel . If the assumption of a normal distribution is fulfilled, modeling of effects as random variables avoids over-parametrization, e.g. it avoids the under-estimation of the effects of interest by an overestimation of the systematic errors. Statistical models with fixed and random effects are called mixed effects models. Mixed effects models constitute an established statistical framework for modeling of multiple sources of variation [P INHEIRO & BATES 2000]. Influences of continuous variables, e.g. the background intensity, are modeled via continuous predictor variables with one regression parameter. In signal transduction, protein concentrations usually depend on time and on the stimulation. Since the exact functional relationship is unknown until an appropriate dynamic model is derived, time and stimulation cannot be modeled as continuous predictor variables. Then, time dependent protein concentrations, x in Equation (2.1) are replaced by discrete time effects Tost . The index o enumerates different observables, e.g. proteins, s = 1, . . . , ns (o) enumerates different stimulation treatment effects and t = 1, . . . , nt (o, s) enumerates different times after stimulation. The number of applied stimulations ns (o) depends on the observable and the number of measured times nt (o, s) depends on the observable and the applied stimulation. Matrix notation of considered mixed effects models is described in Section 2.8. Further, a cell preparation effect P is introduced to account for the biological variability. The overall difference of the observables, e.g. caused by distinct specificities of the antibodies, is modeled via an observable effect Oo . Table 2.1 represents the considered effects for our models. Different combinations of the regarded effects and varying assumptions about the distribution of the effects yield different error models. The experimental data will be utilized to in the following to discriminate between the competing models. For this purpose, several methods are introduced in the next section to decide which effects are required and which effects have to be considered as fixed or random variables.

2.6 Assessing required effects The Akaike Information Criterion (AIC) [A KAIKE 1974, S AKAMOTO et al 1986] and the Bayes’ Information Criterion (BIC) [S CHWARZ 1978] are used to assess the relative fit of competing

74

2 An error model for immunoblot data

error models. Both, AIC and BIC are very well established criteria for model discrimination [KONISHI & K ITAGAWA 1996]. The Akaike Information Criterion is defined as AIC = −2 log (L(y|M)) + 2 npar

(2.4)

where L is the likelihood function, e.g. the probability of the data y given an error model M with npar parameters. Bayes’ Information Criterion BIC = −2 log (L(y|M)) + npar log(n)

(2.5)

is similarly defined but takes into account the number of data points n. Small values of these criteria are preferable and are obtained by a large value of the likelihood function and a small number of parameters. Usually, BIC tends to prefer smaller models than AIC, especially for a large number of data points. Analysis of Variance (ANOVA) is used to check which fixed effects are not capable to significantly explain variance of the measurements. Such effects should be withdrawn from an error model. Otherwise overfitting by too many parameters occurs and the size of confidence intervals of estimated parameters inflates, especially in the case of multicollinear effects [M ARKOVITZ 2005]. For random effects, likelihood-ratio tests [P INHEIRO & BATES 2000] and confidence intervals of estimated parameters are used to check significance. Signal-to-noise ratios SNR =

sd.dev.(predictions) sd.dev.(residuals)

(2.6)

allow a better interpretation of unexplained variability than AIC, BIC or p-values of ANOVA and are therefore calculated for the regarded error models, too. The prediction of a model is the right-hand side, i.e. the sum of all predictor variables in equation (2.1) or (2.2). A primary goal of the analysis is an accurate estimation of the concentration dynamics, i.e. of the time effects. Therefore, models leading to an accurate estimation of the dynamics are preferable, i.e. the standard errors SE(Tost ) are intended to be small in comparison to estimated responses maxt (Tost ) − mint (Tost ) after the stimulation. For time dependent proteins, a robust median time response has been defined as   maxt (Tost ) − mint (Tost ) TR = medianos . (2.7) mediant (SE(Tost )) Models with a small TR are not informative for the estimation of time courses. The issue of an accurate but unbiased estimation is an instance of the bias-variance trade-off. Error models which do not adjust for noticeable systematic errors yield biased and in many circumstances also noisy estimates. On the other hand, error models with a large number of parameters to account for possible sources of noise yield inflating errors for the parameter estimates, i.e. yield large confidence intervals for the estimates of the time effects. Although a main purpose of our approach is the determination of time effects, the time response TR is not used to select an appropriate model. Thereby, underestimation of confidence intervals by in sample optimization of the model is avoided. Nevertheless, the time response constitutes an important property to validate the efficiency of a predetermined model.

75

2.7 Results

Measurements of housekeeping proteins

First analyses (distribution of intensities, residuals of replicates)

Experimental parameters

Set of possible models

Measurements of dynamic proteins

Ranking of all models, superior model

Estimation of time courses, validation by new ranking

Figure 2.2: Overview of applied analysis steps. First analyses of the housekeeping proteins suggest a log-transformation and the use of ratios foreground over background as response. In combination with different experimental parameters, 26 error models are introduced and evaluated. The superior model is used for the determination of the time courses.

In addition, a leave-one-out cross-validation [E FRON 1987] procedure is applied to determine the predictive power of the models. Here, a model is iteratively fitted to all but one data points. This data point is then predicted by the fitted model. The accuracy of out-of-sample predictions is measured via the correlation between the predicted and the measured data and is utilized to evaluate the generalization error of the model. Under-parametrized or over-parametrized models would result in poor predictions. A consistency check of an error model is that observed residuals correspond to the assumed distribution of the measurement errors. Differences between the observed and the theoretical distributions are assessed by a Kolmogorov-Smirnov test [C ONOVER 1971]. On the one hand, such a statistical test provides p-values pks to assess significance but, on the other hand, every error model will be rejected by a test asymptotically, e.g. for large sample size, because theoretically assumed distributions will never be realized exactly. Nevertheless, the order of magnitude of the p-values allows for comparisons of different models.

2.7 Results The benefits of log-transformation and multiplicative background correction are described in Section 2.7.1. In Section 2.7.2, a simulation study is performed to analyze multiplicative background correction. In Section 2.7.3 models are fitted to housekeeping protein measurements to determine the most appropriate error model. The assumption, that the housekeeping proteins are really constant, i.e. independent on time and treatment, is validated in Section 2.7.4. The superior model is used to estimate time courses in Section 2.7.5. Additionally, the selected error model is validated out of sample using data from the time dependent proteins. Figure 2.2 gives a schematic overview of major analysis steps.

76

2 An error model for immunoblot data

Figure 2.3: Foreground intensities are correlated with background intensities. These correlations cannot be removed by background subtraction. This is indicated by the arrows.

2.7.1 Background correction and response variables Background constitutes a systematic bias of the measurements. In immunoblotting, signals S are usually calculated by background subtraction S =F −B

(2.8)

to eliminate this bias. However, if background measurements are very noisy, this step may introduce additional variability. Then, a trade-off between bias and variance occurs. Background correction reduces the bias, but increases the variance. Depending on the strength of the background and on the precision of background measurements, it could be preferable in some cases to abandon from background correction. If foreground and background intensities are strongly correlated, e.g. by a common multiplicative error, then it could be superior to apply a background adjustment by the calculation of rations R=

F B

(2.9)

The ratios R are better correlated with the underlying true protein concentrations if there is a strong common multiplicative effect in foreground as well as in background intensities. This circumstance is illustrated by a simulation study in Section 2.7.2. Actually, the Figure 2.3 reveals that the measured foreground and background intensities are strongly correlated. Background subtraction does not remove these correlations completely. This fact is indicated by the arrows in Figure 2.3. To check the hypothesis that background correction by division is feasible, repeated measurements of the housekeeping proteins on the same gel have been used to assess reproducibility.

77

2.7 Results

Figure 2.4: Ratios foreground over background yield smaller coefficients of variations for measurements on the same gel (left panel) and for measurements of the same preparation (right panel). In fact, a subtraction of the background even increases the variability in comparison with uncorrected raw foreground intensities. The boxes around the medians indicate the 50% quantiles of observed relative residuals.

Figure 2.4 shows the relative variability for raw foreground intensities, for background subtracted intensities as well as for ratios foreground over background. The latters show the smallest variability of around 14% for measurement on the same gel and 19% within the same preparation. In contrast, raw foreground intensities have a variability of around 27% and 40%, respectively. Propagation of errors in foreground and background intensities leads to a variability of signals obtained by background subtraction of around 38% within gels and 57% within preparations. The decreasing variability by multiplicative background correction is a consequence of inhomogeneity of gels, which seems to be a multiplicative influence on foreground as well as on background intensities. Finally, beside raw foreground intensities and signals, ratios R as well as log(S) and log(R) are considered as response variables. In addition to the six response variables shown in the upper part of Table 2.2, background can be adjusted by estimation of a regression parameter b between foreground and background intensities (lower part of Table 2.2). This proportionality constant can be fitted on intensity scale or on log-scale. The special case b = 1 corresponds to background subtraction on intensity scale and coincides with intensity ratios for log-transformed intensities. For measurements of housekeeping proteins, the underlying protein concentrations are constant, i.e. independent on time and stimulation. Therefore repeated observations can be used to determine the distribution of technical and biological noise. Frequencies of measured intensities for housekeeping proteins are in agreement with a lognormal distribution but disagree with a normal distribution (Figure 2.5). The parameters of the normal and the log-normal distribution

78

2 An error model for immunoblot data

Response y Foreground Signal Signal ratio log(foreground) log(signal) log(signal ratio) Foreground, b fitted log(foreground), b fitted

Abbreviation F S R log(F ) log(S) log(R)

Background correction none F −B F/B none log(F − B) log(F ) − log(B) F − bB log(F ) − b log(B)

Table 2.2: Overview of the considered response variables of the evaluated error models. The response variables differ in the log-transformation, or in the way of accounting for the background.

Figure 2.5: The measured distribution of foreground intensities agrees with log-normal distribution. Its asymmetry is in contradiction to a normal distribution.

are estimated by the arithmetic mean and sampling variance of the raw and the log-transformed measurements. Another possibility to assess the assumptions about the noise distribution is a plot of the theoretical quantiles against the observed quantiles. If the distribution of the observations coincides with the theoretical distribution, such a quantile-quantile plot or qq-plot result in a straight line with slope one. A deviation from a straight line is a qualitative criterion to assess the assumption of normal distributed noise. QQ-Plots of raw and log-transformed foreground intensities against normal distribution confirms that the log-transformation is required in order to obtain normally distributed measurements (see Figure 2.6). Additionally, log-transformed intensities agree better with the assumption of normally distributed noise. The Kolmogorov-Smirnov test leads to pks = 0.22 for testing against log-normal

79

2.7 Results

Figure 2.6: A qq-plot of the foreground intensities without and after log-transformation against the normal distribution shows the benefit of the transformation. Response variable Foreground Signals Signal ratios log-foreground log-signals log-ratios

Abbreviation F S R log(F ) log(S) log(R)

Kolmogorov-Smirnov test p < 1e − 19 p = 2.2e − 19 p = 1.8e − 8 p = 0.224 p = 0.0014 p = 0.00027

Table 2.3: Comparison of the repeated measurements of the housekeeping proteins with a normal distribution. A Kolmogorov-Smirnov test shows that the log-transformation is required to obtain normally distributed noise.

distribution and rejects the hypothesis of normally distributed noise with pks < 10−19 . The pvalues in Table 2.3 are orders of magnitude larger than the p-values obtained after a log-transformation. This indicates that the main sources of the noise are multiplicative.

2.7.2 Simulation study The measured intensity is proportional to the number of emitted photons. This means that the intensities are additive, i.e. the measured total intensity is the sum over all sources of light. Therefore, the scanned foreground intensity is a sum of the background and concentration dependent signal intensities and the evident background correction procedure for estimating the signal S would be a background subtraction Sb = F − B. 80

2 An error model for immunoblot data

However, the error model approach shows that a multiplicative adjustment for the background is preferable because immunoblotting suffers from several systematic multiplicative errors. These errors affect both, the foreground as well as the background intensities. For such correlated multiplicative noise, signal ratios R = F/B foreground over background lead to a more accurate estimation of the underlying concentrations than background subtraction. To illustrate this fact, a simulation study is performed in this section to illustrate that multiplicative background correction is superior if strong multiplicative noise affects both, the foreground as well as the background intensities. The foreground measurements are simulated according to F = rS + B

(2.10)

with a signal to background ratio parameter r. The Signals S = x γ eη1 + x(1 − γ)eη2 + ε1 ,

(2.11)

ηi ∼ N (0, ση2i ) , i = 1, 2 , ε1 ∼ N (0, σε21 ) consist of multiplicative lognormally distributed noise eη1 and eη2 and additive Gaussian noise ε1 . It is assumed that η1 is the common error which affects the background, too. The parameter γ ∈ [0, 1] is used to control the proportion of this common error. The true underlying protein concentration is denoted by x. Background intensities are simulated according to B = γeη1 + (1 − γ)eη3 + ε2 .

(2.12)

η3 ∼ N (0, ση23 ) , ε2 ∼ N (0, σε22 ) where η3 denotes the uncorrelated fraction of multiplicative noise corresponding to η2 for signals. The intention of a background correction procedure is to improve the determination of the underlying protein dynamics. Correlations with underlying true protein concentrations are used to compare background subtraction with signal ratios. For positive differences ∆c = cor(R, x) − cor(F − B, x)

(2.13)

of the correlations, signal ratios are more accurate estimators of the underlying protein concentration dynamics. Figure 2.7 shows ∆c obtained by simulations for different values of the parameter γ. For small γ, foreground and background intensities are only weakly correlated and signals F − B should be used for background correction. However, for large γ, a background correction by division is superior. For our simulation, above a threshold of γ = 0.65 corresponding to a correlation between foreground and background of around 0.7, the signal ratios are more accurate than the background subtracted intensities. The simulation study has been performed for realistic parameters, e.g. for the amount of the additive and the multiplicative noise. The assumed parameters are summarized in Table 2.4. The obtained result depends only weakly on the exact choice of these parameters. Since ratios R diverge for background intensities close to zero, the lower 1% quantile of the background intensities have been excluded. This assumption is in agreement to our experiments because the magnitude of the smallest intensities has been observed as sufficiently larger than zero, and no spot yield background intensities larger than the corresponding foreground.

81

2.7 Results

Figure 2.7: For γ > 0.65 which corresponds to a correlation of foreground and background of about 0.7, signal ratios are better correlated to underlying protein concentrations than signals obtained after background subtraction. Parameter r n N γ x ε1 ε2 η1 η2 η3

Description Signal to background ratio Number of simulations Number of data points Fraction of common error True concentrations Additive noise in signals Additive noise in background Common multiplicative noise Multiplicative noise of S Multiplicative noise of B

Parameter value 10 100 1000 {0, 0.1, . . . , 1} ∼ N (0, 1) ∼ N (0, 0.12 ) ∼ N (0, 0.12 ) ∼ N (0, 1) ∼ N (0, 1) ∼ N (0, 1)

Table 2.4: Chosen parameter setting for simulation study.

2.7.3 Error model selection for housekeeping proteins Error models accounting for different systematic influences on the measurements, like the observable effect O for different proteins, preparation effect P accounting for biological variability, and gel effect G are fitted to our data using the response variables shown in Table 2.2. Altogether, 26 combinations are considered as hypothetical error models. The random effects are denoted by (1) , . . . , (4) , the observational noise by ε. The different model assessment criteria introduced in

82

2 An error model for immunoblot data

No. 1 2

model F∗ = Oo + ε∗ F∗ = Oo + Pp + Gpg + ε∗

3 4 5 6 7 8

F∗ = Oo + p + pg + ε∗ S∗ = Oo + ε∗ S∗ = Oo + Pp + Gpg + ε∗ (1) (2) S∗ = Oo + p + pg + ε∗ R∗ = Oo + ε∗ R∗ = Oo + Pp + Gpg + ε∗

9 10 11

R∗ = Oo + p + pg + ε∗ log(F∗ ) = Oo + ε∗ log(F∗ ) = Oo + Pp + Gpg + ε∗

12 13 14

log(F∗ ) = Oo + p + pg + ε∗ log(S∗ ) = Oo + ε∗ log(S∗ ) = Oo + Pp + Gpg + ε∗ (1) (2) log(S∗ ) = Oo + p + pg + ε∗ log(R∗ ) = Oo + ε∗ log(R∗ ) = Oo + Pp + Gpg + ε∗

15 16 17 18 19 20 21 22 23 24

(1)

log(R∗ ) = Oo F∗ = Oo F∗ = Oo F∗ = Oo

(2)

(1)

(2)

(1)

(2)

(1)

(2)

+ p + pg + ε∗ + b B∗ + ε∗ + b B∗ + Pp + Gpg + ε∗ (1) (2) +b  B∗ + p + pg  + ε∗ (3)

(4)

(1)

(2)

F∗ = Oo + b + p + pg B∗ + p + pg + ε∗ log(F∗ ) = Oo + b log(B∗ ) + ε∗ log(F∗ ) = Oo + b log(B∗ ) + Pp + Gpg + ε∗

25

log(F∗ ) = Oo +

26

log(F∗ ) = Oo +

(1) (2) b  log(B∗ ) + p + pg + ε∗ (3) (4) (1) b + p + pg log(B∗ ) + p

(2)

+ pg + ε∗

npar 6 93

AIC 39100 38500

BIC 39100 39000

pks 5.4e-10 3.3e-10

SNR 0.34 1.1

corCV 0.12 0.68

8 6 93

38500 38000 37600

38500 38100 38100

1.1e-10 3.7e-13 5.9e-8

1.0 0.41 1.0

0.27 0.36 0.61

8 6 93

37500 2510 1990

37600 2540 2460

3.2e-8 3.7e-7 2.7e-6

0.93 0.78 1.4

0.30 0.70 0.79

8 6 93

2120 2300 1450

2160 2330 1920

1.3e-5 0.38 0.016

1.3 0.31 1.3

0.67 0.13 0.78

8 6 93

1620 3120 2580

1660 3150 3050

0.029 4.8e-4 6e-6

1.2 0.36 1.1

0.34 0.35 0.65

8 6 93

2720 593 -95

2760 623 372

2.3e-5 0.84 0.016

0.98 0.78 1.6

0.40 0.57 0.78

8 7 94

75 37600 37200

115 37600 37600

0.023 7.8e-6 9.5e-7

1.5 1.8 2.5

0.43 0.87 0.95

9

37100

37200

3.1e-6

2.5

0.89

13

36900

36900

7.0e-9

3.0

0.84

7 94

566 -93

601 379

0.96 0.015

2.0 3.1

0.85 0.95

9

83

128

0.032

3.1

0.88

13

-98

-33

0.008

3.7

0.90

Table 2.5: Performance of error models for measurements of housekeeping proteins. A ’∗’ is used as an abbreviation for all occurring indices in a model, e.g. indices of all predictor variables and an index for replicate measurements. The superior five values of each model assessment criterion are highlighted in bold face, best values are underlined. A log-transformation improves the performance of the models. Model 26 is superior for three out of five criteria.

Section 2.6 are displayed in Table 2.5 for all 26 models. The superior five values of each criterion are highlighted in bold face and the best values are underlined. The Akaike Information Criterion and the Bayes’ Information Criterion are superior for log(R) or log(F ) as response together with fitted relationship to the logarithmic background log(B) (models 16-18 and 23-26). This advance does not strongly depend on the considered fixed and random effects. Additionally, the residuals of these models are more consistent with normal distribution indicated by orders of magnitude larger p-values obtained by Kolmogorov-Smirnov tests. Although both response variables show similar performance regarding the model selection criteria and the distribution of the residuals, the models with a fitted background on the log-scale (models 23-26) are also superior concerning the signal-to-noise ratio and the cross-validation procedure. Therefore, a log-transformation of the measured intensities in combination with a background correction on the log-scale is recommended. Additionally, there are strong indications that the impact of background differs from gel to gel. The best performance concerning the Akaike and Bayes’ criteria as well as the signal-to-noise ratio according to equation (2.6) is obtained for model No. 26   (4) (1) (2) 2 log(Fopgi ) = Oo + b + (3) +  p pg log(Bopgi ) + p + pg + εopgi , εopgi ∼ N (0, σ ) , (2.14) 83

2.7 Results

Model 26 Model 26 + treatment effects Model 26 + time effects Model 26 + time and treat. effects

npar 13 34 92 245

AIC -98 -9.1 268 705

BIC -33 162 735 1981

pks 0.008 0.0019 0.18 0.0056

SNR 3.7 3.7 3.9 4.2

Table 2.6: Model assessment criteria for testing the assumption of time and treatment independence of the housekeeping proteins. According to the AIC and BIC a model without time and treatment effects is preferred. Therefore, the choice of the housekeeping proteins based on our biological prior knowledge is valid.

with random preparation and gel effects 2 (4) 2 (3) p ∼ N (0, σ3 ), pg ∼ N (0, σ4 ),

as well as a random preparation and gel specific contribution to background correction 2 (2) 2 (1) p ∼ N (0, σ1 ), pg ∼ N (0, σ2 ) .

This model requires only 13 parameters because the background dependency as well as variability between preparations and gels are modeled as random effects, each with only a single parameter. As a consistency check, ANOVA has been performed to assess the necessity of the terms in equation (2.14). The ANOVA of the fixed effects yield significant p-values. Additionally, the 95% confidence intervals of the estimated random effects do not contain the value of zero. Both outcomes indicate that the impact of the preparation- and gel effects should be accounted if immunoblotting is used as a quantitative technique. This tendency is confirmed for the other regarded response variables. In addition to the discussed experimental parameters, no improvement could be observed by effects accounting for the size of the spots or for lane specific differences within gels. The spot size effects show strong correlations with underlying concentrations and therefore lead to inflating standard errors of the estimated protein concentrations. Lane effects are partly accounted by a locally determined background and are only badly identifiable due to the limited number of observables within a lane.

2.7.4 Validation of the housekeeping assumption β-actin, gp96 and hsc70 are widely used housekeeping proteins [L I et al 2002, P ICARD 2002, S CHILLING et al 2005b, S UZUKI et al 2000] and are therefore considered as constant. In addition, the total Insulin receptor and the total Insulin receptor substrate concentrations are considered as constant because there are no biological indications that either molecule concentration is changed in mouse hepatocytes after Insulin stimulation within the first hour. Based on this biological prior knowledge β-actin, gp96, hsc70, IRtotal and IRS-1total were used to determine the error model in the preceding Section 2.7.3.

84

2 An error model for immunoblot data

Observable O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

Description Activated IR Activated IR Activated IRS-1 Activated IRS-1 PI3K binding PI3K binding Activated ERK-1 Activated ERK-1 Activated ERK-2 Activated ERK-2

IP IR IR IRS-1 IRS-1 IRS-1 IRS-1 -

Antibody Py20 Py20 Py20 Py20 PI3K PI3K ppERK-1 ppERK-1 ppERK-2 ppERK-2

Insulin treatment 1e-7 nm 1e-5 nm 1e-7 nm 1e-5 nm 1e-7 nm 1e-5 nm 1e-7 nm 1e-5 nm 1e-7 nm 1e-5 nm

Table 2.7: Overview of the measured time dependent observables of the Insulin signaling pathway.

To validate the assumption that the considered housekeeping proteins are indeed independent on the stimulation and constant over time, the obtained superior error model (2.14) for the housekeeping proteins is extended by time and treatment effects. Table 2.6 shows that AIC and BIC are clearly superior for the model without time and treatment effects. The p-values obtained by a Kolmogorov-Smirnov test indicate that violations from the assumption of normally distributed residuals are similar for the considered four models. Because the variance of residuals is always decreased by an enlargement of model 26, the signal to noise ratio SNR is not very meaningful for comparing the considered models with submodel 26.

2.7.5 Application to the time course measurements To illustrate the benefits of the log-transformation and the elimination of the systematic noise, the error models are now applied to estimate concentration time courses in Insulin signaling. Therefore, the models which have been evaluated for housekeeping proteins are extended by time effects Tost . The dependency of this time effects on the observable and stimulation is indicated by the indices o and s. The extension of model (2.14) which was the superior model for the housekeeping proteins yields   (4) (1) (2) log(Fostpgi ) = Oo + Tost + b + (3) +  (2.15) p pg log(Bostpgi ) + p + pg + εostpgi .

Because a change of the observable effects can be compensated by the time effects, the observable effects and time effects are not completely identifiable. This issue is resolved by the biologically reasonable constraint Tost0 = 0. Thereby, the observable effects Oo are defined as the intensity shifts measured at the initial time point t0 = 0. In other cases where a basal activity Tost0 6= 0 would be biologically reasonable, positive controls or calibration data would have to be utilized to circumvent this constraint. Table 2.8 shows the model assessment criteria for the time dependent proteins. In accordance with the result obtained from housekeeping proteins, the model 26’, resp. equation (2.15), is

85

2.7 Results

No. 1’ 2’

F∗ F∗

3’ 4’ 5’

F∗ S∗ S∗

model = Oo + Tost = Oo + Tost = Oo + Tost = Oo + Tost = Oo + Tost

6’ 7’ 8’

S∗ R∗ R∗

= = =

+ p + pg + ε∗ + ε∗ + Pp + Gpg + ε∗ (1) (2) Oo + Tost + p + pg + ε∗ Oo + Tost + ε∗ Oo + Tost + Pp + Gpg + ε∗

9’ 10’ 11’

R∗ log(F∗ ) log(F∗ )

= = =

Oo + Tost + p + pg + ε∗ Oo + Tost + ε∗ Oo + Tost + Pp + Gpg + ε∗

12’ 13’ 14’

log(F∗ ) log(S∗ ) log(S∗ )

15’ 16’ 17’

log(S∗ ) log(R∗ ) log(R∗ )

Oo + Tost + p + pg + ε∗ Oo + Tost + ε∗ Oo + Tost + Pp + Gpg + ε∗ (1) (2) = Oo + Tost + p + pg + ε∗ = Oo + Tost + ε∗ = Oo + Tost + Pp + Gpg + ε∗

18’ 19’ 20’

log(R∗ ) F∗ F∗

21’

F∗

22’ 23’ 24’

F∗ log(F∗ ) log(F∗ )

= = =

25’

log(F∗ )

= Oo + Tost + b log(B∗ ) + p

26’

log(F∗ )

= Oo + Tost + (b + p

= = =

npar AIC BIC 74 56500 56900 185 54400 55400

+ ε∗ + Pp + Gpg + ε∗ (1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

= Oo + Tost + p + pg + ε∗ = Oo + Tost + b B∗ + ε∗ = Oo + Tost + b B∗ + Pp + Gpg + ε∗ (1) (2) = Oo + Tost + b B∗ + p + pg + ε∗ (3)

(4)

(1)

(3)

76 52300 52700 74 55300 55700 185 53700 54700

4.6e-28 3.4e-83 2.4e-38

1.9 0.3 1.6

6.5 2.4 5.1

76 74 185

51700 3130 2740

52100 3530 3730

9.5e-39 3.2e-13 7.4e-5

1.6 0.59 1

5.8 4.3 6.8

76 74 185

2980 3900 2630

3390 4290 3620

1.3e-5 0.00014 0.02

0.92 0.47 1.5

8.2 3.5 8.2

76 74 185

2870 4860 4100

3280 5260 5080

0.12 0.0019 0.0013

1.4 0.59 1.3

8.6 5.4 8.8

76 74 185

4260 807 430

4670 1200 1420

0.00011 0.031 0.77

1.2 0.62 1

9.5 5.1 8.0

76 773 1180 75 54700 55100 186 53700 54700

0.68 3.6e-60 2.8e-40

0.93 1.6 2.6

8.8 2.6 4.2

77 (1)

(2)

Oo + Tost + (b + p + pg ) B∗ + p + pg + ε∗ Oo + Tost + b log(B∗ ) + ε∗ Oo + Tost + b log(B∗ ) + Pp + Gpg + ε∗ (2)

+ pg + ε∗

(4)

(1)

+ pg ) log(B∗ ) + p

(2)

+ pg + ε∗

pks SNR TR 1.6e-58 0.28 2.1 8.6e-26 2 7.4

52100

6e-41

2.5

3.5

81 51200 51600 75 799 1200 186 382 1370

51600

4.8e-45 0.024 0.5

2.9 2.9 3.6

2.6 5.1 9.0

77

737

1150

0.63

3.5

7.9

81

490

923

0.0052

4.2

8.0

Table 2.8: Comparison of error models for time course measurements. The abbreviation ’∗’ is used instead of all occurring indices in a model, i.e. indices of all predictor variables and an index for replicate measurements. The best values are underlined and the five superior values of each model assessment criterion are highlighted in bold face. In accordance with the results obtained from the housekeeping proteins, the log-transformation improves the performance. The preferred model 26’ is superior for two out of the six criteria.

superior for two out of the five criteria. Affirming the results obtained for the housekeepers, a logtransformation of the measurements, a multiplicative background correction and an adjustment for preparation- and gel effects improves the models’ performance in terms of model discrimination, signal-to-noise ratio, and normal distribution of the residuals. The benefit of model 26’ in terms of the model assessment criteria, results also in a more accurate estimation of time effects. As an example, the estimated time dependency of the phosphorylated IR∗ after a stimulation with 100 nM Insulin are plotted in Figure 2.8 with the respective standard errors. Using the complete model (Figure 2.8, panel (d)) leads to a reliable estimation of the time dependency and to small confidence intervals. In contrast, without log-transformation or if systematic errors are not regarded the signal-to-noise ratio and smoothness of the estimated time courses are decreased. Figure 2.8, panel (a) shows time course estimates from signal intensities without considering preparation or gel effects. Panel (b) shows the same for log-ratios as response according to model No. 16’. In panel (c), model 6’ is used which corresponds to the superior model without log-transformation. The obtained time courses for all considered models are displayed in Figures 2.9 to 2.11. The panels in one row show the outcomes for different error models for the same observable. The abbreviations of the observables are summarized in Table 2.7. A zigzag shape of some time

86

2 An error model for immunoblot data

Figure 2.8: The activation of the Insulin receptor has been estimated (a) from signals, (b) from log-ratios, (c) from signals with elimination of systematic errors and (d) for the full model with log-transformation and multiplicative background correction.

courses emerge for models with a fixed gel effect because neighboring time points are mostly not on the same gel. This causes a badly identifiable parameter which determines how time effects of even and uneven time points have to be combined. This is no problem if gel effects are modeled as random variables, because then only a single parameter has to be estimated for each gel. Figure 2.9 shows that the error models without a log-transformation yield large confidence intervals for the time course estimates. If a log-transformation is applied and systematic errors are treated multiplicatively, the error bars are decreased (Figure 2.10). Qualitatively similar results can be seen in Figure 2.11 where a regression parameter is estimated for background correction. This step improves the model assessment criteria and leads to the overall best model No. 26’. For this model, the log-transformed foreground intensities are used as the response variable, preparation and gel effects are modeled as random variables, and a gel specific random regression parameter is estimated for the background correction. Although, most models yield qualitatively similar shapes for the time dependency, the estimated dynamic behavior depends on the chosen model. One possibility to avoid this dependency would be a model averaging procedure. Here, a weighted average of all estimated time courses would be calculated. Weights wM are given by the posterior probability of the considered model M . This posterior probability can be approximated up to first order by the exponential of Bayes’ Information criterion BICM of model M [K ASS & R AFTERY 1994], i.e. exp(− BIC2 M ) wM = P . BICm m exp(− 2 )

(2.16)

Because model 26’ has a clearly superior BIC, this model would contribute dominantly in a model averaging process, i.e. w260 ≈ 1. Recapitulating, all appropriate error models lead to qualitatively similar time courses. Never-

87

2.7 Results

Figure 2.9: Estimated time effects of the ten measured observables for the error models 1’-9’. For the models 1’-3’ (blue color) raw foreground intensities are used as the response variable. Background subtraction is applied for models 4’-6’ (black) and ratios foreground over background are used for models 7’-9’ (magenta). Models with fixed gel effect show a zigzag shape of the time course due to a poorly identifiable parameter.

theless, the obtained time effects depend on the applied models. This emphasizes the need for proper error models for the analysis of immunoblotting and immunoprecipitation measurements.

2.7.6 Model selection for the superior error model The submodels obtained by omitting an effect are tested by likelihood-ratio tests against the full model 26 and 26’, respectively. For this purpose, parameters are fitted by maximum likeli-

88

2 An error model for immunoblot data

Figure 2.10: The estimated time effects of the regarded observables after log-transformation. Logforeground intensities are used as the response variable for models 10’-12’ (gray color). The models (13’-16’) with log-signals as response are display in green color and the models for the log-ratios are plotted in red.

hood estimation [P INHEIRO & BATES 2000] instead of the restricted maximum likelihood method [H ARVILLE 1977] which has been used by default for the parameter estimation of the mixed effects models. The p-values obtained by the likelihood-ratio tests are displayed in Table 2.9. Only (1) the effect accounting for preparation specific background correction p is not clearly significant. For the full data set, a regression coefficient of b = 0.98±0.04 was estimated for the background correction. The confidence interval is in agreement with ratios log(F )/ log(B) corresponding to b = 1. Nevertheless a likelihood-ratio test is significant with p < 2.2 × 10−16 . The qq-plot in Figure 2.12 for the superior error model No. 26’ shows the consistency of the

89

2.8 Matrix notation

Figure 2.11: The estimated time effects for the error models with a fitted regression parameter for the background correction. Untransformed foreground intensities are displayed in blue whereas the log-transformed intensities are plotted in gray color. The model 26’ is the superior model concerning the model assessment criteria.

residuals with the normal distribution. Additionally a scatter plot of residuals against predicted values shows no pattern (see Figure 2.13).

2.8 Matrix notation The superior mixed model 26, i.e. equation (2.15), can be written in matrix notation as F~pg = Xpg β~ + Zpg ~γpg + ~εpg

90

(2.17)

2 An error model for immunoblot data

Tested effect Time and stimulation Background Preparation specific B Preparation effects Gel specific B Gel effects

T B (1) p (2) p (1) pg (2) pg

p-value for model 26 < 2.2 × 1016 1.58 × 1015 0.095 4.13 × 105 < 2.2 × 1016 < 2.2 × 1016

p-value for model 26’ < 2.2 × 1016 6.67 × 1016 1.67 × 103 4.01 × 103 < 2.2 × 1016 < 2.2 × 1016

Table 2.9: Model selection of superior models. p-values obtained by likelihood-ratio tests of the superior error model against submodels.

Figure 2.12: A qq-plot shows that the residuals of the superior model are normally distributed.

where • F~pg ∈ M (npg × 1) are vectors containing all npg log-foreground intensities of preparation p on gel g. • Xpg ∈ M (npg × nfixed ) are model matrices. Without loss of generality Xpg can be written as   ~ pg XO XT Xpg = B (2.18) pg pg   where ∀i, j: XijO

∈ {0, 1} indicates whether observable j has an influence on the data   point i of the preparation p on the gel g. Similarly, XijT ∈ {0, 1} indicates whether the pg

pg

~ pg are log-background intensities of the time effect j has an influence on the data point i. B considered preparation and gel.

91

2.8 Matrix notation

Figure 2.13: The residuals of the superior error model show no dependency on the predictions of the model. • β~ ∈ M (nfixed × 1) is a vector containing all fixed effects. Model 26’ has X nfixed = 1 + no + nt (o, s)

(2.19)

fixed parameters. If model matrices Xpg are chosen as described above,   b ~ ~ β = O  T~

(2.20)

o,s

holds. The first entry of β~ is the regression coefficient b of background correction followed by observable and time effects. • Zpg ∈ M (npg × 4) are model matrices for the random effects. In our model every data point is influenced by all four random effects. This yields   log ((B1 )pg ) log ((B1 )pg ) 1 1   (2.21) Zpg =  ...  .   log (Bnpg )pg log (Bnpg )pg 1 1 Again, npg is the number of data points of preparation p on gel g.

• ~γpg ∈ M (4 × 1), ~γpg ∼ N (~0, Ψ2 ) denote arrays containing random effects coefficients of preparation p and gel g. The correlation matrix is   σ42 0 0 0  0 σ2 0 0    3 Ψ= (2.22)  .  0 0 σ22 0  0 0 0 σ12

92

2 An error model for immunoblot data

• ~εpg ∼ N (~0, σ 2 I) denotes uncorrelated observational noise of variance σ 2 .

2.9 Conclusions Molecular biology has already remarkably contributed to our understanding of biological systems at the cellular level. However, in recent time it became obvious that screening of all molecular components is not enough to understand complexity of cellular processes. Time resolved measurements in combination with mathematical models have to be used to uncover dynamics and systems properties of biochemical networks. Despite the development of new experimental techniques, quantitative reliability constitutes still a main bottleneck of new experimental approaches. Many methods show a minor signal-to-noise ratio and are only applicable for time resolved measurements in a limited context, especially if primary cells are studied. Statistical analyses of measurement noise by error models has been successfully applied to some other experimental techniques, e.g. for high throughput screening techniques [M ALO et al 2006], analytical chemistry [ROCKE & L ORENZATO 1995], microarrays [L OVE et al 2002, W ENG et al 2006] or polymerase chain reactions [P FAFFL 2001]. A challenging next step is the statistical analysis of data obtained by other experimental techniques. In this chapter, time resolved immunoblotting data of the Insulin signaling pathway is used to develop and validate an error model. Immunoblotting constitutes an essential technique in molecular biology because it is very sensitive to small protein amounts and it allows for discriminating between protein modifications like phosphorylations. A set of 26 error models for immunoblotting intensities has been introduced and evaluated with respect to their ability to explain variability of repetitive measurements. It could be shown by a model discrimination procedure that predominant errors are multiplicative. A log-transformation of measured intensities is required to obtain desirable statistical properties like a Gaussian noise distribution. These statistical properties are necessary for dynamic modeling of protein interactions, e.g. in signal transduction. A consequence is that error bars for immunoblotting intensities are not symmetric at the raw intensity scale. Although outliers towards large values are often observed, asymmetry of error bars is not considered in practice so far. Based on this work, it is recommended that background correction procedures are applied on the log-scale. Furthermore, a gel specific background correction seems preferable. Moreover, it is strongly recommended to determine background intensities carefully. Overestimation of the local background, e.g. by contamination with foreground signals would lead to a lower signal-to-noise ratio after correction. To avoid this risk, it has been shown that the same results are obtained by using background intensities determined “far away” from signal spots on our gels. Furthermore, a strong biological variability between the cell preparations as well as technical noise between different gels have been revealed. An elimination of these sources of a bias improves the reproducibility of the data significantly and results in smaller confidence intervals of the estimated protein concentration time courses. In comparison to separate calibration experiments [S CHILLING et al 2005b], the shown ap-

93

2.9 Conclusions

proach does not require additional experiments. The amount of observational noise is determined by replicate measurements from time course measurement itself. The advantage of this approach is that calibration experiments can be avoided and the error model is developed in exactly the same experimental setup as time courses, e.g. same cells, proteins, and antibodies. A coefficient of variation within gels of around 14% has been found. This result can be utilized for Monte-Carlo simulation studies. A realistic setting for the typical noise obtained for immunoblotting data would be log-normally distributed with a similar coefficient of variation. Modeling of measurements and measurement errors allows for statistical testing and elimination of systematic errors. The illustrated mixed effects model approach can be extended to correct for arbitrary undesired influences, e.g. by different experimental techniques, different cell types or tissues. Therefore, mixed effects models constitute an appropriate approach to perform analyses of data obtained under various experimental conditions, e.g. in different laboratories. Such integration of various sources of data is an important and essential step for systems biology, because the reconstruction of molecular interaction networks require a vast amount of experimental information. The applied multivariate statistical models are very powerful to process experimental data. In such a framework, model assessment criteria can be calculated to determine optimal performing procedures. Further, it allows for a systematic comparison of the dependency of any outcome on changes in the processing of the data. After all, the development of appropriate error models accounting for biological inhomogeneity and experimental noise is one step in the key challenge of systems biology, the generation of reliable and biologically relevant mathematical models. For the project presented in this chapter, I thank Dr. Mar´ıa Bartolom´e-Rodr´ıguez and her group for their great experimental efforts.

94

3 Modeling of flow cytometry data In this chapter, single cell measurements of insulin binding of primary hepatocytes obtained by flow cytometry are examined. A comprehensive data processing strategy is introduced and applied to estimate the insulin binding dynamics. Two subpopulations of primary hepatocytes have been discovered, showing a distinct magnitude of insulin binding. For both subpopulations, the average amount of bound insulin as well as the cell-to-cell variability have been estimated using Gaussian mixture models. Statistical models accounting for the variability between the experimental runs as well as for the effects of data processing parameters are applied. In contrast to the identification of an error model for Immunoblotting presented in Chapter 2, a comprehensive evaluation of all possible models was not feasible due to combinatorial complexity. Therefore, a stepwise backward elimination procedure has been performed to identify the relevant effects. Because there is no objective criterion for specifying ambiguous data processing parameters with such an effect on the results, the estimates of the time and dose effects obtained for alternative data processing strategies have been merged by a mixed effects model. This approach can be interpreted as a statistical meta-analysis. A detailed examination of the impact of the chosen data processing strategy on the results showed that the average binding in both subpopulations can be determined reliably. In contrast, the subpopulations’ size and the within-subpopulation variability cannot be analyzed quantitatively because of a strong sensitivity to the chosen data processing strategy. An ODE model for the binding dynamics of insulin in both subpopulations has been established. The kinetic parameters have been estimated and a model discrimination procedure has been applied to determine the subpopulation dependent parameters.

3.1 Flow Cytometry Flow cytometry is an experimental technique which allows for high-throughput single cell evaluations of fluorescently labeled molecular compounds. The technical principle of flow cytometry is illustrated in Figure 3.1. After the treatment of the cells according to a desired experimental protocol, antibodies labeled with a fluorophore bind to the molecular compounds of interest. During the measurements, the fluorophores are excited by a laser beam. Several detectors are used to determine the scatter intensities in forward and perpendicular directions at the excitation wavelength, as well as the emission intensities of the fluorophores at their characteristic wavelengths. The forward scatter (FSC) intensities are known to be correlated with the cells’ size, the perpen-

3.1 Flow Cytometry

Figure 3.1: Schematic overview of the flow cytometric quantification. Tagged antibodies are used to quantify a molecular compound of interest. In the flow cytometer, the fluorophores of single cells are excited by a laser. The scatter light is quantified by several detectors. The forward- and side scatter intensities at the excitation wavelength of the laser are used to characterize the cells’ geometry. The intensities at the emission wavelengths of the fluorophores are used to quantify the compounds of interest.

dicular or side scatter (SSC) intensities indicate the geometry or granularity of the cells, i.e. their shape and their amount of cytoplasmic granules. The measurements of these two channels are utilized to physically characterize the examined cells and to select the cells of interest. In general, the use of different fluorophores with distinct emission spectra would allow for a quantification of several compounds of the same cell in parallel. The fluorescence of the different fluorophores is quantified using several band-pass filters as shown in Figure 3.1. More technical details about flow cytometry can be found in [E DWARDS et al 2004, H ERZENBERG et al 2002, 2006]. The calibration of cytometers is described in [K RAAN et al 2003]. A summary of the excitation spectra of commonly used lasers as well as the emission spectra of some fluorophores is given in [S HAPIRO 1994]. In our application, primary mouse hepatocytes are treated with different amounts of insulin which has been labeled with covalently bound fluorescein isothiocyanate (FITC). FITC is a fluorophore with a maximal excitation efficiency at a wavelength of 495 nm. The emission is in the range between 500 nm and 600 nm and has a maximal intensity at around 519 nm. In our

96

3 Modeling of flow cytometry data

application, a band-pass filter between 515 and 545 nm has been used for the quantification of the FITC emissions. The forward- and side scatter light is measured at the excitation wavelength of the Argon laser at 488 nm with a bandwidth of ±5 nm. If the measurements are evaluated in real-time, the flow cytometry technique can be utilized to sort the cells according to the measured fluorescence. This is called Fluorescence Activated Cell Sorting (FACS). FACS allows for a further experimental evaluation of cells with certain amounts of the compounds of interest. However, in our biological application, i.e. in the investigation of the insulin signaling pathway in primary mouse hepatocytes, this step could not be established because the remaining yield of viable cells was found as too small for further experimental analyses.

3.2 Methodology 3.2.1 2D-Analysis In flow cytometric studies, the emissions of the fluorophores which are bound to single cells are examined as signals of interest. The measured intensities are proportional to the number of molecular compounds of interest. As an initial step, i.e. before the intensity channels of interest are evaluated, the forward- and side scatter intensities characterizing the cells’ size and granularity are examined to select the cell population of interest. This step is called 2D-analysis in the following. The demands on the 2D-analysis are highly application specific, because the amount of contaminating cells and the definition of the cells of interest, depend on the experimental issue. This constitutes probably the reason why there are currently no automated selection methodologies available and the cells of interest usually have to be selected manually. One commonly chosen strategy is the definition of two thresholds for the forward- and the side scatter intensities, respectively, as it is indicated in Figure 3.2, panel (A). Another commonly applied strategy is the manual definition of an arbitrarily shaped region, like a polygon, as illustrated in panel (B). An automated selection of the cells of interest would enable more objective and reproducible analyses. Further, it would allow for a systematic evaluation of the sensitivity of the outcomes to the selection step and enables comparisons of the performance of alternative strategies. Finally, an algorithmic selection would save time if many data sets are intended to be analyzed. Panel (C) in Figure 3.2 demonstrates that the selection of the cells of interest has a noticeable impact on the outcomes, namely on the emission intensities of the FITC fluorophore. According to the radial distance from the origin, the same fractions of cells are labeled by each color in the scatter plot FSC vs. SSC. The distribution of the fluorescence measurements, i.e. the amount of bound insulin molecules, depends on the selected region in the scatter plot. Only the cells of interest, i.e. the viable hepatocytes which are characterized by large forward- and side scatter intensities are able to bind large amounts of insulin. For the viable hepatocytes, a bimodal distribution of the FITC fluorescence is observed. The location and the shape of this bimodal distribution also slightly depend on the selected cells. In our application, the goal of the 2D-analysis is the separation of the viable hepatocytes from

97

3.2 Methodology

Figure 3.2: Panel (A) illustrates a selection of the cells of interest by a definition of two thresholds for the forward- and the side scatter intensities, respectively. Panel (B) shows a corresponding selection strategy on the basis of a polygon which is defined manually. These two strategies are widely applied in practice to define the cells which are examined in the fluorescence channel of interest. Panel (C) shows the dependency of the FITC intensities, i.e. of the amount of bound insulin on the forward- and side scatter intensities. Here, the same numbers of cells is labeled by each color.

dead hepatocytes and from other cell types occurring in liver tissues, like hepatic stellate cells or Kuppfer cells. For this purpose, the measurements obtained for the different cell types are modeled with Gaussian mixture models [BAILEY & E LKAN 1994, G IBBONS et al 1984, L EISCH 2004], i.e. the forward- and side-scatter intensities, f and s respectively are described as a sum of two bivariate distributions ρ(f, s) = π I ρI (f, s) + (1 − π I ) ρU (f, s) , 0 ≤ π I ≤ 1 .

(3.1)

Here, ρI (f, s) denotes the measurement distribution for the cells of interest I, ρU (f, s) the alternative distribution for the uninteresting cells U , and π I is the fraction of the cells of interest. In our application, both probability distributions ρI (f, s) and ρU (f, s) of the mixture model (3.1) have been assumed as bivariate Gaussian distributions ρI (f, s) ∼ N (µI , ΣI )

98

(3.2)

3 Modeling of flow cytometry data

Figure 3.3: Panel (A) shows a scatter plot of the forward- vs. the side-scatter intensities. In panel (B), the scatterplot is depicted as a histogram in two dimensions. Panel (C) shows the corresponding bivariate mixture model after parameter estimation and panel (D) illustrates the the selection of the cells of interest.

and ρU (f, s) ∼ N (µU , ΣU )

(3.3)

with expectations µI ∈ R2 and µU ∈ R2 , and covariance matrices ΣI ∈ M (2 × 2) and ΣU ∈ M (2 × 2), respectively. Because the covariance matrices are symmetric, there are eleven independent model parameters which are defined as a vector  U U U U Θ := π I , µI1 , µI2 , ΣI11 , ΣI12 , ΣI22 , µU . (3.4) 1 , µ2 , Σ11 , Σ12 , Σ22 These parameters have been estimated for each data set separately. Figure 3.3 shows in panel (A) the forward- and side scatter measurements as a scatterplot and in panel (B) as a histogram. In panel (C), the fitted bivariate mixture is depicted. An example of the selected cells, i.e. the viable hepatocytes which are evaluated for their amount of insulin binding, is shown in panel (D). After the parameter estimation, the distributions

and

bI) ρbI (fj , sj ) ≡ ρI (fj , sj |b µI , Σ

bU ) ρbU (fj , sj ) ≡ ρU (fj , sj |b µU , Σ 99

(3.5) (3.6)

3.2 Methodology  • Parameter initialization: Θ(0) := π I (0) , µI (0) , µU (0) , ΣI (0) , ΣU (0) . • Iterate the following for i = 0, 1, . . . until convergence: 1. E-step: Probabilities P (I|fj , sj , Θ(i) )

=

P (U |fj , sj , Θ(i) )

=

π I (i) ρI (fj , sj |Θ(i) )  π I (i) ρI (fj , sj |Θ(i) ) + 1 − π I (i) ρU (fj , sj |Θ(i) )  1 − π I (i) ρU (fj , sj |Θ(i) )  π I (i) ρI (fj , sj |Θ(i) ) + 1 − π I (i) ρU (fj , sj |Θ(i) )

for the class memberships of cell j for both classes I and U .

2. M-step: Estimation of the expectations ! P P (i) (i) j P (I|fj , sj , Θ ) fj j P (I|fj , sj , Θ ) sj I (i+1) P µ = , P (i) (i) j P (I|fj , sj , Θ ) j P (I|fj , sj , Θ ) ! P P (i) (i) P (U |f , s , Θ ) f P (U |f , s , Θ ) s j j j j j j j j P , P µU (i+1) = (i) (i) j P (U |fj , sj , Θ ) j P (U |fj , sj , Θ ) and covariance matrices Σ

Σ

I (i+1)

U (i+1)

=

=

1 P (I|fj , sj |Θ(i) ) − 1  qP   P (I|fj , sj |Θ(i) ) fj − µI fj − µI 1 1 ,  qPj   I I (i) ) sj − µ2 fj − µ1 , j P (I|fj , sj |Θ P

j

qP

   sj − µI 2    (i) ) s − µI sj − µI j j P (I|fj , sj |Θ 2 2

qPj

P (I|fj , sj |Θ(i) ) fj − µI 1

1 P (U |fj , sj |Θ(i) ) − 1  qP   fj − µ U P (U |fj , sj |Θ(i) ) fj − µU 1 , 1  qPj   (i) ) s − µU fj − µU j j P (U |fj , sj |Θ 1 , 2 P

j

and estimation of the posterior probability X 1 π I (i+1) = ncells j=1,...,n

qP

qPj

j

P (U |fj , sj |Θ(i) ) fj − µU 1



 U

P (U |fj , sj |Θ(i) ) sj − µ2

    sj − µU 2 sj − µU 2

ρI (fj , sj |µI (i) , ΣI (i) )

cells

for the class of interest. b and posterior probabilities distributions ρI (f, s|Θ) b • After convergence, final parameter estimates Θ U b have been found. and ρ (f, s|Θ)

Figure 3.4: The EM-algorithm implemented for the 2D-analysis, i.e. for the parameter estimation of a mixture of two Gaussian distributions ρU and ρI .

have been evaluated for each cell j = 1, . . . , ncells in the data set in order to get posterior probabilities π bI ρbI (fj , sj ) P (I|fj , sj ) = I I (3.7) π b ρb (fj , sj ) + (1 − π bI )b ρU (fj , sj ) for the cells j of belonging to the class I of viable hepatocytes.

Because the class memberships of the cells are unknown, the parameters of the mixture model have to be estimated without an assignment of the cells to the populations I and U . For this purpose, the expectation maximization (EM) algorithm [D EMPSTER et al 1977] has been applied

100

3 Modeling of flow cytometry data

for parameter estimation. The EM-algorithm iteratively alternates between an expectation- or Estep, and a maximization- or M-step. The E-step is the calculation of the expected assignment of the cells to the considered two classes for given parameter values. In the the M-step, the parameter estimation is performed on the basis of the assignments derived in the E-step of the current iteration. The EM-algorithm for the 2D-analysis is depicted in more detail in Figure 3.4. The EM-algorithm is a well-established approach for parameter estimation in the case of unobserved or latent variables. If the maximization step is realized as maximum likelihood estimation, the EM-algorithm converges to a local maximum [X U & J ORDAN 1995]. For unimodal likelihood densities ρI (f, s) and ρU (f, s), the EM algorithm converges against the global maximum [W U 1983]. The raw forward- and side scatter intensities have been log-transformed γ1 (f ) = log10 (f ) , γ1 (s) = log10 (s)

(3.8)

to account for the asymmetry of the distribution of the intensity data. In addition to the logarithmic transformation, an arcus-sinus hyperbolicus transformation γ2 (f ) = asinh(f − fmin ) , γ2 (s) = asinh(s − smin )

(3.9)

has been applied after a subtraction of the minimal scatter intensity fmin and smin for each data set to be able to study the impact of the chosen transformation on the outcomes. Figure 3.3 shows that after this transformation, the scatter intensities are approximately normally distributed. The transformation   p γ2 (x) = asinh(x) = log x + x2 + 1 (3.10)

is for large values of the argument x proportional to the logarithmic transformation. In general, the asinh transformation is known to be more stable than the log-transformation for small values of x because there are no outliers towards minus infinity for x < 1.

Eight different selection strategies have been applied to investigate the sensitivity of the selection strategy on the outcomes. Two different thresholds α for the posterior probability P (I|f, s), namely α1 = 0.8 and α2 = 0.95 have been used for the 2D-analysis after a log- and after a asinh transformation. In addition, a further restriction on the measurement density of the cells of interest in the scatterplot, namely ρbI (f, s) ≥ β with β ∈ {0, 0.2} has been evaluated. Thereby, eight different selection strategies have been performed in parallel to every data set. A summary of the chosen data processing settings is provided in Table 3.1.

3.2.2 1D-Analysis After the selection of the cells of interest in the 2D-analysis, the emission e of the FITC fluorophore is evaluated which is related to the number of bound insulin molecules. This step of the analysis is denoted as the 1D-analysis in the following. Figure 3.5 shows the FITC emission intensities e obtained for viable hepatocytes as response to four different insulin treatments. The cell specific amount of insulin binding clearly shows a bimodal distribution. This has already been suggested in panel (C) in Figure 3.2. Primary

101

3.2 Methodology

Strategy AA AB AC BA BB BC CA CB CC DA DB DC EA EB EC FA FB FC GA GB GC HA HB HC

2D transformation γ log10 log10 log10 asinh asinh asinh log10 log10 log10 asinh asinh asinh log10 log10 log10 asinh asinh asinh log10 log10 log10 asinh asinh asinh

Posterior threshold α 0.8 0.8 0.8 0.8 0.8 0.8 0.95 0.95 0.95 0.95 0.95 0.95 0.8 0.8 0.8 0.8 0.8 0.8 0.95 0.95 0.95 0.95 0.95 0.95

Density threshold β none none none none none none none none none none none none 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

1D transformation δ log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox log10 asinh boxcox

Table 3.1: Overview of the applied data processing strategies. Eight different settings have been applied for the 2D-analysis, namely two different transformations, two thresholds for the posterior probabilities, and for ρI (f, s). For these eight different settings, three transformations of the FITC intensities have been applied for the 1D-analysis. Altogether, every data set is processed 24 times to evaluate the impact of the data processing on the outcomes.

hepatocytes seem to consist of two subpopulations showing a distinct affinity to insulin. Therefore, the viable hepatocytes are subdivided into two subpopulations to account for the distinct affinity for insulin binding. Here, a univariate mixture   ρ(e) = π (1) ρ(1) (e) + 1 − π (1) ρ(2) (e)

(3.11)

of two Gaussian distributions ρ(1) (e) ∼ N (µ(1) , σ (1) 2 )

(3.12)

ρ(2) (e) ∼ N (µ(2) , σ (2) 2 )

(3.13)

and

have been used to estimate the average magnitude of insulin binding µ b(1) and µ b(2) in both subpopulations within the viable hepatocytes, as well as the cell-to-cell variabilities σ b(1) and σ b(2) , and the 102

3 Modeling of flow cytometry data

Figure 3.5: Illustration of the 1D-analysis. Because the viable hepatocytes show a bimodal distribution of the bound insulin, a mixture of two univariate Gaussian distributions is used to estimate the amount of bound insulin and the cell-to-cell variability in both subpopulations. Here, exemplary experimental outcomes are depicted for four different insulin stimulations measured after 15 minutes. The black lines indicate a smooth density estimate using a Gaussian kernel, the blue and red lines depict the fitted distributions.

proportion π (1) of the subpopulation which binds less insulin. The fraction of cells in the second subpopulations with a larger amount of insulin binding is given by 1 − π b(1) .

For the purpose of measuring the impact of the chosen data transformation, a log-transformation δ1 (e) = log10 (e)

(3.14)

δ2 (e) = asinh(e)

(3.15)

and an asinh-transformation have been used. Because it turned out that the transformation within the 1D-analysis has a noticeable impact on the results, a third transformation, namely a Box-Cox transformation [B OX & C OX 1964] has been applied. The Box-Cox transformation ( λ e −1 if λ 6= 0 λ δ3 (e, λ) = (3.16) log(e) if λ = 0

103

3.2 Methodology

constitutes a general rank preserving transformation for positive numbers. The parameter λ is optimized to make the distribution of the transformed measurements as similar as possible to a Gaussian distribution, i.e. λ is estimated by a maximization of a Gaussian likelihood of the transformed data. For λ = 0, the Box-Cox transformation is defined as a log-transformation. Also, a linear (λ = 1), a quadratic (λ = 2), and a hyperbolic transformation (λ = −1) are special cases of the Box-Cox family of transformations. The expectations obtained for the asinh and the Box-Cox transformations have been mapped  δ1 ◦ δi−1 (µ) for i = 2, 3 to enable a direct comparison between the estimates obtained after different data transformations. Similarly, the estimates of the spread of the distributions are compared by a projection δ1 ◦ δi−1 of the interval [b µ − 0.5 σ b, µ b + 0.5 σ b] to the logarithmic scale.

3.2.3 Quality control A reliable analysis strategy should enable an identification of outliers which could be due to a failed cell preparation or due to a miscarried measurement procedure. The introduced mixture model approach provides several possibilities for quality control. In our analysis, the number of identified viable hepatocytes π I and the fraction N0 of cells which do not bind any insulin are used as quality criteria for the cell preparation. In addition, the estimated mixture model parameters µ bI and µ bU have been compared between several data sets. Panel (A) in Figure 3.6 shows the parameter estimates for different cell preparations. Preparation 1 which is displayed in dark yellow color yields systematically larger side scatter components of µU . Panel (B) shows an exemplary data set out of this preparation. Here, the cells have been treated with 100 nM insulin solution. Although less insulin binding has been observed than expected, the data sets in this preparation show almost no contamination by dead cells or other cell types. In comparison to the other cell preparations, the 2D scatter plot and the 1D FITC distributions are found to be distorted. Finally, it turned out that, accidentally, a distinct amplification gain has been used for the measurements. Because the comparability of these data sets is not ensured, the preparation has been removed from further analyses. Also preparation 7 displayed in black color in panel (A) seems suspicious. Panel (C) shows that for this preparation only a very small proportion of the cells behave like viable hepatocytes. Here, only around 4% − 10% viable hepatocytes could be identified within the 2D-analysis. Further, these cells bind almost no insulin as indicated by the large number of cells with FITC intensities close to zero. Due to these reasons, the preparation has also been removed from further analyses. In addition to the quality control at the level of cell preparations also the data sets within a preparation, and the data points within a data set can be evaluated for outlier detection. At the level of the data sets, no artifacts have been found. Within the data set, the discretization of the forward scatter intensities constitutes the major issue for data quality. The discretization of the forward scatter intensities which is due to the 16bit analog-to-digital converter, can be seen in Figure 3.3. For the performed analyses, data points close to the background, namely cells with a forward scatter intensity in the smallest intensity bin, have been removed to eliminate systematic errors due to a limited measurement range. Saturation

104

3 Modeling of flow cytometry data

Figure 3.6: Panel (A) shows the estimated expectations µ bU and µ bI of the bivariate Gaussian mixture model used for the 2D-analysis. The parameters µ bU and µ bI have been evaluated for the purpose of quality control. Cell preparation 1 (dark yellow) yields systematically different estimates of µU . Panel (B) shows an example data set out of this preparation after a treatment with 100nM insulin. There are less contaminating cells, all scatter plots of this preparation showed a distortion in comparison to the other preparations and the FITC channel yield a smaller amount of insulin binding as expected for this stimulation. Panel (C) shows an exemplary data set of cell preparation 7 (black color in panel (A)). Here, a very low number of viable hepatocytes is obtained. This leads to distinct estimates of µI for this preparation.

at high intensity levels occurred only to a noticeable extent for preparation one. Because this preparation has already been removed from further analyses, no further restrictions had to be defined to account for saturation effects for large intensities.

3.2.4 Further analyses After the analyses of single data sets, the experimental outcomes are analyzed with respect to our biological issues. For this purpose, the outcomes of the single experiments, i.e. the estimated parameters of the univariate mixture models, have to be condensed. This comprises the averaging over the experimental replicates, estimation of the impact of the experimental parameters of interest, namely those of the chosen treatment, time and temperature, and the determination of the

105

3.3 Results

Condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Treatment [M ] 0 0 10−10 10−10 10−9 10−9 10−8 10−8 10−7 10−7 10−7 10−7 10−7 10−7 10−6 10−6 10−6 10−6 10−6 10−6 10−5 10−5 10−5 10−5 10−5 10−5

Time [min] 15 15 15 15 15 15 15 15 15 1 2 5 15 30 15 1 2 5 15 30 15 1 2 5 15 30

Temperature [◦ C] 4 37 4 37 4 37 4 37 4 37 37 27 37 37 4 37 37 37 37 37 4 37 37 37 37 37

# replicates 18 (-9) 25 (-3) 18 (-9) 6 (-3) 18 (-9) 6 (-3) 18 (-9) 6 (-3) 18 (-9) 16 16 16 22 (-3) 16 18 (-9) 8 4 8 14 (-3) 8 18 (-9) 8 4 8 22 (-3) 8

# bio. replicates 4 (-2) 7 (-1) 4 (-2) 2 (-1) 4 (-2) 2 (-1) 4 (-2) 2 (-1) 4 (-2) 2 2 2 4 (-1) 2 4 (-2) 2 1 2 4 (-1) 2 4 (-2) 2 1 2 6 (-1) 2

Table 3.2: Overview of the experimental conditions. The insulin binding has been evaluated for 26 different experimental conditions. Time courses and dose-responses have been generated for two different temperatures, namely for 4 and 37◦ C. Altogether, 347 data sets have been generated using eight different cell preparations. The number of removed data sets due to the quality control criteria is denoted in brackets.

corresponding confidence intervals. These steps are discussed in Section 3.3. Briefly, statistical models are used with all experimental parameters as predictor variables, and with the estimated parameters of the univariate mixture model as response. The predictor variables comprise time, treatment and temperature effects, and also parameters which account for sources of systematic noise. This approach allows for the identification of the dominating sources of noise, and for the estimation of the time and treatment dependency and the corresponding confidence intervals adjusted for the systematic errors.

106

3 Modeling of flow cytometry data

3.3 Results The mixture model approaches for the 2D-analysis and the 1D-analysis as described in the previous Section have been applied to 347 data sets. Altogether, eight different cell preparations have been analyzed. Insulin binding has been measured at different times and after treatments with different amounts of insulin. The measurements have been performed either at a temperature of 4◦ C and 37◦ C. A summary of the experimental conditions is provided in Table 3.2. For each data set, the 24 data processing strategies summarized in Table 3.1 have been applied to estimate the average insulin binding in both subpopulations of viable primary hepatocytes, as well as the cell-to-cell variability and the fraction of both populations.

3.3.1 Sensitivity to the data processing strategies The five parameters of the univariate mixture model in equations (3.11)-(3.13), namely µ(1) , µ(2) , σ (1) , σ (2) , and π (1) , have been estimated for all experiments and all data processing strategies. Thereby, 6216 estimates for each parameter have been determined out of the 259 experiments for the six cell preparations used after filtering according to the quality control criteria. These outcomes are further analyzed using linear statistical models to evaluate the sensitivity of the estimates to the data processing, and to estimate the effects of interest, namely the impact of time, treatment and temperature on the measurement distributions. It turned out that both aims require distinct statistical models. The identification of sources of heterogeneity requires a model containing all relevant predictor variables. For the estimation of the impact of time, treatment, and temperature, a model without confounding with these variables is required. A reliable determination of confidence intervals of the estimates also demands for regarding the hierarchical structure of the data, i.e. accounting for the fact that the different outcomes of all preprocessing strategies are based on the same experimental noise realizations. The sensitivity to the data processing strategy has been evaluated with a model Rcpdαβγδ = R0 + Cc + Pp + Ee + αα + ββ + γγ + δδ +(Cα)cα + (Cβ)cβ + (Cγ)cγ + (Cδ)cδ + εcpdαβγδ .

(3.17)

The left hand side denotes the experimental response Rcpdαβγδ , i.e. R is a placeholder for one of the parameters µ b(1) , µ b(2) , σ b(1) , σ b(2) , or π b(1) which are estimated for the c’th experimental condition, the p’th cell preparation, in the e’th experiment and by using preprocessing parameters indicated by the indices α, β, γ, and δ. The remaining observational noise is denoted by ε. The effects of interest, i.e. the effects of the treatment, the time and the temperature, are estimated as parameters Cc . Further, cell preparation effects Pp as well as effects Ee accounting for systematic noise between different experiments e have been considered. Because, obviously, a single experiment can only be performed for a single experimental condition, the effect Ee is nested within the experimental condition effect Cc . The dependency on the parameters of the 2Danalysis is accounted for by the effects αα , ββ and γγ for the two posterior probability thresholds, for the two density thresholds, and for the two transformations, respectively. The choice of the transformation used for the 1D-analysis is accounted for by the effect δδ . Also interactions (Cα),

107

3.3 Results

(Cβ), (Cγ), and (Cδ) between the effect of interest C with the preprocessing effects have been regarded. The indices in equation (3.17) take values corresponding to the levels of the effect, e.g. index c = 1, . . . , nc for the nc = 26 different experimental conditions, or p = 1, . . . , np for the np = 6 different cell preparations after quality control. The notation and interpretation of equation (3.17) is exactly the same as introduced in Sections 1.6 and 2.5. The essential effects, i.e. the effects showing a significant impact on the response R ∈ {b µ(1) , µ b(2) , σ b(1) , σ b(2) , π b(1) }, have been determined by Analysis of Variances (ANOVA). Here, the total variance of the response is split to variance components originating from all predictor variables on the right hand side of the model. ANOVA has been introduced in Section 1.4. Starting with the model (3.17), an iterative backward elimination procedure [C HAMBERS & H ASTIE 1992] is applied to iteratively remove single effects which do not have an impact on the response. Starting from the least significant effect, all effects with a p-value larger than a significance level of 0.01 have been removed. Because of the relatively large number of data points, some effects yield significant results although the nominal effect size has been determined as quite small in comparison to the effect of interest. As a more stringent criterion, also effects with a mean square value below 1‰ of the maximal mean square value have been removed from the model to come up with only the essential effects. The p-values and the mean squares can only be reasonably interpreted for the interaction terms if the corresponding main effects are included in the model. Therefore, main effects are only removed from the model, if the corresponding interaction has already been removed. A summary of the backward elimination procedure is shown in Figure 3.7. For ANOVA, an intercept parameter R0 , i.e. an offset parameter, is required to allow for comparing the mean squares related to different effects. A summary of the results of the backward model discrimination procedure is depicted in Table 3.3. Panels (A) and (B) show the outcomes after the backward elimination procedure for the expectations µ b(1) and µ b(2) as the responses. Here, the backward elimination procedure yields (h)

µ bcpeαβγδ = R0 + Cc + Pp + Ee + δδ + (Cδ)cδ + εcpeαβγδ

(3.18)

for both responses h = 1, 2, i.e. there are significant and noticeable effects between the preparations and between the experiments. Moreover, there is an impact of the transformation used for the 1D-analysis indicated by δ, as well as a significant interaction (Cδ) with the experimental condition. The mean squares allow for comparing the nominal magnitude of the different effects. For µ b(1) , the preparations differ by a similar magnitude as the experimental conditions. For µ b(2) , the experimental condition is the dominating effect.

Panels (C) and (D) show that for the standard deviations of σ (1) and σ (2) , all effects in model (3.17) were significant and yielded a mean square value larger than 1‰ of the maximal mean square value. Therefore, the model (3.17) could not be reduced for these two responses. This indicates that the estimation of the variances of both distributions is highly dependent on the data processing. Note, that for the expectations as well as for the standard deviations, the backward

108

3 Modeling of flow cytometry data

Figure 3.7: Overview of the backward elimination procedure applied to determine the parameters of the experiments and of the data processing which affects the experimental responses µ b(1) , µ b(2) , σ b(1) , σ b(2) , and π b(1) . Effects which are not significant, or have a smaller impact than 1‰ of the strongest effect in terms of mean square values, have been iteratively removed from the initial model (3.17). The star ‘*’ indicates the additional restriction that main effects are only selected if there are no interactions of the effect in the evaluated model.

elimination procedure gave consistent results, i.e. yielded identical models for both subpopulations. Panel (E) shows the outcome for the proportion π b(1) of cells in both subpopulations. Here, a significant and relevant dependency on the transformations used for the 2D and 1D-analysis is observed. The sum of squares of the main effect C and the interaction Cδ can be utilized to assess the debc on the data processing. In panels (C)-(E), the interactions with some pendency of the estimates C preprocessing parameters have a similar magnitude as the main effect. This is another indication for the sensitivity of these responses to the data processing. After the identification of the experimental and data processing parameters having an impact on the outcomes of the 1D-analysis, the dependency on the experimental conditions is now estimated accounting for these sources of systematic noise. For this purpose, the linear models have to be adapted. First, the intercept parameter and the experimental effects E and the interactions with C have been removed from the models obtained by the backward elimination procedure, i.e. these parameters are set to zero, to allow for a direct estimation of the condition effects Cc . Then, the

109

3.3 Results Subpopulation H1 Subpopulation H2 Estimate

(A)

Experimental condition

(B)

Subpopulation H1 Subpopulation H2 Estimate

Experimental condition Subpopulation H1 Subpopulation H2 Estimate

(C)

Experimental condition

Figure 3.8: The parameter estimates of the two Gaussian distributions vary between the data processing strategies. The blue lines are the outcomes for all data processing strategies for the first subpopulation, the red lines for the second without regarding preparation or experiment effects. The black lines indicate the estimates and the standard errors obtained by the mixed effects model. Panel (A) shows the estimates for the expectations, panel (B) for the standard deviations and panel (C) for the proportions of the two 110 mixtures.

3 Modeling of flow cytometry data

(A)

(B) (1)

Response: µ b

Df 25 7 226 2 50 5905

C P E δ Cδ Residuals

Sum Sq. 418.66 139.23 32.81 0.06 1.53 2.31

Mean Sq. 16.75 19.89 0.15 0.03 0.03 0.0003919

F value 42731.805 50752.224 370.423 80.980 78.022

Pr(>F) < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16

(C) Response: σ b

Response: µ b(2) C P E δ Cδ Residuals

Df 25 7 226 2 50 5905

Sum Sq. 1893.56 185.36 64.63 1.02 4.59 9.64

Mean Sq. 75.74 26.48 0.29 0.51 0.09 0.001632

F value 46405.316 16223.660 175.199 311.902 56.239

Pr(>F) < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16

Df 25 7 226 2 1 1 1 50 25 25 25 5827

Sum Sq. 11.0805 1.3181 2.8513 1.8777 0.0008 0.1612 0.0422 0.9922 0.0192 0.0355 0.0503 2.4087

Mean Sq. 0.4432 0.1883 0.0126 0.9388 0.0008 0.1612 0.0422 0.0198 0.0008 0.0014 0.0020 0.0004

F value 1072.2212 455.5150 30.5214 2271.2137 1.9382 389.8487 102.1415 48.0053 1.8570 3.4376 4.8638

Pr(>F) < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 0.163918 < 2.2e-16 < 2.2e-16 < 2.2e-16 0.005872 1.542e-08 1.749e-14

(D) (1)

C P E δ α β γ Cδ Cα Cβ Cγ Residuals

Response: σ b (2) Df 25 7 226 2 1 1 1 50 25 25 25 5827

Sum Sq. 12.0575 6.3521 3.8618 0.1841 0.0100 0.0152 0.0357 1.0848 0.0195 0.0160 0.2156 1.4223

Mean Sq. 0.4823 0.9074 0.0171 0.0920 0.0100 0.0152 0.0357 0.0217 0.0008 0.0006 0.0086 0.0002

F value 1975.9389 3717.7025 70.0070 377.0928 40.9340 62.1647 146.3195 88.8859 3.1997 2.6183 35.3255

Pr(>F) < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 1.697e-10 3.744e-15 < 2.2e-16 < 2.2e-16 1.315e-07 1.923e-05 < 2.2e-16

Df 25 7 226 1 2 25 50 5879

Sum Sq. 68.244 23.568 49.852 0.050 1.345 0.584 12.467 19.736

Mean Sq. 2.730 3.367 0.221 0.050 0.672 0.023 0.249 0.003

F value 813.1419 1002.9181 65.7077 14.9794 200.2730 6.9606 74.2712

Pr(>F) < 2.2e-16 < 2.2e-16 < 2.2e-16 0.0001099 < 2.2e-16 < 2.2e-16 < 2.2e-16

C P E δ α β γ Cδ Cα Cβ Cγ Residuals

(E) Response: π b (1) C P E γ δ Cγ Cδ Residuals

Table 3.3: Linear models have been utilized to identify sources of systematic impact on the parameter estimates of the univariate mixture model. ANOVA analyses of the parameter estimates in combination with a backward elimination procedure show that in addition to the effect of interest, namely the dependency on the experimental condition C, the outcomes also show systematic variations between different preparations P and between different experiments E. In addition, all responses, i.e. all parameters of the mixture model are significantly affected by the chosen transformation used for the 1D-analysis. The corresponding main effect is denoted as δ, the interaction with the experimental condition is denoted as Cδ. The estimates of standard deviations σ b(1) and σ b(2) depend on all data processing parameters α, β, γ, and δ. The estimates of the proportion π b(1) showed a significant dependency on γ and δ, i.e. on the transformations used for both, the 2D and the 1D-analyses.

bc correspond to the average response over the experimental replicates after adjusting estimates C for preparation and for the respective relevant data processing effects. As discussed in Section 1.6, there is a freedom in the parametrization for linear models, e.g. the parametrization of equation (3.17). This freedom does not play a role for ANOVA, but is crucial bc . All parameters except Cc can be regarded as for the estimated parameters, in particular for C nuisance parameters, i.e. the estimates are not relevant for the experimental issues. For these effects, it is essential to use sum contrasts [C HAMBERS & H ASTIE 1992], i.e. to parametrize the

111

3.3 Results

model so that the nuisance parameter estimates sum up to zero. After these adaptations required for the estimation of the effects of interest, the linear models for the five responses could be applied to estimate the respective Cc . As an example, the response µ(1) is analyzed after averaging (1)

(1)

µ bcpeδ = meanαβγ µ bcpeαβγδ

(3.19)

µ bcpeδ = Cc + Pp + δδ + (Cδ)cδ + εcpeδ

(3.20)

over the data processing strategies which do not substantial affect the estimates. This averaging is essential to not artificially increase the sample size by using different data processing strategies. Then, the adapted model (1)

is used to estimate the impact of the experimental conditions on the average insulin binding in the subpopulation H1 . However, the confidence intervals obtained by model (3.20) are strongly under-estimated, i.e. are determined as too small. This is due to correlated outcomes, i.e. correlated residuals, obtained for the three transformations of the 1D-analysis. If 3 × 259 estimates µ b(1) are used for fitting Cc for the three transformations, the confidence intervals are biased because only 259 independent experimental noise realizations are considered which are analyzed in three different manners. As introduced in Section 2.5, statistical models accounting for such correlated noise are called mixed effects models. For the expectations, the mixed effects model (h)

µ bcpeδ = Cc + δδ + p + pe + peδ + εcpeδ

(3.21)

has been used to estimate Cc . This model adjust for systematic variations between the outcomes for the three transformations and contains four Gaussian distributed sources of noise, namely p between the preparations, pe between the experiments, peδ between the transformations for a given experiment, and the observational noise εcpeδ . For the random effects p ∼ N (0, σp2 ), pe ∼ N (0, σe2 ) and peδ ∼ N (0, σδ2 ) only single parameters, namely the variances σp2 between the preparations, σe2 between the experiments, and σδ2 between the transformations are estimated. Figure 3.8 shows the estimates for all five response variables for all data processing parameters. The underlying blue and red lines indicate the outcomes if only a single data processing strategy is applied. In agreement with the backward elimination procedure, the expectations showed a minor dependency on the data processing procedure (panel (A)). In contrast, the estimates of the standard deviations σ b(1) and σ b(2) are very sensitive to the 2D- and 1D-analysis as depicted in panel (B). A comparison of µ b with σ b suggests a correlated magnitude of the cell-to-cell variability with the amount of bound insulin.

Panel (C) shows the estimates for the proportion π (1) . Here, also a strong dependency on the choice of the preprocessing strategy is observed. The median over all experimental conditions is estimated to 0.606, i.e. around 60% of the cells belong the subpopulation H1 which bind less insulin. For experimental conditions with less or even no insulin, the subpopulation H2 is estimated as larger than H1 . In the next section, it will be shown by a simulation study, that this outcome is related to an estimation bias.

112

3 Modeling of flow cytometry data

c Figure 3.9: In panel (A), the bootstrap approach for the evaluation of the estimation bias of ∆µ is summarized. First, representative noise distributions have been taken from the experimental data obtained without an insulin treatment. In the second step, bootstrap samples for both subpopulations have been drawn. Then, a shift ∆µ is introduced and estimated out of the merged data of both populations by the univariate mixture model. Panel (B) illustrates the estimation bias in the case of a unimodal distribution, i.e. for a small difference ∆µ. If the distribution is bimodal, the expectations disagreement of both expectations is overestimated. In panel (C), the outcome of the simulation c ≈ 0.4. The study is summarized. The bias occurs below a critical threshold of ∆µ different colors indicate the results obtained for noise realizations from different cell preparations.

3.3.2 Estimation bias For the negative controls, i.e. for the measurements without any insulin treatment, only the background fluorescence is measured and therefore identical outcomes are expected for both subpopulations of viable hepatocytes. Therefore, the estimated expectations of both univariate Gaussian distributions are expected to coincide. However, Fig. 3.8 shows that µ b(1) and µ b(2) differ noticeably for the negative control condition as well as for small insulin stimulations. This observation is related to a bias in the estimation of the parameters µ(h) when a mixture

113

3.3 Results

of two univariate Gaussian distributions is fitted to unimodal data. Even in the perfect unimodal case, i.e. for µ(1) = µ(2) , equal estimates µ b(1) = µ b(2) are never realized in pratice. Especially, for unimodal noise which is not perfectly Gaussian distributed, a fit with µ b(1) 6= µ b(2) is always superior.

For further analyses of this bias, a simulation study has been performed. All available negative control data sets have been chosen as representative observational noise realizations. Bootstrap ~1 and S ~2 of these noise realizations have been drawn, i.e. data points have been randomly samples S selected with replacement. The proportion of the size of both subpopulations ~1 |/|S ~2 | := 0.606 = π π (1) = |S b(1)

(3.22)

~2 → S ~2 + ∆µ , ∆µ ≥ 0 S

(3.23)

is assumed as it has been determined previously. Then a mixture of two univariate Gaussian ~1 ∪ S ~2 of both bootstrap samples to evaluate the distributions has been fitted to the common set S accuracy of the estimates. The bimodality is simulated by a shift

of the resampled data. The simulation study is summarized in Figure 3.9, panel (A). Panel (B) illustrates the bias for a unimodal distribution (left figure). Only if the data obtained for both populations is bimodal distributed (right figure), the estimates of the parameters of the mixture model are unbiased. In panel (C) of Figure 3.9, the median of the bootstrap sample of the estimated shift c =µ ∆µ b2 − µ b1

(3.24)

is plotted against the true shift ∆µ. The available noise realizations, i.e. all data sets of the negative control treatment, are displayed with different colors. The 95% quantile over the outcomes of c crit ≈ different bootstrap samples is plotted for one data set (black color). Below a threshold of ∆µ 0.4, the shift ∆µ is overestimated. For the extreme case ∆µ = 0, i.e. if both resampling data sets are drawn from the same distribution, the estimates of the shift are still in the range between 0.2 and 0.4. c crit , both distributions cannot be resolved This simulation indicates that below the threshold ∆µ accurately. Then the estimates of the parameters of the mixture distribution are not accurate. On the one hand, this constitutes an intrinsic problem of Gaussian univariate mixture models. On the other hand, the issue is enhanced by the fact that the measurments are not perfectly Gaussian distributed. For simulated data, the threshold for reliable estimates has been found as at least by a factor of two smaller than for the experimental data. In order to account for this bias in the analyses of the experimental data, only estimates µ b(1) and µ b(2) with c > ∆µ c crit := 0.04 µ b(2) − µ b(1) = ∆µ (3.25)

have been used in further analyses. For estimates which do not fulfill this criterion, only the averages   µ b := π b(1) µ b(1) + 1 − π b(1) µ b(2) (3.26) over both subpopulations are used instead. Also, conclusions on the basis of the estimates σ b(1) , (2) (1) σ b , and π b are only drawn, if the condition (3.25) is fulfilled. 114

3 Modeling of flow cytometry data

Figure 3.10: Parameter dependency of the observational function for flow cytometry data. Parameter b1 = 10θ8 determines the sensitivity of the experimental data for low amounts of fluorescent molecules. Below 1/b1 , flow cytometry becomes insensitive. For given b1 , the second parameter b0 is determined by the measured background B0 . The black dotted line indicates the observational function obtained after parameter estimation.

3.3.3 Dynamic model for insulin binding The time dependency of insulin binding can be described according to the rate equation approach [M ELKE et al 2006, W OLKENHAUER 2006] by a system I˙ = −ka IR + kd IR R˙ = −ka IR + kd IR ˙ = ka IR − kd IR IR

(3.27) (3.28) (3.29)

of Ordinary Differential Equations (ODE). I denotes the concentration of unbound insulin molecules, R denotes the concentration of unoccupied insulin receptors and IR the concentration of receptor-insulin complexes. All three concentrations are molecule numbers relative to the volume of the supernatant, i.e. relative to the volume of the insulin solution. The parameter ka is the association and parameter kd the dissociation rate. In our application, the cells are stimulated with insulin. Therefore, initial conditions I(0) := I0

(3.30)

R(0) := R0

(3.31)

IR(0)

=

0

(3.32)

hold at a stimulation with an insulin concentration I0 . The experimental observation that there are two distinct subpopulations of hepatocytes h = 1, 2

115

3.3 Results

is regarded by introducing a subpopulation dependency of the parameters (h)

(3.33)

ka(h) (h) kd

(3.34)

R0 → R0 ka → kd →

.

(3.35)

Equivalently, IR(1) (t) and IR(2) (t) are defined as the concentrations of complexes of receptors and insulin for both types of cells. The dynamics of the reactions is then described by (1) (2) I˙ = −ka(1) IR(1) − ka(2) IR(2) + kd IR(1) + kd IR(2)

R˙ (1) = ˙ (2)

R

˙ IR

(1)

˙ IR

(2)

= = =

(1) −ka(1) IR(1) + kd IR(1) (2) −ka(2) IR(2) + kd IR(2) (1) ka(1) IR(1) − kd IR(1) (2) ka(2) IR(2) − kd IR(2) .

(3.36) (3.37) (3.38) (3.39) (3.40)

Because the parameters could vary by several orders of magnitude, the analyses have been performed in the logarithmic parameter space. A log-transformation of the parameters simplifies the interpretation and graphical visualization of the results. In addition, the nonlinearity of the objective function for parameter estimation, i.e. of the log-likelihood, with respect to the parameters is decreased. After a log-transformation of the parameters, the gradient of the log-likelihood function does not differ by several orders of magnitude which improves the numerical stability of the parameter estimation procedure. Moreover, the log-likelihood yield more symmetric confidence intervals in the logarithmic parameter space. The model parameters are defined as (1)

θ1 := log10 R0 θ2 := θ3 := θ4 := θ5 := θ6 :=

(2) (1) log10 R0 − log10 R0 log10 ka(1) log10 ka(2) − log10 ka(1) (1) log10 kd (2) (1) log10 kd − log10 kd .

(3.41) (3.42) (3.43) (3.44) (3.45) (3.46)

The parameters θ2 , θ4 and θ6 have been defined as the difference of the order of magnitudes of the concentrations or rates within both types h = 1, 2 of hepatocytes. This definition is convenient because a major task of the dynamic modeling is the identification of the distinctions of both subpopulations. The flow cytometric measurements had to be analyzed on the logarithmic intensity scale to obtain Gaussian noise and symmetrically distributed estimates of the average amount of insulin binding in both hepatocyte subpopulations. Because the fluorescence intensities are proportional to the molecule concentrations, the concentrations in the dynamic model are linked to the experimental outcomes by logarithmic observational functions    (h) (h) (h) (3.47) gk = g(IRk ) = log10 b1 IRk + b0 + ε , ε ∼ N (0, σk2 ) 116

3 Modeling of flow cytometry data

with parameter b0 adjusting for an intensity background, and b1 mapping the physical units of the concentrations to flow cytometric intensity units. The index k denotes different experimental conditions, i.e. enumerates the measurement times and treatments. The standard deviation σk of (h) the observational noise ε is given by the uncertainty of the estimate µ bk , i.e. by the standard error (h) SE(b µk ) obtained after fitting the mixed effects model (3.21). c crit , where the estimates µ Below the critical threshold ∆µ b(h) are biased, the predictions    (1) (2) (1) (2) g(IRk (t), IRk ) = log10 b1 π b1 IRk + (1 − π b1 )IRk + b0 (3.48)

of the dynamic model are fitted to the average µ bk according to equation (3.26) of the measured insulin binding in both subpopulations. The parameters b0 and b1 are again analyzed after logtransformation θ7 := log10 b0

(3.49)

θ8 := log10 b1 .

(3.50)

The measured outcome for the negative control, i.e. the background obtained without insulin treatment, relates the values of both parameters. For a background B0 , it holds b0 =

10B0 . b1

(3.51)

Figure 3.10 shows the dependency of the observational function for the estimated background B0 = 0.289. For FITC concentrations below 1/b1 = 10−θ8 , flow cytometry measurements become insensitive. Below this threshold, the observational function is noticeably nonlinear and the concentration changes cannot be resolved optically from the background measurements. The black dashed line indicates the observational function for the parameter estimates bb0 and bb1 , i.e. after fitting of the dynamic model to the data.

The ODE system (3.36)-(3.40) has been integrated numerically. In order to reliably estimate the parameters, the first derivatives of the objective function with respect to the model parameters are required. These derivatives, i.e. the components of the gradient ~ ~ = dL(θ) ∇L(θ) dθj

X ∂L ∂g (h) dxi X ∂L ∂g (h) k k = + (h) ∂x dθ (h) ∂θ i j j ∂g ∂g ikh

k

kh

(3.52)

k

~ depend on the sensitivities of the log-likelihood L(θ), ~ ~ := dxi (t, θ) , Sij (t, θ) dθj i.e. on the derivatives of the predicted concentrations   ~ := I(t, θ), ~ R(1) (t, θ), ~ R(2) (t, θ), ~ IR(1) (t, θ), ~ IR(2) (t, θ) ~ ~x(t, θ)

(3.53)

(3.54)

~ The index k in equation (3.52) denotes again the experimental with respect to parameters θ. conditions which determine the time point t as well as the observational function gk = g(~x(tk )).

117

3.3 Results

Figure 3.11: Panel (A) shows the experimental data used for the establishment of the dynamic model. The blue color indicate population h = 1, the red color h = 2, and the gray color the average over both populations. The black data points are averages below the bias threshold which are used for fitting. The gray data points are averages above the threshold and are not used for parameter estimation. In panel (B), the same data is displayed to illustrate the agreement of the predictions of the dynamic model M0 with the experimental outcomes. The dose response is depicted in the first plot, the binding kinetics is shown in the other plots.

The sensitivities Sij have been determined by numerical integration of the sensitivity equations ~ d dxi (t, θ) dt dθj d dxi (t) = dθj dt d ~ θ) ~ = fi (~x(t, θ), dθj X ∂fi dxk (t, θ) ~ ∂fi = + ∂xk dθj ∂θj k X ∂fi ~ + ∂fi Sij (t, θ) = ∂xk ∂θj

~ = S˙ ij (t, θ)

(3.55) (3.56) (3.57) (3.58) (3.59)

k

where fi denotes the right hand side of the differential equations (3.36)-(3.40) of the ith dynamic variable xi . The initial conditions for the integration of the sensitivity equations are   j = 1, i = 2  1 for Sij (0) = (3.60) 1 for j = 2, i = 3 ,   0 otherwise i.e. the initial conditions are equal to one for the parameters defining the initial concentrations. The parameters of the observational function θ7 and θ8 do not contribute to the sensitivities. Also,

118

3 Modeling of flow cytometry data

the derivatives

∂fi = 0 , ∀i , for j ∈ {1, 2} ∂θj

(3.61)

with respect to the initial receptor concentrations θ1 , θ2 vanish. The ODE system (3.36)-(3.40) is denoted as model M0 . Figure 3.11, panel (A) shows the experimental data used for the calibration of the parameters of the dynamic model. Panel (B) shows the agreement of the model predictions with the experimental data after parameter estimation. For (h) b = −2 L(θ) b = 15.98 is obtained for the 38 different µ model M0 , χ2 (θ) bk and µ bk . Here, a b > 53.4. significant rejection according to a 95% significance level would require χ2 (θ)

The parameters θ~ have been estimated by maximum likelihood. Because the noise is assumed as Gaussian, maximum likelihood is equivalent to least squares estimation. Throughout, π b(1) = 0.605 has been assumed as determined in Sec. 3.3.1, although the results are not sensitive to this assumption. The outcomes of the parameter estimation procedure are summarized in Table 3.4. The confidence intervals have been determined by the profile likelihood approach [R AUE et al 2009]. The profile likelihood LP L (θp ) = max ln ρ(y|θ1 , . . . , θ8 ) θj6=p

(3.62)

for a parameter θp is given by the maximization of the log-likelihood ln ρ(y|θ1 , . . . , θ8 ) over all other parameters θj6=i . The confidence interval Cα (θbp ) of a parameter estimate θbp to a predefined confidence level α is then given by the domain in the parameter space Cα (θbp ) = {θp | LP L (θp ) ≤ Lcrit (α)}

(3.63)

where the profile likelihood LP L is greater than or equal to the maximum likelihood minus the 1 − α quantile 1 Lcrit (α) := L(θb1 , . . . , θb8 ) − cdf−1 (χ21 , α) (3.64) 2

of the χ2 distribution with one degree of freedom. Here, cdf−1 (χ21 , α) denotes the α-quantile of the chi-square distribution, i.e. inverse cumulative density function of the chi-square distribution evaluated at the level α. In terms of the least-squares approach, this is equivalent to

and with χ2P L := −2 LP L

 Cα (θbp ) = θp | χ2P L (θp ) ≥ χ2crit (α)

χ2crit (α) := χ2 (θb1 , . . . , θb8 ) + cdf−1 (χ21 , α)

(3.65)

(3.66)

b for all paFigure 3.12 shows the profile likelihood and the 95% confidence intervals C0.95 (θ) rameters. The parameter estimates are indicated by the vertical black line. For the parameters θ2 , θ3 and θ4 , the function χ2P L = −2 LP L , shows a clear minimum and rises above the threshold χ2crit (α) for α = 0.95. This is plotted in panels (B)-(D). The parameters θ5 and θ6 are practically not identifiable to a confidence level α = 0.95 (see panels (E) and (F)). Here, the profile likelihood does not yield finite confidence intervals. The profile likelihood for parameter θ6 has three local minima, showing a qualitatively distinct dynamic

119

3.3 Results

Figure 3.12: The profile likelihood functions yield to confidence intervals (gray background) for the parameter estimates. The receptor number in the first subpopulation is estimated (2) (1) as which is indicated by the estimate. θb2 = log10 (R0 ) − log10 (R0 ) and its respective confidence interval show that the second subpopulation has more than one order of magnitude more receptors. The order of magnitude of the binding rate (1) θb3 = log10 (ka ) is identified in the range between 4.6 and 5.7. For the parameters θ4 and θ6 indicating differences between both populations, the confidence regions contain zero on the log-scale (panels (D) and (F)), i.e. both populations did not show significant differences in insulin binding and dissociation rates. The parameters θ5 is practically not identifiable, i.e. the confidence interval does not have a finite size. Here, the lower confidence limit for θb5 could not be identified by the given (1) measurements, i.e. a dissociation rate kd of zero is not significantly rejected by the experimental data. θ6 is only identifiable due to constraints of the parameter space towards +∞. The profile likelihood which would be obtained for the corresponding issue without any constraint of parameter space is plotted as a dashed colored line. Thereby, the sensitivity of the profile likelihood and the confidence intervals on the constraints is evaluated. The colored circles indicate restrictions of the profile likelihood due to boundaries of the parameter space. Three scenarios are distinguished: Red circles indicate a constraint of the parameter of interest. In this case, the boundary of the confidence interval coincide with the boundary of the parameter space. Purple and yellow circles indicate if another parameter gets restricted. If the shape of the profile likelihood is sensitive to such a restriction, the circle are plotted in purple. Otherwise, i.e. if there is no visible impact on the profile likelihood, the circles are plotted in yellow. The black arrows in panel (F) indicate local optima of the profile likelihood. 120

3 Modeling of flow cytometry data

Figure 3.13: The interrelation of the parameters as obtained during the profile likelihood evaluation is plotted for all eight parameters. The curves show how the other parameters are adjusted to explain the data if the parameter on the horizontal axis is altered. As in Figure 3.12, the circles indicate points where a parameter is restricted due to a constraint of the parameter space. The black arrows in panel (F) indicate local optima of the profile likelihood.

behavior. The two local maxima are indicated by black arrows in panels (F) in figures 3.12 and 3.13. If θ6 is changed beyond the two local maxima, i.e. for θ6 < −0.44 or θ6 > 0.18, the receptor numbers required to explain the experimental data increases noticeably. At the same time the estimated dissociation rate θ5 decrease. In these scenarios, the model predicts that more insulin from the supernatant is bound by the cells. As an example, for a stimulation with 100nM the model predictions for the global minimum predicts that after thirty minutes less than 50% of the insulin is bound to the cells. For the two local minima, the model predicts that more than 70% of the insulin is bound by the cells. Therefore, measurements of the amount of remaining insulin in the supernatant for this experimental condition would be an efficient experimental setup to distinguish these scenarios. The impact of constraints of the parameter domain on the confidence intervals has been evaluated by the profile likelihood functions for the corresponding unconstrained issue. In figure 3.12,

121

3.3 Results

Parameter θb1 θb2 θb3 θb4 θb5 θb6 θb7 θb8

Meaning (1)

log10 (R0 ) (2) (1) log10 (R0 ) − log10 (R0 ) (1) log10 (ka ) (2) (1) log10 (ka ) − log10 (ka ) (1) log10 (kd ) (2) (1) log10 (kd ) − log10 (kd ) log10 (b0 ) log10 (b1 )

Estimate

95% confidence interval

Constraints

-7.23 1.209 5.03 -0.0969 -0.862 -0.1617 -7.82 8.11

[-8.53∗∗ , -6.74] [1.004, 1.412] [4.60, 5.71∗∗ ] [-0.839∗∗ , 0.341] [-10∗ , 0∗ ] [-2∗ , 1∗ ] [-9∗ , -7.31] [7.61, 9.35∗∗ ]

[-10, -3] [0, 2] [0, 10] [-1, 1] [-10, 0] [-2, 1] [-9, 1] [0, 10]

Estimates for Mana -7.34 1.19 5.01 -0.0892 -0.652 -0.2822 -7.97 8.25

Table 3.4: Estimated parameters for the dynamic model. The parameters which could only be identified by constraints of the parameter space are indicated by stars at the boundaries of the confidence interval. A single star means that a boundary is equal to the constraint of the considered parameter. The boundaries of the confidence region which are determined by a constraint of another parameter, are indicated by two stars. The parameter estimates obtained for the analytical approximation are denoted in the last column.

these profile likelihood functions are plotted as dashed lines. The sensitive points of the profile likelihood are highlighted by purple circles. Here, the profile likelihood is noticeably affected by the constraints. The restrictions without a visible impact on the profile likelihood are indicated by yellow circles. For the identification of the parameters θ1 and θ8 the constraints are crucial. These two parameters are not identifiable to the 95% confidence level exclusively on the basis of the available experimental data. In Table 3.4, the confidence intervals which coincide with the boundary of the domain of the parameter of interest are indicated by a star. Boundaries which are sensitive to a constraint of other parameters are highlighted by two stars. Figure 3.13 shows the relationships of the parameters during the calculation of the profile likelihood. The nonlinearity of these curves indicates the nonlinearity of the model with respect to the parameters. For each profile likelihood, several or even all parameters had to be adjusted. This points out, that the estimates of the parameters are highly interrelated. The model M0 has three population dependent parameters (3.33)-(3.35). In addition to M0 , three alternative models with a single population independent parameter have been evaluated. (1) (2) For a model M1 , the receptor number is assumed as population independent, i.e. R0 = R0 . (1) (2) Model M2 has equal association rates ka = ka , and model M3 has the same dissociation rate (1) (2) kd = kd for both populations. Model M1 with equal receptor numbers in both types of hepatocytes is significantly rejected = 36.92) with a p-value of p = 4.7 × 10−6 obtained by a likelihood ratio test. This result is equivalent to the fact that the confidence region of θb2 does not contain the value θ2 = 0. This outcome suggests that both subpopulations of hepatocytes differ by the number of receptors per cell.

(χ2

(1)

The model M2 with ka

(2)

= ka is not rejected (χ2 = 16.74) with p = 0.74. Also, model

122

3 Modeling of flow cytometry data

(1)

(2)

M3 with kd = kd could not be rejected. Here, χ2 = 17.04 and p = 0.52 have been obtained. (1) (2) (1) (2) Moreover, a model M4 with ka = ka and kd = kd is in agreement with the experimental data (χ2 = 17.07, p = 0.50). Therefore, the measured insulin binding does not indicate any distinctions of the association and dissociation rates in both subpopulations. (2)

(1)

Another interesting outcome is that a model with θ6 = log10 (kd ) − log10 (kd ) = −∞ is not rejected (χ2 = 17.49, p = 0.35). Therefore, the dissociation rate of the second subpopulation of hepatocytes could potentially be very small, or even zero. The ODE system (3.27 − 3.29) for a single type of cells is solved by B I(t) = + 2ka

s

A B2 + 2 tanh ka 4ka

r

B2 1 ka A + t + ln 4 2



√ ! −B + 4ka A + B 2 + 2I0 ka √ (3.67) B + 4ka A + B 2 − 2I0 ka

with A := kd (I(0) + IR(0))

(3.68)

B := ka (I(0) − R(0)) − kd .

(3.69)

In this case, the time dependency of the unoccupied receptors would be given by R(t) = R(0) + I(t) − I(0)

(3.70)

and that of the kinetics of the complexes is IR(t) = IR(0) − I(t) + I(0) .

(3.71)

The tanh is a sigmoidal function with codomain between minus one and one. For arguments close to zero, the tanh function is linear with a slope equals to one. A is proportional to the total number of insulin molecules, B + kd is proportional to the number of ligands minus the number of receptors. For large insulin amounts, i.e. either for a high concentration or for a large volume of the supernatant, the decrease of insulin in the supernatant due to binding would be small. Then, the competition between the populations’ binding sites for insulin ligands would be negligible and the binding dynamics of both subpopulations would decouple. In this case, the analytical solution could be used for both types of hepatocytes. The analytically solvable approximation has been evaluated as an additional model Mana . This approximation yields similar shapes of the profile likelihoods for the parameters θb1 , . . . , θb4 , θb7 , and θb8 . Here, the estimates differ by less than 2% for the estimates for the exact model. For θb5 and θb6 which correspond to the dissociation rates, the ratio of the estimated parameters are 0.76 for θb5 , and 1.74 for θb6 . The parameter estimates obtained for the analytical approximation are denoted in Table 3.4.

3.4 Discussion Flow cytometry allows for single cell measurements of the abundance of molecular compounds of interest. On the one hand, flow cytometry enables a distinction between different cell types by the

123

3.4 Discussion

evaluation of the scatter light. On the other hand, the assignment of the cells to the cell types of interest constitutes a major obstacle in the data analysis. In statistics, the issue of finding such an assignment is termed as classification. Because in our application no data sets with a correct class assignment have been available, the classification issue had to be solved unsupervised. For this purpose, a mixture of two bivariate Gaussian distributions has been utilized in combination with the EM-algorithm to determine the posterior distributions and to assign the cells to the classes of interest. The EM-algorithm is widely applied to unsupervised classification problems, e.g. in medical research [P ICKERING & F ORBES 1984], biochemistry [C RAIG & L IAO 2007], image analysis [T SAI et al 2005] or astronomy [M UKHERJEE et al 1998], but there is a variety of further well-established unsupervised classification or clustering methods which are related to the mixture model approach [A LPAYDIN 1998, F INN et al 2008, H ASTIE et al 2001, JAIN & D UBES 1988]. A closely related method is k-means clustering [H ARTIGAN 1973]. Here, a measure for the similarity of two features, e.g. the Euclidean distance, is used to identify similar items. Standard k-means clustering involves an iterative assignment of the items to the classes and estimation of the class centers based on the current assignment. In contrast to the mixture model and the EM-algorithm, the assignment to the classes is exclusive, i.e. the variables indicating the class membership are either zero or one. Another distinction is that the spreads within the different classes are not considered for the similarity measure. Fuzzy C-means clustering [B EZDEK 1981, H ATHAWAY & B EZDEK 2001] generalizes the concept of k-means clustering by arbitrary values for class memberships. Here, the membership variable is the posterior probability adjusted by an exponent. This exponent is called the amount of fuzziness. The mixture model corresponds to an exponent equals to one in the case of Gaussian noise. From this point of view, fuzzy C-means generalizes the EM-algorithm for mixture models. However, standard fuzzy C-means is based on Euclidean distance and is therefore restricted to a corresponding mixture of Gaussian distributions with diagonal covariance matrices. Other unsupervised classification methods comprise for example self organizing maps (SOM) [KOHONEN 1995] and neuronal nets [S ANGER 1989]. Recently, the applicability of the mixture model approach to flow cytometry data has been shown [B OEDIGHEIMER & F ERBAS 2008, L O et al 2008]. A comparison with other data processing strategies, e.g. with the widely used cell selection by hand, would require a precise definition of the cells of interest as well as an objective function for the assessment of the performance of an approach [R AND 1971]. If a cell type classification approach could be combined with fluorescence activated cell sorting (FACS), the performance of a classification approach could be evaluated by comparing the numbers of correctly sorted cells. A further aspect is that the performances of different classification approaches are preferably compared on the basis of data sets which are independent of the data used for the establishment of a method [S ALZBERG 1997, S IMON et al 2003]. This requisite allows for an unbiased and more objective appraisement. In this chapter, the data processing, i.e. the parameter estimation of the bivariate Gaussian mixture model, is performed for each of the data sets, separately. Another strategy would be a common analysis of all available data sets. Then, a single mixture model could be used for all comparable sets of experiments. Such an approach would allow for a more robust estimation of the param-

124

3 Modeling of flow cytometry data

eters of the bivariate mixture model with respect to outlier data sets and with respect to varying occurrences of the different cell types. However, due to daily varying photomultiplier settings [K RAAN et al 2003] or inhomogeneities of the cell preparations, it is a priori ambiguous, which data sets could be considered as “comparable”. Further drawbacks of a holistic preprocessing strategy are the increased numerical complexity and that the results depend on all utilized measurements. In a holistic approach, even adding or removing a single data set would change the outcomes for all measurements. Then, the results obtained for experimental repetitions cannot be considered as independent which obviates the calculation of error bars on the basis of experimental replicates. Nevertheless, a comparison of the outcomes of an approach based on analyses of single data sets with a holistic method could allow the establishment of more refined strategies for quality control and outlier detection. Because each data set consists of more than ten thousand data points, the parameters of the bivariate mixture model could be estimated reliably. The EM-algorithm converged for all data sets in the 2D-analysis. However, especially for weak insulin stimulations leading to an almost unimodal measurement distribution, it has been rarely observed that the EM-algorithm did not converge. In our application, these cases coincide with the occurrence of the estimation bias. A threshold for the difference µ b(2) − µ b(1) has been determined by a simulation study to identify the experimental conditions allowing for a reliable subpopulation resolved analysis. Here, an alternative strategy could potentially be an extension of the EM-algorithm allowing for an automatic reduction of the number of mixture distributions. The kinetic model established in this chapter has been used to evaluate the dose dependency and the behavior of both subpopulations. For this purpose, it was sufficient to describe the binding dynamics with a single type of binding sites on a single cell. In such a description, the insulin receptors have been used as representative terms for the binding sites on the cellular surface. The potential existence of multiple types of binding sites on a single cell, e.g. receptors with different affinities, were not considered here.

3.5 Summary In this chapter, flow cytometry has been used to analyze insulin binding of primary hepatocytes at the single cell level. Gaussian mixture models have been utilized to automatically detect the viable hepatocytes in the 2D-analysis and to evaluate the amount of insulin binding in the 1D-analysis. The measurements have been analyzed for different transformations, i.e. after logarithmic, asinh, and Box-Cox transformations. In the 2D-analysis, the thresholds for the posterior probabilities allow for an intuitive adjustment of the stringency of the assignment. In addition, the estimated parameters of the mixture model allow for an identification of outlier measurements. In the 1Danalysis, two subpopulations with a distinct behavior concerning insulin binding have been found within the viable hepatocytes. This result has a very high biological impact because hepatocytes are strongly involved in glucose regulation in the body and insensitivity for insulin is known to be a major reason for the development of diabetes type II. A univariate mixture model has been utilized to analyze the bimodally distributed data, i.e. to

125

3.5 Summary

Figure 3.14: Summary of the applied analyses. Mixture models have been applied after different data transformations to select the cells of interest, namely the viable hepatocytes, and to evaluate the insulin binding in these cells. Two distinct subpopulations have been found within the viable hepatocytes. Then, the impact of the data processing parameters on the outcomes has been investigated. Biased estimates have found for treatments with a small amount of insulin. After the estimation of the time and treatment dependency and identification of the unbiased estimates, a dynamic model of insulin binding has been established.

estimate the expectations and variances of insulin binding in both subpopulations. The impact of the data processing parameters, e.g. the chosen transformation or thresholds, have been studied by statistical models. Thereby, the experimental and processing parameters which affect the outcomes significantly could be identified. For the expectations, i.e. the average binding in both subpopulations, a moderate impact of the data processing has been observed. However, an estimation bias has been found for small insulin stimulations. The estimates of the cell-to-cell variability have been found as sensitive to the data processing. Therefore, and because of the bias for small insulin treatments, a quantitative interpretation of the estimated cell-to-cell variability is restricted. The outcomes obtained for data processing strategies which did not differ significantly have been averaged in order to not artificially increase the apparent sample size which would result in under-estimated confidence intervals. The results for the processing strategies which significantly differ have been condensated by a mixed-effects model. Similar strategies are used in the literature for statistical meta-analyses [B ROCKWELL & G ORDON 2001, H ARRIS C OOPER 1994], i.e. if results from several independent studies are combined. Here, the equivalent approach is called random-effects meta regression [B ELLAVANCE et al 2009, B ERKEY et al 1995]. A dynamic model for insulin binding in both hepatocyte subpopulations has been established. This enabled the estimation of the number of binding sites and of the binding and association

126

3 Modeling of flow cytometry data

rates in both subpopulations of hepatocytes. In addition, it could be shown that flow cytometry is insensitive to quantify small amounts of fluorescent molecules. In our application, the detection threshold has been determined in the range of around 10nM. Altogether 347 data sets have been analyzed under 26 different experimental conditions. Figure 3.14 shows a summary of all steps of the analysis. The steps 1-3 comprise the processing of single data sets. Then, the sources of systematic errors were identified (step 4), and the dependency on time and treatment have been estimated with corresponding confidence intervals (step 5). In step 6, the experimental conditions which allow for a subpopulation specific analyses have been identified. For this purpose, the measured distributions of the amount of bound insulin have to separated which requires a sufficiently large insulin dose. Finally, a dynamic model is established on the basis of the determined time and treatment dependencies to investigate the binding kinetics, i.e. to identify the distinctions of both subpopulations at the level of molecular interactions between the receptors and the insulin molecules. The receptor number has been shown to differ my more than a factor of ten. No significant difference of the association and dissociation rates have been found.

3.6 Conclusions The data processing strategy introduced in this chapter could serve as a general method for the establishment of dynamic models on the basis of single cell experimental data in systems biology. The issues solved within this chapter in advance to the dynamic modeling, namely the selection of the cells of interest, the identification of sources of systematic noise and outlier data sets, the evaluation of the sensitivity to the data processing strategy, and the averaging over experimental repetitions and competing data processing strategies with a parallel adjustment for the systematic noise, are very general and frequently occurring aspects in quantitative molecular biology. In this chapter, the standard rate equation approach describing the time dependency of the concentration average over a population of cells has been extended to account for the distinct behavior of both hepatocyte subpopulations. This constitutes a first step towards the methodological extensions required in systems biology to a account for heterogeneities between several types of cells or even for the cell-to-cell variabilities at the level of the reaction kinetics. For this purpose, the classical ODE based approaches in systems biology have to be extended to model and infer the parameter variations between cell types or single cells on the basis of experimental data. From the experimental point of view, a major restriction of the flow cytometry technique for dynamic single cell modeling is that single cells can be evaluated only once. This corresponds to a cross sectional experimental design, i.e. a population of cells can only be evaluated under a single experimental condition. However, for the evaluation of the distribution of the molecule abundances and the kinetic parameters between different cells, longitudinal studies are required, i.e. the cells have to be evaluated over time to more efficiently infer variations of the reaction kinetics. Because of the physiological and clinical importance of the insulin signaling pathway, the dis-

127

3.6 Conclusions

covery of the two subpopulations is very relevant from the biological point of view. On the one hand, the insulin pathway in hepatocytes mediates the physiological downregulation of glucose in the blood. Insensitivity of the insulin pathway is the major dysfunction in Diabetes type II. This resistance to insulin is not yet understood at the molecular level. The discovery of further regulation processes at the cellular level, e.g. of the number or the affinity of the receptors could provide new hypotheses concerning the underlying malfunctions in diabetes. On the other hand, the insulin pathway is also known as essential during the regeneration processes of the liver, i.e. during the process of cell division of hepatocytes. At the moment, there are several hypotheses, to explain the diverse behavior of the two subpopulations of hepatocytes concerning insulin binding. The hypothesis that some cells express another receptor type, the so-called splice variants IR-A and IR-B [M C C LAIN 1991] could already be rejected experimentally by measuring the transcribed mRNA sequences by reverse transcription polymerase chain reaction (PCR). Currently, experiments are performed to identify the physiological or morphological difference of both types of hepatocytes. The fact, that these two subpopulations are not observed in cell line cultures, emphasizes the importance of the cellular environment of the cells in the tissue. Therefore, spatial inhomogeneities could explain the experimental outcome, e.g. cells of the periportal or pericentral regions of the liver could differ. However, immunofluorescent microscopy of tissue sections could not substantiate this guess. Another explanation would be that the affinity to insulin changes during the cell cycle, i.e. depends on the state with respect to cell division. Recently, we found that the DNA content of the hepatocytes differ in both cell populations. It is known that many hepatocytes are polyploid, i.e. they contain more than two copies of the chromosomes. This could either occur by multinucleation, i.e. there are two or even more nuclei in a cell, and/or by more than two copies of the chromosomes in a single nucleus [G UPTA 2000]. The increased amount of DNA in polyploid cells could change the transcription and the abundance of the insulin receptors which could explain the occurrence of the two hepatocyte subpopulations. Our latest experimental results indicate that the affinity to insulin highly anti-correlates with polyploidy, i.e. cells which bind more insulin have less DNA content. Primarily, the polyploidy of the nuclei seems to explain our experiments. The multinucleation has apparently a minor impact. However, this issue is still under investigation. For this project, I thank Dr. Mar´ıa Bartolom´e-Rodr´ıguez and her group for their great experimental efforts and for the support concerning the biological background of insulin signaling.

128

4 Experimental design in systems biology

A representative experimental setup is a prerequisite for the generalization of experimental results. Although this demand is usually adequately fulfilled in most applications in physics, complex systems like living cells are often characterized by heterogeneity between the evaluated samples. The diversity of the samples leading to systematic variations of the experimental outcomes, requires an appropriate experimental planning to allow for valid interpretations. The major concepts of the optimal experimental design theory have already been developed in the beginning of the 20th century. At that time, people were forced to improve the efficiency of industrial and agricultural trials by enhanced experimental planning. The theory of experimental design developed at that time by Ronald Aylmer Fisher already comprised modern strategies like randomization, replication and blocking. Nowadays, the methodology for dealing with heterogeneity of the experimental samples e.g. in biomedical research, is well-established. In addition, it is known how to utilize the samples’ covariables to account and adjust for systematic error in the experimental outcomes. However, despite the use of advanced statistical approaches, large studies are usually required to validate hypotheses in such heterogeneous experimental settings. As an example, in clinical trials often thousands of patients are studied over years to confirm interrelations between diseases and pharmaceutical treatments. Because of this circumstance, basic research as performed in molecular and systems biology is usually focused on small biochemical modules and on cell lines, i.e. on clones of a single cell, to gather knowledge without laborious trials. However, along with the efforts of being able to study and understand the cellular responses at the molecular level in an in vivo situation, biological variability becomes also an indispensable issue in basic research. If primary cells of higher organisms like mammals or even humans are investigated, the reproducibility of cellular clones is no longer given and then, in extreme cases, the same extent of biological variability could occur as in clinical trials. For a cell preparation obtained ex vivo, an experimental outcome always depends on the sample of cells which is evaluated because the behavior and state of cells depends on their environment and history. Therefore, conclusions have either to be drawn based on a representative, i.e. large, pool of experimental data, or heterogeneity has to be controlled by an appropriate design to allow for valid conclusions.

4.1 Introduction

Such design considerations are well-established in classical Biostatistics, e.g. to perform clinical studies. Systems biology, however, has developed from more technical disciplines like chemistry, engineering, and physics and is, for example, strongly influenced from methods applied in nonlinear dynamics. In these research areas, the experimental conditions are much more reproducible but the applied models are much more difficult to analyze because models described e.g. by differential equations, can exhibit a vast variety of kinetic behavior like oscillation or even chaos. Instead of phenomenological statistical models, dynamic models are usually highly nonlinear and all design considerations depend on the true underlying system, i.e. on the model structure and its parameters. Therefore, experimental design consideration are focused on robustness with respect to model or parameter uncertainty and on dynamic aspects like uncovering the model’s whole kinetic diversity, by applying most informative inputs or perturbations on a studied system. In this chapter, experimental design optimization is discussed for applications in systems biology. Classical design principles established in Biostatistics like randomization, replication, confounding, etc. are linked with established design approaches for dynamical systems. Monte-Carlo approaches are introduced which allow for design optimization for model discrimination and parameter estimation. The heterogeneity of the biological samples can either be included in the mathematical model or its impact has to be curated by an appropriate design. This aspect is essential in experimental planning because it enables design optimization with respect to technical and especially biological variability. The results of this chapter have been published in [ K REUTZ & T IMMER 2009].

4.1 Introduction The development of new experimental techniques allowing for quantitative measurements and the increasing level of knowledge in cell biology enable for the application of mathematical modeling approaches for testing and validation of hypotheses and for the prediction of new phenomena. Along with the rising relevance of mathematical modeling, the importance of experimental design issues increases. The term experimental design or design of experiments refers to the process of planning the experiments to permit efficient statistical inference. A proper experimental design enables a maximally informative analysis of the experimental data whereas an improper design cannot be compensated by sophisticated analysis methods. Learning by experimentation is an iterative process. Prior knowledge about a system based on literature and/or preliminary tests is used for planning. Improvement of the knowledge based on first results is followed by the design and execution of new experiments which are used for refinement of the knowledge (see Figure 4.1, panel (A)). During the process of planning, this sequential character has to be kept in mind. It is more efficient to adapt designs to new insights than planning a single, large and comprehensive experiment or study. Moreover, it is recommended to spend only a limited amount of the available resources (25 %, [M ONTGOMERY 1991]) in the first experimental iteration to ensure that enough resources are available for confirmation runs. Experimental design considerations require that the hypotheses under investigation and the scope of the study are stated clearly. Moreover, the methods intended to apply for the analysis

130

4 Experimental design in systems biology

(A)

(B)

Hypotheses

Experimental Design Hypotheses

Appropriate Model(s)

Prior Knowledge

Scope yes

Model Discrimination Required?

Experimental Design

Pooling?

no

Experiments

Identifiability Analysis

Parameter Estimation

Experimental Design

Way of Replication Choice of Individuals Choice of Perturbations, Observables, Sampling Times

Experiments no

Best Model Found? Parameter estimation

yes Parameter Estimation Required? no

yes

no

Allocation of Perturbations etc. to Individuals

Parameters Satisfactory? yes

Confounding? Model Adequate?

no

yes

no

yes Sample Size Final Model Conclusions, Predictions

Validation

Design

Figure 4.1: (A) Overview about a standard model building process. Both loops, with and without model discrimination require an experimental planning step which is highlighted in gray color. (B) The most important steps in experimental planning for systems biological applications.

have to be specified [M EAD 1988]. The dependency on the designated analyses is one reason for the wide range of experimental design methodologies in statistics. In this chapter, a methodological overview of experimental planning in systems biology is presented. The discussed aspects are displayed in panel (B) of figure 4.1. Monte-Carlo approaches are introduced for design optimization accounting for restrictions of the experimental feasibility and for the available prior knowledge about the studied biological process. Thereby, the major issue for studying the dynamics of biological systems is approached, namely the identification of an appropriate choice of the sampling times, of the pattern of stimulation and of the observables. Moreover, it is discussed how design aspects are related to the scope of outcomes of the study. Also, the benefits of pooling of cells from different origins, of a randomized sampling, and of replication of the experiments are discussed.

131

4.2 The design problem

4.2 The design problem In this section, the design problem for data based modelling in systems biological applications is stated mathematically in a way that allows for the proposal of optimally informative observables, perturbations, and measurement times.

4.2.1 The mathematical models The dynamics of a biological process is modeled by a system of ordinary differential equations x˙ = f (x, u, θx )

(4.1)

where θx is a vector containing the dynamic parameters of the model and u = u(t) are the externally controlled inputs to the system, such as stimulation by ligands. Typically, the state variables x = x(t) correspond to concentrations. Initial concentrations x(0) have usually also to be considered as system parameters. The level of detail, i.e. the number of equations and parameters depends on the hypotheses under investigation. The system dynamics, i.e. the function f , is often derived from the underlying biochemical mechanisms. These models are called mechanistic models. The discussed principles and mathematical formalism of experimental design also hold for partial differential equations, delay differential equations, and differential algebraic equations. Actually, all the discussed principles hold for any deterministic relationship between the state variables and also for steady states. In contrast, models containing stochastic relations, e.g. described via stochastic differential equations, would require a more general mathematical formalism at some points. The definition of the dynamics of x in (4.1) is the biologically relevant part of a mathematical model. Statistical inference requires an additional component y(ti ) = g(x(ti ), θy ) + ε(ti ) , ε(ti ) ∼ N (0, σi2 )

(4.2)

linking the dynamical variables x(ti ) to the measurements y(ti ). Here, independently and identically distributed additive Gaussian noise is assumed, although the following discussion is not restricted to this type of observational noise. The vector θy contains all parameters of the observational functions g, e.g. scaling parameters for relative data and parameters for further effects which can e.g. correspond to experimental parameters, which account for interfering covariates. For simplicity, θ ∈ P is introduced as the parameter vector containing all nθ model parameters θx and θy . An experimental design D specifies the choice of the external perturbations u, the choice of the observables g, and the number and time points ti of measurements. The way of stimulation as well as the times of measurement can usually be controlled by the experimenter. Therefore, they are called independent variables. In contrast, the measured variables y are called dependent variables because the realizations depend on the design and on the system behavior. Note that in the models, equation (4.1) and (4.2), only the dependent variables y are affected by noise. It is assumed that the independent variables, e.g. the sampling times, can be controlled exactly.

132

4 Experimental design in systems biology

4.2.2 External perturbations In systems biology, an important independent variable is the treatment, e.g. stimulation by nutrients, hormones or drugs [C HEN & A SPREY 2003, C OONEY & M C D ONALD 1995, E SPIE & M ACCHIETTO 1989, G ALVANIN et al 2007, M AIWALD et al 2007]. Because the stimulation can be time dependent, it is modeled as continuous input function u(t). Up- or down-regulation of genes i.e. by constitutive over-expression, siRNAs, or by knockouts can also be regarded as external perturbations of the studied system. A design can be optimized with respect to the chosen perturbations u ⊂ U . This includes the choice of the applied treatments or treatment combinations as well as stimulation strength and the temporal pattern, e.g. permanent or pulsatile stimulation. U denotes the set of all experimentally applicable perturbations. For numerical optimization, the input functions has to be parametrized. A common approach is the control vector parametrization (CVP) [BALSA -C ANTO et al 1998, BANGA et al 2005] or using stepwise constant input functions.

4.2.3 Measurement times The choice of the sampling times, i.e. the times of measurement t ⊂ T is crucial if the dynamics of a system is studied [K UTALIK et al 2004]. On the one hand, the sampling interval ∆ti should be small enough to capture the fastest processes. On the other hand, the duration tmax − tmin of observation should be appropriate to capture the long term behavior of the studied system. Because of limitations in experimental resources, this trade-off has to be solved reasonable by experimental planning. This requires, however, some knowledge about the time scale of the studied dynamic processes.

4.2.4 Observables The output of an experiment y is represented in the model by observational functions g and the noise ε. The experimenter has the freedom of choosing which measurement technique will be applied and which system players, e.g. proteins, will be measured. Thereby, it is possible to select the most informative observables g ⊂ G from the set of all available observational functions G. which are determined by experimental feasibility. In practice, such experimental design considerations are very helpful, if e.g. new antibodies have to be generated or experimental techniques have to be established in a laboratory. Another reason for the importance of the choice of the observables is that this step determines the expected amount of observational noise.

4.2.5 Experimental constraints In cell biology, there are usually many more experimental restrictions than in more technically orientated disciplines like engineering or physics. Often only a small fraction of the dynamic variables can be measured. The feasible external perturbations are usually very limited, e.g. it is

133

4.2 The design problem

often impossible to define the stimulation in the frequency domain, which is a natural approach in engineering. Experimental constraints are accounted by the definition of the design region D, i.e. the set of all practically applicable designs. During the optimization, D is considered as the domain, i.e. only designs D ∈ D are allowed. If there are only separate experimental constraints for the domains U , G and T , then D corresponds to the set of all combinations D =U ×G×T

(4.3)

of possible perturbations, observations and measurement times. An example for commonly occurring constraints is a lower boundary for the sampling interval ∆t or that only a limited number of measurements can be obtained from one experimental unit. After the definition of a utility (or loss) function V (D) the design can be optimized D∗ = arg max V (D) D∈D

(4.4)

over the design region to identify the optimal design D∗ . The utility function V , also called design criterion, reflects the purpose of the experiments. If, for example, parameters are estimated, the utility function could be a measure for the expected accuracy of the estimated parameters. If the discrimination between competing models for the description of a phenomenon is regarded, the design criterion measures the difference in the model predictions. The most commonly used utility functions are introduced in Section 4.3.

4.2.6 Prior knowledge In general, besides the dependency on the design, the utility function depends on the true underlying parameters θ and on the realization of the observational noise V (D) → V (D, θ, ε). Therefore, in the general case the determination of an optimal design requires some prior knowledge about the parameters [D ETTE & B IEDERMANN 2003]. The accuracy of the predicted optimal designs is limited by the precision of the provided prior knowledge. Such knowledge, e.g. the order of magnitude or physiological meaningful ranges, could be obtained from preliminary experiments. The expected utility function Z Z ∞ V¯ (D) = ρ(ε)ρ(θ)V (D, θ, ε) dε dθ (4.5) P

−∞

is obtained by averaging over the parameter space P and over all possible realizations of the observational noise. By using a prior distribution ρ(θ), the parameter space is weighted according to its relevance. ρ(ε) denotes the distribution of the observational noise. In the case of an unknown model structure, i.e. for the purpose of model discrimination, an additional weighting with the prior probabilities π(M) of different reasonable models M is required. Then equation (4.5) becomes Z Z ∞ X ¯ V (D) = π(M) ρ(ε)ρ(M) (θ)V (M) (D, θ, ε) dε dθ , (4.6) M

PM

−∞

134

4 Experimental design in systems biology

where ρ(M) (θ) denotes the parameter prior for model M. After the analysis of new experimental data, the parameter prior as well as the model prior are updated to account for new insights. Bayes’ formula yields to posterior probabilities R π(M) ρ(y|θ(M) ) ρ(M) (θ) dθ 0 R (4.7) π (M) = P (Mm ) ) ρ(Mm ) (θ) dθ m π(Mm ) ρ(y|θ

for the considered models and

ρ(M)0 (θ) = R

ρ(M) (θ) ρ(M) (y|θ) ρ(M) (θ0 ) ρ(M) (y|θ0 ) dθ0

(4.8)

for the model parameters. In turn, the posterior can be considered as a new prior for the next experimental planning iteration.

4.3 Determination of optimal designs If the potentially feasible experiments, i.e. the design region, and the prior knowledge are specified, appropriate objective functions can be utilized to assess the expected performance of a design. In this section, Monte-Carlo approaches for the optimization of designs for parameter estimation and model discrimination are introduced on the basis of established objective functions.

The utility or objective function can be used for numerical optimization yielding, e.g. optimal sampling times, observables, and external perturbations. The choice of the design criterion reflects the issues which are to be studied. Therefore, an important preliminary need for experimental design considerations is the exact formulation of the question under investigation [J OHNSON & B ESSELSEN 2002]. Usually, the differential equations (4.1), used as a model for the underlying process of interest, cannot be solved analytically. In this case, an optimal design can only be determined by numerical techniques. By means of Monte Carlo simulations [B INDER 1979, H ONERKAMP 1993, TARAN TOLA 2005], synthetic data sets are generated by drawing from an assumed distribution of the noise. By analyzing the simulated data in exactly the same way as intended for the analysis of the measurements, it is possible to evaluate and compare the possible outcomes, i.e. the utility functions obtained for different designs. Repeated simulations are then used to calculate the expected utility function. This expectation can be used for numerical optimization. The disadvantage of Monte Carlo approaches is the high numerical effort. This drawback can be minimized by introducing reasonable approximations. The benefit of Monte Carlo simulations is their great flexibility. In principle, every source of uncertainty can be included by drawing from a corresponding prior distribution. Furthermore, nonlinear dependencies of the observations on the parameters or on the states does not constitute a limitation for Monte Carlo methods. In systems biology, the hypotheses are usually answered by discrimination between different mathematical models [S WAMEYE et al 2003] and/or the estimation of model parameters

135

4.3 Determination of optimal designs

[RODRIGUEZ -F ERNANDEZ et al 2006, M ENDES & K ELL 1998, C HO & W OLKENHAUER 2003]. Therefore, in the following two sections, Monte Carlo procedures for optimization with respect to parameter estimation and model discrimination are described.

4.3.1 Experimental design for parameter estimation An important step in the establishment of a mathematical model is the determination of the model parameters. Besides initial protein concentrations and kinetic rate constants, parameters in the observational functions have to be estimated. In the Maximum Likelihood approach [BANGA et al 2005, H ORBELT 2001], the likelihood function, i.e. the probability ρ(y|θ) of the measurements y given a parameter set θ is maximized b This probability is calculated from the distribution of the to estimate the model parameters θ. observational noise. In the case of independently normally distributed noise with known variances σ 2 , the log-likelihood function is proportional to the well known standardized residual sum of P squares i ((yi − gi )/σi )2 and therefore maximum likelihood estimation is equivalent to leastsquares estimation. The log-likelihood LL(y|θ) = log (ρ(y|θ)) as well as its derivatives depend on the noise realization and can therefore be considered as random variables. The Fisher Information F of a design D is defined as the variance   ∂ LL(y|θ) F (D) = Var (4.9) ∂θ of first derivative of the log-likelihood with respect to Dthe parameters ILVEY 1970]. Because the E [S

∂LL  ∂LL ∂LL 2 ∂LL 2 expectation ∂θ is zero, the equality Var ∂θ = ∂θ − ∂θ can be utilized to express the Fisher Information as the expectation F (D) =



∂ LL(y|θ) ∂ LL(y|θ) · ∂θ ∂θ



PI

=−



∂ 2 LL(y|θ) ∂θ∂θ



(4.10)

where P I denotes partial integration. For a model g, data y = g(D, θ) + ε ε ∼ N (0, σ 2 )

(4.11)

and for additive Gaussian noise ε with known variance σ 2 , the likelihood ρ(y|θ) =

Y i

1 (yi − g(ti , ui , oi , θ))2 √ exp − 2σ 2 2πσ

!

(4.12)

2 P  and the log-likelihood is proportional to the residual sum of squares i yi −g(tiσ,ui ,oi ,θ) which is commonly minimizes for least-squares estimation. In this case, Eq. (4.10) yields F (D) = −



∂ ∂θ



y − F ∂F · σ2 ∂θ





1 ∂F ∂F y − F ∂2F = · − σ 2 ∂θ ∂θ σ 2 ∂θ∂θ   1 ∂F ∂F = · σ 2 ∂θ ∂θ

136



(4.13) (4.14)

4 Experimental design in systems biology

This equation generalizes to F (D) =



∂F −1 ∂F Σ ∂θ ∂θ



(4.15)

if the noise is described by multivariate normal distribution ε ∼ N (0, Σ) with covariance matrix Σ. In practice, the true parameters θ are unknown and the expectation is usually approximated by the so-called observed Fisher Information [E FRON & H INKLEY 1978] ∂F −1 ∂F Fb(D) = F (D) = , (4.16) Σ ∂θ ∂θ θ=θˆ θ=θˆ b i.e. by evaluating the sensitivities ∂F/∂θ in Eq. 4.15 for the estimated parameters θ.

For optimization, a scalar utility function is required. There are several design criteria derived from the Fisher information matrix [D ETTE et al 2003, H IDALGO & AYESA 2001]. An alphabetical nomenclature for the different criteria was introduced in [K IEFER 1959]. Often, the determinant V (D) = det(F (D)) =

Y

λi (D)

(4.17)

i

is maximized. λi denote the eigenvalues of F . The obtained optimal design is called D-optimal [J OHN & D RAPER 1975]. Maximization of equation (4.17) corresponds to minimization of the generalized variance of the estimated parameters, i.e. minimization of the volume of the confidence ellipsoid [E VA BALSA -C ANTO 2007]. An A-optimal design is obtained by maximizing the sum of eigenvalues X V (D) = λi (D)

(4.18)

i

of the Fisher information matrix, i.e. minimizing the average variance of the estimated parameters. Similarly, the E-optimal design is obtained by maximization V (D) = λmin (D)

(4.19)

of the smallest eigenvalue. This is equivalent to minimization of the largest confidence interval of the estimated parameters. In systems biology, the number of unknown parameters is often large in comparison to the available amount of measurements. This raises the problem of non-identifiability [C HAPPELL et al 1990, L JUNG & G LAD 1994, T IMMER et al 1998, H ENGL et al 2007]. Structural non-identifiability manifests in a redundant parametrization of the model for a given set of experimental observations. Practical non-identifiability is due to limited amount of experimental information. The above mentioned criteria are only meaningful if all model parameters are identifiable. Otherwise, the Fisher information matrix is singular. In this situation, a regularization techniques could be applied [ATKINSON & D ONEV 1992], i.e. a small number is added to all matrix entries of F . In the case of a diagonal Fisher information matrix, the parameters of the model are called orthogonal. Then, the precision of all parameters can be optimized independently. In the more general case, not all parameters but only s linear combinations Aθ of the parameters could be of interest. Here, A denotes an s × nθ matrix. Often, only the kinetic parameters θ are

137

4.3 Determination of optimal designs

Initialize model M Initialize parameter prior ρ(θ) Initialize design D b (D): Optimize D by evaluation of V for a parameter set θ

for a noise realization ε

y(D) = SIMULATE (D, M, θ, ε) b [ θ(D), F (D) ] = ESTIMATE PARAMETERS (y(D), M ) V (D, θ, ε) = EVALUATE OPT CRITERION (F (D))

Vb (D) = WEIGHTED AVERAGE (V (D, θ, ε), ρ(θ), ρ(ε))

Figure 4.2: Schematic overview of a Monte Carlo approach to optimize a design for parameter estimation.

of interest, in contrast to the parameters k of the observational function. The covariance matrix of such linear combinations is AF −1 (D)AT . The inverse can be interpreted as a new Fisher information matrix which can be used to define new utility functions to optimize the design for the estimation of the linear combinations. The corresponding D-optimal design is called DA -optimal [T ITTERINGTON 1975]. A similar criterion is DS -optimality [S TUDDEN 1980, ATKINSON 1988]. Here, the Fisher information matrix is arranged and then partitioned ! B11 B12 F = (4.20) T B12 B22 into four blocks. Block B11 contains second derivatives with respect to the interesting parameters and block B22 the corresponding derivatives with respect to the unimportant or nuisance parameters. By maximization of −1 T V (D) = det B11 − B12 B22 B12



,

(4.21)

the variance of the nuisance parameters is only considered if they are correlated to the parameter estimates of interest. If a model is linear in the parameters, the Fisher information matrix and the optimal design are independent from the true parameters. In contrast, for models which are nonlinear with respect to the parameters, the performance of a design depends on the true parameters and on the noise realization V (D) = V (D, β, ε) . (4.22) Then, a design which is proposed on the basis of the expected performance depends on the prior knowledge of the parameters. For D-optimal designs, the number of evaluated experimental conditions is usually equal to the number of model parameters. Such designs are often very sensitive to parameter assumptions. Robustness of the designs with respect to the presumed underlying parameters has been discussed in Section 4.2.6. Further discussions can be found in [D E F EO &

138

4 Experimental design in systems biology

M YERS 1992, G OOS et al 2005, ROJAS et al 2007, S ACKS & Y LVISAKER 1984, Y UE & H ICK ERNELL 1999]. By applying a Monte Carlo approach, robust designs for parameter estimation are obtained by computing the expected utility function V¯ (D) from the parameter prior distribution according to equation (4.5). Figure 4.2 provides an overview of the Monte Carlo approach. In the inner loop, the performance V (D) of a design is evaluated for noise realization and for a parameter vector drawn from a prior distribution ρ(θ). The expectation Vb is then used to assess an optimize the design. The covariance matrix and equivalently the Fisher information matrix provide only sufficient information about the uncertainty in parameter estimates if the likelihood can be approximated by a quadratic function around the maximum. This is asymptotically, i.e. for infinite amount of data, fulfilled. However, in most systems biology applications, the assumption is strongly violated. In such circumstances, alternative methods for the determination of confidence intervals, e.g. the profile likelihood approach [M URPHY & VAART 1998, R AUE et al 2009] or the bootstrap [DAVISON & H INKLEY 1997, D I C ICCIO & T IBSHIRANI 1987, J OSHI et al 2006] have to be applied. Then, also the utility function has to be adapted to these approaches.

4.3.2 Experimental design for model discrimination The structure of a mathematical model for describing the studied system is initially unknown. Model discrimination or model selection is the statistical procedure to decide, on the basis of experimental data, which model is the most appropriate one [S TEWARD et al 1996, 1998, T IMMER et al 2004]. The accordance of the data and the model is examined by evaluation of the maximum likelihood function ρ(y|θb(M) ) for a model M obtained after parameter estimation. A well-established criterion for model discrimination is the Akaike Information Criterion [A KAIKE 1974, S AKAMOTO et al 1986] (M) AIC(M) (D) = −2 log ρ(y|θb(M) ) + 2 nθ . (4.23) (M)

A model with a small AIC, i.e. with a low number of parameters nθ preferable. If two models are compared, the sign of the difference ∆AIC(Mm , Mn ) = log

and a large likelihood is

 ρ(y|θb(Mn ) )  (Mm ) (M ) + nθ − nθ n ρ(y|θb(Mm ) )

(4.24)

indicates the superior model. Here, model Mm would be preferred for negative ∆AIC(Mm ,Mn ) . Besides some further variants of the AIC there are other related criteria like the Bayes Information Criterion [S CHWARZ 1978], or the Minimum Description Length [R ISSANEN 1983] which can also be applied for the purpose of model discrimination. They are mathematically derived under different assumptions. Here, only the application of the AIC is discussed. Nevertheless, the AIC can be replaced if another model assessment criterion is desired. The advantage of these model discrimination criteria is the simple applicability. However, these criteria do not allow any conclusions concerning statistical significance. This is enabled by sta-

139

4.3 Determination of optimal designs

tistical tests, i.e. by a likelihood ratio test [C OX 1961, H ONERKAMP 2002]. Here, p-values are computed under the additional assumption that the considered models are nested, i.e. the parameter space of one model is a sub-manifold of the parameter space of the other model. Often, the sub-manifold can be obtained by setting some parameters to zero. The nested model can be considered as a special case of the other, more general model. If Mm denotes the submodel, the relationship ρ(y|θb(Mm ) ) ≤ ρ(y|θb(Mn ) ) holds for the two likelihood functions. If furthermore Mm is appropriate, the advantage of Mn is only due to overfitting. In this case, it can be shown that under standard assumptions [S ELF & L IANG 1987] minus two times the likelihood ratio −2 LR(Mm ,Mn ) (D) = −2 log

ρ(y|θb(Mn ) ) ρ(y|θb(Mm ) )

!

(4.25)

is χ2df -distributed. The degree of freedom df is given by the difference in the number of parameters. If the likelihood ratio obtained from the experimental data is larger as one would expect according to the χ2 distribution, the small model is rejected. If the observational noise is independently, normally distributed, the likelihood ratio equation (4.25) yields !2 (Mn ) (d, θ b(Mn ) ) y(d) − g − , ∆RSS(Mm ,Mn ) (D) = σ(d) d∈D d∈D (4.26) which is equal to the difference of the two standardized residual sum of squares. Here, d ∈ D denotes the design points, i.e. the set of chosen experimental conditions. For models which are linear in the parameters, the expectation of equation (4.26) is X

V

(Mm ,Mn )

y(d) − g (Mm ) (d, θb(Mm ) ) σ(d)

(D) =

X

d∈D

!2

X

g (Mm ) (d, θb(Mm ) ) − g (Mn ) (d, θb(Mn ) ) σ(d)

!2

(4.27)

and therefore asymptotically, i.e. for large sample size, independent of the noise realization [H UNTER & R EINER 1965]. Therefore, numerical optimization does not require averaging over the observational noise in an asymptotic setting. Likelihood based model discrimination comprises maximum likelihood parameter estimation and a subsequent computation of a statistic for pairs of rival models, e.g. the difference of AIC, BIC, or the likelihood ratio. A Monte Carlo approach which imitates exactly these steps is schematically displayed in Figure 4.3. Here, the expectation Vb (D) of a model discrimination criterion V is estimated by drawing numerous realizations from the model and from the parameter priors, as well as realizations of the observational noise. Each realization of simulated data is analyzed exactly in the same way as it is intended for the experimental data yielding a realization of the model discrimination criterion. For design optimization, the expectation, the worst case, or a certain quantile can be utilized. The Monte Carlo approach in Figure 4.3 is very general because there are no restrictive assumptions and every kind of prior knowledge can be included. On the other hand, such an approach is very expensive in terms of computational time.

140

4 Experimental design in systems biology

for all considered models Mm Initialize model prior π(Mm ) Initialize parameter prior ρ(Mm ) (θ) Initialize design D b (D): Optimize D by evaluation of V for a noise realization ε

for a ‘‘true’’ model M 0 0

for a parameter set θ (M ) for M 0

 0 y(D) = SIMULATE D, M 0, θ (M ), ε for all models Mm θb(Mm) = ESTIMATE PARAMETERS (y(D), Mm) 0

V (D, M 0, θ (M ), ε) = EVALUATE OPT CRITERION

  y(D), Mm, θb(Mm)

0 0 Vb (D) = WEIGHTED AVERAGE V (D, M 0, θ (M ), ε), π(M 0), ρ(M )(θ), ρ(ε)



Figure 4.3: Schematic overview of a general Monte Carlo approach to optimize a design for model discrimination.

There are some approaches for the optimization of experimental designs for model discrimination which constitute approximations of the general Monte Carlo approach presentd in Figure 4.3). Most algorithms are based on equation (4.27). In [H UNTER & R EINER 1965] V (Mm ,Mn ) (D) =

X

d∈D

g (Mm ) (d, hθ(Mm ) i) − g (Mn ) (d, θb(Mn ) ) σ(d)

!2

(4.28)

is optimized. Here, the expected response g (Mm ) (d, hθ(Mm ) i) of the “true” model Mm at design points d is computed for the expected parameters hθ(Mm ) i according to the parameter prior. The parameters θb(Mn ) of the other models are obtained by parameter estimation. A similar approach was used in [ATKINSON & F EDOROV 1975b] to find the optimal design for two rival regression models. The obtained design is called T-optimal. The case of more than two competing models is discussed in [ATKINSON & F EDOROV 1975a]. A criticism of both approaches is that uncertainty in the expected response due to parameter uncertainty is not considered. In [B OX & H ILL 1967], an example is provided where these uncertainty depends strongly on the design points. In an improved approach [B UZZI F ERRARIS et al 1983, B UZZI F ERRARIS & F ORZATTI 1984], the covariance matrices of the parameter prior distributions are propagated to the model response after linearization of the model. This leads to optimization of  2 b X g (Mm ) (d, hθi) − g (Mn ) (d, θ) P V (Mm ,Mn ) (D) = 2 (d)) nM σ 2 (d) + m0 σm 0

(4.29)

d∈D

2 are the covariance matrices of the responses due to parameter uncertainty. where σm 0

In [H SIANG & R EILLY 1971] an approach is introduced in which also higher order moments (M) (M) are propagated. Here, a representative group of parameters sets {θ˜1 , θ˜2 , . . . } is drawn from

141

4.3 Determination of optimal designs

the prior distribution of the parameters for each model. For these groups of parameters, the models are evaluated. This yields an expected response X (M) (M) gb(M) (d) = g (M) (d, θ˜i ) ρ(M) (θ˜i ) (4.30) i

for model M and

V (Mm ,Mn ) (D) =

X

d∈D

gb(Mm ) (d) − gb(Mn ) (d) σ(d)

!2

(4.31)

as utility function for the comparison of two models. Here, the linearization of the model is avoided by computing the expectation after evaluation of the model response g. In equations (4.28), (4.29), and (4.31) model Mm is assumed as the true underlying model. Averaging over all pair-wise comparisons of the models accounting for model uncertainty yields X V (D) = π(Mm )π(Mn )V (Mm ,Mn ) (D) . (4.32) m,n6=m

An alternative is optimization of the worst case, i.e. maximization of the difference V (D) = min V (Mm ,Mn ) (D) m,n6=m

(4.33)

between the two most similar models. The introduced approaches are reasonable in the case of normally distributed noise. In a more general setting, the expected likelihood ratio X (M ,M ) VLR m n (D) = π(Mm )π(Mn ) LR(Mm ,Mn ) (D) (4.34) m,n6=m

or, for non-nested models, the expected difference X (M ,M ) VAIC m n (D) = π(Mm )π(Mn ) ∆AIC(Mm ,Mn ) (D)

(4.35)

m,n6=m

of the Akaike Information can be used instead. A Bayesian methodology for optimal experimental design has been introduced in B OX & H ILL [1967], R EILLY [1970]. In this exact entropy approach, the entropy X S=− π(Mm ) ln π(Mm ) (4.36) m

is used to quantify the amount of information, i.e. the certainty about the true underlying model. A linearization of the model response is used to propagate the covariance matrices of the prior distributions. By this way, the expected change V (D) = S 0 (D) − S

(4.37)

in the entropy is calculated which has to be optimized in the experimental planning. Equations for the expected entropy S 0 (D) after a new experiment can be found in B OX & H ILL [1967].

142

4 Experimental design in systems biology

Figure 4.4: In our example, a substrate S and an protein P are produced with a common rate θ1 . The protein is degraded with rate θ2 and promotes the degradation of the substrate with rate θ3 .

A comparison of both, the Bayesian approach and the more frequentist approach are given in ATKINSON [1981b]. Only slight differences in the proposed designs were found. Another comparison of the published approaches can be found in B URKE et al [1994]. Despite the importance of model selection, there are still few applications of the experimental design procedures discussed here in the field of systems biology. In F ENG & R ABITZ [2004], a concept called “optimal identification” is introduced to estimate model parameters and to discriminate between different models. Their algorithm is illustrated by a simulation study for a tRNA proofreading mechanism. The criteria in equation (4.29) was used in [C HEN & A SPREY 2003] to calculate optimal input for model selection between different dynamical models for a yeast fermentation in a bioreactor. In B URKE et al [1994] computer simulations have been used to check the applicability of model discrimination methods to the modeling of polymerization reactions in organic chemistry. Here, also some of the discussed design optimization approaches were applied and compared. An overview about model selection and design aspects in engineering applications is given in V ERHEIJEN [2003]. An appropriate design for model selection is not necessarily advantageous for parameter estimation. An example where the optimal design for discrimination between two regression models cannot be used to estimate the parameters of the true model is described in ATKINSON & D ONEV [1992]. If both, parameter estimation and model discrimination is required, different design criteria i.e. D-optimality and T-optimality has to be combined [ATKINSON & D ONEV 1992].

4.4 Illustrative examples In this section, design optimization with respect to the sampling times is performed for a simple example model. Analogous strategies could be applied for the optimization of the chosen observables, perturbations or the total number of measurements.

Figure 4.4 shows, as an example, a substrate S and an protein P which are produced with a common rate θ1 . The protein is degraded with rate θ2 and promotes the degradation of the substrate with parameter θ3 . The time dependencies of the substrate concentration xS (t) and the

143

4.4 Illustrative examples

protein concentration xP (t) are then given in model M1 by M1 : x˙ P (t) = θ1 − θ2 xP (t) x˙ S (t) = θ1 − θ3 xP (t) xS (t) with xS (0) = xP (0) = 0. Initially, θ1 = 2, θ2 = 1 and θ3 = 1 are assumed as the true underlying parameters. Further, it is assumed that the substrate concentration y(t) = xS (t) + ε , ε ∼ N (0, 0.052 )

(4.38)

is measured in absolute concentrations. Because the concentration xS is in the order of one (compare Figure 4.7), the assumed standard deviation of the observational noise corresponds to a signal to noise ratio of around 5%. First, the calculation of the optimal sampling times is exemplified for the estimation of the three rates θ1 , θ2 and θ3 with an initial measurement at time t1 followed by nine subsequent equidistant measurements in time. In this case, two design parameters, the point in time t1 of the first measurement and the sampling interval ∆t have to be optimized. For this purpose, the D-optimality criterion according to equation (4.17) is applied. The design region, i.e. the set of feasible and experimentally reasonable values of t1 and ∆t can be restricted, as an example, to t1 > 0 and ∆t > 0.25. Another prerequisite could be that the measurements have to be executed within the first ten minutes leading to a further constraint t1 + 9 ∆t ≤ 10 if the time unit is minutes. Because the model M1 is nonlinear in the parameters, the performance of a design, i.e. the expected accuracy of the parameter estimates, depends on the true underlying parameters and on the realization of the noise. To examine the impact of the noise realizations, hundred data sets y(t) = xS (t) + ε(t) , t = t1 , t1 + ∆t, . . . , t1 + 9 ∆t for the same parameter set θ1 , θ2 , θ3 have been simulated for different t1 and ∆t. For each realization, the parameters have been estimated and the covariance matrices of the parameter estimates have been calculated to determine V = det(F ) = det(Cov(θbi , θbj )−1 ) according to equation (4.17). Figure 4.5 shows the performance of the designs for parameter estimation for three different noise realizations indicating a strong dependency of the parameter estimation covariances on the data realization. In addition the median performance of 100 noise realization is depicted. Usually, the impact of different noise realizations is neglected, e.g. in [A SPREY & M ACCHI ETTO 2002, E VA BALSA -C ANTO 2007, BALTES et al 1994], and the performance is optimized for the expected measurements y(t) = xS (t) , t = t1 , t1 + ∆t, . . . , 9 ∆t. Figure 4.6 shows V (t1 , ∆t) for this approximation. The most informative design is obtained for t∗1 = 0.52 and ∆t∗ = 0.56 which is in agreement with Figure 4.5, where the median of the performance is displayed when many noise realizations are considered. For nonlinear models, the optimal design, i.e. t∗1 and ∆t∗ , depends on the model parameters, i.e. on the true time scales of the system. This dependency is evaluated in Figure 4.7. The dynamics of xS (t) and the optimal sampling times for the initial parameter set are plotted as black curves. The respective time course and the corresponding optimal sampling times are also displayed after

144

4 Experimental design in systems biology

Figure 4.5: For nonlinear models, the performance of a design depends on the realization of observational noise. The first three panels show the performance of three data realizations, i.e. V (t1 , ∆t) = det(F (t1 , ∆t)). The panel on the bottom right shows the median over 100 noise realization.

Figure 4.6: The approximate performance of the design obtained for the expectation of the measurements, i.e. the observational noise is neglected. The design is optimal for t∗1 = 0.52 and ∆t = 0.56.

145

4.4 Illustrative examples

Figure 4.7: The time dependency of the substrate concentration for different parameter values. The corresponding optimal sampling times for the (re-) estimation of the three parameters are indicated by circles.

changing θ1 (red), θ2 (green) and θ3 (blue) by a factor of two. In general larger parameters increase the velocity of the dynamics and therefore optimally informative sampling times cover a shorter time range. Next, design optimization for model selection is exemplified. For this purpose, the question is raised whether the substrate is degraded independently of the protein, i.e. model M2 :

x˙ P (t) = θ1 − θ2 xP (t) x˙ S (t) = θ1 − θ3 xS (t)

is compared with M1 . In this case, the time dependency of the substrate concentration yields xS (t) = θ1 t − exp(−θ3 t)

(4.39)

for the case xS (0) = 0. Again, the simplifying assumption y(t) = xS (t), t = t1 , t1 +∆t, . . . , 9 ∆t is made. Because the number of parameters of both models M1 and M2 are equal, the utility functions based on the likelihood ratio (4.34) and on the difference of the Akaike information (4.35) are equivalent. The left panel in Figure 4.8 shows the performance V (M1 ,M2 ) (t1 , ∆t) if model M1 is assumed as the true model. In the panel in the middle, the performance is displayed if M2 is the correct model. If both models have equal prior probabilities π(M1 ) = π(M2 ), V (M1 ,M2 ) and V (M2 ,M1 ) can be averaged to obtain an expected performance V (t1 , ∆t) according to (4.32), as plotted in the right panel. In this case, however, the average is dominated by V (M1 ,M2 ) , because model M1 is hardly discriminated if model M2 is the truth. Therefore, depending on the purpose of the study, it could be more appropriate to optimize the worst case scenario, i.e. equation (4.33). In this example, this coincides with the result shown in the middle panel.

146

4 Experimental design in systems biology

Figure 4.8: The left panel shows the performance of model discrimination depending on the sampling times if model M1 is true. The respective performance for M2 as the underlying truth is depicted in the middle. Note the different vertical axes of both panels, i.e. the performance is superior, if model M1 is the true. Therefore the average performance shown in the right panel is dominated by V (M1 ,M2 ) . In this example, the worst case scenario according to (4.33) is identical to the case where model M2 is correct.

4.5 Limitations The methodology of numerical design optimization introduced above seems satisfying because it allows for experimental planning with respect to observables, system perturbations, and sampling times. However, there are several practically relevant design issues in cell biological experiments which can still not be answered because neither technical nor biological data heterogeneities are included into the model. This issue is addressed in this section.

For mechanistic models of biochemical reaction networks, a methodology of design optimization has been introduced in the previous sections of this chapter. Beside the drawback of the numerically complexity, the expected performance of a design could be evaluated and design variables could be optimized by improving the utility function. For experiments in technical disciplines like physics or engineering, this approach would be sufficient in a wide range of applications. However, this methodology is insufficient for scientific disciplines suffering from systematic error between experimental runs. In cell biology, the experimental conditions are usually not perfectly reproducible due to technical and especially biological variability. Therefore, it turns out that many practically relevant issues of the experimentalists cannot be answered by the methodology introduced above. Here, I provide some examples: • Should cells from different individuals be pooled or is it more informative to measure each individual separately? • How should the different measurement times and perturbations be spread on different experimental runs, e.g. on different gels? • How should the evaluated points in time and the treatments be assigned to individuals or cell preparations?

147

4.5 Limitations

Response: log(Foreground) Experiment Experiment:Gel Time:Treatment Experiment:Gel:log(Background) Residuals

Df 13 40 51 54 235

Sum Sq. 249.481 28.244 34.008 23.344 4.123

Mean Sq. 19.191 0.706 0.667 0.432 0.018

F value 1093.918 40.249 38.011 24.641

p-value < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16

Table 4.1: ANOVA table obtained for insulin receptor phosphorylation data measured by immunoblotting. The most prominent effect on the experimental data is biological variability between the cell preparations used for different experiments. Within an experiment, the same cell preparation is measured on different gels. The technical variability between different gels is still larger than the effect the response of interest due to treatment and time. As proposed in Chap. 2.5, the last interaction term corresponds to a gel specific multiplicative background correction.

• How can trends in time, e.g. circadian rhythms or the cell cycle, be accounted for? • Which fraction of data points should be used for calibration? The classical methodology, e.g. the Monte-Carlo approaches for design optimization are only applicable for such issues, if systematic sources of noise are included in the model and the experimental noise cannot be treated as uncorrelated. In this thesis, it has already been illustrated for microarrays (Section 1), flow cytometry (Section 3) and immunoblotting data (Section 2) that these experimental techniques suffer from contamination of different sources of systematic noise. Neglecting such systematic errors does not allow appropriate design optimization. Table 4.1 shows, as an example, that for immunoblotting, the biological variability between different experiments is the most dominating effect as indicated by sum of squares and mean square values. Also, the technical variability between different gels has stronger influence on the data as the treatment dependent time course of activation. If the sources of systematic noise are included into the model, then the assumption of Gaussian observational noise, i.e. the residuals, still holds. In addition, numerical design optimization by Monte-Carlo becomes feasible for design issues concerning the technical and biological variability. Then, also the questions raised above could be answered. The error models for immunoblotting data introduced in Chap. 2.5 provides one possibility to extend the classical ODE methodology to account for the specific noise structure of immunoblotting data. This error model can immediately be translated to observational functions. Instead of factorial time and treatment effects, the ODE system would provide the prediction of the true underlying concentrations. Another reasonable extension are preparation or batch dependent kinetic constants and initial molecule concentrations. Depending on the experimental setup, this would account for a cell-tocell variability, for varying proportions of cell types in ex vivo studies, or, for a varying susceptibility of cells to stimulations due to stress during the preparation procedure.

148

4 Experimental design in systems biology

The last years, research in systems biology has mainly concentrated on the development of realistic models for the biological process. However, statistical analyses require a realistic model of the data. The underlying process is only one part of such a model. Models accounting for sources of systematic noise allow for more accurate statistical inference as well as for a more general experimental planning. In addition, the biological variability can be accounted for, qualitatively, during the experimental setup. In the following sections, sampling strategies accounting for these heterogeneous experimental settings are discussed.

4.6 Sampling strategies In this section it is discussed, how an elaborate selection of the experimental units like cells or individuals allow for controlling and elimination of sources of data heterogeneity, in some cases without further experimental efforts. However, the scope of the results is often restricted at the same time.

Any biological experiment is conducted to obtain knowledge about a population of interest, e.g. about cells from a certain tissue of an organism. Sampling refers to the process of the selection of experimental units, i.e. cells, to study a question under consideration. The aim of an appropriate sampling is to provide experimental results, which are representative for the population of interest, i.e. avoid systematic errors, and to minimize the variability in the measurements due to inhomogeneities of the experimental units. Adequate sampling is a prerequisite for drawing valid conclusions. Moreover, the finally selected sub-population of studied experimental units and the biochemical environment defines the scope of the results. If, as an example, only data from a certain phenotype or of a specific cell culture is examined, then the generalizability of any result to other populations is initially unknown. In cell biology, there is usually a huge number of potential features or covariates of the experimental units having a noticable impact on the observations. In principle, each genotype and each environmentally induced varying feature of the cells constitute a potential source of variation. Further undesired variation can be caused by inhomogeneities of the cells due to cell density, cell viabilities or mixture of measured cell types. Moreover, systematic errors can be caused by changes in the physical experimental conditions like the pH value or the temperature. The initial issue is to appraise relevant covariates which should be controlled. As proposed in Section 4.5, these interfering covariates can be included in the model to adjust for their influences. However, this yields often an undesired enlargement of the model (see example in figure 4.9, No. (3)). An alternative to extending the model is controlling the interfering influences by an appropriate sampling [JACOBSEN et al 2006]. One possibility for becoming independent on a source of noise, is to choose a single fixed level of the influencing covariate. However, this restricts the scope of the study to the selected level of the covariate. For the demonstration of the trade-off between a decreasing heterogeneity and limitation of

149

4.6 Sampling strategies

the study’s scope, the sampling strategy which was chosen in the HepatoSys project is discussed. HepatoSys is a systems biology initiative for the investigation of the outstanding regenerative potential of primary mouse hepatocytes. The bulk of the projects of this dissertation was part of HepatoSys, In HepatoSys, it was intended to study individual cellular signal transduction pathways under comparable experimental settings. In addition, it was desired to establish the mathematical models for the different pathways for a comparable cellular state, enhancing possibilities of merging the single pathway models to larger signaling networks. Therefore, exploratory experiments had been conducted to establish a standard operating procedure (SOP), which seemed to allow an efficient ¨ investigation of all relevant pathways [K LINGM ULLER et al 2006]. In this SOP, beside the mouse strain, sex and age, also the tissue extraction as well as the cell preparation had been specified. As an example, the SOP forces, after the extraction, a four hour “attachment phase”, followed by a 12-24 hours “pre-starvation-” and a 2-8 hour “starvation phase”. For each phase, the ingredients of the cell culture medium, i.e. the mixture of anti-inflammatory and growth serum substances provided to the cells, had been defined. By working according to a common SOP, comparable experimental data could be obtained in several projects. The variability of the data could be reduced in comparison to a less precisely defined cell preparation procedure. However, the price to be paid for this advance is that the scope of the results is uncertain. In preliminary experiments it was ensured that the SOP yield a cellular state which allow activation of the major pathways. However, it is unclear to which extent this cellular state is representative for the in vivo situation. Correspondingly, the generalizability of mathematical models established under SOP conditions is unknown. In addition, it is unknown whether a single cell preparation allows for the investigation of the whole spectrum of relevant cellular responses. Because the cells’ behavior seems to depend on the preparation, measurements under varying preparation protocols could be essential for revealing the intracellular signaling network. An extension to measuring only a single standard condition is the so called stratified sampling. Here, the samples are grouped or stratified according to factor levels of sources of systematic noise. This grouping strategy is called stratification. The obtained groups are called blocks or strata. One attempts to assign the experimental conditions of interest to the strata in a way which ensures that each condition is affected by the same amount by the interfering covariates. Such a blocking strategy is frequently applied when the runs cannot be performed at once or under the same conditions. In a complete block design [K IRK 1989] every treatment is allocated to each block. The experiments and analyses are executed for each block independently (see Figure 4.9, (2a)). Merging the obtained results for the blocks yields more precise estimates because the variability due to the interfering factors is eliminated. Paired tests [G OULDEN 1956] are special cases of such complete block designs. In full factorial designs all possible combinations of the factor levels are examined. Because the number of combinations rapidly increases with the number of regarded covariates, this strategy requires a large experimental effort. One possibility to reduce the number of necessary measurements is a subtle combination of the factorial influences. Latin-square-sampling represents such a strategy for two blocking covariates. A prerequisite is that the number of the considered factor

150

4 Experimental design in systems biology

Example: Interfering covariates The dynamics of protein activation is studied by measurements at different times after stimulation. It is assumed that there is a biological variability between cells obtained from different individuals and that circadian rhythms have an impact. There are three main strategies to account for the impact of these two covariates or factors, “individual” and “circadian state”. (1) Limiting the scope Here, the impact of both factors is controlled by fixing their levels. Only one individual would be measured and all probes would be obtained at the same time of day. Then, all data points are affected by the same amount by the two covariates. However, this strategy would limit the scope. The generalization to the population of all individuals and to other points in the circadian rhythm in unknown. (2a) Blocking/Stratification Here, the measurements are grouped according to both factors, i.e. only measurements from one individual at a certain circadian state are considered as replicates. Each group is analyzed independently and the results are averaged. This yields mean estimates of the model parameters. However, this strategy could lead to a large experimental effort. If ni denotes the number of regarded individuals, nc the number of circadian states and nt the number of time points, at least ni × nc × nt measurements would be required. (2b) Latin square blocking Here, the same number of factor levels ni = nc = nt are investigated. The combinations of the three effects are chosen so that each time point is measured once for each individual and once for each circadian state. Then only ni × nc measurements are required and the impact of the two covariates is equal, in the average, for each point in time. The limitation is that, the biological variability as well as the impact of the circadian clock cannot be estimated. (3) Expanding the model Here, additional model parameters are introduced to adjust for the individual as well as the circadian impact. However, it has to be known in which way the model has to be expanded, e.g. if both factors can be accounted by different offset parameters, by different kinetic constants, etc. In addition, the increase in the number of parameters would require more data points to obtain the same precision as in a non-expanded model.

Figure 4.9: An illustrative example how the impact of two sources of variation can be accounted for in time course measurements.

levels are equal to the number of regarded experimental conditions. Further, latin-square-sampling assumes that there is no interaction between the two bocking covariates, i.e. the influence of the factors to the measurements are independent of each other, e.g. there are no cooperative effects. A latin-square design for elimination of two interfering factors with three levels is illustrated in Figure 4.10 (2a). Here, three different conditions, e.g. times after a stimulation t1 , t2 , t3 are measured for three individuals A, B, C at three different states c1 , c2 and c3 within the circadian rhythm. The obtained results are unbiased with respect to biological variability due to different individuals and due to the circadian effects. Pooling of samples constitutes a possibility to obtain measurements which are less affected by biological variability between experimental units without an increase in the number of experiments [K ENDZIORSKI et al 2005]. Pooling is only reasonable when the interest is not on single individu-

151

4.7 Confounding

individual

circadian state

c1 c2 c3

A t1 t2 t3

B t2 t3 t1

C t3 t1 t2

Figure 4.10: Latin-square experimental design for three individuals A, B, C measured at three states of the circadian rhythms c1 , c2 , c3 . Because each time t1 , t2 , t3 is influenced by the same amount by both interfering factors, the average estimates are unbiased.

als or cells but on common patterns across a population. If the interest is in the single experimental unit, e.g. if a mathematical model for a intracellular biochemical network like a signaling pathway has to be developed, pooled measurements obtained from a cell population are only meaningful, if the dynamics is sufficiently homogeneous across the population. Otherwise, e.g. if the cells do not respond to a stimulation simultaneously, only the average response can be observed. Then the scope of the mathematical model is limited to the population average of the response and does not cover the single cell behavior. Pooling can cause new, unwanted biological effects, e.g. stress responses or pro-apoptotic signals. Therefore, it has to be ensured that these induced effects do not have a limiting impact on the explanatory power of the results. However, if pooling is meaningful, it can clearly decrease the biological variability and the risk of unwanted confounding, especially for a small number of repetitions.

4.7 Confounding Frequently, the covariates with a relevant impact on the measurements are unknown or cannot be controlled experimentally. These covariates are called confounding variables or simply confounders [G REENLAND & M ORGENSTERN 2001]. In the presence of confounders, it is likely that ambiguous or even wrong conclusions are drawn. This occurs if some confounders are overrepresented within a certain experimental condition of interest. In an extreme case, for all samples within a group of replicates, one level of a confounding variable would be realized. Overrepresentation of confounders are very likely for small number of repetitions. In Figure 4.11, the probabilities are displayed for the occurrence of a confounding variable for which the same level is realized for any repetition in one out of two groups. It is shown that there is a high risk of over-representation if the number of repetitions is too small. An adequate amount of replication is a main strategy to avoid unintended confounding. This ensures that significant correlations between the measurements and the chosen experimental conditions are due to a causal relationship. However, especially in studies based on high-throughput screening methods, three or even less repetitions are very common. Then, without the use of prior knowledge, the obtained results are only appropriate as a preliminary test for the detection

152

4 Experimental design in systems biology

Probability of total overrepresentation in a group

1

ng = 2 0.9

n =3

0.8

ng = 4

g

ng = 5

0.7

ng = 10

0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

5

6

7

8

9

10

Number of confounders

Figure 4.11: The probability of a totally over-represented confounder, i.e. the chance of the occurrence of a confounding variable for which the same level is realized for all ng repetitions in a group. In this example, confounding variables are assumed to have two levels with equal probabilities.

of interesting candidates. In systems biology, measurements of the dynamic behavior after a stimulation is very common. Here, additional confounding with systematic trends in time can occur, e.g. caused by the cell cycle or by circadian processes. The issue of designing experiments which are robust against time trends is discussed in [ATKINSON & D ONEV 1996, BAILEY et al 1992]. Another basic strategy to avoid systematic errors is randomization. Randomization means both, a random allocation of the experimental material and a random order in which the individual runs of the experiment are performed. Randomization minimizes the risk of unintended confounding because any systematic relationship of the treatments to the individuals is avoided. Any non-random assignment between experimental conditions and experimental units can introduce systematic errors, leading to distorted, i.e. biased results [F ISHER 1950]. If, as an example, the controls are always measured after the probes, a bias can be introduced if the cells are not perfectly in homeostasis. For immunoblotting, it has been shown that a chronological gel loading causes systematic errors [S CHILLING et al 2005a,b]. A randomized, non-chronological gel loading is recommended to obtain uncorrelated measurement errors.

4.8 Conclusions In systems biology, experimental planning is becoming crucial, because the establishment of mathematical models for complex biochemical networks requires huge experimental efforts. In com-

153

4.8 Conclusions

parison to classical biochemical studies, investigation of reaction networks requires a relatively large amount of experimental data to be able to calibrate dynamic models. At a first glance, this seems to be a drawback of mathematical modeling approaches in cell biology. The major reason, however, for the need of a large amount of experimental knowledge is not merely the modeling approach, but rather the complexity of the issue. The number of hypotheses, i.e. of possible interactions, increase exponentially with the network size. Establishing a model does not require extra experimental information. If unspecified model components, e.g. non-identifiable parameters have no impact on a hypothesis or model prediction there is no need to experimentally address them. Nevertheless, the models have to be established efficiently with respect to the hypothesis under investigation. The modeling framework itself should therefore not be restricted to mechanistic models. The scale of the issue, usually determines the scale of the model. If, as an example, the physiological impact of the liver on glucose regulation in the blood would be studied, the intracellular insulin signaling pathway has not necessarily to be modeled at the level of individual molecule interactions. Instead, a phenomenological model of the input-/output behavior of the pathway might be sufficient. In this chapter, an overview about experimental design aspects for systems biological applications has been provided. Two general numerical approaches for the optimization of designs for parameter estimation and for model discrimination have been introduced. Design optimization for parameter estimation and for model discrimination have been illustrated by examples. Extensions of classical modeling approaches accounting for biological as well as technical noise have been proposed. General principles in experimental planning, i.e. replication and randomized sampling as well as the problem of confounding have been discussed. In addition, it was emphasized that clear definitions of the investigated hypotheses and of the scope of the study are crucial. In comparison to classical questions concerning design of experiments, the applications in systems biology are characterized by little prior knowledge. Therefore, experimental design considerations has to be robust against preceding assumptions. By all means, the sensitivity of a proposed experimental design on the assumptions has to be considered. However, there is a general trade-off between robustness of the designs and their efficiency for testing the hypotheses under consideration. A related problem is that the models are often large and the available data is very limited. Therefore, experiments have to be planned on the basis of imprecise knowledge. Moreover, relative noise levels of 10% and more are standard for biochemical data. Model identification based on such noisy data is a challenging task. This situation can be improved by efficient experimental designs. However, the methods for the experimental planning have to deal with the problem of non-identifiable parameters. The models in systems biology are usually nonlinear in the parameters. Therefore, linearized models are only rough approximations and are often not adequate to show qualitatively the same behavior as the exact model. Monte Carlo approaches for experimental planning do not require any restrictive assumptions. However, an automatic and reliable parameter estimation procedure is needed, which has been shown as very challenging for systems biology applications [BANGA 2008, M OLES et al 2003]. However, because for the Monte-Carlo approaches, the underlying truth is known, this issue can be resolved in most circumstances.

154

4 Experimental design in systems biology

Another challenge in the optimization of the design is the fact that the expectation of the utility function of different designs can only be approximated by many realizations of the underlying model, parameters and of the observational noise. Because the approximation of the utility function is not smooth, standard optimization techniques, e.g. based on gradient descent, which assumes a deterministic utility function, are not be applicable. For these reasons, mathematical modeling in systems biology is a very challenging task which also exhibits the chance and requirement of developing new methodological approaches. A proper experimental planning can decrease gaps between model based predictions, biologically motivated hypotheses and the experimental validation and enables the entire power of mathematical modeling to be exploited. As a concluding remark, I would like to emphasize again the fundamental difference of design optimization in technical disciplines and in biomedical research. The experimental setup for studying a technical dynamical system is usually perfectly reproducible. Therefore in experimental planning, the major task is to find an informative input for exploring the system’s dynamic behavior. In biological applications, however, an experiment can usually not repeated under the same conditions because each biological sample is unique. Here, experimental planning requires the selection of a representative sample allowing for an unbiased estimation. Otherwise, it is uncertain whether patterns in the experimental data are really due to the effects of interest.

155

5 Observability analysis and confidence intervals for model predictions

A major goal of mathematical models is prediction, i.e. forecasting systems behavior in various circumstances. However, mechanistic dynamic models of biochemical networks such as Ordinary Differential Equations (ODEs) contain a large number of unknown parameters like the reaction rate constants and the initial concentrations of the compounds. The high-dimensionality of the parameter space as well as the nonlinear impact of the parameters on the model responses hamper the determination of confidence regions for parameter estimates. Consequently, classical approaches translating the uncertainty of the parameters into confidence intervals for model predictions are hardly feasible. In this chapter it is shown that a so-called ’prediction profile likelihood’ yields reliable confidence intervals for model predictions, despite arbitrarily complex and high-dimensional shapes of the confidence regions for the estimated parameters. The theoretical background of the calculation of uncertainties of predictions was already established in statistical literature in the late 70s. However, the proposed calculations were not feasible for realistic applications, e.g. in biochemistry, at that time. In this chapter, an approach is presented solving this task numerically using constraint optimization. This approach renders the issue of sampling a highdimensional parameter space into evaluating one-dimensional prediction spaces. The resulting prediction confidence intervals of the dynamic states allow a databased observability analysis. The method is also applicable if there are nonidentifiable parameters yielding to some insufficiently specified model predictions that can be interpreted as non-observability. Moreover, a ’validation profile likelihood’ is introduced that should be applied when noisy validation experiments are to be interpreted. The properties and applicability of the prediction and validation profile likelihood approaches are demonstrated by two examples, a small and instructive ODE model describing two consecutive reactions, and a realistic ODE model for the MAP-kinase signal transduction pathway. The value of a model parameter can also be considered as a prediction. In this special case, the prediction profile likelihood reduces to the parameter profile likelihood which we introduced in R AUE et al [2009] for investigation of parameter identifiability and for the calculation of confidence intervals for parameter estimation. The presented approach constitutes a general concept for observability analysis and for generating reliable confidence intervals of model predictions for a broad range of applications.

5.1 Introduction

5.1 Introduction Computer-aided simulations are a well-established tool to study the behavior of complex systems. The applications range from forecasting climate changes [R.L. S MITH 2009] to predicting events in a detector in high-energy physics [A LLISON et al 1992]. To be able to generate realistic predictions, the individual processes of a system of interest have to be translated into a mathematical framework. The task of establishing a realistic mathematical model which is able to reliably predict a systems behavior is to comprehensively use the existing knowledge, e.g. in terms of experimental data, to adjust the models’ structures and parameters. The major steps of this mathematical modeling process comprise model discrimination, i.e. identification of an appropriate model structure, model calibration, i.e. estimation of unknown model parameters, as well as prediction and model validation. For all these topics it is essential to have appropriate methods assessing the certainty or ambiguity of any result for given experimental information. For parameter estimation, there are several approaches to derive confidence intervals, like standard intervals which are based on an estimate of the covariance matrix of the parameter estimates [S ACHS 1984], bootstrap based confidence intervals [DAVISON & H INKLEY 1997, D I C ICCIO & T IBSHIRANI 1987, J OSHI et al 2006], as well as likelihood based confidence intervals [V ENZON & M OOLGAVKAR 1988, R AUE et al 2009]. For model discrimination, significance statements can be obtained by statistical tests [B OX & H ILL 1967, R EILLY 1970, S TEWARD et al 1996]. However, for model predictions, there are still demands for methodology that is applicable for mathematical models like ODEs used to describe the dynamics of a system in a variety of scientific fields e.g. in molecular biology [H LAVACEK 2009, S WAMEYE et al 2003], but also in medical research, chemistry, engineering, and physics. The mere estimation of parameters is often not the final aim of an investigation. More frequently, it is desired to utilize the parametrized model to generate model predictions such as the dynamic behavior of unobserved components. Classically, the uncertainty in the model parameters is attempted to be translated into corresponding prediction confidence intervals. For models that depend linearly on the model parameters, as it occurs in classical regression models, this is well studied and known as propagation of uncertainty based on standard errors. This approach is appropriate and sufficient for many applications. However, e.g. for biochemical networks, the model responses depend nonlinearly on the model parameters. Here, the boundaries of the parameter confidence region can exhibit arbitrarily complex shape and are usually difficult to translate into boundaries for the prediction confidence intervals. Therefore, established approaches aim to scan the entire parameter subspace which is in a sufficient agreement with the experimental data to propagate the parameters confidence regions into confidence intervals for the model predictions. The major challenge is the complex nonlinear interrelation between parameters and model responses which requires that the parameter space has to be densely sampled to capture all scenarios of model predictions. For models with tens to hundreds of parameters this is numerically demanding or even infeasible because high dimensional spaces cannot be densely sampled. This is issue often referred to the curse of dimensionality in literature [M ARIMONT & S HAPIRO 1979, S COTT & WAND 1991]. There are several methods for an approximate sampling of the parameter space, e.g. the Markov

158

5 Observability analysis and confidence intervals for model predictions

Chain Monte Carlo (MCMC) methods [G ELMAN et al 2003, G IROLAMI 2008, K ASS et al 1998, P. M ARJORAM & TAVARE 2003], and bootstrap based approaches [J OSHI et al 2006, M OLINARO et al 2005]. However, for the ODE models used to describe interaction networks, these methods are numerically very demanding and provide only very rough approximations. Therefore, it is difficult to control the coverage of the prediction confidence intervals for these approaches. Moreover non-identifiable parameters are not explicitly considered hampering the convergence of these sampling techniques and yielding results that are questionable and difficult to interpret [BAYARRI & B ERGER 2004]. Conceptually related to the prediction profile likelihood approach presented here, [R AUE et al 2009, 2010] presented an approach for the determination of confidence intervals for the model parameters by sampling the parameter profile likelihood. For that approach the problems concerning identifiability are resolved and the computation is feasible also for high-dimensional parameter spaces. Nevertheless, the direct translation to confidence intervals of the model trajectories only works as an approximation yielding a coverage rate that is sometimes lower than desired. The idea of the prediction profile likelihood presented here is to determine prediction confidence without an explicit sampling strategy for the parameter space. Instead, a certain fixed value for a prediction is used as a nonlinear constraint and the parameter values are chosen via constraint maximization of the likelihood. This does neither require a unique solution in terms of identifiability nor confidence intervals for the parameter estimates. It is reasonable because point estimation of the parameters is not of primary interest for generating predictions, i.e. the parameters appear as so-called nuisance parameters. The constraint maximum likelihood approach checks the agreement of a predicted value with the experimental data. By repeating this procedure for continuous variations of the predicted value, a prediction profile-likelihood is obtained. Thresholding the prediction profile likelihood yields statistically accurate confidence intervals. The desired level of confidence which coincides with the level of agreement with the experiments is controlled by the threshold. The theoretical background of the prediction profile likelihood, also called predictive likelihood has been already studied in H INKLEY [1979]. This idea is already applied in the context of generalized linear mixed models [B OOTH & H OBERT 1998], unobserved data points [B UTLER 1986]. The linear approximation has been applied in nonlinear regression analyses [T HOMAS F. C OOLEY 1989]. A review of prediction profile likelihood approaches and a modification to sufficiency-based predictive likelihood is provided in B JORNSTAD [1990]. In this project, this concept is applied to ODE models occurring in mechanistic dynamic models, e.g. in molecular and systems biology. In this context the approach allows a data-based observability analysis. Moreover, it is extended to obtain confidence intervals for validation experiments.

159

5.2 Methodology

5.2 Methodology 5.2.1 The prediction profile likelihood For additive Gaussian noise ε ∼ N (0, σ 2 ) with known variance σ 2 , two times the negative loglikelihood X (yi − F (ti , u, θ))2 −2 LL(y|θ) = + const. (5.1) σ2 i

of the data y for the parameters θ is except a constant offset identical to the residual sum of squares X RSS(θ) = (yi − F (ti , u, θ))2 /σ 2 . (5.2) i

In this case, maximum likelihood estimation is equivalent to standard least-squares estimation θˆ = arg max LL(y|θ) ≡ arg min RSS(θ) , θ

θ

(5.3)

i.e. to minimizing the residual sum of squares. F = g(x(t, u, θ), θ) denotes the model response which is in our case given after integration of a system of differential equations x(t) ˙ = f (x(t), u(t), θ)

(5.4)

with an externally controlled input function u and a mapping to experimentally observable quantities y(t) = g(x(t), θ) + ε(t). (5.5) The parameter vector θ comprises the kinetic parameters as well as the initial values, e.g. the initial concentrations, and additional offset or scaling parameters for the observations. It has been shown [R AUE et al 2009] that the profile likelihood PL(θi ) = max LL(θ|y) θj6=i

(5.6)

for a parameter θi given a data set y yields reliable confidence intervals  CIα (θi |y) = θi | − 2PL(θi ) ≤ −2LL(y)∗ + cdf−1 (χ21 , α)

(5.7)

for the estimation of a single parameter. Here, α is the confidence level and icdf (χ21 , α) denotes the α quantile of the chi-square distribution with one degree of freedom which is given by the respective inverse cumulative density function. LL∗ is the maximum of the log-likelihood function after all parameters are optimized. In (5.6), the optimization is performed for all parameters except θi . The analogy of likelihood-based parameter and prediction confidence intervals is discussed in detail in Sec. 5.2.3. The desired coverage Prob (θi ∈ CIα (θi )) = α ,

(5.8)

i.e. the probability that the true parameter value is inside the confidence interval, holds for (5.7) if the magnitude of the decrease of the residual sum of squares by fitting of θi is χ21 distributed.

160

5 Observability analysis and confidence intervals for model predictions

This is given asymptotically as well as for linear parameters and is a good approximation under weak assumptions [F EDER 1968, S EBER & W ILD 1989]. If the assumptions are violated, the distribution of the magnitude of the decrease has to be generated empirically, i.e. by Monte-Carlo simulations, as discussed in the Sec. 5.2.4. The experimental design D = {t, g, u} comprises all environmental conditions which can be controlled by the experimenter like the measurement times t, the observables g, and the input functions u. A prediction z = F (Dpred , θ) is the response of the model F for a prediction condition Dpred = {tpred , gpred , upred } specifying a prediction observable gpred evaluated at time point tpred given the externally controlled stimulation upred . In some cases the observable gpred corresponds to measuring a dynamic variable x(t) directly, i.e. it corresponds to a compound whose concentration dynamics is modeled by the ODEs. In a more general setting the observable is defined by an observational function gpred (x(t), θ) depending on several dynamic variables x. Therefore, gpred does neither have to coincide with a dynamic variable nor with an observational function g of the measurements performed to build the model. In analogy to (5.8), the desired property of a prediction confidence interval PCIα (D|y) derived from an experimental data set y with a given significance level α is that the probability Prob(F (Dpred , θtrue ) ∈ PCIα (D|y)) = α

(5.9)

that the true value of F (Dpred , θtrue ) is inside the prediction confidence interval PCIα is equal to α. In other words, the PCI covers the model response for the true parameters with a proportion α of the noise realizations which would yield different data sets y. The prediction profile likelihood PPL(z) =

max θ∈{θ|F (Dpred ,θ)=z}

LL(y|θ)

(5.10)

is obtained by maximization over the model parameters satisfying the constraint that the model response F (D, θ∗ ) after fitting is equals to the considered value z for the prediction. The prediction confidence interval is in analogy to (5.7) given by  PCIα (Dpred |y) = z | − 2PPL(z) ≤ −2LL∗ (y) + icdf (χ21 , α) , (5.11) i.e. the set of predictions z = F (Dpred , θ) for which the PPL is below a threshold given by the χ21 distribution. In analogy to likelihood based confidence intervals for parameters, such PCI yields the smallest unbiased confidence intervals for predictions for given coverage α [C OX & H INKLEY 1994].

Instead of sampling a high-dimensional parameter space, the prediction profile likelihood calculation comprises sampling of a one-dimensional prediction space by evaluating several predictions z. Evaluating the maximum of the likelihood satisfying the prediction constraint does in general not require an unambiguous point in the parameter space as in the case of structural nonidentifiabilities. In analogy to profile likelihood for parameter estimates, the significance level determines the threshold for the PPL, which is given asymptotically by the quantiles (5.7) of the χ21 distribution [M EEKER & E SCOBAR 1995]. In Sec. 5.2.4, a Monte-Carlo algorithm is presented which can be used to calculate the threshold in cases where the asymptotic assumption is violated.

161

5.2 Methodology

5.2.2 The validation profile likelihood Likelihood-based confidence interval like (5.7) or (5.11) correspond to the region where a likelihood ratio test would not reject the model. Having a prediction confidence interval, the question arises whether a model has to be rejected if a validation measurement is outside the predicted interval. This, in fact, would hold if a “perfect” validation measurement would be available, i.e. a data point without measurement noise. For validation experiments, however, the outcome is always noisy and is therefore expected to be more frequently outside the PCI than the true value. Hence, the prediction confidence interval (5.11) has to be generalized for application to a validation experiment. For a validation experiment, we therefore introduce a validation profile likelihood VPL and a corresponding validation confidence interval VCISD α in the following. In such a setting, a confidence interval should have a coverage Prob (z ∈ VCIαSD (Dvali |y)) = α

(5.12)

for the validation data point z ∼ N (µ, SD2 ) with expectation µ = F (Dvali , θtrue ) and variance SD2 . Here, Dvali denotes the design for the validation experiment. A validation confidence interval satisfying (5.12) allows a rejection of the model if a noisy validation measurement with error SD is outside the interval. VCISD α for validation data can be calculated by relaxing the constraint (5.10) used to compute the prediction profile likelihood. Because in this case, the model prediction does not necessarily have to coincide with the data point z. Instead, the deviation from the validation data point is penalized equivalently to the data y. The agreement of the model with the data y and the validation measurement z is then given by −2LL(z, y|θ) =

X  yi − F (Di , θ) 2 i

σ

+



z − F (Dvali , θ) SD

2

+ const.

(5.13)

We now define the validation profile (log-)likelihood VPLSD (z|y) = LL∗ (z, y) = LL(z, y|θ∗ )

(5.14)

with θ∗ = θ∗ (z, y) = arg maxθ LL(z, y|θ) as the maximized joint log-likelihood in (5.13) read as a function of z. The corresponding validation confidence interval is given by  SD ∗ 2 VCISD α (Dvali |y) = z| − 2VPL (z|y) ≤ −2LL (z, y) + icdf (χ1 , α) .

(5.15)

Optimization of the likelihood (5.13) minimizes both, the contribution of the data RSS(y), and the mismatch with the fixed prediction value z. The model response F (Dpred,θ∗ ) obtained after this parameter optimization can be interpreted as a prediction z 0 satisfying the constraint optimization problem (5.10) considered for the prediction profile likelihood. It holds 1 (z − F (Dvali , θ∗ ))2 LL (z, y; SD > 0) − = LL∗ (z 0 , y; SD = 0) , 2 2 SD ∗

162

(5.16)

5 Observability analysis and confidence intervals for model predictions

Figure 5.1: This figure illustrates the outcome of the prediction profile likelihood calculation (gray dashed line) as well as the validation profile likelihood (black dashed line). The respective confidence intervals obtained by thresholding the profiles correspond to a 90% confidence level in this case. The confidence intervals for predictions are smaller than the respective confidence intervals for validation experiments because validation data is noisy.

i.e. the validation profile likelihood LL∗ can be scaled to the prediction likelihood via

PPL(z 0 |y) = VPLSD (z|y) −

1 (z 0 − z)2 2 SD2

(5.17)

where z 0 = F (θ∗ (z, y, SD > 0)) is the model response for θ∗ estimated from z and y. For SD → 0, validation confidence intervals converge to prediction confidence intervals. Optimization with nonlinear constraints is a numerically challenging issue. Therefore, (5.17) provides a helpful way to omit constraint optimization. The VPL can be calculated with SD > 0 like a common least-squares minimization and is then afterwards rescaled to obtain the PCI for the true value. Figure 5.1 illustrates the idea of the prediction and validation profile likelihood. The horizontal axis is the predicted value or the value for a validation measurement. The prediction or validation profile likelihood plotted in vertical direction is a measure for the agreement of the prediction/validation with the existing data. Appropriate thresholding yields the respective confidence intervals.

163

5.2 Methodology

5.2.3 Re-parametrization Parameter estimation can be seen as a special case of a model prediction, i.e. as the prediction of a parameter value out of experimental data. Then, the parameter profile likelihood coincides with the prediction profile likelihood and the respective parameter confidence intervals correspond to prediction confidence intervals. In this sense, the prediction profile likelihood generalizes the parameter profile likelihood. In fact, the idea of the prediction profile likelihood and the calculation of prediction confidence intervals, e.g. the choice of the threshold, is very intuitive for this special case. In other situations, an analog strategy would require a re-parametrization of the model in a way that the desired model prediction is unambiguously given by the value of a single new parameter. Then, again the profile likelihood for the new parameter would give a confidence interval for the prediction. In this case, without loss of generality such a parameter can be denoted by θ10 . Then, the re-parametrization would be a transformation T : {θ1 , . . . , θnp } → {θ10 , . . . , θn0 p }

(5.18)

of the np parameters θ to new parameters θ10 , . . . , θn0 p where all predictions for the condition Dpred satisfy F 0 (Dpred , θ0 ) = F 0 (Dpred , θ10 ) . (5.19) Here, F 0 = F ◦ T −1 denotes the model for the transformed parameters. For a transformation satisfying (5.19), any change of the parameters θ20 , . . . , θn0 p would not affect F 0 , because T is chosen in a way that the effect of θ2 , . . . , θnθ is orthogonal to the effects of θ1 . However, because ODE systems can only be solved analytically for special cases, such a reparametrization cannot be found explicitly for most realistic models. This restriction can be resolved numerically by an implicit re-parametrization which is obtained by a constrained nonlinear optimization procedure. This idea yields the prediction profile likelihood (5.10) which is obtained by maximization over the model parameters satisfying the constraint that the model response F (D, θ∗ ) after fitting is equals to the considered value z for the prediction. In this case, the ‘new parameter’ is the predicted value itself, i.e. z ≡ θ10 and F 0 is the identity function. Equation (5.10) also resolves the formal issue which occurs if there is not a unique parameter set θ given by the constraint F (Dpred , θ) = z. If there are several such parameter sets, the ambiguities either vanish by taking the parameter set with maximize the log-likelihood, or they are not relevant because only the value of the maximized log-likelihood enters the calculation.

5.2.4 Profile likelihood threshold According to the discussion in the previous section, a suitable parameter transformation makes the prediction profile likelihood equivalent to the parameter profile likelihood. Therefore, the following discussion holds for both, for parameter and for prediction confidence intervals. In general, fitting a model to experimental data reduces the residual sum of squares RSS. In the asymptotic case, i.e. for a large number of data points, it can be shown that the decrease of RSS

164

5 Observability analysis and confidence intervals for model predictions

Fit the model M to the measurements obtaind for the design D . This model is considered as the ’truth’. Define the prediction design Dpred. Evaluate true model prediction ztrue = F (Dpred). for a noise realization εi yi(D) = SIMULATE DATA (M, D, εi)

P LL∗i = CALIBRATE MODEL (yi(D), M ) P LLi(ztrue) = CALIBRATE MODEL CONSTRAINT (yi(D), M, ztrue) epdfi = P LLi(ztrue) − P LL∗i

threshold(α) = quantileα(ecdf )

Figure 5.2: A Monte-Carlo algorithm for calculating the profile likelihood threshold empirically. New noise realizations yi (D) are utilized to calculate the distribution of PPLi (ztrue ) − PPL∗i . The α quantile of this distribution can be used as a threshold for prediction confidence intervals instead of the asymptotic threshold, i.e. instead of the α quantile of the χ21 distribution. due to fitting one parameter is chi-square distributed with one degree of freedom. This result also holds exactly in the non-asymptotic case for linear parameters. This outcome is utilized to define the asymptotic threshold for profile likelihood confidence intervals [R AUE et al 2009]. For nonlinear parameters, the distribution of the decrease of the residual sum of squares by the parameter estimation procedure, has not yet been derived for the general setting. However, since the profile likelihood based confidence intervals are independent on bijective transformations of the parameter space [C OX & H INKLEY 1994], the assumption also holds if there is such a transformation, which makes the parameter of interest linear at least within its confidence interval. Such a transformation only has to exist, it is not required to derive it analytically. A situation where such a transformation does not exist occurs if the nonlinearity yields a nonmonotone dependency of the profile likelihood on the parameters, i.e. there are several local minima in the confidence interval. In this case, there is a larger decrease of the residual sum of squares and the standard threshold yields conservative results, i.e. the calculated confidence intervals are too large for the desired confidence level α. In Fig. 5.2, a procedure is presented for checking the standard threshold. It is a Monte-Carlo analysis of the impact of the nonlinear constraint used to calculate the prediction profile likelihood on the magnitude of overfitting. In Fig. 5.3, the asymptotic thresholds corresponding to α = 0.05, 0.1, . . . , 0.95, 0.99, i.e. the quantiles of the χ21 distribution, are compared with the empirically calculated thresholds for several prediction scenarios. For the model A → B → C

(5.20)

which is discussed in more detail later, the asymptotic thresholds are slightly too large for predicting A(t) and B(t) on the basis of measurements of C(t). This makes the asymptotic confidence intervals conservative. The impact on the coverage is discussed in the following section.

165

5.2 Methodology

Figure 5.3: The Monte-Carlo approach allows a comparison of the asymptotic thresholds with the empirically calculated, i.e. the correct thresholds. Here, the thresholds corresponding to 0.05, 0.1, . . . , 0.95, 0.99 confidence levels have been plotted for nine different prediction scenarios. In our example, the asymptotic thresholds are slightly too large for predictions of A(t) and B(t) which makes the asymptotic confidence intervals conservative.

5.2.5 Comparison of prediction and validation confidence intervals The validation profile likelihood satisfies VPLSD (z|y) ≤ PPL(zalt |y) +

1 (zalt − z)2 ∀zalt 2 SD2

(5.21)

i.e. the VPLSD (z|y) is smaller than the right hand side of the inequality for any alternative predicted value zalt . This inequality can be utilized to interpret a difference between the respective confidence intervals. Furthermore, the equation can be utilized for consistency checks, e.g. to prove the numerically calculated VPL and PPL. Small differences between the size of the VCISD and PCI indicate a flat prediction profile likelihood close to the threshold whereas deviations of the confidence intervals in the order of magnitude of SD occur if the PPL has a large slope. This aspect is illustrated in the following. For illustration purpose, a quadratic prediction profile likelihood −2 PPL(z) =

z2 SE2

(5.22)

with SE ∈ {0.1, 0.5, 1, 2} has been assumed. These four settings are shown in Fig. 5.4. The prediction profile likelihood is shown as a red line. For several zalt , the quadratic term in (5.21)

166

5 Observability analysis and confidence intervals for model predictions

Figure 5.4: Comparison of prediction and validation confidence intervals. Panel (A) shows a prediction profile likelihood (red line) with a rather flat shape. Here, the curvature of the prediction profile likelihood corresponds to a prediction standard error SE = 2. In this case, the prediction confidence intervals are large (red shaded) and the increase of the validation confidence intervals (gray) is smaller than indicated by the validation data error SD. If the data is more informative, i.e. SE decreases (panels B-D), the slope of prediction profile likelihood increases yielding larger difference between the PCI and VCI.

is plotted by blue curves attached to the PPL. The VPL constitutes the infimum of these curves which in this special case can be calculated analytically and is given by

−2 VPLSD (z) =

z 2 SE2 SD2 + SE2

2 +

z zSE2  − SD SD SD2 + SE2

!2

.

(5.23)

Panel (A) shows the comparison for SE > SD. In this case, the boundaries of the VCI and the PCI differ only by a value around 0.38. In Panel (B), SE is chosen equal to SD. In Panels (C) and (D) SE is further decreased. This corresponds to more informative data for predicting the exact value of z. In these cases, the optimum of the PPL is narrow in comparison to validation data error SD. Then, during fitting the model, a mismatch z − z ∗ is predominantly explained by the observation error of the validation data point. The difference of the boundaries of the confidence intervals increase and approach the 95% quantile of the Gaussian distribution, i.e. a value icdf(N (0, SD = 1), .95) = 1.64 which is the critical value for the one-sided 5% confidence interval for a validation data point for a constant model prediction, i.e. for SE → 0.

167

5.2 Methodology

5.2.6 Prior information If prior information about parameters is available, e.g. a prior distribution π(θ), maximum likelihood estimation is replaced by maximum a-posteriori (MAP) estimation θˆMAP = arg max ρ(y|θ)π(θ) θ

= arg max (LL(y|θ) + log(π(θ))) , θ

(5.24) (5.25)

i.e. the parameters are estimated by maximizing the a-posterior probability of the data and the parameter estimates. For most common priors, MAP estimation can be performed by MLE using a penalized likelihood. As an example, a log-normal prior with expectation hθ0 i for a parameter component θ0 yields 1 (log(θ0 ) − hθ0 i)2 LLprior = LL − + const . (5.26) 2 Var(θ0 ) To incorporate prior knowledge, the presented prediction profile likelihood approach has to be generalized to MAP estimation and the penalized likelihood (5.26) is used instead of the standard log-likelihood LL.

5.2.7 Availability and implementation The major task which has to be performed to apply the prediction profile likelihood approach is the application-specific implementation of the mathematical model as well as performing optimization to estimate parameters. Because this is highly context dependent there is no way to provide the method in an universally applicable manner. Usually, a programming environment like R or MATLAB is utilized to simulate mathematical models. Such environments already provide optimization algorithms which can be used for parameter estimation. Calculating the validation profile likelihood is similar to classical parameter estimation. The optimization has just to be performed comprehensively for the experimental and validation data. For calculating the prediction profile likelihood, the optimization has to be performed in a constraint manner as described in Section 5.2, or formula (5.17) has to be utilized to rescale the validation profile likelihood to the prediction profile likelihood. In the following, details about the implementation of our examples are provided. For this project, the cvodes package from the Suite of Nonlinear Differential/Algebraic Equation Solvers (SUNDIALS) [H INDMARSH et al 2005] has been used for the numerical integration of the ODEs and the sensitivity equations. MATLAB’s fmincon optimizer was used to estimate the parameters. The gradient and the Hessian of the objective function have been provided for the optimizer using the sensitivity equations [L EIS & K RAMER 1988] and the approximation X 1 ∂Fi ∂Fi ∂2 LL ≈ ∂θj ∂θk σi2 ∂θj ∂θk

(5.27)

i

of the second derivatives [P RESS et al 1992]. Within a single optimization procedure, the parameters have been alternatingly optimized on the logarithmic scale as well as the common linear scale until the optimizer converged to a common value on both scales.

168

5 Observability analysis and confidence intervals for model predictions

For the calculation of the validation profile likelihood, 30 test predictions z within the reasonable range of predictions given by the model structure and SD have been evaluated to obtain an initial guess of the likelihood shape. Then the grid is iteratively refined and/or enlarged until a smooth validation likelihood covering the whole confidence interval is obtained and local minima have been removed. For a single profile likelihood, around 102 to 103 optimizations were required. For the prediction profile likelihood the initial guess is obtained from the VPL by equation (5.17). The gaps in this guess are then filled by nonlinear constrained optimization. If the constraint optimization procedure did not converge, the validation data error SD has been iteratively decreased by factors 100 , 10−1 , . . . , 10−5 .

5.3 Results 5.3.1 Small illustration model First, a small but illustrative model of two consecutive reactions θ

θ

1 2 A → B → C

(5.28)

with rates θ1 = 0.05, θ2 = 0.1 and initial conditions A(0) = θ3 = 1, B(0) = 0, C(0) = 0 is utilized to illustrate our approach. For this purpose, it is assumed that C(t) is measured at t = 0, 10, . . . , 100. For the simulated measurements, Gaussian noise ε ∼ N (0, σ 2 ) with σ = 0.1 has been assumed which corresponds to a typical signal-to-noise ratio for applications in cell biology of around 10%. If an experimental setup would not allow for negative measurements, a log-normal distribution of the observational noise could be more appropriate. Then, the Gaussian setting is obtained after a log-transformation of the data [K REUTZ et al 2007]. Such transformations and preprocessing procedures would have to be performed before the analysis starts. Panel (a) in Figure 5.5 shows the dynamics of A(t), B(t), and C(t) as well as a typical noise realization. Such simulated data realizations are utilized to calculate the prediction- and validation profile likelihood for the dynamic states. Panel (b) shows, as an example, the prediction profile likelihood and the validation profile likelihood for the same noise realization for predicting A(t) at time point t = 10. The validation profile likelihood has been calculated for validation data with 10% measurement noise, as it was assumed for the measurements. The vertical axis is minus two times the log-likelihood which corresponds to the residual sum of squares. For illustration purposes, the minimum of the log-likelihood LL∗ is shifted to zero in all figures. Three thresholds corresponding to 68%, 90%, and 99% confidence levels are plotted as horizontal lines. As explained in the Section 5.2, the projections to the horizontal axis yields the respective confidence intervals for a prediction or for a validation experiment. The constraint optimization procedure is infeasible for A(t) ≤ 0 and therefore the PCIs automatically account for strictly positive values of A. The calculation of the prediction and validation confidence intervals has been repeated for t = 0, 10, . . . , 100 and all three dynamic states A(t), B(t), C(t). In panels (c)-(e), the respective

169

5.3 Results

Figure 5.5: The three figures in panel (a) show the dynamics and measurement realization for the small model used for illustration purpose. C(t) is measured and the dynamics of all states, i.e. A(t), B(t), and C(t), is intended to be predicted. Panel (b) shows as an example the prediction profile likelihood (gray dashed curve) and validation profile likelihood (black dashed curve) of A(t=10). Thresholding yields confidence intervals for prediction (gray vertical lines) and validation (black vertical lines). The three thresholds and the respective projections correspond to α=68%, 90%, and 99% confidence intervals. The VCIs are larger than the PCIs, because they account for the measurement error of a validation data point. Panels (c)-(e) show prediction confidence intervals (gray) for the unobserved states A(t), B(t), as well as for the measured state C(t). The prediction profile likelihood functions are plotted as black curves in vertical direction. Non-observability is illustrated in panels (f)-(h). Panel (f) shows a realization of the measurements for a design which does not provide sufficient information about the steady state of C. This leads to a flat prediction profile likelihood for A(t) towards infinity as shown in panel (g), as well as for B(t) for t>0 as plotted in panel (h). A flat prediction profile likelihood in turn yields unbounded prediction and validation confidence intervals and non-observability of A(t) and B(t) as indicated by the gray shaded regions.

170

5 Observability analysis and confidence intervals for model predictions

prediction confidence intervals (PCIs) are plotted as well as the prediction profile likelihood. For plotting the confidence intervals along the time axis, the PCIs evaluated the eleven time points have been interconnected by cubic piecewise interpolation. The displayed confidence intervals constitute the propagation of information from the measurements of C(t) to predictions of the dynamics of the compound concentrations. Because C is the measured compound in our example, the prediction confidence intervals for C are much smaller than for A and B. However, also A and B yield bounded prediction confidence intervals which can be interpreted as observability of these dynamic states.

Figure 5.6: The left panel shows the validation confidence intervals for the unobserved state A(t). The validation profile likelihood functions are plotted as curves in vertical direction. For the plotting the confidence intervals along the time axis, the VCIs have been interconnected by cubic piecewise interpolation. Validation confidence intervals and validation profile likelihood functions for the intermediate unobserved state B(t) are shown in the middle and for C(t) in the right panel. Figure 5.6 shows the corresponding validation profile and the respective validation confidence intervals for the same noise realization for all dynamic variables A(t), B(t), and C(t). Validation confidence intervals account for the measurement noise in a validation experiment. Therefore, they are larger than the prediction confidence intervals shown in Fig. 5.5, panels (c)-(e). Because Gaussian noise ε ∼ N (µ, SD2 ) has been assumed, the validation confidence intervals covers negative values if the true model response µ = F (Dpred , θtrue ) is close to zero.

Non-observability To illustrate the effect of non-observability, the assumption about the available experimental information is slightly changed. The measurements are simulated for earlier and closer time steps, i.e. for t = 0, 2, . . . , 20. Panel (f) of Figure 5.5 shows that these time points sample only the transient increase of C(t). Hence, such a design does not provide sufficient information about the steady state level of C. In other words, the modification limits the available information about the total amounts of the compounds. This, in turn, renders A(t) and B(t) non-observable. Panel (g) shows the prediction confidence intervals for A(t). In the chosen setting, the predictions are unbounded towards infinity and therefore A(t) is non-observable. In panel (h), it is also shown that B(t) is non-observable. According to the model definition, B(0) is known to

171

5.3 Results

Figure 5.7: Prediction confidence intervals for the extrapolation of C(t) to time points much larger than the measurement times. Because in this example the experimental design does not provide sufficient information about the steady state level of C(t), the prediction confidence intervals diverge and the steady state C(t) for t → ∞ is non-observable.

be zero, but for t > 0, unbounded prediction confidence intervals are obtained which indicate non-observability of B(t). Figure 5.7 shows prediction confidence intervals for C(t) for times much larger than the measurement times t = 0, 2, . . . , 20. C(t) becomes practically non-observable for times which are much larger than the time sampling interval.

Coverage The coverage C = Prob(F (Dpred , θtrue ) ∈ PCIα (D|y))

(5.29)

is the probability that the PCIα (z|y) contains the true value F (Dpred , θtrue ). A desired property of any confidence interval is that the coverage coincides with the confidence level α. Figure 5.8 shows the estimated coverage of the prediction confidence intervals calculated for nine different prediction scenarios. In these scenarios A(2), B(2), C(2), A(10), B(10), C(10), A(50), B(50), C(50) have been predicted, i.e. all three dynamic variables are predicted for an early, an intermediate, and a late point in time. For this analysis, a hundred noise realizations have been analyzed. The error bars plotted in this figure are bootstrap confidence intervals of the mean coverage. The coverage obtained for the asymptotic threshold (red) tends to be conservative, i.e. the true model response is inside the confidence interval more frequently as specified by the confidence level α. This means that there are more false negatives than intended which does not constitute a serious problem in terms of validity of conclusions. In contrast, an anti-conservative coverage would constitute an issue because an increased false positive rate could lead to invalid reasoning.

172

5 Observability analysis and confidence intervals for model predictions

Figure 5.8: Coverage of the prediction confidence intervals for the consecutive model. The horizontal axis is the confidence level α = {0.05, 0.1, . . . , 0.95, 0.99} which constitutes the desired coverage of the confidence intervals. The vertical axis is the realized coverage obtained for 100 data realizations. The red error bars are the result obtained for the asymptotic threshold which yield conservative outcomes for predictions of A(t) and B(t). The black error bars indicate the results for the Monte-Carlo thresholds which shows almost perfect agreement with the confidence level in all prediction scenarios.

The coverage obtained by the adjusted thresholds obtained by the Monte-Carlo algorithm shown in Figure 5.2 are displayed by the black error bars in Figure 5.8. Here, the coverage coincides with the confidence level which confirms the validity respective the prediction profile likelihood based confidence intervals. Prior Information To illustrate the incorporation of prior knowledge for parameter values, the initial concentration A(0) = θ3 is assumed to be drawn from a log-normal distribution θ3 ∼ logN (0, 1)

(5.30)

with expectation hlog(θ3 )i = 0 and variance V ar(log(θ3 )) = 0.1. For parameter estimation, this is accounted for by using the penalized likelihood (5.26), i.e. by adding an additional term to the residual sum of squares. As before, the calculation of the prediction and validation confidence intervals has been repeated for t = 0, 10, . . . , 100 and all three dynamic states A(t), B(t), C(t). In this example, the true value of A(0) ≡ θ3 has been drawn according to the prior from the log-normal distribution (5.30).

173

5.3 Results

Figure 5.9: The curves in vertical direction are the prediction profile likelihood functions for A(t) (left panel), B(t) (middle), and C(t) (right panel) if a log-normal prior for θ3 is assumed. The respective 90% confidence intervals are plotted in dark gray. The light gray regions indicate the 90% confidence intervals if the parameter θ3 is estimated without prior information.

Figure 5.9 shows the prediction profile likelihood functions as curves in vertical direction as well as the respective 90% prediction confidence intervals as dark gray shaded regions. The prediction confidence intervals plotted in light color are obtained if θ3 is estimated without prior information. Because C is the measured compound in our example, the prediction confidence intervals for C are much smaller than for A and B. However, also A and B yield bounded prediction confidence intervals which can be interpreted as observability of these dynamic states. Omitting the prior information yields larger prediction confidence intervals, especially for the unobserved states A(t) and B(t).

5.3.2 MAP kinase signaling model An ODE model of cellular signal transduction has been used to illustrate our method in a realistic setting. For this purpose, a model of the mitogen-activated protein (MAP) kinases which is one of the most extensively studied signal transduction pathway, is utilized. The chosen model [K HOLODENKO 2000] consists of eight dynamic states describing the time dependency of the MAP kinases Raf, Raf∗ , Mek, Mek∗ , Mek∗∗ , Erk, Erk∗ , and Erk∗∗ which play a very prominent role in many cellular processes, e.g. in cell proliferation. Dysregulation of the MAP kinase pathway is known to be one mechanism for cancer development. A star ‘*’ denotes phosphorylation of the protein which biologically acts as activation. The left panel in Figure 5.10 provides a summary of the MAP kinase signaling pathway. The right panel in Figure 5.10 shows the long-term dynamics of this model, i.e. the oscillations. In our analysis only the initial phase, i.e. the first 1000 seconds have been considered. This time interval is characterized by strong nonlinearity of the model response with respect to the parameters and constitutes a compromise setting between a transient and an oscillatory dynamics. The enzymatic reactions in the ODE model are described as Michaelis-Menten rate equations, i.e. each reaction is parametrized with a maximal enzymatic rate and a Michaelis constant. As in the original publication, the parameters of the two consecutive phosphorylation and dephosphory-

174

5 Observability analysis and confidence intervals for model predictions

Figure 5.10: The left panel shows the MAP kinase model according to K HOLODENKO [2000]. It is assumed that the phosphorylated compounds are measured. The dynamics of all compounds is intended to be predicted to illustrate the prediction profile likelihood approach. The long-term dynamics of the MAP kinase model shown in the right panels is characterized by regular oscillations. In our analysis, the first 1000 seconds, as highlighted by gray background color, have been analyzed as a compromise between a transient and oscillatory dynamics.

Figure 5.11: In left panel, the dynamics of the MAP kinase model as well as simulated data set are plotted. The 90% confidence intervals of the dynamic variables for predictions (dark gray) and for validation experiments (light gray) for this noise realization are plotted in the right panel. The size of the PCIs is plotted as a dashed-dotted line. In absolute concentrations, the dynamics of Erk∗∗ has the largest PCI at t=181 seconds, i.e. when the negative feedback is activated. Also, the dynamics of Mek∗ is only badly observable in this example. Measurements of both would be very informative for better calibrating the model.

175

5.3 Results

symbol V1 n KI K1 V2 K2 k3 K3 k4 K4 V5 K5 V6 K6 k7 K7 k8 K8 V9 K9 V10 K10

description max. enzyme rate Hill coefficient of the feedback Michaelis constant Michaelis constant max. enzyme rate Michaelis constant catalytic rate constant Michaelis constant catalytic rate constant Michaelis constant max. enzyme rate Michaelis constant max. enzyme rate Michaelis constant catalytic rate constant Michaelis constant catalytic rate constant Michaelis constant max. enzyme rate Michaelis constant max. enzyme rate max. enzyme rate

value 2.5 1 9 10 0.25 8 0.025 15 0.025 15 0.75 15 0.75 15 0.025 15 0.025 15 0.5 15 0.5 15

lower boundary 1e-2 1 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2 1e-2

upper boundary 1e2 1 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2 1e2

units nM s−1 1 nM nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM nM s−1 nM

Table 5.1: Parameters of the MAP kinase model as published in K HOLODENKO [2000]. The upper and lower boundaries of the parameters have been chosen to range over four orders of magnitude.

lation steps of Mek and Erk are assumed to be identical and the initial concentrations are assumed to be known. In this setting, 14 parameters are estimated out of three times eleven data points. Tab. 5.1 summarizes the model parameters as they have been published in K HOLODENKO [2000]. The Hill coefficient n is assumed to be equal to one. For the observational noise of the validation data, the same noise level as for the experimental data has been assumed, i.e. σ = SD = 10 nM. It is assumed that the total amount of the phosphorylated forms for each protein, i.e. Raf∗ , the sum of Mek∗ and Mek∗∗ as well as the sum of Erk∗ and Erk∗∗ , are measured. This observational assumption holds for example for phospho-specific antibodies such as utilized for Western blotting. The measurement times are set to 0, 100, . . . , 1000 seconds. Again, additive Gaussian noise is assumed. In the left panel of Figure 5.10, a typical noise realization is displayed. The right panel shows the prediction confidence intervals (dark gray) and the validation confidence intervals (light gray) for this noise realization calculated for all dynamic states. The size of the confidence intervals is

176

5 Observability analysis and confidence intervals for model predictions

Figure 5.12: Size of the prediction confidence intervals for the dynamic states of the MAP kinase model. The left panels show the size of the confidence intervals in absolute units. In the panels on the right, the size is plotted relative to the total concentration of a protein. The upper row shows the results for the prediction confidence intervals, the lower row for the validation confidence intervals.

plotted as a dashed-dotted line. The prediction confidence intervals show how precisely the dynamics is inferred by the available data. The temporal behavior of Raf, Raf∗ is quite well determined, i.e. the size of the PCI is below 40 nM. Similarly, the unphosphorylated states of Mek and Erk have narrow prediction confidence intervals. For Mek∗ the concentration dynamics is only predicted within rather large intervals which for most time points nearly span a range between zero and 100 nM. The largest absolute size of the prediction confidence interval of 176 nM is obtained for Erk∗∗ after 181 seconds. This is the point in time where the negative feedback is activated. Additional experimental investigation at this time point is very informative to further specify the dynamic behavior of the MAP kinase cascade in our example.

Experimental design considerations The size of prediction confidence interval can be utilized to figure out informative experimental designs. If the information of single data points is intended to be evaluated, then the validation confidence intervals are appropriate. If many experimental replicates are feasible, the average observation will have a small standard error and then prediction confidence intervals can be used to assess a design.

177

5.4 Discussion

Figure 5.12 shows the size of the 90% prediction confidence intervals (upper row), i.e. the difference between the upper and lower boundary, and the size of the validation confidence intervals (lower row) along the time axis. The size is plotted in absolute concentrations (left panels) and relative to the total amount of the protein (right panels). Independently from the way of the assessment, Erk (blue lines) yields the smallest prediction and validation confidence intervals for 300 < t ≤ 1000. Therefore, measurements of Erk in this time interval constitute very informative experimental designs for testing the model. Such a setting which is appropriate for validating the whole model but it is not informative for improving the model parameter estimation because the existing data already specify the model response reliably. For such a purpose, new experimental data has to be generated for designs in which the model behavior is not yet precisely specified, i.e. for a setting with large prediction or validation confidence intervals. In these terms, Erk∗∗ (gray lines) is most informative in absolute units between 100 and 200 seconds. Also, absolute measurements of Mek∗ (red lines) and Mek∗∗ (green lines) along the whole time axis are informative. If only the amount of a phosphorylated form relative to the total concentration of the protein is experimentally accessible, then the panels on the right should be evaluated to assess the power of a design. In our example, the outcome for the prediction profile likelihood is very similar to the results obtained for absolute concentrations. Again, Erk for t < 200 as well as Mek∗ and Mek∗∗ are most informative. For the validation confidence intervals, Raf and Raf∗ appear as informative because the total concentration of Raf is three times smaller than the total amount of Mek and Erk.

5.4 Discussion Existing approaches for prediction confidence intervals like MCMC or bootstrap procedures are based on forward evaluations of the model for many parameter values. This works reasonably well for a low dimensional parameter space and if the target density function, i.e. the parameter space to be sampled, is well-behaved [BAYARRI & B ERGER 2004]. However, sampling nonlinear high-dimensional spaces densely is impractical and it is almost impossible to ensure that sampling the parameter space covers all prediction scenarios. Especially in biological applications the target distributions frequently inherit strong and nonlinear functional relations. In the case of non-identifiability, the parameter space to be sampled is not restricted rendering convergence near to impossible. In this chapter, a contrary procedure has been established. The model prediction space is sampled directly and the corresponding model parameters are determined by constraint maximum likelihood to check the agreement of the predictions with the data. This concept yields the prediction profile likelihood which constitutes the propagation of uncertainty from experimental data to predictions. If a comprehensive prior, i.e. for all parameters, would be available, a Bayesian procedure like MCMC, i.e. marginalization over the nuisance dimensions, is feasible and could have superior performance. However, in cell biology applications, prior knowledge is very restricted because ki-

178

5 Observability analysis and confidence intervals for model predictions

netic rates and concentrations are highly dependent on the cell type and biological context, e.g. on the cellular environment and biochemical state of a cell. Therefore, there is usually at most some prior information for few parameters available. Such prior information can be incorporated in our procedure without restricting its applicability by generalizing maximum likelihood estimation to maximum a-posterior estimation as discussed in Sec. 5.2.6. In general, generating prediction confidence intervals given the uncertainty in the high-dimensional nonlinear parameter space requires large numerical efforts. However, this complication primarily originates from the complexity of the issue itself rather than from the methodological choice. In fact, the aim is approached by the prediction profile likelihood in a very efficient manner because scanning the parameter space by the constrained optimization procedure to explore the data-consistent predictions is more efficient than sampling parameter space without considering the predictions like it is performed for MCMC. Instead of sampling a high-dimensional parameters space, only the prediction space has to be explored for calculating a prediction profile likelihood, i.e. the optimization of the parameters reduces the high-dimensional sampling problem to exploring a single dimension. For the parameter profile likelihood, it has been demonstrated [R AUE et al 2009] that the computational effort only scales slightly super-linear with the number of parameters. This result does, due to the similarity of the computations, carry over to the prediction profile likelihood. The prediction confidence regions introduced here has to be interpreted point-wise. This means that a confidence level α controls errors of type 1 which is the probability that the model response for the true parameters is inside the prediction confidence interval for a single prediction condition if many realizations of the experimental data and the corresponding prediction confidence intervals are considered. In contrast, if a single data set is utilized to generate many prediction intervals, e.g. predictions for several points in time as performed above, the results are statistically dependent, i.e. the realization of the PCI of a neighboring time point is very similar and therefore correlated. Therefore, the prediction confidence intervals for a compound for two adjacent points in time very likely both contain the true value, or neither. In such an example, a common prediction confidence region for two statistically dependent predictions would require a two-dimensional prediction profile likelihood. This topic, however, is beyond the scope of this chapter. The prediction profile likelihood also provides a concept for experimental planning as discussed in Sec. 5.3.2. Experimental conditions with a very narrow prediction confidence interval are very accurately specified by the available data. New measurements for such a condition on the one hand does not provide very much additional information to better calibrate the model parameters, and hence is from this point of view a bad choice for additional measurements. On the other hand, it very precisely predicts the model behavior under these certain conditions and is therefore a very powerful candidate setting for validating the model structure. Contrarily, large prediction confidence intervals indicate conditions which are weakly specified by the existing data and therefore constitute informative experimental designs for better calibrating the model. Because a design optimization on the basis of the prediction profile likelihood does not require any linearity approximation like common experimental design techniques, e.g. based on the Fisher information [K REUTZ & T IMMER 2009], the presented procedure could be very valuable for ODE models

179

Summary

which are typically highly nonlinear. Another potential of the prediction profile likelihood shown in this chapter is its interpretation in terms of observability. This term is very commonly used in control theory to characterize whether the dynamics of some unobserved variables can be inferred by the set of feasible experiments. The theory in this field is based on analytical calculations, i.e. the limited amount and inaccuracy of the data is usually not considered. In this chapter, it has been shown that the prediction profile likelihood allows for a general data-based approach to check whether there is enough information about unobserved dynamic states in the given experimental design and realization of measurements. Therefore, in analogy to the terminology of practical identifiability [R AUE et al 2009], we would suggest to term observability for a given data set, i.e. the existence of a restricted prediction confidence interval, as practical observability.

5.5 Summary Generating model predictions is a major task in mathematical modeling. For the dynamic mechanistic models as they are applied e.g. in molecular and systems biology, the confidence regions from parameter estimation can have arbitrarily complex shapes. Therefore, it is very difficult or even impossible to sample the parameter space appropriately to generate confidence intervals for predictions. This in turn impedes a data-based observability analysis for the dynamic states. In this chapter, the prediction profile likelihood approach is presented as a methodology which directly, i.e. without sampling the parameter space, calculates the set of model predictions which are consistent with existing measurements. This concept constitutes a powerful tool for assessing model predictions, performing observability analyses, and experimental design. The method is feasible for arbitrary dimensions of the parameter space. It only requires a proper calculation of the maximum likelihood value, i.e. a numerically reliable parameter optimization procedure. Therefore, the idea is applicable for a broad range of applications. The task of sampling a highdimensional parameter space is reduced to scanning a one-dimensional prediction space. It therefore allows the calculation of confidence intervals for model predictions as well as confidence intervals for the outcome of validation experiments. Finally, it should be noted, that a prediction could be any function of the compounds and the parameters. In applications, e.g. a ratio of two compound concentrations is a characteristics of interest. In principle also integrals, peak positions and other functions of the dynamic states can be considered as predictions which could be targets for observability considerations as well as for the calculation of prediction and validation confidence intervals. This flexibility renders the prediction profile likelihood as a concept promising to resolve one bottleneck in computer-aided simulations of complex systems, the generation of reliable confidence intervals for predictions in a broad range of applications For this project, I acknowledge the very productive support by Andreas Raue in discussing and developing the method.

180

Summary ”If your experiment needs statistics, you ought to have done a better experiment.“ Ernest Rutherford (1871–1937), pioneer in nuclear physics.

”Questions come in a volume and at a pace that demands answers; we [statisticians] simply do not have the luxury to wait until we have final solutions to problems before we get back to the biologists. A major aim [...] is to stimulate other statisticians to work with their local biologists on microarray experiments and to come up with better solutions ...“ Terry Speed, pioneer in statistics for high-throughput data, 2003.

The discrepancy between both statements can be explained by two facts. In contrast to most experiments in more technical disciplines like physics or engineering, the research on living cells systems is based on measurements with orders of magnitudes larger heterogeneity. Moreover, with some exceptions like in high-energy physics, the experimental feasibility is much more restricted in cell biology, although new measurement techniques continuously augment the experimental practicability. These arguments and the fact that most biochemical assays require applicationspecific statistical approaches, demand for new statistical methodologies for data analyses of biochemical assays. This requirement is further vastly enhanced by the rapid technological progress in molecular biology in recent years. In agreement with Terry Speed’s wish, exactly this need, i.e. the statistical support of molecular biologists, has been followed in the presented thesis. The efforts concerning the development of new methods were primarily driven by the requirements in projects performed in close collaboration with scientists in medicine and cell biology. The applications were mainly based on three techniques, namely on different variants of DNA microarrays, western-/immunoblotting, and flow cytometry. In the following, the obtained results are briefly summarized. Most of the projects have been described in detail in this thesis, some of them are only available in the original scientific publication. The data analysis performed for the applications based on DNA microarray data, comprising data preprocessing, estimation of effect size, as well as testing significance and adjusting for multiple testing has been summarized in the first sections of Chapter 1. Some methodological issues have been investigated in more detail like the advantage of effect size based rankings over significance based orderings (Section 1.7). A comparison between a one and a two color technique has been performed in Section 1.8. Here, also the impact of the choice between alternative data preprocessing strategies has been investigated. For the one-color technique five standard approaches, for the two color measurements nine alternative analyses have been compared comprehensively to assess the consistency between the methods. Moreover, a data processing strategy could be developed (Section 1.9) allowing for quantitative analyses of gene expression in tissue samples as

Summary

they are prepared in histopathology for conservation in tumor banks, i.e. for tissues slices which have been fixed with formalin and embedding into paraffin [L ASSMANN et al 2009]. From a statistical point of view, the identification of constantly expressed genes, so-called housekeeping genes, is in some respects a converse issues to determination of regulated genes. This has been discussed in the application presented in Section 1.10. Further applications for DNA microarray data comprise the identification of biomarkers for diagnosis of a blood disorder, called polycythemia vera [G OERTTLER et al 2005], the anti-viral immune response at the transcriptional level after Hepatitis C infection [FANG et al 2006], as well as the target genes of the transcription factor tumor necrosis factor-α (TNF-α) on gene transcription have been studied in L INDENMEYER et al [2007]. The transcriptional mechanisms characterizing the development of polycystic autosomal dominant polycystic kidney disease (ADPKD) after transplantation have been studied in S CHIEREN et al [2006]. The transcriptional effects of pharmaceutical drugs and drug combinations as they are provided after kidney transplantation to suppress the immune system have been investigated in RUMBERGER et al [2009] to suggest more efficient drug therapies. Another microarray technology, i.e. the single nucleotide polyphorphism (SNP) chips, has been used to investigate the importance of genomic aberrations like deletions or trisomy in the development of chronic lymphocytic leukemia (CLL) [P FEIFER et al 2007]. In BARTHOLOME et al [2009], a methodology has been established for a so-called gene set analysis, i.e. for the identification of functionally related groups of genes from high-throughput studies. The transcriptional regulation during regeneration of the murine kidneys following partial nephrectomy has been studied in RUMBERGER et al [2007]. The corresponding recovery of the murine liver after partial hepatecotomy has been discussed in Section 1.6. The establishment of a repre¨ sentative experimental setting for investigating liver regeneration in mice [K LINGM ULLER et al 2006] was also partly based on the microarray measurements. The early transcriptional proliferative response of hepatocytes, as it occurs in cell culture on so-called collagen monolayers, has been investigated in Z ELLMER et al [2010]. In Chapter 2, an error model has been established for Western blotting revealing the sources of systematic errors as well as their distribution [K REUTZ et al 2007]. Moreover, systematic errors between different experimental runs performed on different gels, have been studied in S CHILLING et al [2005a]. Experimental design considerations concerning this issue and a method adjusting for such systematic errors has been presented in S CHILLING et al [2005b]. In Chapter 3, a comprehensive data analysis methodology has been established for flow cytometry data, a technique allowing for quantification of molecule abundances at a single cell level by fluorescent labeling. The method comprises an automatic, unbiased selection of the cells of interest, the choice of a data transformation aiming to satisfy distributional assumptions, the quantification of insulin binding as well as the cell-to-cell variability, the identification and adjustment for systematic errors, the detection of outlier data sets, as well as averaging over different experimental runs. In this project, we found out that there are two types of liver cells, i.e. hepatocytes, whose membrane-located receptors are characterized by different affinities for insulin. Because the liver plays a central role in glucose metabolism and also in the development of diabetes type

182

Summary

II, this outcome is of very high biological relevance. In Chapter 4 and [K REUTZ & T IMMER 2009], a general methodological framework has been established for experimental design issues in molecular and systems biology. Here, the classical methodology applied in dynamic modeling as performed in disciplines like physics, chemistry, and engineering has been connected to classical experimental design approaches as utilized in Biostatistics. Some parts of the methodology have been applied to dynamic modeling of signal transduction pathways to identify a sequence of experiments which are informative for model discrimination [M AIWALD et al 2007]. For dynamic models described by ordinary differential equations, the detection of non-identifiabilities has been studied in H ENGL et al [2007]. An improved methodology for non-identifiability analyses as well as the relationship to likelihood-based confidence intervals has been derived in R AUE et al [2009] and in R AUE et al [2011]. In S CHILLING et al [2009], the methodology has been applied to a dynamic model of signal transduction, i.e. to erythropoietin (EPO) induced MAP-kinase signaling, to unravel the impact of extracellular signal-regulated kinases (ERK) isoforms on cell proliferation. Another performed application is the EPO induced signaling of the janus kinase (JAK2) and the subsequent effects on signal transducer and activator of transcription (STAT5) proteins. For this signal transduction pathway, the impact of feedback regulation for physiologically meaningful signaling over a wide range of EPO concentrations has been studied in BACHMANN et al [2011]. Parameter identifiability considerations and calculations of confidence intervals have been generalized to non-observability analyses and confidence intervals for arbitrary model predictions in Chapter 5, [K REUTZ et al 2011]. The presented approach allows a reliable calculation of confidence intervals for model prediction despite a high-dimensional parameter space and nonlinear impact of the parameters. Moreover, a data-driven observability analysis is feasible. Although the method has been motivated by examples from biochemistry, it is generic and therefore relevant for a broad range of scientific applications. When I started my thesis, we were often challenged by the questions of biologists and could just approach their issues heuristically and at that time in a sub-optimal manner. In that respect, we could certainly make enormous methodological progress and the established methods and insights seem to constitute a reliable basis for further developments in the future. Certainly, the interdisciplinary collaboration with biologists turned out as fruitful and inspiring for both sides, for theoreticians as well as for experimentalists. Besides the promising progress in improving, accelerating, and simplifying experimental techniques, high-throughput measurements does not turn molecular biology into high-throughput science. The potential can only be exploited by appropriate statistical approaches. And in agreement with Rutherford’s opinion stated above, a bulk of data can never replace a smart idea. ”What we hope ever to do with ease, we must learn first to do with diligence.“ Samuel Johnson (1709–1784), English poet.

Clemens Kreutz Freiburg, November, 7th , 2011.

183

Acknowledgement First, I greatly thank my supervisor Prof. Jens Timmer for his excellent support, scientific education, and for the outstanding academic freedom. I also thank the former head of our group, Prof. Josef Honerkamp for his visionary methodological orientation. I am deeply thankful to my colleagues Andreas Raue and Kilian Batholome for the fruitful teamwork, forthcoming discussions, and amusing times. I further thank Dr. Seong-Hwan Rho, Dr. Julia Rausenberger, Dr. Thomas Maiwald, and Dr. Florian Geier for their collaboration and memorable time. Moreover, I thank all my current and former collaborators in our group, especially Daniel Kaschek, Bernhard Steiert, Raphael Engesser, Max Schelker, Julian Gehring, Stefan Hengl, Sebastian Sonntag, Dr. Stefan Reinker, Dr. Markus Kollmann, and Dr. Stefan Jansen. Finally, I thank all my experimental collaborators, especially Dr. Mar´ıa Bartolom´e-Rodr´ıguez and her group for their great experimental efforts, as well as Prof. Ursula Klingm¨uller and her group, particularly Dr. Sebastian Bohl, Dr. Marcel Schilling. Moreover, I acknowledge the support and collaboration with Dr. Dietmar Pfeifer, Dr. Thorsten Kurz, Titus Sparna, Dr. Silke Lassmann, Dr. Johannes Donauer, Dr. Brigitta Rumberger, Dr. Marcel Geyer, and Prof. Gerd Walz. Moreover, I thank Prof. Martin Schumacher and his group, especially Prof. Harald Binder and Dr. Thomas Gerds for some hints guiding me into appropriate methodological directions. My work has been predominantly funded by the BMBF-grants 0313074D-Hepatosys and 0315766VirtualLiver.

Danksagung Ich m¨ochte mich an dieser Stelle bei allen bedanken, die mich bei der Anfertigung meiner Doktorarbeit unterst¨utzt haben. Besonderer Dank gilt meinen Eltern, Erika und Ernst Kreutz, da der Besuch der Universit¨at nur aufgrund ihres unerm¨udlichen Fleißes m¨oglich war. Der gleiche Dank gilt meiner Frau Susanne Kreutz. Ohne ihre permanente und uneingeschr¨ankte Unterst¨utzung w¨are diese Dissertation ebenfalls nicht machbar gewesen.

Bibliography A FFYMETRIX [2002]. Statistical algorithms description document. Technical Report, Affymetrix, Inc. URL http://media.affymetrix.com/support/technical/ whitepapers/sad%d_whitepaper.pdf. A KAIKE , H. [1974]. A new look at the statistical model identification. IEEE T. Automat. Contr. AC-19, 716–723. A LLISON , J., BANKS , J., BARLOW, R. J., BATLEY, J. R., B IEBEL , O., B RUN , R., B UIJS , A., B ULLOCK , F. W., C HANG , C. Y., C ONBOY, J. E., C RANFIELD , R., DALLAVALLE , G. M., D ITTMAR , M., D UMONT, J. J., F UKUNAGA , C., G ARY, J. W., G ASCON , J., G ED DES , N. I., G ENSLER , S. W., G IBSON , V., G ILLIES , J. D., H AGEMANN , J., H ANSROUL , M., H ARRISON , P. F., H ART, J., H ATTERSLEY, P. M., H AUSCHILD , M., H EMINGWAY, R. J., H EYMANN , F. F., H OBSON , P. R., H OCHMAN , D., H OSPES , R., J ONES , R. W. L., K AWA GOE , K., K AWAMOTO , T., K ENNEDY, B. W., K PKE , L., KOWALEWSKI , R. V., K REUTZ MANN , H., L AFFERTY, G. D., L AYTER , J. G., L ELLOUCH , D., L LOYD , S. L., L ORAZO , B., L OSTY, M. J., L UVISETTO , M. L., M C P HERSON , A. C., M ASHIMO , T., M TTIG , P., M ILDENBERGER , J. L., M URRAY, W. J., O’N EALE , S. W., O REGLIA , M. J., PATRICK , G. N., PAWLEY, S. J., P FISTER , P., P OSSOZ , A., P REBYS , E. J., Q UAST, G., R EDMOND , M. W., R EES , D. L., R ILES , K., ROACH , C. M., ROSSI , A., ROUTENBURG , P., S CHAILE , A. D., T YSARCZYK -N IEMEYER , G., VAN DALEN , G. J., VAN KOOTEN , R., WARD , C. P., WARD , D. R., WATKINS , P. M., WATSON , N. K., W EISZ , S., W ILSON , G. W., YAARI , R. & Z ANARINI , P. [1992]. The detector simulation program for the OPAL experiment at LEP. Nuc. Instr. Meth. A 317, 47–74. A LPAYDIN , E. [1998]. Soft vector quantization and the EM algorithm. Neural. Netw. 11(3), 467–477. A NDERSEN , C. L., J ENSEN , J. L. & O RNTOFT, T. F. [2004]. Normalization of real-time quantitative reverse transcription-PCR data: a model-based variance estimation approach to identify genes suited for normalization, applied to bladder and colon cancer data sets. Cancer Res. 64(15), 5245–5250. doi:10.1158/0008-5472.CAN-04-0496. A SPREY, S. & M ACCHIETTO , S. [2002]. Designing robust optimal dynamic experiments. J. Process Contr. 12, 545–556. ATKINSON , A. [1981a]. Likelihood ratios, posterior odds and information criteria. J. Econometrics 16, 15–20. ATKINSON , A. C. [1981b]. A comparison of two criteria for the design of experiments for discriminating between models. Technometrics 23, 301–305.

Bibliography

ATKINSON , A. C. [1988]. Recent developments in the methods of optimum and related experimental designs. Internat. Stat. Rev. 56, 99–115. ATKINSON , A. C. & D ONEV, A. N. [1992]. Optimum Experimental Designs. Clarendon Press. ATKINSON , A. C. & D ONEV, A. N. [1996]. Experimental designs optimally balanced for trend. Technometrics 38, 333–341. ATKINSON , A. C. & F EDOROV, V. V. [1975a]. The design of experiments for discriminating between several models. Biometrika 62, 289–303. ATKINSON , A. C. & F EDOROV, V. V. [1975b]. Optimal design: Experiments for discriminating between two rival models. Biometrika 62, 57–70. ¨ BACHMANN , J., R AUE , A., S CHILLING , M., B OHM , M. E., K REUTZ , C., K ASCHEK , D., ¨ B USCH , H., G RETZ , N., L EHMANN , W. D., T IMMER , J. & K LINGM ULLER , U. [2011]. Division of labor by dual feedback regulators controls jak2/stat5 signaling over broad ligand range. Mol. Syst. Biol. 7, 516. doi:10.1038/msb.2011.50. BAILEY, R. A., C HENG , C.-S. & K IPNIS , P. [1992]. Construction of trend resistant factorial designs. Stat. Sin. 2, 393–411. BAILEY, T. L. & E LKAN , C. [1994]. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36. BALDI , P. & L ONG , A. D. [2001]. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17(6), 509–519. BALSA -C ANTO , E., A LONSO , A. A. & BANGA , J. R. [1998]. Dynamic optimization of bioprocesses: deterministic and stochastic strategies. Dev. Food. Sci. . BALTES , M., S CHNEIDER , R., S TURM , C. & R EUSS , M. [1994]. Optimal experimental design for parameter estimation in unstructured growth models. Biotechnol. Prog. 10, 480–488. BANGA , J. R. [2008]. Optimization in computational systems biology. BMC Syst. Biol. 2, 47. doi:10.1186/1752-0509-2-47. BANGA , J. R., BALSA -C ANTO , E., M OLES , C. G. & A LONSO , A. A. [2005]. Dynamic optimization of bioprocesses: efficient and robust numerical strategies. J. Biotechnol. 117, 407–419. doi:10.1016/j.jbiotec.2005.02.013. BARNES , M., F REUDENBERG , J., T HOMPSON , S., A RONOW, B. & PAVLIDIS , P. [2005]. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res. 33(18), 5914–5923. doi:10.1093/nar/gki890. BARTHOLOME , K., K REUTZ , C. & T IMMER , J. [2009]. Estimation of gene induction enables a relevance-based ranking of gene sets. J. Comput. Biol. 16(7), 959–967. doi:10.1089/cmb.2008. 0226.

188

Bibliography

¨ , S. & H AMMARSTR OM ¨ , M.-L. [2004]. Utility of the BAS , A., F ORSBERG , G., H AMMARSTR OM housekeeping genes 18S rRNA, beta-actin and glyceraldehyde-3-phosphate-dehydrogenase for normalization in real-time quantitative reverse transcriptase-polymerase chain reaction analysis of gene expression in human T lymphocytes. Scand. J. Immunol. 59(6), 566–573. doi: 10.1111/j.0300-9475.2004.01440.x. BAUER , D. F. [1972]. Constructing confidence sets using rank statistics. J. Am. Stat. Assoc. 67, 687–690. BAYARRI , M. & B ERGER , J. [2004]. The interplay of Bayesian and frequentist analysis. Stat. Sci. 19(1), 58–80. B EILLARD , E., PALLISGAARD , N., VAN DER V ELDEN , V. H. J., B I , W., D EE , R., VAN DER S CHOOT, E., D ELABESSE , E., M ACINTYRE , E., G OTTARDI , E., S AGLIO , G., WATZINGER , F., L ION , T., VAN D ONGEN , J. J. M., H OKLAND , P. & G ABERT, J. [2003]. Evaluation of candidate control genes for diagnosis and residual disease detection in leukemic patients using ’real-time’ quantitative reverse-transcriptase polymerase chain reaction (RT-PCR) - a Europe against cancer program. Leukemia 17(12), 2474–2486. doi:10.1038/sj.leu.2403136. B ELLAVANCE , F., D IONNE , G. & L EBEAU , M. [2009]. The value of a statistical life: a metaanalysis with a mixed effects regression model. J. Health. Econ. 28(2), 444–464. doi:10.1016/j. jhealeco.2008.10.013. B ENJAMINI , Y. & H OCHBERG , Y. [1995]. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 289–300. B ERGER , J. A., H AUTANIEMI , S., J RVINEN , A.-K., E DGREN , H., M ITRA , S. K. & A STOLA , J. [2004]. Optimized LOWESS normalization parameter selection for DNA microarray data. BMC Bioinformatics 5, 194. doi:10.1186/1471-2105-5-194. B ERKEY, C. S., H OAGLIN , D. C., M OSTELLER , F. & C OLDITZ , G. A. [1995]. A random-effects regression model for meta-analysis. Stat. Med. 14(4), 395–411. B EZDEK , J. C. [1981]. Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Kluwer Academic Publishers. B IBIKOVA , M., Y EAKLEY, J. M., C HUDIN , E., C HEN , J., W ICKHAM , E., WANG -RODRIGUEZ , J. & FAN , J.-B. [2004]. Gene expression profiles in formalin-fixed, paraffin-embedded tissues obtained with a novel assay for microarray analysis. Clin. Chem. 50(12), 2384–2386. doi: 10.1373/clinchem.2004.037432. B IBIKOVA , M., Y EAKLEY, J. M., WANG -RODRIGUEZ , J. & FAN , J.-B. [2008]. Quantitative expression profiling of RNA from formalin-fixed, paraffin-embedded tissues using randomly assembled bead arrays. Meth. Mol. Biol. 439, 159–177. doi:10.1007/978-1-59745-188-8 11. B INDER , K. [1979]. Monte Carlo methods in statistical physics. Berlin: Springer. B JORNSTAD , J. F. [1990]. Predictive likelihood: A review. Stat. Sci. 5(2), pp. 242–254. ISSN 08834237.

189

Bibliography

B OEDIGHEIMER , M. J. & F ERBAS , J. [2008]. Mixture modeling approach to flow cytometry data. Cytometry A 73(5), 421–429. doi:10.1002/cyto.a.20553. B OLSTAD , B., I RIZARRY, R., A STRAND , M. & S PEED , T. [2003]. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. doi:10.1093/bioinformatics/19.2.185. B ONFERRONI , C. E. [1935]. Il calcolo delle assicurazioni su gruppi di teste. 13-60. Rome, Italy: Studi in Onore del Professore Salvatore Ortu Carboni. B OOTH , J. G. & H OBERT, J. P. [1998]. Standard errors of prediction in generalized linear mixed models. J. Am. Stat. Assoc. 93, 262–267. B OX , G. E. P. & C OX , D. R. [1964]. An analysis of transformations. J. Roy. Stat. Soc. B 26, 211–254. B OX , G. E. P. & H ILL , W. J. [1967]. Discrimination among mechanistic models. Technometrics 9, 57–71. B ROBERG , P. [2005]. A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6, 199. doi:10.1186/1471-2105-6-199. B ROCKWELL , S. E. & G ORDON , I. R. [2001]. A comparison of statistical methods for metaanalysis. Stat. Med. 20(6), 825–840. doi:10.1002/sim.650. B URKE , A. L., D UEVER , T. A. & P ENLIDIS , A. [1994]. Model discrimination via designed experiments: Discriminating between the terminal and penultimate models on the basis of composition data. Macromolecules 27, 386–399. B URNETTE , W. N. [1981]. Western blotting: electrophoretic transfer of proteins from sodium dodecyl sulfate–polyacrylamide gels to unmodified nitrocellulose and radiographic detection with antibody and radioiodinated protein A. Anal. Biochem. 112(2), 195–203. B URNETTE , W. N. [2009]. Western blotting : remembrance of past things. Meth. Mol. Biol. 536, 5–8. doi:10.1007/978-1-59745-542-8 2. B USTIN , S. A. [2000]. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J. Mol. Endocrinol. 25(2), 169–193. B UTLER , R. W. [1986]. Predictive likelihood inference with applications. J. Roy. Stat. Soc. B 48(1), 1–38. B UZZI F ERRARIS , G. & F ORZATTI , P. [1984]. Sequential experimental design for model discrimination in the case of multiple responses. Chem. Eng. Sci. 39, 81–85. B UZZI F ERRARIS , G., F ORZATTI , P., E MIG , G. & H OFMANN , H. [1983]. New sequential experimental design procedure for discriminating among rival models. Chem. Eng. Sci. 38, 225–232.

190

Bibliography

B YKOV, I., J UNNIKKALA , S., P EKNA , M., L INDROS , K. O. & M ERI , S. [2007]. Effect of chronic ethanol consumption on the expression of complement components and acute-phase proteins in liver. Clin. Immunol. 124(2), 213–220. doi:10.1016/j.clim.2007.05.008. C AIN , K. & G RIFFITHS , B. L. [1984]. A comparison of Isometallothionein synthesis in rat liver after partial hepatectomy and parenteral zinc injection. Biochem. J. 217(1), 85–92. C HAMBERS , J. M. & H ASTIE , T. J. [1992]. Statistical Models in S. Chapman & Hall/CRC. C HAPPELL , M., G ODFREY, K. & VAJDA , S. [1990]. Global identifiability of the parameters of nonlinear systems with specified inputs: a comparison of methods. Math. Biosciences 102, 41–73. C HEN , B. H. & A SPREY, S. P. [2003]. On the design of optimally informative experiments for model discrimination among dynamic crystallization process models. Proc. Found. Comp. Aid. Proc. Op. ( 455–458). C HEN , J., B YRNE , G. E. & L OSSOS , I. S. [2007a]. Optimization of RNA extraction from formalin-fixed, paraffin-embedded lymphoid tissues. Diagn. Mol. Pathol. 16(2), 61–72. doi: 10.1097/PDM.0b013e31802f0804. C HEN , J. J., H SUEH , H.-M., D ELONGCHAMP, R. R., L IN , C.-J. & T SAI , C.-A. [2007b]. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinformatics 8, 412–426. doi:10.1186/1471-2105-8-412. C HEN , Y., D OUGHERTY, E. R. & B ITTNER , M. L. [1997]. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt. 2(4), 364–374. C HERIAN , M. G. & K ANG , Y. J. [2006]. Metallothionein and liver cell regeneration. Exp. Biol. Med. 231(2), 138–144. C HO , K.-H. & W OLKENHAUER , O. [2003]. Analysis and modeling of signal transduction pathways in systems biology. Biochem. Soc. Trans. 31, 1503–1509. doi:10.1042/. C LEVELAND , W. [1979]. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836. C ONOVER , W. J. [1971]. Practical Nonparametric Statistics. John Wiley & Sons, New York. C OONEY, M. J. & M C D ONALD , K. [1995]. Optimal dynamic experiments for bioreactor model discrimination. Appl. Microbiol. Biotechnol. 43, 826–837. C OUDRY, R. A., M EIRELES , S. I., S TOYANOVA , R., C OOPER , H. S., C ARPINO , A., WANG , X., E NGSTROM , P. F. & C LAPPER , M. L. [2007]. Successful application of microarray technology to microdissected formalin-fixed, paraffin-embedded tissue. J. Mol. Diagn. 9(1), 70–79. doi: 10.2353/jmoldx.2007.060004. C OX , D. [1961]. Tests of separate families of hypotheses. In Proceeding of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, ( 105–123). University of California Press.

191

Bibliography

C OX , D. & H INKLEY, D. [1994]. Theoretical Statistics. London: Chapman & Hall. C RAIG , R. A. & L IAO , L. [2007]. Transductive learning with EM algorithm to classify proteins based on phylogenetic profiles. Int. J. Data Min. Bioinform. 1(4), 337–351. C UI , X., H WANG , J. T. G., Q IU , J., B LADES , N. J. & C HURCHILL , G. A. [2005]. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 6(1), 59–75. doi:10.1093/biostatistics/kxh018. DAS , M. K. & DAI , H.-K. [2007]. A survey of DNA motif finding algorithms. BMC Bioinformatics 8, S21. doi:10.1186/1471-2105-8-S7-S21. DAVISON , A. & H INKLEY, D. [1997]. Bootstrap Methods and Their Application. Cambridge: Cambridge University Press. D E , S. K., M C M ASTER , M. T. & A NDREWS , G. K. [1990]. Endotoxin induction of murine metallothionein gene expression. J. Biol. Chem. 265(25), 15.267–15.274. R EYNIS , A., G EROMIN , D., C AYUELA , J.-M., P ETEL , F., D ESSEN , P., S IGAUX , F. & R ICK MAN , D. S. [2006]. Comparison of the latest commercial short and long oligonucleotide microarray technologies. BMC Genomics 7, 51. doi:10.1186/1471-2164-7-51.

DE

D E F EO , P. & M YERS , R. H. [1992]. A new look at experimental design robustness. Biometrika 79, 375–380. D EMPSTER , A., L AIRD , N. & RUBIN , D. [1977]. Maximum likelihood from incomplete data via EM algorithm. J. Roy. Stat. Soc. 39, 1–38. D ENNIS , G., S HERMAN , B. T., H OSACK , D. A., YANG , J., G AO , W., L ANE , H. C. & L EM PICKI , R. A. [2003]. DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol. 4(5), P3. D ETTE , H. & B IEDERMANN , S. [2003]. Robust and efficient designs for the Michaelis-Menten model. J. Amer. Stat. Assoc. 98, 679–686. D ETTE , H., M ELAS , V. B. & P EPELYSHEV, A. [2003]. Standardized maximum E-optimal designs for the Michaelis-Menten model. Stat. Sin. 13, 1147–1167. D ICE , L. R. [1945]. Measures of the amount of ecologic association between species. Ecology 3, 297–302. D I C ICCIO , T. & T IBSHIRANI , R. [1987]. Bootstrap confidence intervals and bootstrap approximations. J. Am. Stat. Assoc. 82(397), 163–170. D O , J. H. & C HOI , D.-K. [2007]. cDNA labeling strategies for microarrays using fluorescent dyes. Eng. Life Sci. 7(1), 26–34. D OUGHERTY, E. R., BARRERA , J., B RUN , M., K IM , S., C ESAR , R. M., C HEN , Y., B ITTNER , M. & T RENT, J. M. [2002]. Inference from clustering with application to gene-expression microarrays. J. Comput. Biol. 9, 105–126. doi:10.1089/10665270252833217.

192

Bibliography

D UDOIT, S., S HAFFER , J. P. & B OLDRICK , J. C. [2003]. Multiple hypothesis testing in microarray experiments. Stat. Sci. 18(1), 71–103. D URBIN , B. & ROCKE , D. M. [2003]. Estimation of transformation parameters for microarray data. Bioinformatics 19, 1360–1367. E CKHARD L IMPERT, W. A. S. & A BBT, M. [2001]. Log-normal distributions across the sciences: keys and clues. Bioscience 51, 341–352. E DWARDS , B. S., O PREA , T., P ROSSNITZ , E. R. & S KLAR , L. A. [2004]. Flow cytometry for high-throughput, high-content screening. Curr. Opin Chem. Biol. 8(4), 392–398. doi:10.1016/ j.cbpa.2004.06.007. E FRON , B. [1987]. The Jackknife, the Bootstrap, and other Resampling Plans. Society for Industrial & Applied Mathematics. E FRON , B. & H INKLEY, D. V. [1978]. Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information. Biometrika 65(3), 457–482. E SPIE , D. & M ACCHIETTO , S. [1989]. The optimal design of dynamic experiments. AIChE Journal 35, 223–229. E VA BALSA -C ANTO , J. B., M ARIA RODRIGUEZ -F ERNANDEZ [2007]. Optimal design of dynamic experiments for improved estimation of kinetic parameters of thermal degradation. J. Food Eng. 82, 178–188. FANG , X., Z EISEL , M. B., W ILPERT, J., G ISSLER , B., T HIMME , R., K REUTZ , C., M AIWALD , T., T IMMER , J., K ERN , W. V., D ONAUER , J., G EYER , M., WALZ , G., D EPLA , E., VON ¨ W EIZS ACKER , F., B LUM , H. E. & BAUMERT, T. F. [2006]. Host cell responses induced by hepatitis C virus binding. Hepatology 43, 1326–1336. doi:10.1002/hep.21191. F EDER , P. I. [1968]. On the distribution of the Log Likelihood Ratio test statistic when the true parameter is ”near” the boundaries of the hypothesis regions. Annals of Mathematical Statistics 39(6), 2044–2055. F ENG , X.-J. & R ABITZ , H. [2004]. Optimal identification of biochemical reaction networks. Biolphys. J. 86, 1270–1281. F INN , W. G., C ARTER , K. M., R AICH , R., S TOOLMAN , L. M. & H ERO , A. O. [2008]. Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: treating flow cytometry data as high-dimensional objects. Cytometry B Clin. Cytom. 76B(1), 1–7. doi:10.1002/cyto.b.20435. F ISHER , R. A. [1950]. Statistical Methods for Research Workers. 11th edition. Edingburgh: Oliver and Boyd. F LO , T. H., S MITH , K. D., S ATO , S., RODRIGUEZ , D. J., H OLMES , M. A., S TRONG , R. K., A KIRA , S. & A DEREM , A. [2004]. Lipocalin 2 mediates an innate immune response to bacterial infection by sequestrating iron. Nature 432(7019), 917–921. doi:10.1038/nature03104.

193

Bibliography

¨ F RANK , M., D ORING , C., M ETZLER , D., E CKERLE , S. & H ANSMANN , M.-L. [2007]. Global gene expression profiling of formalin-fixed paraffin-embedded tumor samples: a comparison to snap-frozen material using oligonucleotide microarrays. Virchows Arch. 450(6), 699–711. doi:10.1007/s00428-007-0412-9. G ALVANIN , F., M ACCHIETTO , S. & B EZZO , F. [2007]. Model-based design of parallel experiments. Ind. Eng. Chem. Res. 46, 871–882. G ELMAN , A., C ARLIN , J. B., S TERN , H. S. & RUBIN , D. B. [2003]. Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). second edition. Chapman and Hall/CRC. ISBN 9781584883883. G IBBONS , R. D., D ORUS , E., O STROW, D. G., PANDEY, G. N., DAVIS , J. M. & L EVY, D. L. [1984]. Mixture distributions in psychiatric research. Biol. Psychiat. 19(7), 935–961. G IROLAMI , M. [2008]. Bayesian inference for differential equations. Theor. Comput. Sci. 408(1), 4–16. ¨ , E., RUM G OERTTLER , P. S., K REUTZ , C., D ONAUER , J., FALLER , D., M AIWALD , T., M ARZ ¨ , A., W ILPERT, J., T IMMER , J., WALZ , G. & BERGER , B., S PARNA , T., S CHMITT-G R AFF PAHL , H. L. [2005]. Gene expression profiling in polycythaemia vera: overexpression of transcription factor NF-E2. Br. J. Haematol. 129, 138–150. doi:10.1111/j.1365-2141.2005.05416. x. G OOS , P., KOBILINSKY, A. & O’B RIEN , T. E. [2005]. Model-robust and model-sensitive designs. Comput. Stat. Data Ana. 49, 201–216. G ORYACHEV, A., M AC G REGOR , P. & E DWARDS , A. [2001]. Unfolding of microarray data. J. Comput. Biol. 8, 443–461. doi:10.1089/106652701752236232. G OULDEN , C. H. [1956]. Methods of Statistical Analysis. New York: Wiley. G REENLAND , S. & M ORGENSTERN , H. [2001]. Confounding in health research. Annu. Rev. Publ. Health 22, 189–212. doi:10.1146/annurev.publhealth.22.1.189. G UO , L., L OBENHOFER , E. K., WANG , C., S HIPPY, R., H ARRIS , S. C., Z HANG , L., M EI , N., C HEN , T., H ERMAN , D., G OODSAID , F. M., H URBAN , P., P HILLIPS , K. L., X U , J., D ENG , X., S UN , Y. A., T ONG , W., D RAGAN , Y. P. & S HI , L. [2006]. Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nat. Biotechnol. 24(9), 1162–1169. G UPTA , S. [2000]. Hepatic polyploidy and liver growth control. Semin. Cancer Biol. 10(3), 161–171. doi:10.1006/scbi.2000.0317. ¨ H ALLER , F., K ULLE , B., S CHWAGER , S., G UNAWAN , B., VON H EYDEBRECK , A., S ULTMANN , ¨ H. & F UZESI , L. [2004]. Equivalence test in quantitative reverse transcription polymerase chain reaction: confirmation of reference genes suitable for normalization. Anal. Biochem. 335(1), 1–9. doi:10.1016/j.ab.2004.08.024.

194

Bibliography

H ARRIS , M. A., C LARK , J., I RELAND , A., L OMAX , J., A SHBURNER , M., F OULGER , R., E IL BECK , K., L EWIS , S., M ARSHALL , B., M UNGALL , C., R ICHTER , J., RUBIN , G. M., B LAKE , J. A., B ULT, C., D OLAN , M., D RABKIN , H., E PPIG , J. T., H ILL , D. P., N I , L., R INGWALD , M., BALAKRISHNAN , R., C HERRY, J. M., C HRISTIE , K. R., C OSTANZO , M. C., DWIGHT, S. S., E NGEL , S., F ISK , D. G., H IRSCHMAN , J. E., H ONG , E. L., NASH , R. S., S ETHU RAMAN , A., T HEESFELD , C. L., B OTSTEIN , D., D OLINSKI , K., F EIERBACH , B., B ERAR DINI , T., M UNDODI , S., R HEE , S. Y., A PWEILER , R., BARRELL , D., C AMON , E., D IMMER , E., L EE , V., C HISHOLM , R., G AUDET, P., K IBBE , W., K ISHORE , R., S CHWARZ , E. M., S TERNBERG , P., G WINN , M., H ANNICK , L., W ORTMAN , J., B ERRIMAN , M., W OOD , V., DE LA C RUZ , N., T ONELLATO , P., JAISWAL , P., S EIGFRIED , T., W HITE , R. & C ONSOR TIUM , G. O. [2004]. The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 32(Database issue), D258–D261. H ARRIS C OOPER , J. C. V., L ARRY V. H EDGES [1994]. The Handbook of Research Synthesis. first edition. Russell Sage Foundation Publications. ISBN 0871542269. H ARTIGAN , J. A. [1973]. Clustering. Annu. Rev. Biophys. Bioeng. 2, 81–101. doi:10.1146/ annurev.bb.02.060173.000501. H ARVILLE , D. A. [1977]. Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 72, 320338. doi:10.2307/2286796. H ASTIE , T., T IBSHIRANI , R. & F RIEDMAN , J. [2001]. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag. H ATHAWAY, R. J. & B EZDEK , J. C. [2001]. Fuzzy c-means clustering of incomplete data. IEEE Trans. Syst. Man. Cybern. B Cybern 31(5), 735–744. doi:10.1109/3477.956035. H ENGL , S., K REUTZ , C., T IMMER , J. & M AIWALD , T. [2007]. Data-based identifiability analysis of non-linear dynamical models. Bioinformatics 23, 2612–2618. doi:10.1093/ bioinformatics/btm382. H ERNANDEZ , J., C ARRASCO , J., B ELLOSO , E., G IRALT, M., B LUETHMANN , H., L EE , D. K., A NDREWS , G. K. & H IDALGO , J. [2000]. Metallothionein induction by restraint stress: Role of glucocorticoids and IL-6. Cytokine 12(6), 791–796. doi:10.1006/cyto.1999.0629. H ERZENBERG , L. A., PARKS , D., S AHAF, B., P EREZ , O., ROEDERER , M. & H ERZENBERG , L. A. [2002]. The history and future of the fluorescence activated cell sorter and flow cytometry: a view from stanford. Clin. Chem. 48(10), 1819–1827. H ERZENBERG , L. A., T UNG , J., M OORE , W. A., H ERZENBERG , L. A. & PARKS , D. R. [2006]. Interpreting flow cytometry data: a guide for the perplexed. Nat. Immunol. 7(7), 681–685. doi: 10.1038/ni0706-681. H IDALGO , M. E. & AYESA , E. [2001]. Numerical and graphical description of the information matrix in calibration experiments for state-space models. Water Res. 35(13), 3206–3214.

195

Bibliography

H INDMARSH , A. C., B ROWN , P. N., G RANT, K. E., L EE , S. L., S ERBAN , R., S HUMAKER , D. E. & W OODWARD , C. S. [2005]. SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM T. Math. Softw. 31, 363–396. ISSN 0098-3500. doi:http://doi.acm.org/ 10.1145/1089014.1089020. H INKLEY, D. [1979]. Predictive likelihood. The Ann. Stat. 7(4), 718–728. H LAVACEK , W. S. [2009]. How to deal with large models? Mol. Syst. Biol. 5, 240–242. H OCHBERG , Y. [1988]. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803. H OCKLEY, S. L., M ATHIJS , K., S TAAL , Y. C. M., B REWER , D., G IDDINGS , I., VAN D ELFT, J. H. M. & P HILLIPS , D. H. [2009]. Interlaboratory and interplatform comparison of microarray gene expression analysis of HepG2 cells exposed to benzo(a)pyrene. OMICS 13(2), 115–125. doi:10.1089/omi.2008.0060. H OLLANDER , M. & W OLFE , D. A. [1973]. Nonparametric Statistical Methods. New York: John Wiley & Sons. H OLM , S. [1979]. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70. H OLODNIY, M. [1999]. Quantitative PCR protocols: effects of collection, processing, and storage on RNA detection and quantification. Humana Press. H OMMEL , G. [1988]. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75, 383–386. H ONERKAMP, J. [1993]. Stochastic Dynamical Systems. New York: VCH. H ONERKAMP, J. [2002]. Statistical Physics. An Advanced Approach with Applications. SpringerVerlag, Heidelberg. H ORBELT, W. [2001]. Maximum likelihood estimation in dynamical systems. Ph.D. thesis, University of Freiburg. doi:http://www.freidok.uni-freiburg.de/volltexte/213/. H SIANG , T. & R EILLY, P. M. [1971]. A practical method for discriminating among mechanistic models. Can. J. Chem. Eng. 38, 225. H SIEH , H.-C., C HEN , Y.-T., L I , J.-M., C HOU , T.-Y., C HANG , M.-F., H UANG , S.-C., T SENG , T.-L., L IU , C.-C. & C HEN , S.-F. [2009]. Protein profilings in mouse liver regeneration after partial hepatectomy using iTRAQ technology. J. Proteome Res. 8(2), 1004–1013. doi:10.1021/ pr800696m. H UANG , D. W., S HERMAN , B. T., TAN , Q., C OLLINS , J. R., A LVORD , W. G., ROAYAEI , J., S TEPHENS , R., BASELER , M. W., L ANE , H. C. & L EMPICKI , R. A. [2007]. The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8(9), R183. doi:10.1186/gb-2007-8-9-r183.

196

Bibliography

¨ H UBER , W., VON H EYDEBRECK , A., S ULTMANN , H., P OUSTKA , A. & V INGRON , M. [2002]. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 Suppl 1, S96–104. H UBER , W., VON H EYDEBRECK , A. & V INGRON , M. [2005]. An introduction to low-level analysis methods of DNA microarray data. Bioconductor Project Working Papers 9, 1–30. doi:http://www.ebi.ac.uk/huber/docs/microarraybasic 051017.pdf. H UNTER , W. G. & R EINER , A. M. [1965]. Designs for discriminating between two rival models. Technometrics 7, 307–323. I DEKER , T., T HORSSON , V., S IEGEL , A. F. & H OOD , L. E. [2000]. Testing for differentiallyexpressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, 805– 817. doi:10.1089/10665270050514945. I RIZARRY, R. A., B OLSTAD , B. M., C OLLIN , F., C OPE , L. M., H OBBS , B. & S PEED , T. P. [2003a]. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31(4), e15. I RIZARRY, R. A., H OBBS , B., C OLLIN , F., B EAZER -BARCLAY, Y. D., A NTONELLIS , K. J., S CHERF, U. & S PEED , T. P. [2003b]. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264. doi:.2.249. I RIZARRY, R. A., WARREN , D., S PENCER , F., K IM , I. F., B ISWAL , S., F RANK , B. C., G ABRIELSON , E., G ARCIA , J. G. N., G EOGHEGAN , J., G ERMINO , G., G RIFFIN , C., H ILMER , S. C., H OFFMAN , E., J EDLICKA , A. E., K AWASAKI , E., M ARTNEZ -M URILLO , F., M ORSBERGER , L., L EE , H., P ETERSEN , D., Q UACKENBUSH , J., S COTT, A., W ILSON , M., YANG , Y., Y E , S. Q. & Y U , W. [2005]. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2(5), 345–350. doi:10.1038/nmeth756. JACOBSEN , M., R EPSILBER , D., G UTSCHMIDT, A., N EHER , A., F ELDMANN , K., M OL LENKOPF, H. J., K AUFMANN , S. H. E. & Z IEGLER , A. [2006]. Deconfounding microarray analysis - independent measurements of cell type proportions used in a regression model to resolve tissue heterogeneity bias. Methods Inf. Med. 45, 557–563. JAIN , A. & D UBES , R. [1988]. Algorithms for Clustering Data. Englewood Cliffs, NJ: PrenticeHall. J IANG , D., TANG , C. & Z HANG , A. [2004]. Cluster analysis for gene expression data: a survey. IEEE T. Knowl. Data En. 16(11), 1370–1386. doi:10.1109/TKDE.2004.68. J OHN , R. C. S. & D RAPER , N. R. [1975]. D-optimality for regression designs: a review. Technometrics 17, 15–23. J OHNSON , P. D. & B ESSELSEN , D. G. [2002]. Practical aspects of experimental design in animal research. ILAR Journal 43, 202–206. J OSHI , M., S EIDEL -M ORGENSTERN , A. & K REMLING , A. [2006]. Exploiting the bootstrap method for quantifying parameter confidence intervals in dynamical systems. Metab. Eng. 8(5), 447–455. doi:10.1016/j.ymben.2006.04.003.

197

Bibliography

J RVINEN , A.-K., H AUTANIEMI , S., E DGREN , H., AUVINEN , P., S AARELA , J., K ALLIONIEMI , O.-P. & M ONNI , O. [2004]. Are data from different gene expression microarray platforms comparable? Genomics 83(6), 1164–1168. doi:10.1016/j.ygeno.2004.01.004. K ANEHISA , M. & G OTO , S. [2000]. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30. K ARSTEN , S. L., D EERLIN , V. M. D. V., S ABATTI , C., G ILL , L. H. & G ESCHWIND , D. H. [2002]. An evaluation of tyramide signal amplification and archived fixed and frozen tissue in microarray gene expression analysis. Nucleic Acids Res. 30(2), E4. K ASS , R., C ARLIN , B., G ELMAN , A. & N EAL , R. [1998]. Markov Chain Monte Carlo in practice: a roundtable diskussion. Am. Stat. 52, 93–100. K ASS , R. & R AFTERY, A. [1994]. Bayes factors. Technical Report, Dep. of Statistics, University of Washington. doi:www.andrew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf. K ELADA , S. N., E ATON , D. L., WANG , S. S., ROTHMAN , N. R. & K HOURY, M. J. [2003]. The role of genetic polymorphisms in environmental health. Environ. Health Perspect. 111(8), 1055–1064. K ENDZIORSKI , C., I RIZARRY, R. A., C HEN , K.-S., H AAG , J. D. & G OULD , M. N. [2005]. On the utility of pooling biological samples in microarray experiments. PNAS 102, 4252–4257. K HOLODENKO , B. N. [2000]. Negative feedback and ultrasensitivity can bring about oscillations in the mitogen-activated protein kinase cascades. Eur. J. Biochem. 267(6), 1583–1588. K IEFER , J. [1959]. Optimum experimental designs. J. Roy. Stat. Soc. B 21, 272–319. K IM , R. D. & PARK , P. J. [2004]. Improving identification of differentially expressed genes in microarray studies using information from public databases. Genome Biol. 5(9), R70. doi: 10.1186/gb-2004-5-9-r70. K IRK , R. [1989]. Experimental Design: Procedures for the behavioral Science. Brooks / Cole Publishing Company. ¨ K LINGM ULLER , U., BAUER , A., B OHL , S., N ICKEL , P. J., B REITKOPF, K., D OOLEY, S., Z ELLMER , S., K ERN , C., M ERFORT, I., S PARNA , T., D ONAUER , J., WALZ , G., G EYER , ¨ M., K REUTZ , C., H ERMES , M., G OTSCHEL , F., H ECHT, A., WALTER , D., E GGER , L., N EU BERT, K., B ORNER , C., B RULPORT, M., S CHORMANN , W., S AUER , C., BAUMANN , F., P REISS , R., M AC N ELLY, S., G ODOY, P., W IERCINSKA , E., C IUCLAN , L., E DELMANN , J., Z EILINGER , K., H EINRICH , M., Z ANGER , U. M., G EBHARDT, R., M AIWALD , T., H EIN RICH , R., T IMMER , J., VON W EIZSCKER , F. & H ENGSTLER , J. G. [2006]. Primary mouse hepatocytes for Systems Biology approaches: a standardized in vitro system for modeling of signal transduction pathways. Syst. Biol. (Stevenage) 153, 433–447. KOHONEN , T. [1995]. Self-organizing maps. Berlin: Springer. KONISHI , S. & K ITAGAWA , G. [1996]. Generalized information criteria in model selection. Biometrica 83(4), 875–890.

198

Bibliography

K RAAN , J., G RATAMA , J. W., K EENEY, M. & D’H AUTCOURT, J. L. [2003]. Setting up and calibration of a flow cytometer for multicolor immunophenotyping. J. Biol. Regul. Homeost. Agents 17(3), 223–233. K REUTZ , C., R AUE , A. & T IMMER , J. [2011]. Likelihood based observability analysis and confidence intervals for predictions of dynamic models. Submitted 1107.0013. K REUTZ , C., RODRIGUEZ , M. M. B., M AIWALD , T., S EIDL , M., B LUM , H. E., M OHR , L. & T IMMER , J. [2007]. An error model for protein quantification. Bioinformatics 23(20), 2747– 2753. doi:10.1093/bioinformatics/btm397. K REUTZ , C. & T IMMER , J. [2009]. Systems biology: experimental design. FEBS J. 276(4), 923–942. doi:10.1111/j.1742-4658.2008.06843.x. K URIEN , B. T. & S COFIELD , R. H. [2006]. Western blotting. Methods 38, 283–293. doi: 10.1016/j.ymeth.2005.11.007. K UTALIK , Z., C HO , K.-H. & W OLKENHAUER , O. [2004]. Optimal sampling time selection for parameter estimation in dynamic pathway modeling. Biosystems 75, 43–55. doi:10.1016/j. biosystems.2004.03.007. L ARKIN , J. E., F RANK , B. C., G AVRAS , H., S ULTANA , R. & Q UACKENBUSH , J. [2005]. Independence and reproducibility across microarray platforms. Nat. Methods 2(5), 337–344. doi:10.1038/nmeth757. L ASSMANN , S., K REUTZ , C., S CHOEPFLIN , A., H OPT, U., T IMMER , J. & W ERNER , M. [2009]. A novel approach for reliable microarray analysis of microdissected tumor cells from formalinfixed and paraffin-embedded colorectal cancer resection specimens. J. Mol. Med. 87(2), 211– 224. doi:10.1007/s00109-008-0419-y. L EE , S., J O , M., L EE , J., KOH , S. S. & K IM , S. [2007]. Identification of novel universal housekeeping genes by statistical analysis of microarray data. J. Biochem. Mol. Biol. 40(2), 226–231. L EIS , J. & K RAMER , M. [1988]. The simultaneous solution and sensitivity analysis of systems described by ordinary differential equations. ACM T. Math. Software 14(1), 45–60. L EISCH , F. [2004]. FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software 11(8), 1–18. L I , Z., DAI , J., Z HENG , H., L IU , B. & C AUDILL , M. [2002]. An integrated view of the roles and mechanisms of heat shock protein gp96-peptide complex in eliciting immune response. Front. Biosci. 7, d731–d751. L INDENMEYER , M. T., K ERN , C., S PARNA , T., D ONAUER , J., W ILPERT, J., S CHWAGER , J., P ORATH , D., K REUTZ , C., T IMMER , J. & M ERFORT, I. [2007]. Microarray analysis reveals influence of the sesquiterpene lactone parthenolide on gene transcription profiles in human epithelial cells. Life Sci. 80(17), 1608–1618. doi:10.1016/j.lfs.2007.01.036.

199

Bibliography

L INTON , K. M., H EY, Y., S AUNDERS , E., J EZIORSKA , M., D ENTON , J., W ILSON , C. L., S WINDELL , R., D IBBEN , S., M ILLER , C. J., P EPPER , S. D., R ADFORD , J. A. & F REEMONT, A. J. [2008]. Acquisition of biologically relevant gene expression data by Affymetrix microarray analysis of archival formalin-fixed paraffin-embedded tumours. Br. J. Cancer 98(8), 1403–1414. doi:10.1038/sj.bjc.6604316. L IU , Q. & N ILSEN -H AMILTON , M. [1995]. Identification of a new acute phase protein. J. Biol. Chem. 270(38), 22.565–22.570. L JUNG , L. & G LAD , T. [1994]. On global identifiability for arbitrary model parameterizations. Automatica 30, 265–276. L O , K., B RINKMAN , R. R. & G OTTARDO , R. [2008]. Automated gating of flow cytometry data via robust model-based clustering. Cytometry A 73(4), 321–332. doi:10.1002/cyto.a.20531. L OUDIG , O., M ILOVA , E., B RANDWEIN -G ENSLER , M., M ASSIMI , A., B ELBIN , T. J., C HILDS , G., S INGER , R. H., ROHAN , T. & P RYSTOWSKY, M. B. [2007]. Molecular restoration of archived transcriptional profiles by complementary-template reverse-transcription (ct-rt). Nucleic Acids Res. 35(15), e94. doi:10.1093/nar/gkm510. L OVE , B., ROCKE , D., P ENN , S., J ENKINS , D. & T HOMAS , R. [2002]. A conditional density error model for the statistical analysis of microarray data. Bioinformatics 18, 1064–1072. M AH , N., T HELIN , A., L U , T., N IKOLAUS , S., K HBACHER , T., G URBUZ , Y., E ICKHOFF , H., K LPPEL , G., L EHRACH , H., M ELLGARD , B., C OSTELLO , C. M. & S CHREIBER , S. [2004]. A comparison of oligonucleotide and cDNA-based microarray systems. Phyisiol. Genomics 16(3), 361–370. ¨ M AIWALD , T., K REUTZ , C., P FEIFER , A. C., B OHL , S., K LINGM ULLER , U. & T IMMER , J. [2007]. Feasibility analysis and optimal experimental design. Ann. N.Y. Acad. Sci. 1115, 212–220. M ALO , N., H ANLEY, J. A., C ERQUOZZI , S., P ELLETIER , J. & NADON , R. [2006]. Statistical practice in high-throughput screening data analysis. Nat. Biotechnol. 24, 167–175. doi:10. 1038/nbt1186. M AOUCHE , S., P OIRIER , O., G ODEFROY, T., O LASO , R., G UT, I., C OLLET, J.-P., M ONTALE SCOT, G. & C AMBIEN , F. [2008]. Performance comparison of two microarray platforms to assess differential gene expression in human monocyte and macrophage cells. BMC Genomics 9, 302. doi:10.1186/1471-2164-9-302. M ARGELI , A. P., T HEOCHARIS , S. E., YANNACOU , N. N., S PILIOPOULOU , C. & KOUTSELI NIS , A. [1994]. Metallothionein expression during liver regeneration after partial hepatectomy in cadmium-pretreated rats. Arch. Toxicol. 68(10), 637–642. M ARIMONT, R. B. & S HAPIRO , M. B. [1979]. Nearest neighbour searches and the curse of dimensionality. IMA J. Appl. Math. 24(1), 59–70. doi:10.1093/imamat/24.1.59. M ARKOVITZ , B. P. [2005]. The principle of multicollinearity. Pediatr. Crit. Care. Med. 6, 94–95.

200

Bibliography

M C C LAIN , D. A. [1991]. Different ligand affinities of the two human Insulin receptor splice variants are reflected in parallel changes in sensitivity for insulin action. Mol. Endocrinol. 5(5), 734–739. M C C LINTICK , J. N. & E DENBERG , H. J. [2006]. Effects of filtering by present call on analysis of microarray experiments. BMC Bioinformatics 7, 49. doi:10.1186/1471-2105-7-49. M EAD , R. [1988]. The Design of Experiments: Statistical Principles for Practical Applications. Cambridge University Press. M EEKER , W. & E SCOBAR , L. [1995]. Teaching about approximate confidence regions based on maximum likelihood estimation. Am. Stat. 49(1), 48–53. M ELKE , P., J NSSON , H., PARDALI , E., TEN D IJKE , P. & P ETERSON , C. [2006]. A rate equation approach to elucidate the kinetics and robustness of the TGF-beta pathway. Biophys. J. 91(12), 4368–4380. doi:10.1529/biophysj.105.080408. M ENDES , P. & K ELL , D. [1998]. Non-linear optimization of biochemical pathways: application to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883. M ILLER , C. J. & H UBBELL , E. [2007]. Guide to Probe Logarithmic Intensity Error (PLIER) estimation. Technical Report, Affymetrix, INC. URL www.affymetrix.com. M OLES , C. G., M ENDES , P. & BANGA , J. R. [2003]. Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res. 13(11), 2467–2474. doi:10.1101/gr.1262503. M OLINARO , A. M., S IMON , R. & P FEIFFER , R. M. [2005]. Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307. doi:10.1093/ bioinformatics/bti499. M ONTGOMERY, D. C. [1991]. Design and Analysis of Experiments. third edition. New York: John Wiley & Sons. M UKHERJEE , S., F EIGELSON , E. D., BABU , G. J., M URTAGH , F., F RALEY, C. & R AFTERY, A. [1998]. Three types of gamma-ray bursts. Astrophys. J. 508, 314–327. M URPHY, S. A. & VAART, A. W. V. D. [1998]. On profile likelihood. J. Amer. Statist. Assoc. 95 95, 449–485. N EYMAN , J. & P EARSON , E. [1936]. Contributions to the theory of testing statistical hypotheses. Statist. Res. Mem. 1(1), 1–37. N EYMAN , J. & P EARSON , E. [1938]. Contributions to the theory of testing statistical hypotheses. Statist. Res. Med. 1(2), 25–57. O CH , F. J., N EY, H., J OSEF, F. & N EY, O. H. [2003]. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 19–51.

201

Bibliography

O GATA , H., G OTO , S., S ATO , K., F UJIBUCHI , W., B ONO , H. & K ANEHISA , M. [1999]. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27(1), 29–34. P. M ARJORAM , V. P., J. M OLITOR & TAVARE , S. [2003]. Markov Chain Monte Carlo without likelihoods. PNAS 100(26), 15.324–15.328. PAHLAVAN , P. S., F ELDMANN , R. E., Z AVOS , C. & KOUNTOURAS , J. [2006]. Prometheus’ challenge: Molecular, cellular and systemic aspects of liver regeneration. J. Surg. Res. 134(2), 238–251. doi:10.1016/j.jss.2005.12.011. PATTERSON , T. A., L OBENHOFER , E. K., F ULMER -S MENTEK , S. B., C OLLINS , P. J., C HU , T.-M., BAO , W., FANG , H., K AWASAKI , E. S., H AGER , J., T IKHONOVA , I. R., WALKER , S. J., Z HANG , L., H URBAN , P., DE L ONGUEVILLE , F., F USCOE , J. C., T ONG , W., S HI , L. & W OLFINGER , R. D. [2006]. Performance comparison of one-color and two-color platforms within the microarray quality control (MAQC) project. Nat. Biotechnol. 24(9), 1140–1150. doi:10.1038/nbt1242. PAVELKA , N., P ELIZZOLA , M., V IZZARDELLI , C., C APOZZOLI , M., S PLENDIANI , A., G RANUCCI , F. & R ICCIARDI -C ASTAGNOLI , P. [2004]. A power law global error model for the identification of differentially expressed genes in microarray data. BMC Bioinformatics 5, 203. doi:10.1186/1471-2105-5-203. P FAFFL , M. W. [2001]. A new mathematical model for relative quantification in real-time RTPCR. Nucleic Acids Res. 29, e45. P FEIFER , D., PANTIC , M., S KATULLA , I., R AWLUK , J., K REUTZ , C., M ARTENS , U. M., F ISCH , P., T IMMER , J. & V EELKEN , H. [2007]. Genome-wide analysis of DNA copy number changes and LOH in CLL using high-density SNP arrays. Blood 109(3), 1202–1210. doi: 10.1182/blood-2006-07-034256. P ICARD , D. [2002]. Heat-shock protein 90, a chaperone for folding and regulation. Cell Mol. Life Sci. 59, 1640–1648. P ICKERING , R. M. & F ORBES , J. F. [1984]. A classification of scottish infants using latent class analysis. Stat. Med. 3(3), 249–259. P INHEIRO , J. C. & BATES , D. M. [2000]. Mixed-Effects Models in S and S-Plus. Statistics and Computing. Springer, New York. P OUNDS , S. & C HENG , C. [2005]. Statistical development and evaluation of microarray gene expression data filters. J. Comput. Biol. 12, 482–495. doi:10.1089/cmb.2005.12.482. P RESS , W., F LANNERY, B., S AUL , S. & V ETTERLING , W. [1992]. Numerical recipes. Cambridge: Cambridge University Press. Q UACKENBUSH , J. [2002]. Microarray data normalization and transformation. Nat. Genet. 32 Suppl, 496–501. doi:10.1038/ng1032. ¨ R AHNENF UHRER , J. [2005]. Clustering algorithms and other exploratory methods for microarrays data analysis. Method. Inform. Med. 44(3), 444–448.

202

Bibliography

R AND , W. [1971]. Objective criteria for the evaluation of clustering methods. J. Amer. Stat. Soc. 66, 846–850. ¨ R AUE , A., B ECKER , V., K LINGM ULLER , U. & T IMMER , J. [2010]. Identifiability and observability analysis for experimental design in non-linear dynamical models. Chaos 20(4), 45.105. ¨ R AUE , A., K REUTZ , C., M AIWALD , T., BACHMANN , J., S CHILLING , M., K LINGM ULLER , U. & T IMMER , J. [2009]. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25, 1923–1929. doi: doi:10.1093/bioinformatics/btp358. ¨ R AUE , A., K REUTZ , C., M AIWALD , T., K LINGM ULLER , U. & T IMMER , J. [2011]. Addressing parameter identifiability by model-based experimentation. IET Syst. Biol. 5(2), 120–130. doi: 10.1049/iet-syb.2010.0061. R AVO , M., M UTARELLI , M., F ERRARO , L., G ROBER , O. M. V., PARIS , O., TARALLO , R., V IGILANTE , A., C IMINO , D., B ORTOLI , M. D., N OLA , E., C ICATIELLO , L. & W EISZ , A. [2008]. Quantitative expression profiling of highly degraded RNA from formalin-fixed, paraffinembedded breast tumor biopsies by oligonucleotide microarrays. Lab. Invest. 88(4), 430–440. doi:10.1038/labinvest.2008.11. R EILLY, P. M. [1970]. Statistical methods in model discrimination. Can. J. Chem. Eng. 48, 168–173. R ISSANEN , J. [1983]. A universal prior for integers and estimation by minimum description length. Ann. Stat. 11, 416–431. R.L. S MITH , D. N. L. M., C. T EBALDI [2009]. Bayesian modeling of uncertainty in ensembles of climate models. J. Am. Stat. Assoc. 104(485), 97–116. ROCKE , D. M. & D URBIN , B. [2003]. Approximate variance-stabilizing transformations for gene-expression microarray data. Bioinformatics 19, 966–972. ROCKE , D. M. & L ORENZATO , S. [1995]. A two-component model for measurement error in analytical chemistry. Technometrics 37, 176–185. RODRIGUEZ -F ERNANDEZ , M., M ENDES , P. & BANGA , J. R. [2006]. A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83, 248–265. doi:10.1016/j.biosystems.2005.06.016. ROJAS , C. R., W ELSH , J. S., G OODWIN , G. C. & F EUER , A. [2007]. Robust optimal experiment design for system identification. Automatica 43, 993–1008. RUMBERGER , B., K REUTZ , C., N ICKEL , C., K LEIN , M., L AGOUTTE , S., T ESCHNER , S., T IM MER , J., G ERKE , P., WALZ , G. & D ONAUER , J. [2009]. Combination of immunosuppressive drugs leaves specific fingerprint on gene expression in vitro. Immunopharm. Immunot. ( 1–10). doi:10.1080/08923970802626268.

203

Bibliography

RUMBERGER , B., VONEND , O., K REUTZ , C., W ILPERT, J., D ONAUER , J., A MANN , K., ROHRBACH , R., T IMMER , J., WALZ , G. & G ERKE , P. [2007]. cDNA microarray analysis of adaptive changes after renal ablation in a sclerosis-resistant mouse strain. Kidney Blood Press Res. 30, 377–387. doi:10.1159/000108624. S ACHS , L. [1984]. Applied Statistics. New York: Springer. S ACKS , J. & Y LVISAKER , D. [1984]. Some model robust designs in regression. Ann. Stat. 12 (4), 1324–1348. S AKAMOTO , Y., I SHIGURO , M. & K ITAGAWA , G. [1986]. Akaike Information Criterion Statistics. D. Reidel Publishing Company. S ALZBERG , S. L. [1997]. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc. 1, 317–327. S AMUEL KOTZ , N. L. J. (Editor) [1985]. Encyclopedia of statistical sciences. first edition. Wiley-Interscience. S ANGER , T. [1989]. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural. Networks 2, 459–473. S CHIEREN , G., RUMBERGER , B., K LEIN , M., K REUTZ , C., W ILPERT, J., G EYER , M., FALLER , D., T IMMER , J., Q UACK , I., RUMP, L. C., WALZ , G. & D ONAUER , J. [2006]. Gene profiling of polycystic kidneys. Nephrol Dial Transplant 21(7), 1816–1824. doi: 10.1093/ndt/gfl071. S CHILLING , M., M AIWALD , T., B OHL , S., KOLLMANN , M., K REUTZ , C., T IMMER , J. & ¨ K LINGM ULLER , U. [2005a]. Computational processing and error reduction strategies for standardized quantitative data in biological networks. FEBS Journal 272, 6400–6411. doi: 10.1111/j.1742-4658.2005.05037.x. S CHILLING , M., M AIWALD , T., B OHL , S., KOLLMANN , M., K REUTZ , C., T IMMER , J. & ¨ K LINGM ULLER , U. [2005b]. Quantitative data generation for Systems Biology: The impact of randomization, calibrators and normalizers. IEE Proc. Sys. Biol. 152, 193–200. S CHILLING , M., M AIWALD , T., H ENGL , S., W INTER , D., K REUTZ , C., KOLCH , W., L EHMANN , W. D., T IMMER , J. & K LINGM L¨ LER , U. [2009]. Theoretical and experimental analysis links isoform-specific ERK signalling to cell fate decisions. Mol. Syst. Biol. 5, 334. doi:10.1038/msb.2009.91. S CHLINGEMANN , J., H ABTEMICHAEL , N., I TTRICH , C., T OEDT, G., K RAMER , H., H AMBEK , M., K NECHT, R., L ICHTER , P., S TAUBER , R. & H AHN , M. [2005]. Patient-based crossplatform comparison of oligonucleotide microarray expression profiles. Lab. Invest. 85(8), 1024–1039. doi:10.1038/labinvest.3700293. S CHMITTGEN , T. D. & Z AKRAJSEK , B. A. [2000]. Effect of experimental treatment on housekeeping gene expression: validation by real-time, quantitative RT-PCR. J. Biochem. Biophys. Methods 46(1-2), 69–81.

204

Bibliography

¨ ¨ S CHR ODER , A., M ULLER , O., S TOCKER , S., S ALOWSKY, R., L EIBER , M., G ASSMANN , M., L IGHTFOOT, S., M ENZEL , W., G RANZOW, M. & R AGG , T. [2006]. The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol. Biol. 7, 3. doi: 10.1186/1471-2199-7-3. S CHWARZ , G. [1978]. Estimating the dimension of a model. Ann. Stat. 6, 461–464. S CICCHITANO , M. S., DALMAS , D. A., B ERTIAUX , M. A., A NDERSON , S. M., T URNER , L. R., T HOMAS , R. A., M IRABLE , R. & B OYCE , R. W. [2006]. Preliminary comparison of quantity, quality, and microarray performance of RNA extracted from formalin-fixed, paraffinembedded, and unfixed frozen tissue samples. J. Histochem. Cytochem. 54(11), 1229–1237. doi:10.1369/jhc.6A6999.2006. S COTT, D. W. & WAND , M. P. [1991]. Feasibility of multivariate density estimates. Biometrika 78(1), pp. 197–205. ISSN 00063444. S EBER , G. & W ILD , C. [1989]. Nonlinear regression. New York: Wiley. S ELF, S. G. & L IANG , K. Y. [1987]. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Ass. 82, 605–610. S ELVEY, S., T HOMPSON , E. W., M ATTHAEI , K., L EA , R. A., I RVING , M. G. & G RIFFITHS , L. R. [2001]. Beta-actin–an unsuitable internal control for RT-PCR. Mol. Cell. Probes. 15(5), 307–311. doi:10.1006/mcpr.2001.0376. S EVERGNINI , M., B ICCIATO , S., M ANGANO , E., S CARLATTI , F., M EZZELANI , A., M ATTI OLI , M., G HIDONI , R., P EANO , C., B ONNAL , R., V ITI , F., M ILANESI , L., B ELLIS , G. D. & BATTAGLIA , C. [2006]. Strategies for comparing gene expression profiles from different microarray platforms: application to a case-control experiment. Anal. Biochem. 353(1), 43–56. doi:10.1016/j.ab.2006.03.023. S HAPIRO , H. M. [1994]. From Practical Flow Cytometry. third edition. New York: Wiley-Liss. S HI , L., R EID , L. H., J ONES , W. D., S HIPPY, R., WARRINGTON , J. A., BAKER , S. C., C OLLINS , P. J., DE L ONGUEVILLE , F., K AWASAKI , E. S., L EE , K. Y., L UO , Y., S UN , Y. A., W ILLEY, J. C., S ETTERQUIST, R. A., F ISCHER , G. M., T ONG , W., D RAGAN , Y. P., D IX , D. J., F RUEH , F. W., G OODSAID , F. M., H ERMAN , D., J ENSEN , R. V., J OHNSON , C. D., L OBENHOFER , E. K., P URI , R. K., S CHERF, U., T HIERRY-M IEG , J., WANG , C., W ILSON , M., W OLBER , P. K., Z HANG , L., A MUR , S., BAO , W., BARBACIORU , C. C., L UCAS , A. B., B ERTHOLET, V., B OYSEN , C., B ROMLEY, B., B ROWN , D., B RUNNER , A., C ANALES , R., C AO , X. M., C EBULA , T. A., C HEN , J. J., C HENG , J., C HU , T.-M., C HUDIN , E., C ORSON , J., C ORTON , J. C., C RONER , L. J., DAVIES , C., DAVISON , T. S., D ELENSTARR , G., D ENG , X., D ORRIS , D., E KLUND , A. C., HUI FAN , X., FANG , H., F ULMER -S MENTEK , S., F US COE , J. C., G ALLAGHER , K., G E , W., G UO , L., G UO , X., H AGER , J., H AJE , P. K., H AN , J., H AN , T., H ARBOTTLE , H. C., H ARRIS , S. C., H ATCHWELL , E., H AUSER , C. A., H ESTER , S., H ONG , H., H URBAN , P., JACKSON , S. A., J I , H., K NIGHT, C. R., K UO , W. P., L E C LERC , J. E., L EVY, S., L I , Q.-Z., L IU , C., L IU , Y., L OMBARDI , M. J., M A , Y., M AGNUSON , S. R.,

205

Bibliography

M AQSODI , B., M C DANIEL , T., M EI , N., M YKLEBOST, O., N ING , B., N OVORADOVSKAYA , N., O RR , M. S., O SBORN , T. W., PAPALLO , A., PATTERSON , T. A., P ERKINS , R. G., P E TERS , E. H., P ETERSON , R., P HILIPS , K. L., P INE , P. S., P USZTAI , L., Q IAN , F., R EN , H., ROSEN , M., ROSENZWEIG , B. A., S AMAHA , R. R., S CHENA , M., S CHROTH , G. P., S HCHEGROVA , S., S MITH , D. D., S TAEDTLER , F., S U , Z., S UN , H., S ZALLASI , Z., T EZAK , Z., T HIERRY-M IEG , D., T HOMPSON , K. L., T IKHONOVA , I., T URPAZ , Y., VALLANAT, B., VAN , C., WALKER , S. J., WANG , S. J., WANG , Y., W OLFINGER , R., W ONG , A., W U , J., X IAO , C., X IE , Q., X U , J., YANG , W., Z HANG , L., Z HONG , S., Z ONG , Y. & S LIKKER , W. [2006]. The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24(9), 1151–1161. doi: 10.1038/nbt1239. S HIPPY, R., F ULMER -S MENTEK , S., J ENSEN , R. V., J ONES , W. D., W OLBER , P. K., J OHN SON , C. D., P INE , P. S., B OYSEN , C., G UO , X., C HUDIN , E., S UN , Y. A., W ILLEY, J. C., T HIERRY-M IEG , J., T HIERRY-M IEG , D., S ETTERQUIST, R. A., W ILSON , M., L UCAS , A. B., N OVORADOVSKAYA , N., PAPALLO , A., T URPAZ , Y., BAKER , S. C., WARRINGTON , J. A., S HI , L. & H ERMAN , D. [2006]. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat. Biotechnol. 24(9), 1123–1131. doi: 10.1038/nbt1241. S ILVEY, S. D. [1970]. Statistical Inference. Harmondsworth, Middlesex, England: Penguin Books Ltd. S IMON , R., R ADMACHER , M. D., D OBBIN , K. & M C S HANE , L. M. [2003]. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95, 14–18. S MYTH , G. K. [2004]. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3. doi: 10.2202/1544-6115.1027. S PEED , T. [2003]. Statistical Analysis of Gene Expression Microarray Data. Boca Raton: Chapman & Hall/CRC. ¨ S RIVASTAVA , P. K., K UFFER , S., B RORS , B., S HAHI , P., L I , L., K ENZELMANN , M., G RETZ , ¨ N. & G R ONE , H.-J. [2008]. A cut-off based approach for gene expression analysis of formalinfixed and paraffin-embedded tissue samples. Genomics 91(6), 522–529. doi:10.1016/j.ygeno. 2008.03.003. S TAFFORD , P. & B RUN , M. [2007]. Three methods for optimization of cross-laboratory and cross-platform microarray expression data. Nucleic Acids Res. 35(10), e72. doi:10.1093/nar/ gkl1133. S TEWARD , W. E., H ENSON , T. L. & B OX , G. E. P. [1996]. Model discrimination and criticism with single-response data. AIChE Journal 42, 3055–3062. S TEWARD , W. E., S HON , Y. & B OX , G. E. P. [1998]. Discrimination and goodness of fit of multiresponse mechanistic models. AIChE Journal 66, 1404–1412.

206

Bibliography

S TOREY, J. D. & T IBSHIRANI , R. [2003]. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. 100(16), 9440–9445. doi:10.1073/pnas.1530509100. S TRAUSS , E. [2006]. Arrays of hope. Cell 127(4), 657–659. S TUDDEN , W. J. [1980]. Ds -optimal designs for polynomial regression using continued fractions. Ann. Stat. 8, 1132–1141. S U , L.-J., C HANG , C.-W., W U , Y.-C., C HEN , K.-C., L IN , C.-J., L IANG , S.-C., L IN , C.H., W HANG -P ENG , J., H SU , S.-L., C HEN , C.-H. & H UANG , C.-Y. F. [2007]. Selection of ddx5 as a novel internal control for q-RT-PCR from microarray data using a block bootstrap re-sampling scheme. BMC Genomics 8, 140. doi:10.1186/1471-2164-8-140. S UNIL , V. R., PATEL , K. J., N ILSEN -H AMILTON , M., H ECK , D. E., L ASKIN , J. D. & L ASKIN , D. L. [2007]. Acute endotoxemia is associated with upregulation of lipocalin 24p3/lcn2 in lung and liver. Exp. Mol. Pathol. 83(2), 177–187. doi:10.1016/j.yexmp.2007.03.004. S UZUKI , T., H IGGINS , P. J. & C RAWFORD , D. R. [2000]. Control selection for RNA quantitation. Biotechniques 29, 332–337. ¨ ¨ S WAMEYE , I., M ULLER , T., T IMMER , J., S ANDRA , O. & K LINGM ULLER , U. [2003]. Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by data-based modeling. Proc. Natl. Acad. Sci. 100(3), 1028–1033. S ZABO , A., P EROU , C. M., K ARACA , M., P ERREARD , L., PALAIS , R., Q UACKENBUSH , J. F. & B ERNARD , P. S. [2004]. Statistical modeling for selecting housekeeper genes. Genome Biol. 5(8), R59. doi:10.1186/gb-2004-5-8-r59. TARANTOLA , A. [2005]. Inverse Problem Theory and Methods for Model Parameter Estimation. Society for Industrial Mathematics. TATSUMI , K., O HASHI , K., TAMINISHI , S., O KANO , T., YOSHIOKA , A. & S HIMA , M. [2008]. Reference gene selection for real-time RT-PCR in regenerating mouse livers. Biochem. Biophys. Res. Commun. 374(1), 106–110. doi:10.1016/j.bbrc.2008.06.103. T HE G ENE O NTOLOGY C ONSORTIUM [2001]. Creating the gene ontology resource: Design and implementation. Genome Res. 11(8), 1425–1433. T HERNEAU , T. M. & BALLMAN , K. V. [2008]. What does PLIER really do? Cancer Inform. 6, 423–431. T HOMAS F. C OOLEY, S. C., W ILLIAM R. PARKE [1989]. Predictive efficiency for simple nonlinear models. Journal of Econometrics 40(1), 33–44. T IBSHIRANI , R., H ASTIE , T., E ISEN , M., ROSS , D., B OTSTEIN , D. & B ROWN , P. [1999]. Clustering methods for the analysis of DNA microarray data. Technical Report, Department of Health Research and Policy, Stanford University. URL citeseer.ist.psu.edu/ tibshirani99clustering.html.

207

Bibliography

¨ T IMMER , J., M ULLER , T. & M ELZER , W. [1998]. Numerical methods to determine calcium release flux from calcium transients in muscle cells. Biolphys. J. 74, 1694–1707. ¨ ¨ T IMMER , J., M ULLER , T., S ANDRA , O., S WAMEYE , I. & K LINGM ULLER , U. [2004]. Modeling the non-linear dynamcis of cellular signal transduction. Int. J. Bif. Chaos 14, 2069–2079. T ITTERINGTON , D. M. [1975]. Optimal design: Some geometrical aspects of D-optimality. Biometrika 2, 313–320. T SAI , C.-A., L EE , T.-C., H O , I.-C., YANG , U.-C., C HEN , C.-H. & C HEN , J. J. [2005]. Multiclass clustering and prediction in the analysis of microarray data. Math. Biosci 193, 79–100. doi:10.1016/j.mbs.2004.07.002. U DVARDI , M. K., C ZECHOWSKI , T. & S CHEIBLE , W.-R. [2008]. Eleven golden rules of quantitative RT-PCR. Plant Cell 20(7), 1736–1737. doi:10.1105/tpc.108.061143. VANDESOMPELE , J., P RETER , K. D., PATTYN , F., P OPPE , B., ROY, N. V., PAEPE , A. D. & S PELEMAN , F. [2002]. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3(7). V ENZON , D. J. & M OOLGAVKAR , S. H. [1988]. A Method for Computing Profile-LikelihoodBased Confidence Intervals. Appl. Statist. 37(1), 87–94. V ERHEIJEN , P. J. [2003]. Model selection: an overview of practices in chemical engineering. Computer-Aided Chemical Engineering 16, 85–104. W EBB , M. [1987]. Metallothionein in regeneration, reproduction and development. Experientia Suppl 52, 483–498. W ENG , L., DAI , H., Z HAN , Y., H E , Y., S TEPANIANTS , S. B. & BASSETT, D. E. [2006]. Rosetta error model for gene expression analysis. Bioinformatics 22, 1111–1121. doi: 10.1093/bioinformatics/btl045. W OLKENHAUER , O. [2006]. Systems Biology - Dynamic pathway modelling. W OO , Y., A FFOURTIT, J., DAIGLE , S., V IALE , A., J OHNSON , K., NAGGERT, J. & C HURCHILL , G. [2004]. A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. J. Biomol. Tech. 15(4), 276–284. W RIGHT, S. P. [1992]. Adjusted p-values for simultaneous inference. Biometrics 48, 1005–1013. W U , C. F. J. [1983]. On the convergence properties of the EMalgorithm. The Ann. Stat. 11(1), 95–103. ISSN 00905364. doi:10.2307/2240463. W U , Z., I RIZARRY, R., G ENTLEMAN , R., M ARTINEZ -M URILLO , F. & S PENCER , F. [2004]. A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 99, 909–917. X U , L. & J ORDAN , M. I. [1995]. On convergence properties of the EM algorithm for gaussian mixtures. Neural. Comput. 8, 129–151.

208

Bibliography

YANG , Y., D UDOIT, S., L UU , P., L IN , D., P ENG , V., N GAI , J. & S PEED , T. [2002]. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res.earch 30, e15. doi:10.1093/nar/30.4.e15. Y UE , R.-X. & H ICKERNELL , F. J. [1999]. Robust designs for fitting linear models with misspecification. Stat. Sin. 9, 1053–1069. Y UEN , T., W URMBACH , E., P FEFFER , R. L., E BERSOLE , B. J. & S EALFON , S. C. [2002]. Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res. 30(10), e48. Z ELLMER , S., S CHMIDT-H ECK , W., G ODOY, P., W ENG , H., M EYER , C., L EHMANN , T., S PARNA , T., S CHORMANN , W., H AMMAD , S., K REUTZ , C., T IMMER , J., VON W EIZSCKER , ¨ F., T H URMANN , P. A., M ERFORT, I., G UTHKE , R., D OOLEY, S., H ENGSTLER , J. G. & G EBHARDT, R. [2010]. Transcription factors ETF, E2F, and SP-1 are involved in cytokineindependent proliferation of murine hepatocytes. Hepatology 52(6), 2127–2136. doi:10.1002/ hep.23930.

209