Systematic characterization of protein complexes by MS

0 downloads 0 Views 5MB Size Report
Sep 2, 2009 - in collaboration projects with Dr. Cristina-Maria Valcu (Technical University Munich) and Dr. Tiago Santana (University of São Paulo, São Paulo ...
Systematic Characterization of Mammalian Protein Complexes by Shotgun Liquid Chromatography Tandem Mass Spectrometry. DISSERTATION zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.)

vorgelegt der Fakultät Mathematik und Naturwissenschaften der Technischen Universität Dresden von Magno Rodrigues Junqueira BSc (Hons) MSc geboren am 23. October 1980 in Volta Grande, Brazil

Gutachter:

Prof. Dr. Michael Göttfert Prof. Dr. Elly Tanaka

Eingereicht am: 30 september 2009

2

Part of this work was published in: Chapter 3.4 Junqueira, M., Spirin, V., Santana Balbuena, T., Waridel, P., Surendranath, V., Kryukov, G., Adzhubei, I., Thomas, H., Sunyaev, S., and Shevchenko, A. (2008b). Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification. Journal of Proteome Research 7, 3382-3395. Chapter 3.5 Junqueira, M., Spirin, V., Balbuena, T.S., Thomas, H., Adzhubei, I., Sunyaev, S., and Shevchenko, A. (2008a). Protein identification pipeline for the homology-driven proteomics. Journal of Proteomics 71, 346-356. Charneau, S* ; Junqueira, M*; Costa, C ; Pires, D ; Fernandes, E ; Bussacos, A ; Sousa, M ; Ricart, C ; Shevchenko, A ; Teixeira, A . The saliva proteome of the blood-feeding insect Triatoma infestans is rich in platelet-aggregation inhibitors. International Journal of Mass Spectrometry, v. 268, p. 265-276, 2007. * Equal contribution Shevchenko, A., Valcu, C.M., and Junqueira, M. (2009). Tools for exploring the proteomosphere. Journal of Proteomics 72, 137-144.

The method for protein complex analysis described in Chapters 3.1, 3.2 and 3.3 was applied in collaboration projects with groups of Prof. Dr. Antony Hyman (MPI-CBG), Dr. Frank Buchholz (MPI-CBG).and Dr. Wolfgang Zachariae (MPI-CBG) and published in the following papers: Theis, M., Slabicki, M., Junqueira, M., Paszkowski-Rogacz, M., Sontheimer, J., Kittler, R., Heninger, A.K., Glatter, T., Kruusmaa, K., Poser, I., et al. (2009). Comparative profiling identifies C13orf3 as a component of the Ska complex required for mammalian cell division. Embo Journal 28, 1453-1465.

3

Ding, L., Paszkowski-Rogacz, M., Nitzsche, A., Slabicki, M.M., Heninger, A.K., de Vries, I., Kittler, R., Junqueira, M., Shevchenko, A., Schulz, H., et al. (2009). A genome-scale RNAi screen for Oct4 modulators defines a role of the Paf1 complex for embryonic stem cell identity. Cell Stem Cell 4, 403-415. Matos, J., Lipp, J.J., Bogdanova, A., Guillot, S., Okaz, E., Junqueira, M., Shevchenko, A., and Zachariae, W. (2008). Dbf4-dependent CDC7 kinase links DNA replication to the segregation of homologous chromosomes in meiosis I. Cell 135, 662-678. Maffini S., Maia A.R., Manning A.L., Maliga Z., Pereira A.L., Junqueira M., Shevchenko A., Hyman A., Yates J.R. 3rd, Galjart N., Compton D.A., Maiato H. CENPE targets CLASP1 to kinetochores to regulate the dynamics of attached microtubules. Current Biology. 2009 Sep 2 (Published ahead of print) Slabicki M., Theis M.,. Krastev D.B., Teyra J., Mundwiller E., Junqueira M., Paszkowski-Rogacz M., Samsonov S., Heninger A.K., Poser I., Prieur F., Truchetto J., Durr A., Laurent B., Brice A., Shevchenko A., Pisabarro M.T., Stevanin G., Buchholz F. A genome-scale DNA repair RNAi screen identifies SPG47 a novel gene associated with hereditary spastic paraplegia. Submitted to Nature Genetics

The pipeline for homology-driven search described in Chapters 3.4 and 3.5 was applied in collaboration projects with Dr. Cristina-Maria Valcu (Technical University Munich) and Dr. Tiago Santana (University of São Paulo, São Paulo, Brazil) and published in the following papers: Valcu, C.M., Junqueira, M., Shevchenko, A., and Schlink, K. (2009). Comparative proteomic analysis of responses to pathogen infection and wounding in Fagus sylvatica. Journal of Proteome Research. 2009 Aug;8(8):4077-91 Balbuena, T.S., Silveira, V., Junqueira, M., Dias, L.L., Santa-Catarina, C., Shevchenko, A., and Floh, E.I. (2009). Changes in the 2-DE protein profile during zygotic embryogenesis in the Brazilian Pine (Araucaria angustifolia). Journal of Proteomics 72, 337-352. 4

Editorial Note Figure 22 and Figure 23 were obtained in collaboration with Dr. Charles Bradshaw (MPI-CBG). Figure 27 was acquired by Dr. Toyoda Yusuke (MPI-CBG) in a collaboration project. Figure 28 was acquired by Dr. Mirko Theis (MPI-CBG) in collaboration project Statistical modeling of EagleEye software presented in Chapter 3.4.1 was performed in collaboration with Dr. Dr. Shamil Sunyaev and Dr. Victor Spirin (Harvard Medical School). Protein samples from the bug Triatoma infestans and Brazilian pine Araucaria angustifolia (Chapter 3.4.2) were obtained from collaboration projects with Dr. Tiago Santana Balbuena (Plant Cell Biology Laboratory University of Sao Paulo) and Dr. Sebastien Charneau (Laboratory of Biochemistry and Protein Chemistry University of Brasilia), respectively. The new web interface of the MS BLAST server (described in Chapter 3.5) was developed in collaboration with Dr. Shamil Sunyaev and Dr. Ivan Adzhubei from Harvard Medical School.

5

Abbreviations: AP-MS: Affinity Purification and Mass Spectrometry APC: Anaphase Promoting Complex BAC: Bacterial artificial chromosome BLAST: Basic Local Alignment Search Tool BSA: Bovine serum albumin C-terminus: Carboxyl terminus CFP: Cyan Fluorescent Protein CID: Collision-Induced Dissociation DNA: Deoxyribonucleic acid eGFP: enhanced Green Fluorescent Protein EGTA: Ethylene glycol-bis(2-aminoethylether)-N,N,N′,N′-tetraacetic acid ER: Endoplasmic Reticulum ESI: Electrospray Ionization EST: Expressed Sequence Tag FDR: False Discovery Rate FRET: Fluorescence Resonance Energy Transfer FT ICR: Fourier Transform Ion Cyclotron Resonance geLC-MS/MS: in-gel digestion of SDS-PAGE separated proteins followed by LCGFP: Green Fluorescent Protein GS: Streptavidin-binding peptide GST: Glutathione S-transferase HILIC: Hydrophilic Interaction Chromatography His-tag: Polyhistidine-tag HPLC: High performance liquid chromatography HSP: High Scoring Pair i.d: internal diameter iCAT: Isotope-coded affinity tags IgG: Immunoglobulin G IP: Immunopurification 6

kDa: kiloDalton LAP-tag: Localization and Affinity Purification tag LC-MS/MS: Liquid Chromatography Tandem Mass Spectrometry. m/z: mass to charge ratio MALDI-TOF/TOF: MALDI-time-of-flight/time-of-flight MALDI: Matrix-Assisted Laser Desorption/Ionization MGF: MASCOT Generic Format MS BLAST: Mass Spectrometry driven BLAST MS/MS: Tandem Mass Spectrometry MS: Mass Spectrometry MudPIT: MultiDimensional Protein Identification Technique MW: Molecular Weight N-terminus: Amino terminus nano-HPLC: nano flow high performance liquid chromatography nanoES: Nanoelectrospray NCBI: National Center for Biotechnology Information PMF: Peptide Mass Fingerprint ppb: parts-per-billion PPI: Protein-Protein Interaction PPIN: Protein-Protein Interaction Networks ppm: parts-per-million psi: pound per square inch PTM: Post Translational Modification RL: Renilla Luciferase RNA: Ribonucleic Acid RNAi: RNA interference RP: Reverse Phase SDS-PAGE: Sodium Dodecylsulfate Polyacrylamide Gel Electrophoresis SEAM: Sequential rounds of Epitope tagging, Affinity isolation, and Mass spectrometry SILAC: Stable Isotope Labeling with Amino acids in Cell culture 7

SPE: Solid Phase Extraction TAP: Tandem Affinity Purification TEV: Tobacco Etch Virus TFA: Trifluoroacetic Acid TOF: Time-Of-Flight Tris: Tris(hydroxymethyl)aminomethane XIC: Extracted Ion Chromatogram Y2H: Yeast two hybrid YFP: Yellow Fluorescent Protein

Amino acids (one-letter code) A alanine

C cysteine

D aspartic acid

E glutamic acid

F phenylalanine

G glycine

H histidine

I isoleucine

K lysine

L leucine

M methionine

N asparagine

P proline

Q glutamine

R arginine

S serine

T threonine

V valine

W tryptophan

Y tyrosine

8

Table of Contents Summary ....................................................................................................................... 11 1

INTRODUCTION .................................................................................................. 13 1.1

Protein complexes are the machines of the cells ............................................... 13

1.2

Methods of charting protein–protein interaction ............................................... 13

1.2.1

Yeast two-hybrid screens ....................................................................14

1.2.2

Affinity purification of protein assemblies..........................................14

1.3

LC-MS/MS platforms for protein identification ................................................ 18

1.4

Bottom-up approaches for the characterization of protein complexes ............... 22

1.4.1

Gel-based protein identification ..........................................................23

1.4.2

Shotgun-LC-LC-MS/MS ....................................................................24

1.4.3

Pre-processing of data prior to database search ...................................26

1.4.4

Protein identification by database search ............................................28

1.4.5

Sorting out background proteins from genuine protein interactions. ...35

1.5

From individual proteins to protein interaction networks in mammals. ............. 36

2

Motivations for the thesis work ............................................................................... 39

3

RESULTS AND DISCUSSION .............................................................................. 41 3.1

Establishing the method of protein complex analysis. ....................................... 41

3.2

Shotgun LC-MS/MS outperforms geLC-MS/MS. ............................................. 47

3.3

Large scale shotgun LC-MS/MS for characterization of mammalian protein

complexes. ................................................................................................................. 53

3.3.1

Filtering of common background proteins in HeLa cells.....................54

3.3.2

Functional annotation of uncharacterized genes. .................................62

3.3.3

Conclusion..........................................................................................67 9

3.4

Computational post-processing of LC-MS/MS datasets. ................................... 69

3.4.1

Unbiased database-independent filtering of tandem mass spectra. ......69

3.4.2

Filtering improves the performance of sequence similarity searches. ..81

3.4.3

Filtering improves the identification of low abundant gel separated

proteins ...........................................................................................................84 3.5

A pipeline for the homology-driven proteomics ................................................ 89

3.5.1

Homology-driven pipeline applied to affinity-purified complexes. .....93

3.5.2

Application of homology-driven proteomics pipeline: Proteomic map

of T. infestans saliva .......................................................................................95 3.5.3

Conclusion .......................................................................................102

4

GENERAL CONCLUSIONS AND PERSPECTIVES .......................................... 103

5

Materials and methods .......................................................................................... 107

6

Supplementary data ............................................................................................... 117

7

References ............................................................................................................ 133

8

Acknowlegments .................................................................................................. 145

10

Summary More than three quarters of proteins in mammalian cells function in the context of multi-protein complexes. The characterization of native complexes under different biological conditions is critical for understanding the molecular machinery underling biological functions. Affinity purification of protein complexes and the identification of co-purified proteins by mass spectrometry is a powerful method for the discovery of biologically relevant protein-protein interactions. However, current methods for the characterization of protein complexes are laborious and time consuming and therefore, not compatible with large-scale analysis that aims to cover multiple cellular conditions and obtain a network based understanding of biological systems. To understand cellular processes in the context of these systems, it is necessary to generate proteomic assays that are compatible with large-scale studies. Therefore, the major goal of this thesis work was to develop a high-throughput method to characterize protein complexes and overcome the limitations of the conventional geLC-MS/MS or MudPIT protocols. The work improved throughput by more than 10-fold with an associated 5 to 10 fold improvement in sensitivity when compared to the ―gold‖ standard geLC approach. Additionally, we show new approaches for solving a major challenge for the interpretation of protein interaction results from MS: distinguishing bona fide from non-specific interactors. This was performed by the development of filtering strategies that were applied on both the peptide and the protein level. These methods have been proven to be robust in large scale proteomics screens. Collectively, they were applied to the analysis of more than 500 immunoprecipitation experiments representing more than 100 unique baits. Moreover, this work presents new methods to removes background tandem mass spectra from LC-MS/MS data sets that improved both stringent and sequence similarity identification of proteins. This approach was particularly useful when combined with error-tolerant search for the proteomic analysis of organisms without a sequenced genome, and was applied to identify proteins from organisms of pharmacological or conservational relevance. Overall, the methods developed during this thesis were successfully employed for the discovery and characterization of novel protein complexes in cultured mammalian cells, mice, yeast and in the proteomes of unsequenced organisms and therefore present a step forward towards the application of proteomic analysis in combination with large scale studies.

11

12

Introduction

1 INTRODUCTION 1.1 Protein complexes are the machines of the cells It is estimated that 80% of human proteins function as components of complexes or are regulated by protein-protein interactions (Rhodes et al., 2005). The assembly, stability and activity of protein complexes are regulated both by protein translation and post-translational modifications in order to fulfill multiple roles within the cells. Protein translation, counterbalanced by protein degradation, controls the levels of protein present in the cells and consequently the amount of protein available for the formation of complexes. On the other hand, post-translational modifications, such as phosphorylation and ubiquitination, among others can lead to conformational changes that affect the affinity and kinetic parameters of protein interactions working as a regulatory mechanism. Several protein-protein complexes have been described as components of larger cellular structures involved in different cellular activities, such as: Signaling (e.g. TOR complex, voltage-gated-channels) Metabolism (e.g. cytochrome c oxidase (COX), cellulosome (degradation of cellulose)) Intracellular sorting (e.g. Rabs/Rab effectors, SNAP/SNARE, kinesins/myosins) Molecular processing (e.g. proteasome, RNA/DNA polymerase, Ccr4-Notcomplex (regulator of mRNA metabolism) Structural (e.g. cytoskeleton, chromatin structure, telomeres, centromeres) The involvement of protein complexes in a variety of cellular processes highlights the importance of understanding in detail how such complexes are assembled from multiple subunits. Therefore, an increasing effort has been put into the development of techniques for global characterization of protein complexes within different biological conditions, such as distinct phases of cell cycle and different drug treatment. 1.2 Methods of charting protein–protein interaction Two major approaches are used for large scale discovery of protein-protein interactions: 1) Yeast two-hybrid screens, which detect binary interactions in vivo, and 2)

13

Introduction

Affinity purification of complexes, followed by protein identification by mass spectrometry, which detects native protein complexes. 1.2.1 Yeast two-hybrid screens The yeast two-hybrid assay detects a binary interaction between proteins inside the nuclei of yeast cells. Each of the two candidate proteins is fused to separate fragments of a transcription factor. Interaction of the candidate proteins results in a functional transcription factor, thereby inducing the reporter gene expression (usually lacZ gene). The first interactome map of a organism was obtained from bacterophage T7 (Bartel et al., 1996). Later, it was applied to budding yeast (Ito et al., 2001; Uetz et al., 2000), Caenorhabditis elegans (Li et al., 2004) , Plasmodium falciparum (LaCount et al., 2005), Drosophila melanogaster (Giot et al., 2003) and recently for generation of the human interactome (Rual et al., 2005; Stelzl et al., 2005). However, yeast two-hybrid screening has significant limitations (Gentleman and Huber, 2007; Kocher and Superti-Furga, 2007; von Mering et al., 2002), most notably incomplete coverage (false negatives), detection of non-biological interactions (false positives) and low accuracy. For mammalian interactome mapping, the efficiency of the method decreases because proteins are often not correctly folded or modified in yeast (e.g. missing post translational modifications, such as phosphorylation). Additionally, this approach only detects binary interactions; consequently, complexes that are stabilized by more than two partners are left out. Therefore, the potential interactions identified by two-hybrid normally require validation by other methods (Selbach and Mann, 2006). 1.2.2 Affinity purification of protein assemblies As an alternative to Y2H, affinity purification can be applied to the purification of endogenous protein complexes. This approach has several distinct advantages, including the recovery of protein from its native environment and the potential to identify multiprotein complexes in the same experiment. Ideally, endogenous proteins would serve as baits – the protein used to pull down the protein complex – if antibodies or other reagents are available that allow the specific isolation of the protein along with its interaction partners. Unfortunately, specific 14

Introduction

antibodies that would enable efficient immunoprecipitation are not frequently available. A more general strategy is the expression of genetically engineered versions of the ‗bait‘ fused with tags, which are amenable to isolation procedures. The tags are sequences coding for epitopes that are specifically recognized by different methods of affinity purification. Common recombinant tags used in affinity purification (Table 1),include, FLAG, 6 × His, and glutathione-S-transferase (GST) systems (Hopp et al., 1988). The FLAG along with HA (Field et al., 1988) and c-myc (Evan et al., 1985), that are protein epitopes that can be isolated with the use of antibodies. the 6 × His tag used for purification of recombinant proteins by means of metal chelate chromatography (Hochuli et al., 1988); GST-tagged proteins purified using glutathione agarose (Smith and Johnson, 1988). In addition to the traditional antibody and epitope combinations, other types of affinity tags were developed, like protein A tag, which binds to IgG (Uhlen et al., 1983). Several tags rely on the interaction with streptavidin (Schmidt and Skerra, 2007) and biotin (Grisshammer et al., 1993); maltose-binding peptide (MBP tag) and maltose (di Guan et al., 1988); chitin-binding domain (CBD) and chitin (Chong et al., 1997); the calmodulin-binding peptide that binds to calmodulin (Rigaut et al., 1999); and the Speptide tag that binds to the S-protein derived from pancreatic RNase A (Hackbarth et al., 2004).

15

Introduction Table 1 –List of epitopes commonly used. Name

Affinity element

Detection

Purification

TAP

Calmodulin and IgG-binding domains

Anti-CBP

Calmodulin and IgG

CBP

CBP peptide

Anti-CBP

Calmodulin

Avitag

GLNDIFEAQKIEWHE

Avidin

Avidin

S-tag

S-peptide

Anti-S peptide

S-peptide

CBD

Chitin-binding domain

Anti-CBD

Chitin

MBP

Maltose-binding- domain

Anti-MBP

Maltose

Strep-tag

WSAPQFEK

Strep-Tactin

Strep-Tactin

CD

18 aa exon

12CA5

Immunoaffinity

Protein A

IgG-binding domain

IgG

IgG

GST

220 aa GST

Anti-GST

Glutathione

c-myc

EQKLISEEDL

9E10

Immunoaffinity

HA

YPYDVPDYA

12CA5

Immunoaffinity

6 x His

HHHHHH

Anti-His

Metal affinity

FLAG

DYKDDDDK

M1, M2, M5

Immunoaffinity

The most widely applied procedure for isolating protein complexes is tandem affinity purification (TAP) (Rigaut et al., 1999). The standard TAP-tag consists of calmodulin binding domain, followed by tobacco etch virus protease (TEV protease) cleavage site and Protein A (Figure 1). It is used for a two step purification; the first step is based on the interaction between IgG and Protein A (Figure 1-A), followed by elution of the complex by cleaving from the affinity resin using TEV protease (Figure 1-B). The second step is based on the calmodulin binding peptide that interacts with calmodulin beads in the presence of calcium (Figure 1-C). The complex is then eluted by addition of EGTA containing buffer (calcium chelating) (Figure 1-D). While TAP purification normally results in a relatively clean purification, sometimes, a large number of nonspecific proteins are co-isolatated during this process. Consequently, affinity purifications

16

Introduction

performed by TAP can still result in a mixture of a few hundred proteins (Shevchenko et al., 2008).

Figure 1- Overview of the TAP purification strategy. From (Puig et al., 2001)

Although the TAP tag was particularly effective in yeast, where endogenous genes can be tagged by homologous recombination, the conventional TAP- tag has several limitations for the use in high eukaryotes. These include: a large initial amount of cells required (typically more than 5 x 109 cells); interaction of endogenous IgG and calmodulin with the tag; inefficient cleavage of TEV at 4oC; and an inefficient elution of complexes from the calmodulin beads (Barth et al., 1998; Peersen et al., 1997). Therefore, different tandem tags were developed for mammalian cells. One example utilizes green fluorescent protein (GFP) and S-peptide which improved the classical configuration for the use in higher eukaryotes. This tag allows for the localization of the protein in living cells along with subsequent immunoprecipitation and co-localization 17

Introduction

studies, and is termed Localization and Affinity Purification (LAP-tag) (Cheeseman and Desai, 2005). One major limitation for tagging proteins in higher eukaryotes is that recombinant tagging of endogenous gene is less efficient (Burckstummer et al., 2006). To overcome this, stable integration of the construct (Bouwmeester et al., 2004) or transient transfection with a recombinant cDNA transgene (Westermarck et al., 2002), were employed alternatively. However, this typically results in protein expression exceeding endogenous levels thereby potentially disturbing the dynamic equilibrium of native interactions, enhancing non-specific interactions, and in some cases, leading to cellular toxicity (Jarvik and Telmer, 1998). Stably integrated bacterial artificial chromosome (BAC) transgenes offers a better alternative in which tagged gene is surrounded by its native regulatory elements, making possible the selection of cells expressing endogenous levels of protein (Poser et al., 2008). Because this method allows for simple, rapid and inexpensive high-throughput generation of large DNA constructs, the systematic exploration of protein complexes in higher eukaryotes has become possible. Automation along with parallel purification scales it up to a genomic level, which, in turn, requires a sensitive protein identification technology with a comparable throughput. 1.3 LC-MS/MS platforms for protein identification Mass Spectrometry (MS) is a superior analytical technology for the analysis of purified protein complexes due to its ability to identify the components of protein mixtures. The major approach used today is based on the proteolytic fragmentation of proteins ("bottom-up" strategy) prior to MS characterization. The power of MS relies on high information content of mass spectra of fragmented peptides, which are sufficiently specific to allow matching to database sequences. However, if the peptide of interest is a minor component of a mixture the identification can be compromised. To fulfill this analytical demand two dimensional gel electrophoresis has been used to separate proteins followed by in-gel digestion of excised spots and MS analysis to identify these proteins. However, this approach has several disadvantages in that only a limited range of proteins can be analyzed, thus, it suffers from a low dynamic range and low throughput (Rappsilber et al., 2002). The liquid chromatography-MS (LC-MS) platforms utilizing 18

Introduction

microcapillary columns has brought several improvements to protein analysis. LC-MS results in a two-dimensional separation, with mass/charge on one axis, and retention time on the other. The use of smaller inner diameter LC columns improves the sensitivity of the system, which increases inversely as the mobile phase flow rates drop, demonstrating the advantages of very low liquid flow rates in electrospray ionization (Wilm and Mann, 1994). Current microcapillary columns rely on the use of high pressure (5–20 kpsi), small porous particles (3-5 µm), smaller inner diameter columns (30-100µm i.d.) and work at low flow rates (nL/min) (Shen et al., 2005; Shen et al., 2002; Tolley et al., 2001). As a consequence nano-LC-MS/MS is a superior platform for protein identification providing sensitivity and reliability. The high sensitivity of nanoLC-MS/MS is achieved mainly by concentrating peptides before MS detection, for example, the use of 75 µm i.d. reversed phase C18 column can effectively concentrate peptides 100-300 fold to a volume of few nL that can be dynamically injected into the ion source of the mass spectrometer. Since MS detection is concentration-dependent, such an increase in concentration is effectively translated into an amplified MS signal. Consequently, it can dramatically increase real sensitivity and therefore the identification of peptides. Improved acquisition rate of MS/MS spectra allows for better use of narrow peak widths provided by highly efficient separations. Linear ion trap instrument (LTQ) (Schwartz et al., 2002) is able to acquire ∼10 MS/MS spectra in 1-2 s with ∼5× higher sensitivity compared to the previous 3D ion trap instruments (Douglas et al., 2005), which leads to more reliable acquisitions in a shorter time period. Recently, the high sensitivity and fast acquisition of the linear ion trap was further combined with high resolution analyzers within hybrid instruments, such as the LTQ-Fourier transform (FT) and the LTQ-Orbitrap (Zubarev and Mann, 2007). The Orbitrap is a novel mass analyzer and is commercially available since 2005. In the orbitrap (Figure 2-b) ions are injected tangentially into the electric field between the electrodes and trapped because their electrostatic attraction to the inner electrode is balanced by centrifugal forces due to the injection (similar to an orbital trajectory). Thus, ions cycle around the central electrode a long with ring trajectories. In addition, the ions also move back and forth along the axis of the central electrode. Therefore, ions of a 19

Introduction

specific mass-to-charge ratio move in rings that oscillate along the central spindle. The frequency of these harmonic oscillations is independent of the ion velocity and is inversely proportional to the square root of the mass-to-charge ratio (m/z) (Makarov, 2000). By sensing the ion oscillation similar as in the FTICR-MS, this trap can be used as a mass analyzer. Orbitraps have a high mass accuracy (better than 1-2 ppm), a high resolving power (up to 200,000) and a high dynamic range (around 5,000). Once compared to other instruments, such as FTICR it has an improved dynamic range related to its enhanced trapping capacity and lower space-charge-induced frequency shifts (Makarov, 2000). Consequently, more ions can be stored and detected simultaneously, which increases the sensitivity of the analysis. Additionally, the hybrid LTQ-Orbitrap offers the advantage of parallel acquisition, in which the survey scan can take place in the Orbitrap at high resolution and at the same time MS/MS events take place in the LTQ at high sensitivity and speed (but low resolution), maximizing the duty cycle.

Figure 2 – Panel A – Design of LTQ Orbitrap. The front part of the instrument is a linear ion trap mass spectrometer with high sensitivity and fast MS/MS acquisition, but low mass resolution and accuracy. The core part involves the C-trap and Orbitrap, in addition to HCD cell. For high resolution and accurate acquisition within the Orbitrap, accumulated ion populations are moved into the C-trap via an octapole ion guide coming from the linear ion trap. In the C-trap ions are then injected into the orbitrap in a short pulse and begin to circle the central electrode. The signal is detected via a differential amplifier between in the outer orbitrap electrodes. Panel B – Electrode structure of Orbitrap mass analyzer showing the trajectory of trapped ions.

20

Introduction

Typically, the mass spectrometer cannot isolate and acquire MS/MS spectra fast enough to identify all peptides as they are eluted from LC, which makes the restrictive time required for MS/MS acquisition of co-eluted peptides challenging. This problem was, to some extent, reduced by data dependent acquisition (DDA) to control the mass spectrometer acquisition process. In this strategy, the software selects ions with predefined features in the full scan (MS1) for fragmentation (Table 2). The most intense ions observed in the survey scan (normally 3-10 precursors depending on the settings) are selected and subjected to fragmentation in a cyclic way. Each cycle scan of survey is followed by 3-10 MS/MS, and then a new cycle starts with survey scan and selection of another set of precursors for subsequent fragmentation (Figure 3). This process of DDA acquisition takes place along the LC gradient, which consequently generates a large number of MS/MS (Figure 4). Table 2 - Data-dependent acquisition has many features to maximize the amount of useful data acquired from each run. Data-Dependent Acquisition Features Benefit Increases the number of unique precursor ions from which data are acquired. Can be set Selection of the most abundant parents separately for MS>2. Especially helpful for coeluting peaks. Prevents acquisition of unnecessary MS/MS data Isotopic exclusion from isotopes (C13 etc.) Ensures data is acquired from target ions, even if Static include list they are present at low abundance Feature

Static exclude list

Static preferred list Preferred charge state selection

Active exclusion – repeat count

Active exclusion – exclude time

Threshold – absolute Threshold – relative

Neutral loss detection

Eliminates acquisition of MS/MS data from solvent ions and other background components Ensures data is acquired from target ions, if present, but allows acquisition of data from other ions if target ions are not present Ensures selection of doubly charged peptide ions for higher quality MS/MS data Increases the amount of unique data acquired by preventing acquisition of MS/MS data from the same ion more than a user-specified number of times Removes ions from the active exclusion list after a user-specified time to ensure reacquisition if an ion appears in more than one chromatographic peak Ensures data is acquired only from ions of userspecified abundance Ensures data is acquired only from ions of userspecified abundance relative to the most abundant ion in an MS scan1 Identifies neutral losses in MS/MS and triggers MS3 or pseudo-MS3 for additional structural information. Especially useful for analysis of phosphopeptides.

21

Introduction

Figure 3- Data dependent acquisition of the ―top four‖ most intense ions per MS scan applied to LCMS/MS. The cycle starts with a survey scan that detects all ionized peptides eluted from LC column at a given time. Then, several precursors are selected for fragmentation based on intensity and charge state. After, a new cycle starts with survey scan and selection of another set of precursors for subsequent fragmentation. The duty for a complete cycle depends on the machine speed acquisition and normally is in an order of few seconds.

Figure 4 – Total ion chromatogram of 6 digested proteins acquired by nano-LC-MS/MS on a LTQ Orbitrap during a gradient of 45 min. Peptides were separated by nano-LC (200nL/min flow rate) before being sprayed into a mass spectrometer. MS (~500) and MS/MS (~2000) spectra were acquired in a data dependent acquisition mode as shown in Figure 3. The black bars in the chromatogram represent the MS/MS events obtained.

1.4 Bottom-up approaches for the characterization of protein complexes There are two main approaches for the characterization of protein complexes based on proteolytic digested proteins (bottom-up): ‗sort-then-break‘ and ‗break-thensort‘ (shotgun) approaches. ‗Sort-then-break‘ consists of off-line protein fractionation 22

Introduction

performed by SDS-PAGE followed by in-gel digestion (Gel-LC-MS/MS), so that separation is obtained before protein digestion (gel-based approach) (Shevchenko et al., 1996). This is followed by peptide separation by LC (reverse phase) interfaced to a tandem mass spectrometer (also known as geLC-MS/MS). ‗Break-then-sort‘ relies on protein digestion without any prefractionation/separation (known as shotgun) and, afterwards, peptides are separated by multidimensional chromatography followed by tandem mass spectrometric analysis. 1.4.1 Gel-based protein identification The most commonly used method for fractionation of affinity purified complexes is SDS-PAGE, staining with MS-compatible dyes, followed by in-gel digestion. The separation by SDS-PAGE eliminates common contaminants (such as detergent and salts) from the affinity purification. At the same time, it decreases the complexity of protein mixture by separating individual proteins according to their molecular weight. Additionally, staining of the gel gives a semi-quantitative assessment of the protein yield. Individual protein bands within the gel can be excised or the entire gel lane can be cut into slices that are independently in-gel digested. After in-gel digestion, usually by trypsin, the resulting peptides are extracted and separated on a microcapillary column according to their hydrophobicity by reversed phase LC coupled on-line to the mass spectrometer. These two steps of fractionation (SDS-PAGE and reversed phase LC) improve the resolving capacity of the method. The combination of SDS-PAGE and LCMS/MS analysis is termed geLC-MS/MS. This method has been applied in several genome-scale analyses of protein complexes (Ewing et al., 2007; Gavin et al., 2002; Ho et al., 2002; Krogan et al., 2006; Uetz et al., 2000) and is considered the gold-standard in the field.

23

Introduction

Figure 5 – Schematic representation of the geLC-MS/MS approach. Affinity purified proteins are first concentrated and loaded onto the SDS-PAGE followed by an in-gel digestion and analysis by nanoLCMS/MS

However, the gel-based strategy has important limitations, such as: 1- higher concentration of the protease is required for in-gel digestion compared to in-solution digestion (Havlis and Shevchenko, 2004; Shevchenko et al., 2006), which yields abundant background of trypsin autolysis products; 2- it is laborious, and therefore lowthroughput (Chen and Pramanik, 2009); 3- excessive handling, which increases the risk of contamination with keratins, and enhances chemical noise (Shevchenko et al., 2006). 4- sample losses occur during the process of concentration (including sometimes precipitation), which is necessary for loading the large sample volume from affinity purification (50-2000 µL depending on the method); 5- reduced peptide recovery during in-gel digestion (Richert et al., 2004) that displays peptide sequence-dependent variability of 20-40% when compared to in-solution digestion (Havlis and Shevchenko, 2004).

1.4.2 Shotgun-LC-LC-MS/MS In this gel-free approach, purified complexes are digested in-solution followed by multidimensional chromatography separation of the resulting peptide mixture. Usually, two or more steps of LC fractionation are applied in order to resolve complex peptide mixtures. Chromatographic resolution is improved because the method combines the peak capacities of each orthogonal peptide separation (Giddings, 1984). Most commonly, the strong cation exchange chromatography (SCX) - first dimension - is combined with reversed phase chromatography as the second dimension. In this way, each fraction of 24

Introduction

peptides eluted by the salt gradient from SCX is further resolved by RP column chromatography in a gradient of organic solvent. New orthogonal strategies in the first dimension, such as anion exchange, high pH reversed phase (Delmotte et al., 2007), isoelectric focusing (Hubner et al., 2008) and affinity chromatography have also been described. However, the second dimension is always RP because water / methanol or water / acetonitrile eluent is compatible with on-line MS.

Figure 6 - Schematic representation of multidimensional chromatography separation. The purified complex is digested in-solution without prior separation of proteins followed by multidimensional chromatography separation of the resulting peptides.

Initially, multidimensional chromatography was designed such that each step of salt gradient was collected offline and then analyzed by RP LC. Online coupling was introduced later, in which the eluent from SCX column is direct injected into the RP column. Without collecting fractions sample losses are minimized while improving throughput (Link et al., 1999). Different online methods can be used, such as separate columns for SCX and RF connected by switching valves (Mitulovic et al., 2004; Taylor et

al.,

2009),

or

using

multidimensional

protein

identification

technology

(MudPIT)(Wolters et al., 2001). MudPIT consists of SCX stationary phase packed upstream RP sorbent inside the same fused silica capillary column. Peptides are first collected on a SCX column and successively eluted, depending on their isoelectric point, in steps of increasing salt concentration and captured by the second dimension RP column. The reversed-phase column is then eluted with an increasing gradient of organic 25

Introduction

solvent between each salt step to displace the peptides into the ion source of the mass spectrometer. A typical MudPIT experiment consists of a 4-12-cycle run in which a nanoRP-LC gradient is run for each cycle of salt concentration (Table 3).

Table 3- A twelve-cycle MudPIT acquisition. The Concentration is represented as percentage of 500 mM ammonium acetate used for each step of salt concentration. Cycle 1 2 3 4 5 6 7 8 9 10 11 12

Duration (min) 90 112 112 112 112 112 112 112 112 112 140 140

Concentration 0 4 8 10 12 15 20 30 40 50 75 100

The MudPIT approach was applied to characterize complex protein mixtures (Florens et al., 2002; Washburn et al., 2001) and isolated protein complexes (Link et al., 1999) However, the major disadvantage of the multidimensional separation of peptides is that the method is very slow and unreliable in large-scale operations. Considering the number of salt gradient steps (between 4 to 12 steps) in the first SCX dimension, a LCLC experiment for a single sample might take more than 24 hours of MS acquisition (Washburn et al., 2001).

1.4.3 Pre-processing of data prior to database search LC-MS/MS analysis of protein complexes by bottom-up approach under datadependent MS/MS acquisition control produces thousands of tandem mass spectra of varying quality and information content. Therefore, successful protein identification by database search requires pre-processing of mass spectrometric data. The main goal of this 26

Introduction

step is to increase the specificity, sensitivity and accuracy of database searches, which includes tools for: peak detection and noise reduction (Du et al., 2006), de-isotoping of isotopic clusters (Hoopmann et al., 2007; Mujezinovic et al., 2006); charge state determination (Colinge et al., 2003; Sadygov et al., 2002); recognition of spectra generated from co-isolated/fragmented peptides (Carvalho et al., 2009; Zhang et al., 2005), merging and clustering redundant mass spectra acquired from the same precursor (Beer et al., 2004; Tabb et al., 2005) and removing low quality spectra (Bern et al., 2004; Flikka et al., 2006; Savitski et al., 2005). Another challenge in data pre-processing is that analyzed samples contain a large, diverse and poorly defined group of spectra that are often termed peptide background. They originate from ubiquitous human and sheep keratin contaminants; from autolysis products of proteolytic enzymes, such as trypsin (Parker et al., 1998), which are especially abundant in in-gel protein digests (Shevchenko et al., 2006); and from preparation-specific protein contaminants (Roguev et al., 2004; Shevchenko et al., 2002), such as proteins from the cell media or expression of the host organism, antibodies, GST, TEV and PreScission proteases, etc. Many of these sequences are either not present in a given database or scattered through a large number of partially redundant database entries. When abundant, they also give rise to a large pool of polymorphic sequences, orifice fragmentation products, sodium adducts, etc. Another group of tandem mass spectra that confuse the identification of proteins are the ones acquired from precursors of non-peptide origin (detergents, plasticizers, etc) (Schlosser and Volkmer-Engert, 2003). They are mostly detected as singly charged ions and, assuming the employed mass spectrometer offers adequate mass resolution, these are readily recognized in survey scans and, if required, excluded from subsequent MS/MS experiments. However, they are often co-selected with genuine multiply charged peptide precursors and might contaminate MS/MS spectra since, for better sensitivity, the width of the isolation window is maintained within the m/z range of 2 to 4 Da (Loboda et al., 2000; Medzihradszky et al., 2000). To address this problem, computational methods have been developed to recognize background spectra by comparison with a reference library (Gentzel et al., 2003; Yates et al., 1998). Although these are robust solutions, they are 27

Introduction

computationally intense and do not provide a statistically transparent cross-platform framework since they derive empirical similarity thresholds directly from acquired data. Therefore, new solutions have to be implemented in order to obtain a reliable method for filtering background spectra for an efficient database search.

1.4.4 Protein identification by database search A number of computational methods have been developed to assign peptide sequences to acquired tandem mass spectra in bottom-up proteomics. Generally, there are three ways to identify a protein: Identification using unprocessed MS/MS data. Identification by Sequence Tag. Identification by de novo sequencing. Identification using unprocessed MS/MS – Despite the exponential increase in the number of genomes sequenced, efficient matching of MS/MS spectra with the sequences of proteins in databases is possible. A pioneering work by the Yates group (Eng et al., 1994) proposed a method to correlate uninterpreted tandem mass spectra of peptides produced under low collision energy with peptide sequences in a database of proteins, which since has become the most powerful approach for protein identification. This strategy dramatically improved the throughput of analysis, given that database searches of fragmentation patterns is a less demanding process than the manual de novo sequencing, previously used. MS/MS spectrum usually does not contain sufficient information to derive the complete and unambiguous sequence of amino acids. However, it possesses sufficient information to uniquely match an in-silico pattern of peptide fragments and MS/MS spectra by comparing the m/z of the precursor and fragments (Figure 7).

28

Introduction

Figure 7 – Protein identification using MS/MS spectra. Precursor mass, representing peptides obtained by digestion are selected for fragmentation (MS/MS) and compared with fragment patterns of all peptides in a database considering m/z precursor matching. The quality of the match between the experimental and theoretical spectra is assessed by probability-based score that discriminates them from random events.

There are several available algorithms for database searching of unprocessed MS/MS spectra (Table 4). Those programs apply probability-based score to establish the confidence of identifications

Table 4- List of common MS/MS database search engines ( Adapted From (Lu et al., 2009).

29

Introduction

The automated database search is largely applied in the proteomics field and works efficiently for proteins coming from model organisms. However, it requires a very high specificity of spectrum-to-sequence correlation and any discrepancy between the actual data and database sequences might prevent peptide identification. Regardless of whether this variability is due to amino acid substitution; unexpected post-translational modification; peculiar fragmentation pathway or simply that the sequence of the protein was not in the database. As a consequence, a large number of MS/MS spectra acquired in large scale proteomics remains unmatched or matches peptides with low scores, resulting in misidentification of proteins that were not actually in the sample (false positives) or leaving out unmatched proteins that were in fact in the sample (false negatives) (Johnson et al., 2005). Furthermore, background proteins originating from exogenous species are commonly found in samples. The examples are keratins, autolysis products of enzymes, antibodies, and fragments of expression vectors. They increase false positive rate, especially if the search is performed against species specific databases. Thus, several statistical strategies are currently used to evaluate the confidence of putative peptide identifications with marginal database scores (Elias and Gygi, 2007; Fenyo, 2000; Keller et al., 2002; MacCoss et al., 2002), also reviewed in (Fitzgibbon et al., 2008). The most commonly used strategy for the assessment of the false discovery rate (FDR) of large proteome datasets is searching a randomized database (decoy) in parallel to the target database (Elias and Gygi, 2007). This approach assumes that matches to decoy peptide sequences and false-positive matches follow the same distribution. However, the decoy strategy has limitations related to incorrect estimations of FDR in small datasets generated from less complex samples, in addition to doubled database search time (reviewed on (Nesvizhskii et al., 2007). The "Sequence Tag" protein identification strategy was developed by the Matthias Mann group in the mid 90‘s (Mann and Wilm, 1994). A peptide sequence tag is a sequence stretch of, typically, 2 to 4 amino acid residues deduced from the MS/MS spectrum, the adjacent pair of masses in the extremes of the tag, and the precursor mass (Figure 8). 30

Introduction

Figure 8 – Assignment of sequence tags from MS/MS spectrum. (A) Spectrum collected on an ion trap mass spectrometer. In blue is the deduced short amino acid sequence tag. (B) Simplified diagram of the spectrum illustrating how the "Sequence Tag" is generated at a precise distance from the peptides termini. The sequence tag in this case is defined as . Adapted from www.ionsource.com web page.

In the standard settings of sequence tag database search, both masses and the sequence are required to match. However, sequence tag can also allow error-tolerant searches by assuming that one of its regions (and, consequently, the intact mass) can mismatch, whereas the rest of its sequence is identical to the database peptide. Thus unexpected post translation modifications or sequence polymorphism can be tolerated (Mann and Wilm, 1994). In this way sequence tag can be used to validate and improve conventional searches from large MS/MS datasets (Tabb et al., 2003) or helping genome annotation (Frank, 2009; Wright et al., 2009). Sequence tag has also been successfully applied for the identification of proteins that are not present in the databases (e.g. unsequenced organisms) via cross-species identifications to other related proteins in the database (Shevchenko et al., 1997). However, loose matching requirements by error-tolerant results in a dramatic loss of search specificity. Search specificity can be improved and 31

Introduction

manual inspection alleviated, if hit lists with several sequence tags produced from each of the fragmented peptides are combined and evaluated together using a dedicated statistical model (Sunyaev et al., 2003). Identification by de novo sequencing and sequence similarity searches Database algorithms using unprocessed MS/MS data imply that the genome of the organism is accurately sequenced and all ORFs are correctly annotated. Both conditions are rarely met due to the challenges of genome assembly and alternative splicing of gene products, many of which are not adequately represented in the databases. Recent efforts have focused on developing alternative strategies for the identification of peptides based on de novo sequencing methods (Dancik et al., 1999; Frank and Pevzner, 2005a; Ma et al., 2003; Taylor and Johnson, 1997). Using de novo sequencing, a reconstruction of the original peptide sequence is performed without knowledge of the protein‘s sequence. Thus, de novo candidate sequences can be generated by automatic interpretation of MS/MS spectra using algorithms, such as Lutefisk (Taylor and Johnson, 1997), Sherenga (Dancik et al., 1999), Peaks (Ma et al., 2003) and PepNovo (Frank and Pevzner, 2005a). Nevertheless, de novo sequencing is still difficult and is error-prone and often produces ambiguous sequences. These imperfections are mainly related to difficulties in differentiating between isobaric amino acids such as leucine and isoleucine or amino acids of nearly identical masses (e.g. glutamine/lysine and phenylalanine/oxidized methionine) by mass spectrometry (which could be in part solved by high accuracy MS). In addition, MS/MS interpretation results in relatively short sequences (usually 6-12 AA), which are highly redundant and containing gaps in the sequence. As a result, the algorithms for comparing accurate and long protein sequences, such as BLAST or FASTA, are not efficient for database searches since they strongly penalize amino acid permutations. To address this problem the sequence alignment algorithms have been modified in order to allow matching of proposed sequences generated from mass spectrometry and protein databases. For instance MS BLAST (Shevchenko et al., 2001), FASTS (Mackey et al., 2002), and ProBLAST (Arif et al., 2004) are examples of programs that can be used to align the de novo proposed sequences to database using efficient sequence alignment algorithms (Figure 9). 32

Introduction

MS/MS

Intensity

Mass Spectrometry driven BLAST (MS BLAST) M/Z

Spectral graph construction

F Q

G

D

V

L

E

T S

de novo reconstruction K

Semi-optimal sequences

de novo interpretation

QGDFVLESTK QGDGVLGFK QGIVFKLSTK …

List of possible sequences for each spectrum

Figure 9 - Approach to peptide identification based on automatic de novo sequencing using spectral graph reconstruction and protein database search by MS BLAST.

The sequence similarity searches are efficient in the identification of proteins that are not in databases by comparing their sequences to proteins already present on databases (Habermann et al., 2004a; Liska and Shevchenko, 2003). Given that sequence similarity searches allow multiple mismatches between compared sequences (Figure 10) the chance of correct cross-species identification is increased, compared to conventional strategies. Computational simulations suggested that MS BLAST allows a efficient cross-species identification of peptides down to 50% of the sequence identity (Habermann et al., 2004a).

33

Introduction

Figure 10 - Matching of peptide sequences via cross-species identification. Stringent search only allows cross-species identification when a full match between the protein in the database and the unknown protein is present. Sequence tag and stringent search methods allow error-tolerant search either by point mismatch or partial match. However, the most efficient approach when the similarity between the protein in the database and the unknown protein is not high is by sequence similarity searches based on de novo sequencing, where several mismatches are allowed between the experimental protein and protein in the database.

Sequence similarity search employs peptide sequence candidates, rather than raw MS/MS as in the conventional database search. Consequently, this method can be considered an orthogonal strategy for the identification of proteins. It can be used for error-tolerant search on top of conventional searches or validate statistically borderline hits obtained by conventional database searches (Wielsch et al., 2006). Among thousands of spectra acquired in a typical LC-MS/MS run, a large fraction belongs to peptides originating from background proteins – trypsin, keratins and even more of those can be matched via error-tolerant searches (Junqueira et al., 2008a; Junqueira et al., 2008b; Waridel et al., 2007) However, peptides from background proteins disturb sequence similarity searches by hitting a very large number of unrelated proteins to which they might have some local similarity. For example, trypsin autolysis products hit many serine proteases of diverse functional specificity and species of origin. 34

Introduction

Due to low complexity sequence stretches, keratin peptides could match almost any protein in a database with varying degrees of statistical confidence. Therefore, it would be advantageous to remove these spectra prior to database searches. However, there has been little effort to decrease the amount of data avoiding possible false positives (Gentzel et al., 2003; Waridel et al., 2007).

1.4.5 Sorting out background proteins from genuine protein interactions. One of the major challenges of protein complex analysis by MS is distinguishing bait specific interacting proteins from the identified list of proteins. Since LC-MS/MS is sensitive it produces a large number of confident protein identifications in each pulldown, most of them being unspecific co-isolated proteins. In general, the background proteins can be grouped into three categories: 1- highly expressed proteins such as ribosomal proteins, metabolic enzymes, stable constellations of proteins (e.g. cytoskeleton) or proteins from organelles (e.g. ER) that, despite weak non-specific binding to the bait, are found in considerably large quantities (Schirle et al., 2003); 2- hydrophobic proteins that interact with the matrix (beads used for purification)(Trinkle-Mulcahy et al., 2008); 3- proteins that interact with unfolded tagged proteins (such as chaperones). However, the origin of a large number of common background proteins remains unclear and also seems to be related to the proteomic environment of the bait used for complex isolation. Several strategies have been presented in order to identify unspecifically bound proteins in protein-protein interaction experiments. In a more focused assay, where the goal is to characterize a small set of complexes, it is crucial to have negative controls, in which one can exclude proteins that bind to the matrix, tag and bait in an unspecific manner. Three approaches can be applied to generate the negative control: 1- expressing the cassette containing just the tag (without the bait) and performing IP in parallel; 2- using another antibody that does not target the tag; 3- using the wild type cell (without the tag) to perform the negative control IP. Another possible strategy to sort out common backgrounds in high-throughput data is the semi-quantitative comparison of their abundance against core proteomes (Schirle et 35

Introduction

al., 2003); or monitoring the relative abundance profiles of co-purified proteins (Andersen et al., 2003; Mueller et al., 2007; Rinner et al., 2007; Shevchenko et al., 2008). A further approach is based on the subtraction of promiscuous proteins isolated in a larger number of purifications using a specific cut-off value (Krogan et al., 2004) or to eliminate proteins present in affinity purifications of unrelated proteins (Bouwmeester et al., 2004; Shevchenko et al., 2008). Recently, isotope labeling was combined to control experiments to distinguish unspecific to specific binding proteins in isolated complexes (reviewed in (Vermeulen et al., 2008). Metabolic and chemical labeling strategies can be used for introducing heavy isotopes into control samples for relative quantification. For example, using a combination of stable isotope labeling with amino acids in cell culture (SILAC) – based quantitative proteomics (Ong et al., 2002), cells containing an affinity-tagged protein are grown in light isotopic medium, while wild-type cells are grown in heavy isotopic medium. Equal quantities of these two cell preparations are mixed and the affinity-tagged protein complex is isolated. After isolation of the affinity-tagged protein complex, specific protein interactions are identified by mass spectrometry as isotopically light, while nonspecific interactions appear as a mixture of isotopically light and heavy(Tackett et al., 2005) Other examples of quantitative MS strategies applied to facilitate the identification of bona fide interactors in protein complexes are the following: iTRAQ (Zieske, 2006), iCAT (Shiio and Aebersold, 2006), O18 stable isotope labeling (Ye et al., 2009) and also methods based on label-free quantification (Rinner et al., 2007).

1.5 From individual proteins to protein interaction networks in mammals. While considerable progress has been made in mapping the interacting protein network in yeast (Gavin et al., 2006; Gavin et al., 2002; Ho et al., 2002; Ito et al., 2001; Krogan et al., 2006; Uetz et al., 2000), our understanding of protein networks in higher eukaryotes is limited. One of the reasons is that a much larger scale of interactions is expected, from 20,000–25,000 human genes a network of roughly 1–400,000 interactions is anticipated (Hart et al., 2006). Therefore, the field is limited mainly by the current 36

Introduction

technological throughput, reproducibility and reliability (Kocher and Superti-Furga, 2007; Shevchenko et al., 2002) . Accordingly, major efforts are underway to support more reliable analysis of PPI networks in mammalian cells for functional genomics studies (Bader and Hogue, 2003; Ewing et al., 2007; Ramani et al., 2008). Nevertheless, after including previously known human protein interactions, our understanding of the complete human protein interaction map may only be 10–30% complete (Hart et al., 2006).

37

Introduction

38

Motivations

2 Motivations for the thesis work Affinity purifications result in a mixture of a several hundred proteins including a large number of unspecific proteins of diverse origin and abundance. They come in a wide variety of volumes, concentrations and buffers. Typically, at least 5-10 affinity purifications have to be performed in order to deduce a robust composition of a complex consisting of multiple subunits. However, analytical throughput is compromised by the current methods used for protein complex characterization (MudPIT and geLC-MS/MS), which are very laborious and slow. Therefore, a reliable analytical approach offering high throughput, scalability and robustness at the low femtomole range is required. The major aim of this work was to overcome the practical limitations of conventional analysis by developing a fast and reliable method to dissect mammalian protein complexes by MS. The work involved: Development of Shotgun-LC-MS/MS analysis for characterization of protein complexes. Development of computational processing methods to identify bona fide interactors in purified complexes. Development and validation of an algorithm that removes background tandem mass spectra from LC-MS/MS for improving protein identification. Optimization of an automated pipeline that combines the high sensitivity and dynamic range of LC-MS/MS analysis with sequence-similarity searches.

39

Motivations

40

Results and Discussion

3 RESULTS AND DISCUSSION 3.1 Establishing the method of protein complex analysis. Extensive sample manipulation leads to sample loss and reduced reproducibility, thereby compromising throughput and confidence of protein identification by MS. Here, I propose a simple analytical strategy, in which isolated protein complexes are digested insolution with trypsin directly in large volumes of the eluate from affinity purification, followed by off-line cleanup of tryptic peptides and their analysis by LC-MS/MS on a LTQ Orbitrap. Our working concept was that the resolving power offered by nano-LC and high spectra acquisition rate of a LTQ Orbitrap alleviate the necessity of multidimensional LC or gel separation. I have optimized the resolving power of the nano-LC-MS to improve the separation of tryptic peptides obtained by digesting complex mixtures of proteins. To minimize sample carryover and consequently speed up the analysis, the nano-LC was operated using a parallel set of two pre-columns employed for simultaneous separation and washing procedures (Schaefer et al., 2004). While one of the two pre-columns was used for desalting and concentrating the loaded sample, the other column was extensively washed using a separate gradient. Different gradients lengths ranging from 75 to 155 min were tested considering the peak capacity that represents the maximum theoretical number of components that can be separated on a column within a given gradient time. Those were calculated using the average peak width of 5 peptides ―w” measured at 13 % of peak height and the gradient (separation) time ―tg‖ according to Equation (1): P = 1 + tg/w in each case (Gilar et al., 2004). By applying a 155 min gradient an improvement of 40% in peak capacity was achieved when compared to a 75 min gradient (Figure 11). From that we concluded that the optimized method would lead to better sensitivity and dynamic range, because of higher peak capacity and absent carry-over between the analyzed samples.

41

Results and Discussion

Figure 11 –Peak capacity calculated from eq. (1) using the average peak width of 5 different peptides considering different lengths of gradient separation.

How to trap peptides from diluted samples? In order to establish a suitable method for concentration/desalting of peptides after in-solution digestion of affinity purified proteins, we optimized total peptide loading, peptide concentrations, loaded volumes, and the buffer composition. Two methods to cleanup peptides were tested: hydrophilic interaction (HILIC) and reversed phase chromatography. HILIC was chosen because of its ability to remove detergents (Boersema et al., 2007; Hagglund et al., 2004) (review in (Boersema et al., 2008), whereas

reversed

phase

was

tested

because

it

is

widely

applied

for

concentration/desalting of peptides (Ishihama et al., 2006; Rappsilber et al., 2007). To compare the peptide recovery between the two methods I used a label-free quantification. Mass spectrometric signal intensity of peptide (ion current) were extracted from the ion chromatogram of nanoLC-MS acquisition and directly compared between independent runs (Bondarenko et al., 2002; Chelius and Bondarenko, 2002) reviewed in (America and Cordewener, 2008). Using a complex mixture of proteins we showed that the peak area of the extracted ion chromatogram (XIC) from the same precursor can be compared over independent injections considering the precise m/z, charge and retention time (Figure 12). Peak areas were linear over a broad range of concentration (Figure 13). 42

Results and Discussion

We conclude that it is possible to use the extracted peak areas to calculate the relative amount of peptides in different samples.

Peak Area

Figure 12-Extracted ion chromatogram (XIC) for the same peptide (m/z 908.9749) eluted at 45 min in two different LC-MS acquisitions (A and B). Because the area of the XIC of peptides has a linear response over a broad range of peptide concentration (Figure 13) this method can be used for relative quantification as previously reported (Bondarenko et al., 2002).

9.E+08 8.E+08 7.E+08 6.E+08 5.E+08 4.E+08 3.E+08 2.E+08 1.E+08 0.E+00

R² = 0.986

0

200

400

600

800

1000

peptide amount (fmols)

Figure 13 –Correlation between peak area of extracted ion chromatogram and amount of analyzed peptide obtained for BSA peptide K.AEFVEVTK.L m/z 921.4802 in the dilution series of BSA digest.

Using the relative quantification approach described above, the reversed phase UltraMicroSpin C-18 (The Nest Group) and ZIC-HILIC (SeQuant AB) spin-columns 43

Results and Discussion

were compared for their ability to trap and recover 100 fmol of peptides, loaded on very diluted sample (10 fmol/µL). 100 fmols of trypsinized commercial Bovine Serum Albumin (BSA) was added to the TAP buffer loaded onto the spin column washed, eluted and the recovery of peptides was quantified by nanoLC-MS. Figure 14 shows that UltraMicroSpin C-18 provides higher peptide recovery, when compared to HILIC judged by average recovery of 6 individual peptides.

average recovery of peptdes (%)

140

120 100

80 60 40

20 0

1

2

Figure 14 – Peptide recovery from UltraMicroSpin C-18 and HILIC based on the relative quantification of 6 different peptides from BSA loaded at 10fmols/uL (total amount loaded 100fmols): 1- HILIC TAP elution buffer with 10 ng NP40 (diluted to 90% ACN); 2 – UltraMicroSpin C-18 TAP elution buffer with 10 ng NP40. Recovery was calculated based on peak area of XICs for 6 different peptides from LC-MS profiles. Error bars reflect the differences in recovery of individual peptides.

Performance of different reversed phase cartridge Given that reversed phase outperformed HILIC, we decided to test the performance of three different reversed phase SPE spin columns with different stationary phase: (1) UltraMicroSpin C-18, (2) Vivapure C-18 micro and (3) in-house packed cartridge using C18-R2 Poros. The columns were tested for their ability to recover 8 different tryptc BSA peptides in a standard analytical buffer (50 mM Amonium bicarbonate, 0,1% TFA) at the loading concentration of 10fmol/µL. Figure 15 indicates that UltraMicroSpin C-18 and in-house packed SPE performed similarly and that the average recovery obtained was of ~65% for 8 different peptides. Since UltraMicroSpin columns are commercially available and provided comparable recovery to in-house 44

Results and Discussion

packed columns, we purchases these for further work in order to assure batch-to-batch reproducibility.

100.00

Average recovery peptides (%)

80.00

60.00

40.00

20.00

0.00 1

No-cleanup

UltraMicroSpin C-18

Vivapure C-18 micro

C18-R2 Poros

Figure 15 – Peptide recovery using different C18 stationary phase in cartridge configurations. The recovery was calculated based on the relative quantification of 8 different peptides from BSA (133fmols loaded) using peak area of XICs from LC-MS profiles. Error bars reflect the differences in recovery of individual peptides.

Effect of the elution volume on the recovery of peptides Different elution methods produce protein samples of varying volume and concentration, which might affect the peptide recovery of C-18 spin-columns. As a test, equimolar amounts of tryptic BSA peptides (100 fmol) were loaded in volumes of 5µL, 50µL, 250µL and 500µL and the peptide recovery in each volume/concentration was quantified. Figure 16 shows that individual peptides yields did not vary significantly with loading concentrations, which makes it compatible with any method of immunoaffinity isolation.

45

Results and Discussion

Average recovery of peptides (%)

120

100

80

60

40

20

0 5µL

50µL

250µL

500µL

Figure 16 – Effect of loaded volume on the recovery of peptides on an UltraMicroSpin C-18 cartridge. 100 fmols of BSA digest were loaded into the cartridges in 5µL, 50µL, 250µL and 500µL in 50mM ammonium bicarbonate 0.1% TFA followed by cleanup elution and LC-MS/MS analysis. Recovery was calculated based on XICs of 6 different peptides from LC-MS profiles.

A pipeline for protein complex analysis We hypothesized that this simple analytical pipeline in which proteins eluted from affinity purifications are digested in solution with the resulting peptides concentrated offline prior to LC-MS/MS analysis, could be a robust and sensitive approach for reliable characterization of protein complexes (Figure 17). The strategy, which we term as Shotgun-LC-MS/MS, takes advantage of reduced manipulation and parallelization, consequently avoiding sample loss and improving the throughput.

Figure 17 – Workflow for protein complex analysis by Shotgun LC-MS/MS. Samples are digested direct in the eluate of affinity purification followed by spin column cleanup and LC-MS/MS analysis.

46

Results and Discussion

3.2 Shotgun LC-MS/MS outperforms geLC-MS/MS. We compared Shotgun-LC-MS/MS against the conventional geLC-MS/MS, in terms of performance to characterize a well studied biological complex. As a test, we used the Anaphase Promoting Complex (APC) from S. cerevisiae purified by TAP using CDC16 subunit as a bait, as described in (Schwickart et al., 2004). However, the calmodulin elution buffer that we used did not contain detergent, glycerol and protease inhibitors, which affect subsequent MS analysis. Affinity purified proteins in elution buffer were split into two equal parts. One half was digested in-solution and concentrated on C18 UltraMicroSpin cartridge, followed by the analysis using a 155 min gradient on a LTQ Orbitrap (Figure 18-b). The other half proteins were precipitated with methanolchloroform as described in (Wessel and Flugge, 1984) to allow the large volume of 700µL to be loaded onto a 4-12% polyacrylamide precasted gel. The SDS-PAGE was stained with colloidal coommassie (Figure 18-a) and the complete lane was cut in 15 slices, each slice was in-gel digested as described in (Shevchenko et al., 2006) and analyzed by LC-MS/MS using a gradient of 36 min on a LTQ Orbitrap as described in (Junqueira et al., 2008b) (details in materials and methods). All MS/MS from 15 slices of the gel were combined; and both shotgun and gel analysis were searched by MASCOT against nrNCBI- taxonomy S. cerevisiae database (see further details in Materials and Methods). In order to take in account additional hits from common contaminant proteins (trypsin and keratins) in both preparations the datasets were also searched against the comprehensive nrNCBI database.

47

Results and Discussion

Figure 18 – A sample of TAP-purified APC was split into two equal parts and analyzed by geLC-MS/MS and Shotgun-LC-MS/MS. (a) SDS-PAGE obtained after precipitation of APC. (b) Base peak chromatogram of LC-MS/MS run of in-solution digest of APC. (c) Venn diagrams showing overlap in protein identifications and unique peptides between in-gel and in-solution against nrNCBI- taxonomy S. cerevisiae. (d) Pie chart diagrams show the distribution of matched MS/MS spectra. 1- Spectra identified as trypsin and keratin by MASCOT; 2- spectra without match in the database; 3- spectra matched to protein hits. (e) Graph showing the sequence coverage for all subunits of APC identified by both methods.

Taking in account the precipitation step, running the gel, staining, in-gel digestion and LC-MS/MS acquisition the whole process of geLC-MS/MS was 10x longer compared to Shotgun-LC-MS/MS. As shown in Figure 18-e both strategies identified all predicted subunits of APC (reviewed in (Zachariae and Nasmyth, 1999)), and also Cdh1, a transient co-activator protein that regulates APC-dependent proteolysis (Visintin et al., 1997). In total, 142 and 158 proteins were identified by Shotgun-LC-MS/MS and geLCMS/MS, respectively (Figure 18-c), being most of them predicted as unspecific interacting proteins (Shevchenko et al., 2008). The analysis in-gel produced 31688 MS/MS spectra from which 19% were protein hits in the nrNCBI S. cerevisiae database 48

Results and Discussion

and 12 % were trypsin and keratin contaminants. The single LC-MS/MS acquisition for shotgun analysis (Figure 18-d) produced 9295 spectra from which 30 % were matched to S. cerevisiae proteins and 5% matched to trypsin and keratin (Figure 18-d). Remarkably, Shotgun-LC-MS/MS delivered better sequence coverage for all APC subunits (Figure 18 panel-d). Why geLC-MS/MS, which acquired 31688 MS/MS spectra, provided almost the same number of protein identifications as Shotgun-LC-MS/MS, in which a total number total of 9295 MS/MS were acquired? In order to understand that I plotted the cumulative distribution of unique peptides (Figure 19) identified in each fraction of the gel. Although a total of 6046 spectra were confidently identified from in-gel analyses, from those only 1327 unique peptides were matched against S. cerevisiae database. Therefore, each peptide was identified on average 4 times over all slices. In contrast, the shotgun analysis resulted in a smaller number of total spectra matched (2799); yet, nearly the same number of unique peptides (1227) was obtained when compared to in-gel (1327). This suggests that proteins in the SDS-PAGE are not concentrated in narrow mass range band but instead they spread among different fractions of the gel, a phenomena that has been recently characterized by Gao et al.(Gao et al., 2008). Consequently, the same peptides are repetitively identified in different slices. Cumulative distribution of unique peptides

1400

1327 1230

1227 1200 1015 1000

836 800

1125

931 730

591

600

644

480

400 200

890

1073

306 332

396

151

0

Figure 19- Cumulative plot of unique peptides identified by geLC-MS/MS for each fraction (blue bars represents the gel fractions (slices) in comparison to shotgun-LC-MS/MS. Although in-gel 6046 spectra

49

Results and Discussion were matched to the database, they only hit 1327 unique peptides. Shotgun analysis (black bar) produced 1227 unique peptide hits compared to 1517 peptides identified in-gel. The gray bar represents the number of unique peptides identified in both strategies. Table 5-Physical-chemistry properties of APC subunits associated to sequence coverage in-gel and insolution.

The subunits of low molecular weight were better represented in Shotgun-LCMS/MS (Table 5). I further used label-free relative quantification to access the peptide recovery of APC subunits in both methods. To do so, I first calculated the XIC peak areas of 3 different peptides for each subunit of APC complex in-gel. Since peptides signals are dispersed in multiple slices, the area of XIC of each peptide was determined by the sum of the most intense slice in the gel, in addition to the two adjacent slices for each peptide. Next, I compared this value to the XIC peak area of the same peptides obtained by Shotgun-LC analysis. Corresponding ratios of XIC peak areas are reported in Figure 20. Peptides of all APC subunits were better recovered by shotgun analysis compared to ingel (Figure 20), an average of 5 times yield for the ten large subunits and 24 fold for 3 small subunits of APC (Swm1, Apc11, Cdc26).

50

Results and Discussion 50.0

40.0 35.0 30.0 25.0

8.8

5.4

4.9

6.2

2.1

5.4

6.4

5.3

5.0

3.0

10.0

12.4

26.8

15.0

33.2

20.0

7.2

Ratio of XIC in-solution/in-gel

45.0

26 Cd c

11 Ap c

1 Sw m 1

Do c

9 Ap c

23

4

5

nd 2 m

Cd c

Ap c

Ap c

27 Cd c

2

16 Cd c

Ap c

Ap c

1

0.0

APC-subunits

Figure 20 - Relative quantification of APC subunits obtained for in-solution against in-gel. Fold changes were calculated as the average ratios of XIC peak areas of 3 independent peptides for each subunit. For in-gel digestion, the total XIC peak area for each peptide was obtained by summing the area of the same precursor peaks detected in 3 adjacent slices.

The identification bias towards proteins of high MW in in-gel digests and underrepresentation of proteins with MW