UNDERSTANDING NAÏVE PLURIPOTENCY USING QUANTITATIVE MASS SPECTROMETRY
PhD Thesis
Ana Martínez del Val
PROGRAMA DE DOCTORADO BIOCIENCIAS MOLECULARES FACULTAD DE CIENCIAS
DEPARTAMENTO DE BIOLOGIA MOLECULAR
UNDERST ANDING NAÏVE PLURIPOTENCY USING QUANTIT ATIVE MASS SPECTROMETR Y
Universidad Autónoma de Madrid
Madrid, 2019
Programa de Doctorado en Biociencias Moleculares Departamento de Biología Molecular
Facultad de Ciencias
UNDERSTANDING NAÏVE
PLURIPOTENCY USING QUANTITATIVE MASS SPECTROMETRY
Ana Martínez del Val
Licenciada en Biotecnología por la Universidad de Salamanca
Director de tesis:
Javier Muñoz Peralta, PhD Tutora Académica:
Marta Izquierdo Rojo, PhD Madrid, 2019
Tesis realizada en la Unidad de Proteómica Centro Nacional de Investigaciones Oncológicas
i
Javier Muñoz Peralta, Doctor en Ciencias por la Universidad de Navarra y actualmente jefe de la unidad de proteómica del Centro Nacional de Investigaciones Oncológicas (CNIO) de Madrid,
CERTIFICA que,
Doña Ana Martínez del Val, Licenciada en Biotecnología por la Universidad de Salamanca y con Máster en Bioinformática por la Universidad Autónoma de Barcelona, ha realizado bajo su dirección el trabajo de investigación titulado:
“UNDERSTANDING NAÏVE PLURIPOTENCY USING QUANTITATIVE MASS SPECTROMETRY”
Y considera que el trabajo realizado reúne todas las condiciones requeridas por la legislación vigente, así como la originalidad y calidad científicas necesarias para poder ser presentado y defendido con el fin de optar al grado de Doctor por la Universidad Autónoma de Madrid.
Y para que así conste y surjan los efectos oportunos, firmo el presente certificado en Madrid a 21 de enero de 2019.
Firma Director de la Tesis Doctoral
……….
Dr. Javier Muñoz Peralta
ii
Esta tesis ha sido financiada por la ayuda FPI de formación de profesional investigador BES- 2014-070098 (MINECO).
iii
AGRADECIMIENTOS
Ningún logro es individual, y mucho menos en ciencia. Por eso aunque sea mi nombre el que la firma, esta tesis es el resultado de la ayuda de mucha gente, tanto dentro como fuera del laboratorio.
A todos los miembros de la unidad de proteómica, pasados y presentes, gracias. Gracias a Javier por darme la oportunidad de venir al CNIO a hacer esta tesis. Por supuesto, gracias a Nuria, Pilar, Fernando e Isa por acogerme tan bien, enseñarme tantísimo y hacer de este laboratorio una segunda casa. No me olvido de todos los que han compartido pecera conmigo, en especial Cris y Ailyn: ha sido estupendo compartir estos nueve metros cuadrados con vosotros. Of course, Cian, thanks so much for your contribution and support during this project.
Supongo que no es fácil tener a una aspirante a científica en la familia y aguantar mis discursos durante las comidas de domingo, por eso gracias a mi padre, mi madre y mis hermanos. Miguel, sabes bien lo mucho que te agradezco la ayuda que has sido en la recta final de la tesis.
Es imposible enumerar a todos los que habéis formado parte de estos cuatro años. Por eso, a todos los que habéis compartido un rato conmigo dentro o más allá de los límites del CNIO…
¡GRACIAS!
“Science is a bit like the joke about the drunk who is looking under a lamppost for a key that he has lost on the other side of the street, because that's where the light is. It has no other choice.”
- Noam Chomsky
iv
ARN Polimerasa. Acrílico sobre lienzo. Felisa del Val, 2018
TABLE OF CONTENTS
1
TABLE OF CONTENTS
TABLE OF CONTENTS ... 1
ABBREVIATIONS... 5
ABSTRACT & RESUMEN ... 7
ABSTRACT ... 9
RESUMEN ... 11
1. INTRODUCTION ... 13
1.1. CAPTURING PLURIPOTENCY ... 15
Epigenomic regulation in Pluripotent Stem Cells ... 17
Metabolism in the Ground State of Pluripotency ... 18
Post-transcriptional and post-translational regulation ... 19
Mediator complex and enhancer regulation in naïve pluripotency through inhibition of CDK8 ... 20
1.2. PROTEOMICS ... 22
Sample preparation for shot-gun proteomics ... 23
Multidimensional analysis of Protein Identification Technology... 24
Liquid Chromatography-tandem Mass spectrometry ... 25
Electrospray ionization ... 25
Tandem Mass Spectrometry Analysis ... 25
Mass Spectrometry Platform Architecture ... 26
Protein identification ... 29
Quantitative proteomics ... 31
Label-free quantitative proteomics ... 31
Label-based quantitative proteomics ... 32
2. PROJECT AIMS ... 35
TABLE OF CONTENTS
2
3. EXPERIMENTAL PROCEDURES ... 39
Full proteome Time Course ... 41
Sample preparation ... 41
Isobaric labelling ... 41
High pH reverse phase fractionation ... 41
LC-MS/MS ... 42
Data analysis ... 42
Metabolomics ... 43
Sample preparation ... 43
LC-MS/MS ... 43
Data extraction and Compound Identification ... 44
Metabolite quantification, data normalization and statistical analysis ... 45
Phosphoproteome Time Course ... 45
Sample preparation ... 45
Isobaric labeling ... 46
Phosphopeptide enrichment ... 46
Micro high pH reverse phase fractionation ... 46
LC-MS/MS ... 47
Data analysis ... 47
Affinity-purification coupled to mass spectrometry ... 48
Sample preparation and immune-precipitation ... 48
LC-MS/MS ... 48
Data analysis ... 49
Z-scoring and Temporal trend classification ... 50
Functional annotation ... 50
ClueGO ... 50
Motif annotation in Perseus ... 51
Icelogo... 51
TABLE OF CONTENTS
3
Re-analysis of published transcriptomics data ... 51 4. RESULTS ... 53 4.1. PROTEIN QUANTIFICATION: ADRESSING THE RATIO COMPRESSION PROBLEM ... 55
The two proteomes approach ... 55 The degree of ratio compression depends on the mass analyzer used for precursor isolation ... 56 Relationship between Precision and Accuracy ... 57 The impact of precision, accuracy and replicates on the statistical significance ... 59 Importance of isobaric labeling quantification for the scope of the project ... 63 4.2. CHARACTERIZATION OF NAïVE PLURIPOTENCY USING MASS
SPECTROMETRY ... 65 Background and description of the project ... 65
4.2.1. DYNAMICS OF THE PROTEOME DURING STABILIZATION OF NAÏVE
PLURIPOTENCY IN VITRO ... 66 Experimental design ... 66 CDK8i recapitulates the proteomic changes induced during 2i-based stabilization of mESCs . 68 Temporal trends during stabilization of naïve pluripotency ... 71 Stabilization of naïve pluripotency by 2i and CDK8i is a coordinated process that increases mitochondrial capacity ... 77
4.2.2. METABOLOMIC CHARACTERIZATION OF NAÏVE PLURIPOTENCY .... 78
Differential metabolites in each pluripotent state define unique metabolic patterns ... 78 One carbon metabolism explains key differences between 2i and CDK8i... 80
4.2.3. TIME-COURSE CHARACTERIZATION OF EARLY PHOSPHORYLATION
EVENTS ... 85 Experimental design ... 85 Characterization of the phosphorylation landscape induced by 2i and CDK8i ... 85 Phosphoproteomics confirms the specificity of 2i and CDK8i, and reveals novel potential substrates ... 94 2i and CDK8i regulate common kinases ... 98
TABLE OF CONTENTS
4
4.2.4. INTERACTOME OF THE TRANSCRIPTIONAL MACHINERY IN MOUSE
EMBRYONIC STEM CELLS ... 102
RNA Polymerase II interactome of mESCs ... 102
CDK8 inhibition alters the post-transcriptional regulation of mRNA ... 103
2i affects the affinity of transcription factors towards RNA Polymerase II ... 104
Mediator and CDK8 Interactome ... 107
5. DISCUSSION ... 111
Addressing the ratio compression problem ... 113
Mass-spectrometry roadmap to naïve pluripotency ... 114
Phosphoproteomics data can be used to explain the specificity of kinases inhibitors ... 117
Mitochondrial function as a hallmark of the stabilization of naïve state of pluripotency ... 117
Cell cycle control is altered in the ground state of pluripotency ... 119
2i and CDK8i differ in the phosphorylation of the transcriptional machinery ... 120
Further perspectives for a system biology understanding of pluripotency ... 121
6. CONCLUSIONS ... 125
CONCLUSIONS ... 127
CONCLUSIONES... 129
BIBLIOGRAPHY ... 131
APPENDICES ... 157
ABBREVIATIONS
5
ABBREVIATIONS
ACN Acetonitrile
CID Collision induced dissociation
CTD Carboxi-terminal domain
DDA Data-dependent acquisition
DHB 2,5-Dihydroxybezoic acid
ECD Electron capture dissociation
ESI Electrospray ionization
ETC Electron transport chain ETD Electron transfer dissociation
FA Formic acid
FASP Filter Aided Sample Preparation
FC Fold change
FDR False discovery rate
GO Gene Ontology
GSEA Gene Set Enrichment Analysis
H3K27me3 Histone 3 lysine 27 thimethylation H3K36me3 Histone 3 lysine 36 thimethylation H3K4me3 Histone 3 lysine 4 thimethylation H3K9me3 Histone 3 lysine 9 thimethylation HCD Higher energy Collision dissociation HPLC High-performance liquid chromatography
ICM Inner cell mass
IMAC Metal ion affinity chromatography
IP Immuno-purification
iPSC Induced pluripotent stem cell
IT Ion trap
iTRAQ Isobaric tags for relative and absolute quantification
LC-MS/MS Liquid chromatography coupled to tandem mass spectrometry LIF Leukemia Inhibition Factor
m/z Mass to charge ratio
mESC Mouse embryonic stem cell
MS mass spectrometry
MS1 Full or survey scan
MS2 Fragmentation spectrum
OCM One carbon metabolism
OT Orbitrap
PCA Principal component analysis
PSC Pluripotent stem cell
PTM Post-translational modification
Q Quadrupole
ABBREVIATIONS
6
ROS Reactive oxygen species
SAM S-adenosyl methionine
SDC Sodium deoxycholate
SDS Sodium dodecyl sulfate
SILAC Stable isotope labeling by amino acids in cell culture
SL Serum/LIF
STDEV Standard deviation
TEAB Triethylammonium bicarbonate
TFA Trifluoroacetic acid
TMT Tandem mass tags
TOF Time of flight
TPR True positive rate
ABSTRACT
RESUMEN
ABSTRACT & RESUMEN
9
ABSTRACT
During embryonic development stem cells progress through different degrees of developmental potential, spanning from naïve at the early stages to primed pluripotency after implantation.
Whilst primed stem cells are prone to differentiation, naïve stem cells show higher self-renewal capacity and are considered as the representation of ground state of pluripotency. Dual inhibition of GSK3 and MEK, a combination of drugs known as 2i, can capture in vitro that ground state of pluripotency. Most recently, inhibition of CDK8 (CDK8i), which acts as negative regulator of the Mediator complex in enhancers, also stabilizes this condition. 2i and CDK8i-treated cells form dome-shape colonies, show a homogeneous expression of pluripotency markers and contribute very efficiently to the formation of chimeras. Both treatments activate a similar transcriptional program that resembles the transcriptional profile of pluripotent cells from the preimplantation epiblast. However, the mechanisms responsible for these effects are not fully characterized. Here, we used quantitative mass spectrometry to explore these two transitions from four complementary angles: proteome, metabolome, phosphoproteome and interactome.
First, we profiled proteome dynamics comprehensively across seven time points in four different cell lines. These analyses revealed that 2i and CDK8i induce a similar and synchronized proteome response characteristic of the pre-implantation epiblast. Among many others, we found several proteins involved in mitochondrial metabolism consistently up-regulated. To further investigate the implication of this finding, we analyzed the metabolomes of long-term adapted mESCs to 2i and CDK8i and found how their metabolic signatures explain key differences between both treatments, such as epigenetic status. Moreover, given that these events are initiated by kinases inhibition, we sought to delineate the phosphorylation cascades triggered in the early phases of the process and monitored ~14,000 phosphosites within the first 6 hours of treatment. We found that GSK3i/MEKi and CDK8i induce a rapid alteration of phosphorylation networks, mainly affecting pluripotency transcription factors and the transcriptional machinery. Finally, we studied the interactome of key proteins involved in transcriptional control (i.e., Pol II, Mediator and CDK8). As a result, we obtained a comprehensive map of the protein network that interacts with these transcriptional complexes.
Our results demonstrate that the stabilization of naïve pluripotency by stimulation of two different routes, proliferation/self-renewal (2i) and enhancer function (CDK8i) undergo similar mechanisms, suggesting that these pathways are highly interconnected in pluripotent cells.
ABSTRACT & RESUMEN
11
RESUMEN
Durante el desarrollo embrionario las células madre muestran diferentes grados de pluripotencia que abarcan desde el estado “naïve” en las primeras fases hasta el estado “primed” tras la implantación. Mientras que en el estado “primed” las células son más proclives a diferenciarse, las células “naïve” poseen un mayor potencial de proliferación celular y representan el estado basal de pluripotencia. La inhibición de dos quinasas, GSK3 y MEK (tratamiento conocido como 2i) puede estabilizar in vitro las células madre de ratón capturando ese estado basal de pluripotencia. Recientemente se ha demostrado que dicha estabilización también puede lograrse mediante la inhibición de CDK8 (CDK8i), una quinasa que actúa como regulador negativo de la actividad del complejo Mediator en las regiones promotoras. El tratamiento con 2i o CDK8i induce la formación de colonias celulares compactas y redondeadas, así como la expresión homogénea de marcadores de pluripotencia. Estas células, además, contribuyen muy eficientemente a la formación de quimeras. Ambos tratamientos, 2i y CDK8i, activan un programa transcripcional similar, el cual reproduce el perfil de expresión de las células pluripotentes del epiblasto pre-implantacional. No obstante, los mecanismos responsables de estos procesos no están completamente caracterizados. En esta tesis hemos empleado técnicas cuantitativas de espectrometría de masas para explorar estas dos transiciones desde cuatro perspectivas complementarias: proteoma, metaboloma, fosfoproteoma e interactoma.
En primer lugar se realizó un estudio de la remodelación del proteoma en cuatro líneas celulares diferentes tratadas con 2i o CDK8i. Estos análisis mostraron que tanto 2i como CDK8i inducen una respuesta similar y coordinada en el proteoma característica del epiblasto pre-implantacional.
El análisis proteómico también reveló un claro incremento de las proteínas del metabolismo mitocondrial en respuesta a 2i y CDK8i. Motivado por esta evidencia en la alteración de la maquinaria mitocondrial, se analizó el metaboloma de estas células. Gracias al análisis metabolómico identificamos metabolitos diferenciales entre 2i y CDK8i que permiten explicar diferencias clave entre los dos tratamientos, tales como los cambios en los niveles metilación de ADN. Por otro lado, debido a que ambos tratamientos actúan mediante la inhibición de quinasas, se estudiaron los cambios en el fosfoproteoma en los estadíos más tempranos. De este modo, se monitorizaron más de 14.000 sitios de fosforilación durante las primeras 6 horas de tratamiento y se observó que tanto 2i como CDK8i inducen un cambio muy rápido en el fosfoproteoma que afecta principalmente a marcadores de pluripotencia y a la maquinaria transcripcional.
Finalmente, se investigó el interactoma de complejos transcripcionales (RNA polimerasa II,
ABSTRACT & RESUMEN
12
Mediator y CDK8) con el fin de dilucidar el papel de la regulación de la transcripción en la pluripotencia “naïve”.
Nuestros resultados demuestran que la estabilización de la pluripotencia “naïve”, ya sea mediante la estimulación de las rutas de proliferación y renovación celular (2i) o mediante la activación de los promotores (CDK8i) convergen en mecanismos similares, lo cual sugiere que ambos procesos están altamente interconectados en células pluripotentes.
1. INTRODUCTION
INTRODUCTION
15
1.1. CAPTURING PLURIPOTENCY
Pluripotency is the capacity of a cell to differentiate into all three germ layers of the body and therefore engender all the specialized cell types of an organism. Pluripotency is not a steady state, on the contrary, it exists transiently during early stages of embryogenesis, as a continuum of sequential degrees of developmental potential. Before implantation (E4.0 in the mouse), certain Nanog-expressing cells from the inner cell mass of the blastocyst will give rise to all future embryonic lineages (Silva et al. 2009). Due to their unbiased developmental potential, these cells are described as being in a “naïve” state of pluripotency. Also, at this point of development, preimplantation epiblast cells represent the “ground state” which is a cellular condition that is liberated from developmental and epigenetic constraints, as manifested by homogeneous expression of key pluripotency factors, activation of both X chromosomes and global DNA hypomethylation. In contrast, upon implantation (E5.0), cells receive powerful stimuli that shift them to a “primed” state of pluripotency, in which cells become epigenetically restricted and are poised towards lineage specification (Hackett and Surani 2014; Weinberger et al. 2016; Nichols and Smith 2009) (Figure 1). Still, some degree of developmental potential is preserved in the adult organism in the form of multipotent adult stem cells (Reynolds and Weiss 1992; Pittenger et al. 1999; Gage 2000), however, these kind of cells can only generate a limited number of cell types within a determined lineage.
Importantly, although pluripotency is a transient condition in vivo, it can be captured indefinitely in vitro. In mouse, embryonic stem cells (mESCs) can be derived from the inner cell mass (ICM) of the preimplantation blastocyst (E4.5) (Evans and Kaufman 1981). Under defined medium conditions, mESCs preserve the unrestricted developmental potential of the pre-implantation epiblast as shown by their capacity to efficiently contribute to chimaeras and colonize the germ- line (Bradley et al. 1984). Another type of pluripotent stem cells, known as epiblast stem cells (EpiSCs), can be derived from the post-implantation epiblast (E5.5 to E6.5), hence capturing the
“primed” and more developmentally restricted state (Brons et al. 2007; Tesar et al. 2007).
Additionally, pluripotent stem cells (PSCs) can be derived from adult cells by two alternative strategies; somatic cell nuclear transfer (SCNT) (Gurdon 1962) or gene expression reprogramming (Takahashi and Yamanaka 2006) (Figure 1). The possibility to model different pluripotent states in vitro offers a unique biological system to understand cellular plasticity, which holds important implications for clinical applications.
INTRODUCTION
16
In order to maintain self-renewal and developmental potential, mouse ESCs are cultured in the presence of fetal bovine serum (FBS) and leukemia inhibitory factor (LIF) (A. G. Smith et al.
1988; Williams et al. 1988). The Bmp4 present in serum activates Smad signaling, which induces the expression of the Id (‘inhibition of differentiation’) proteins to block pro-differentiation signals and enhance the pluripotency network. Besides, LIF stimulates self-renewal by JAK- mediated phosphorylation of Stat3. Under these conditions, mESCs are denoted as “conventional”
or “Serum/LIF” ESCs. In fact, Serum/LIF ESCs are still sensitive to conflicting signaling via autocrine stimuli and exogenous cues that stimulate differentiation. As a consequence of this continuous reception of competing signals, conventional ESCs manifest a significant degree of morphological and molecular heterogeneity among the cell population (Hayashi et al. 2008;
Toyooka et al. 2008; Chambers et al. 2007). Despite Serum/LIF ESCs can give rise to all embryonic lineages following blastocyst injection, due to this metastability, they cannot fully recapitulate the transcriptional and epigenomic profile characteristic of the “ground state”. Thus, conventional ESCs are defined as “naïve” pluripotent ESCs.
Figure 1. Schematic representation of the different developmental stages during embryonic development in mouse. Below, the corresponding PSCs that can be derived in vitro from each stage with the experimental conditions (either medium composition or ectopic gene expression) required to maintain their identity.
The main trigger of the metastability in conventional ESCs is Fgf4 that through autocrine secretion feeds back into mitogen-activated protein kinase (MAPK) cascade, which induces differentiation programs (Kunath et al. 2007). Importantly, Ying et al showed that dual inhibition of the MAPK/ERK kinase (Mek) and glycogen synthase kinase 3 (Gsk3) using a combination of two inhibitors (i.e., PD0325901 and CHIR99021) results in the stabilization of mESCs by suppressing differentiation signals and stimulating self-renewal (Ying et al. 2008). Inhibition of Mek kinase activity abrogates downstream activation of differentiation cues. On the other hand, inhibition of Gsk3 stimulates self-renewal by stabilization of β-catenin, which is translocated to
INTRODUCTION
17
the nucleus and abrogates Tcf3 repressor activity (Wray et al. 2011). The combination of these two kinase inhibitors is known as 2i, and stabilizes mESCs towards an homogenous population that shows reduced expression of lineage-associated genes and a permissive epigenetic landscape that closely resembles the ICM (Marks et al. 2012; Boroviak et al. 2014). Owing to these attributes, mESCs cultured in 2i are considered to be an in vitro surrogate of the “ground” state of pluripotency. On the contrary, EpiSCs cannot proliferate in the presence of 2i since they are dependent on FGF and Activin/Nodal pathways for derivation and maintenance of the “primed”
state of pluripotency (Figure 1).
Epigenomic regulation in Pluripotent Stem Cells
Epigenetic regulation is considered a hallmark mechanism for the establishment of pluripotency (Ahmed et al. 2010). Ground state ESCs are characterized by a derestricted epigenomic environment defined by an hypomethylated genome (Leitch et al. 2013) and a permissive chromatin architecture (Azuara et al. 2006; Gaspar-Maia et al. 2009, 2011).
DNA methylation is a repressive epigenetic modification that is associated with transcriptional silencing (Jaenisch and Bird 2003). During early stages of embryogenesis, DNA methylation levels are dynamically controlled, leading to a globally hypomethylated state in the ICM. In contrast, after implantation, the genome undergoes a global re-methylation wave, which contributes to lineage restriction and subsequent loss of developmental potency (Z. D. Smith et al.
2012). Thus, global hypomethylation seems to be an essential prerequisite for the establishment of a pristine genomic landscape that allows full developmental potential. Although cells derived from the ICM are initially hypomethylated, mESCs become methylated when cultured in Serum and LIF. However, 2i erases global methylation and therefore, phenocopies the characteristic hypomethylated genome of the ICM (Habibi et al. 2013; Leitch et al. 2013). The mechanism responsible for this dynamic regulation of DNA methylation has been recently elucidated.
Methylation of DNA is controlled by three complementary mechanisms: de novo, maintenance and active demethylation. Although de novo DNA methylation machinery (i.e., Dnmt3a, Dnmt3b and Dnmt3l) is rapidly downregulated in 2i conditions (Habibi et al. 2013), Von Meyenn et al demonstrated that it is the impairment of DNA methylation maintenance what mainly causes the global erasure of methylation (von Meyenn et al. 2016). In Serum/LIF ESCs, Dnmt1 is recruited to replication foci through Uhrf1, which itself is recruited by histone 3 lysine 9 dimethylation chromatin marks (H3K9me2). Switching to 2i reduces both Uhrf1 levels and H3K9m2, which results in the impairment to recruit Dnmt1 to methylate the nascent DNA strand, leading to the passive loss of DNA methylation.
INTRODUCTION
18
Furthermore, many transcription factors rely on their accessibility to DNA to regulate gene expression, and this accessibility depends on chromatin structure which, when compacted, can inhibit gene transcription. Histones, which are the proteins responsible for the compaction of the DNA, can be dynamically modified to regulate the epigenome either to activate gene expression (acetylation of histones 3 and 4, as well as H3K4me3 and H3K36me3) or to repress it (H3K27me3 and H3K9me3) (Mikkelsen et al. 2007). Interestingly, in conventional mESCs, the promoters of several developmental genes are co-occupied by both repressive (H3K27me3) and active marks (H3K4me3). These bivalent domains are thought to repress gene expression whilst poising genes for rapid activation in response to lineage-specifying signals (Bernstein et al. 2006;
Sharov and Ko 2007). Recently, an integrative proteomic study revealed that PRC2-dependent H3K27me3 is widespread across euchromatin and heterochromatin in mESCs grown with 2i, which was proposed as a mechanism to preserve the genome from DNA methylation (van Mierlo et al. 2019).
Metabolism in the Ground State of Pluripotency
Recently, metabolism has emerged as an important mechanism in the regulation of stem cell fate, not only for its role in the synthesis of precursors and the generation of energy required to maintain continuous proliferation, but also because of the participation of metabolites in signaling and epigenetic regulation (Shyh-Chang et al. 2013; Sperber et al. 2015). Indeed, several transcriptomic studies have observed important implications of the metabolism during the establishment of the ground state of pluripotency (Marks et al. 2012; Kalkan et al. 2017).
Regarding energy metabolism in ESCs, very clear distinctions between naïve and primed pluripotency have been described. Primed stem cells (EpiSCs) depend on glycolysis for energy production whilst naïve stem cells (ESCs) show a bivalent metabolism that allows to employ either mitochondrial oxidation or glycolysis (W. Zhou et al. 2012). This difference in their energy production system may resemble their embryonic origin. In early stages, most ATP is produced by oxidative phosphorylation of pyruvate, lactate, amino acids and fatty acids. However, upon implantation, there is a switch towards a glycolytic metabolism, probably as a consequence of the anaerobic conditions of the embryo (Houghton et al. 1996; Folmes and Terzic 2015).
Decrease oxidative potential in EpiSCs can be attributed to a repression of enzymes that constitute the Electron Transport Chain (ETC) (W. Zhou et al. 2012; Kalkan et al. 2017). On the other hand, activation of the mitochondrial respiratory machinery during in vitro primed-to-naïve transition has been linked to the down-regulation of Lin28 (Zhang et al. 2016) and the activation of Stat3 (Carbognin et al. 2016). Recently, Chandrasekaran et al mapped the pluripotent stem cell
INTRODUCTION
19
metabolism using Genome-Scale Network modeling. In their work, one carbon metabolism was marked as a key differential pathway between naïve and primed states. Moreover, by using metabolomics data from Lin28-deficient cells, they identified Lin28 as the main regulator of this pathway, thus establishing an interplay between control of mitochondrial respiration and one- carbon metabolism (Chandrasekaran et al. 2017).
Moreover, as mentioned previously, metabolism is not only responsible for energy production, but it can also be a key player in epigenetic regulation. For instance, S-adenosylmethionine (SAM) metabolism is strictly linked to DNA and histone methylation in pluripotent stem cells (Shyh-Chang et al. 2013; Sperber et al. 2015). In mESCs, SAM is generated by uptake of threonine, which is converted to glycine by the threonine dehydrogenase (Tdh). Cleavage of glycine leads to the conversion of methionine plus ATP into SAM. Threonine is the only amino acid critically required for pluripotency of mESCs since its depletion from the medium or repression of Tdh results in differentiation and loss of self-renewal capacity of mESCs (J. Wang et al. 2009). Shyh-Chang et al showed that threonine dependent levels of SAM correlates with H3K4me3, thus revealing a possible mechanistic link between cellular metabolism and epigenetic regulation (Shyh-Chang et al. 2013). Complementary, Carey et al showed that cellular reprogramming using 2i conditions has an impact on glucose and glutamine metabolism, which in turn induces naïve cells to employ preferentially glucose to obtain elevated α-ketoglutarate to succinate ratio, which promotes chromatin demethylation (Carey et al. 2014).
Post-transcriptional and post-translational regulation
A main focus in stem cell research has been the understanding of transcriptional and epigenetic mechanisms. However, other mechanisms that may play a key role in stem cell maintenance and fate have yet remained unexplored. Protein levels can be post-transcriptionally controlled through non-coding RNAs and RNA binding proteins among others. Indeed, there is evidence that ESC fate is modulated by such mechanisms (Lu et al. 2009). For instance, the RNA binding protein Lin28 is essential for maintaining the low mitochondrial function associated with primed pluripotency. It has been shown that this effect occurs by binding of Lin28 to mRNAs coding for oxidative phosphorylation enzymes, consequently repressing their translation (Zhang et al. 2016).
Messenger RNA stability, turn-over, localization and translation efficiency can also be modulated by reversible chemical modifications, such as N6-methyl-adenosine (m6A). This modification can act as a molecular switch in murine ESCs. Geula et al showed that ablation of RNA methylation machinery in naïve pluripotent stem cells, not only preserves their identity but hampers their differentiation competence, enhancing their naïve phenotype (Geula et al. 2015).
INTRODUCTION
20
Moreover, post-translational modifications constitute a pivotal mechanism of protein function regulation. Indeed, 2i treatment relies on the inhibition of Mek and Gsk3 kinase activities by re- wiring their downstream phosphorylation network. For instance, Gsk3 phosphorylates β-catenin which marks the protein for subsequent degradation. Inhibition of Gsk3 results into the translocation of β-catenin to the nucleus where it activates its targets genes (MacDonald, Tamai, and He 2009). On the other hand, some members of the pluripotency network are closely controlled by PTMs. This is the case of Klf2, whose degradation by Erk mediated phosphorylation is prevented in 2i conditions (Yeo et al. 2014).
Mediator complex and enhancer regulation in naïve pluripotency through inhibition of CDK8
Transcription of all protein coding genes, and most non-coding RNAs, is mediated by RNA polymerase II, hereafter Pol II. However, the regulation of gene expression is tightly governed by transcription factors that bind to promoters or enhancer regions. Those transcription factors interact with a large protein complex known as Mediator that communicates regulatory signals to Pol II. In mammals, the Mediator complex is comprised of 26 subunits, and different transcription factors can interact with different subunits of the Mediator, allowing a throughout regulation of gene expression. Proliferating cells, such as stem cells, generally express all 26 core Mediator subunits in addition to the CDK8-kinase module. This module is comprised of four subunits:
Cdk8, or its less-known paralog Cdk19, cyclin C (Ccnc), Med12 and Med13 and it is the only Mediator domain with enzymatic activity. CDK8-module can dynamically interact with the Mediator complex, thus altering its structure and function. During transcription, CDK8-module interacts with the core Mediator and triggers the release of Pol II from a paused state, priming it for transcriptional elongation (Elmlund et al. 2006) (Figure 2). This release occurs because CDK8-module interaction with Mediator sterically hinders the approach of Pol II. CDK8 kinase activity is fundamental to the regulation of Mediator. However, its targets have not been extensively studied. Recently, Poss et al identified several substrates of CDK8, among which there were several transcription-associated proteins, such as AFF4 or STAT1, and other components of the CDK8 module: CCNC, MED12 and MED13 (Poss et al. 2016). Most relevant signaling pathways related to proliferation and survival of stem cells are ultimately linked to Mediator, and some specifically to CDK8 function, such as Wnt signaling. CDK8 module can enhance Wnt signaling by phosphorylation of E2f1. When not phosphorylated, E2f1 promotes β- catenin degradation which represses Wnt pathway. Therefore, CDK8 kinase activity facilitates Wnt signaling by repressing the β-catenin degradation by E2f1 (Zhao, Ramos, and Demma 2013).
INTRODUCTION
21
Moreover, CDK8 negatively regulates super-enhancer-associated gene expression (Pelish et al.
2015). Super-enhancers are large clusters of enhancers densely loaded with Mediator, transcription factors and chromatin regulators. These regulatory regions are of paramount importance for pluripotency since they drive the expression of key cell identity genes (Whyte et al. 2013). Inhibition of CDK8 inhibition increases Mediator-driven recruitment of Pol II to super- enhancers. Additionally, Lynch et al showed that during embryonic development CDK8 levels are reduced in the preimplantation epiblast. Therefore, Lynch et al hypothesized and proved that chemical inhibition of CDK8 and CDK19 kinases (CDK8i) stabilizes the naïve transcriptional program of mESCs. CDK8 inhibition induces the formation of homogeneous dome-shaped colonies and a transcriptional program that significantly resembles that of 2i (Lynch et al. 2019).
Most importantly, CDK8i-adapted stem cells can contribute very efficiently to chimaeras formation. Hence, the stabilization of naïve ESCs can be alternatively achieved by global hyper- activation of enhancers by means of CDK8 inhibition.
Figure 2: Simplified model for Mediator-dependent regulation of RNA polymerase II (Pol II) driven transcription. First, Mediator recruits Pol II and other transcription factors to the promoter. Next, Pol II starts transcription but stops after approximately 60 nucleotides, and remains “paused”. Cdk8 module may afterwards associate with Mediator, which allows to release Pol II from the paused state. Cdk8 module can also interact with the Super Elongation Complex (SEC) to further regulate Pol II pause release and elongation.
INTRODUCTION
22
1.2. PROTEOMICS
Proteins are the truly molecular effectors of the information encoded in the genes. Whilst the genome is generally constant or stable, the regulation of protein presence, state and amount in a given cell state is highly dynamic. Hence, proteomics is the integrative study of proteins in a biological system (i.e., cell, tissue or organism). Noteworthy, proteomics faces many challenges related to the higher complexity of the proteome when compared to the genome. For instance, several protein isoforms can be coded by only one gene, protein concentrations can vary dramatically between tissues, proteins are regulated by a wide range of post-translational modifications (PTMs) and they can also dynamically interact with each other and form complexes.
Although other proteomics approaches are currently available (e.g., protein array based methods (Poetz et al. 2005)), system wide qualitative and quantitative characterization of the proteome currently relies on mass spectrometry (MS) technology (Aebersold and Mann 2003; Domon and Aebersold 2006; Ahmad and Lamond 2014). MS-based proteomics has grown exponentially in the last decade, mainly, but not exclusively, due to the overwhelming improvements in MS instrumentation regarding speed, sensitivity and accuracy. The advents of new MS technology have had a huge impact, for instance, in the time required to identify a full proteome. In 2013 the yeast proteome was successfully mapped in just one hour of MS analysis (Hebert et al. 2014), whilst years before this took over 144 hours of analysis (de Godoy et al. 2008). Similar improvement has been achieved for more complex proteomes, such as mammalian systems.
Nagaraj et al achieved a depth in the coverage of the Hela proteome similar to next-generation RNA sequencing, although 12 days of MS acquisition were required to identify 9,207 protein coding genes (Nagaraj et al. 2011). In contrast, in 2017, Bekker-Jensen et al were able to identify 11,292 unique proteins (166,620 unique peptides) in only 34.5 hours of analysis, providing the most comprehensive map of a human cell line published to date (Bekker-Jensen et al. 2017).
There are different experimental approaches to explore the proteome using MS technology including top-down proteomics, targeted proteomics and bottom-up proteomics. Top-down proteomics measures intact proteins to determine, for instance, splicing variants (Anderson et al.
2017) or particular PTMs (Lanucara and Eyers 2013). On the other hand, targeted approaches are directed towards analysis of pre-selected proteins, which is relevant for biomarker identification (Picotti and Aebersold 2012; Ebhardt et al. 2015). In contrast, bottom-up proteomics, also referred
INTRODUCTION
23
as shot-gun proteomics, is the most used approach to explore the proteome at a system wide level (Gillet, Leitner, and Aebersold 2016). Shot-gun proteomics is a strategy based on the combination of protein digestion with MS-based peptide sequencing that provides a high throughput approach to analyze the complexity of the proteome. Shot-gun relies on four main pillars: i) sample preparation, ii) data acquisition by MS, iii) data analysis for protein identification and iv) quantification.
Sample preparation for shot-gun proteomics
The first step in most shotgun workflows consists on the efficient extraction of proteins in a denaturing buffer. For cell lysis and denaturing purposes, detergents (e.g., SDS, SDC, RapidGest) or chaotropic agents, such as urea or guanidinium chloride are often used (León et al. 2013). Most importantly, the identification of proteins in shot-gun proteomics relies on the identification of peptides. Peptides are more chemically suited for MS analysis (with regard to ionization and fragmentation). Thus, proteins present in the sample are first digested using a proteolytic enzyme, most often trypsin. Trypsin cleaves specifically at the carboxyl side of lysine (K) and arginine (R) residues (except when followed by proline). Tryptic peptides contain, therefore, a basic residue at the C terminus, which favors their MS analysis. Moreover, due to the homogeneous distribution of K and R throughout proteins, trypsin generates peptides that provide a reasonably complete coverage of the proteome. Still, it is a common practice to complement tryptic digestion with a prior digestion using Lys-C. Indeed, tandem Lys-C/trypsin proteolysis has proven to generate fully cleavage peptides more efficiently than trypsin alone (Glatter et al. 2012), hence improving the proteome coverage. Presence of salts or detergents, such as those used during lysis and denaturing steps can hamper digestion efficiency and may be deleterious for LC-MS analysis.
However, several experimental approaches have been developed to solve this issue. Probably, the most popular is the FASP (Filter Aided Sample Preparation) method (Wiśniewski et al. 2009), in which all steps are carried on top of a molecular-weight cut-off membrane that retains proteins but allow salts to go through in several centrifugation steps. A more recent approach is the S-trap digestion in which proteins are retained in a micellar structure within a matrix that not only allows removing salts and detergents, but also accelerates the digestion (HaileMariam et al. 2018).
Shot-gun proteomics is not only powerful to characterize the proteins of a system, but also to analyse and quantify the dynamic regulation of the proteome by post-translational modifications (PTMs) (i.e., phosphorylation, acetylation, ubiquitination or methylation). MS analysis of PTMs can also provide site-specific information of the modification. However, modified residues are present at very low stoichiometric abundance when compared with their unmodified counterparts, which obliges to employ additional steps of enrichment to reduce sample complexity and achieve
INTRODUCTION
24
a comprehensive map of the PTM of interest. Phosphorylation is the most well studied PTM, and it is of particular interest in the current project since the drugs used to stabilize the ground state of pluripotency are all kinase inhibitors. The most widespread technique to purify phosphorylated peptides is based on the chemical affinity of negatively charged phosphorylated peptides towards metal cations, either in the form of oxidized metals (TiO2, FeO2, ZrO2) (Pinkse et al. 2004) or immobilized metal ion affinity chromatography (IMAC) (Villén and Gygi 2008; Ruprecht et al.
2015). The phosphorylated group adds a negative charge to the tryptic peptide, which, due to its very low pKa, is preserved even in highly acidic conditions and allows them to interact with the positive charge of oxidized metals in a specific way. The field of phosphoproteomics has also benefited from the increase of sensitivity and speed of the last generation of MS, allowing to identify over 50,000 unique phosphorylation sites in a single experiment (Sharma et al. 2014).
In addition, the recent implementation of approaches based on the immunoaffinity purification using pan-specific monoclonal antibodies have allowed to study other PTMs, such as methylation (A. Guo et al. 2014), acetylation (Svinkina et al. 2015) or ubiquitination (Udeshi et al. 2013). In this regard, when combined with shot-gun proteomics, immunopurification has enabled the identification of more than 63,000 ubiquitination sites (Akimov et al. 2018), 21,000 acetylation sites (Weinert et al. 2018) and 8,000 arginine methylation sites (Larsen et al. 2016).
Multidimensional analysis of Protein Identification Technology
The resulting peptide mixture obtained from proteolytic digestion is analyzed by liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS). This approach consists of a chromatographic separation to reduce sample complexity previous to mass spectrometry analysis.
Commonly, the chromatography is performed in acid pH reverse phase that separates peptides by their hydrophobicity. Although MS instruments are increasing their sensitivity and speed at an overwhelming pace (Meier, Brunner, et al. 2018; Hebert et al. 2014), direct analysis (i.e., single shot) of samples by LC-MS/MS is usually not sufficient to address the entire complexity of a sample, such as a cancer cell line or a tissue (Shishkova, Hebert, and Coon 2016). Hence, several chromatographic steps are combined, early known as MudPIT (Multidimensional analysis of Protein Identification Technology), in order to further reduce the complexity of samples and increase proteome coverage (Washburn, Wolters, and Yates 2001). This approach resorts to a fractionation step previous to LC-MS/MS analysis , and it has proven to provide a deep coverage of a complex proteome (Bekker-Jensen et al. 2017). This pre-fractionation step needs to be complementary to the acid pH reverse phase chromatography (Gilar et al. 2005), which can be achieved by using basic pH reverse phase (Y. Wang et al. 2011) or charged based, such as strong cation exchange (Yufeng Shen et al. 2004).
INTRODUCTION
25
Liquid Chromatography-tandem Mass spectrometry
Electrospray ionization
After eluting from the analytical column, peptides are ionized and brought into the gas phase in order to be analyzed in the MS instrument. Electrospray ionization (ESI) is the most widespread ionization approach since it can be easily coupled to HPLC separation (Whitehouse et al. 1985) and it is able to ionize large and fragile molecules without fragmentation (Fenn et al. 1989). In ESI, the eluate from the LC passes through a thin capillary, where a voltage is applied. The high voltage applied in the capillary induces the positively charged peptides to concentrate and form the so-called Taylor cone, which emits a fine spray of charged droplets. These droplets are moved through a gradient of pressure and potential, during which the solvent starts to evaporate, reducing their volume. Finally, the evaporation of the solvent produces an increase of the surface charge density of the droplets which results in an electric field strong enough to desorb solute ions and eject them into the MS.
Tandem Mass Spectrometry Analysis
Peptide identification in shot-gun proteomics is done by tandem mass spectrometry analysis.
Tandem MS analysis or MS/MS analysis consists of two rounds of mass spectrometry analysis. In the first MS step (MS1), the mass-to-charge ratio (m/z) of the peptides is measured. However, the m/z alone does not provide enough information to deduce the primary amino acid sequence of a candidate peptide. Therefore, a subsequent MS analysis step (MS2) is performed, in which selected peptides are fragmented and the resulting ions are measured generating a tandem MS spectrum (Nesvizhskii 2007). The data-dependent approach (DDA), which is employed throughout this thesis, consists of selecting the most abundant species detected in the MS1, which will be sequentially isolated and fragmented. As a result, DDA provides a list of m/z for the most abundant peptides together with their fragmentation spectra.
Roepstorff and Fohlman, and afterwards, Jonhson et al defined an annotation nomenclature to identify the ions resulting from peptide fragmentation (Figure 3) (Roepstorff and Fohlman 1984;
Johnson et al. 1987). The type of fragment depends on the method of fragmentation employed.
Ions a, b and c are produced if the charge of the peptide is localized in the N-terminus. Likewise, if the charge is retained in the C-terminus, the resulting ions will be named x, y, z. Collision induced dissociation (CID) is a standard fragmentation technique in which protonated peptides collide with a neutral gas in the fragmentation cell. These collisions increase the vibrational energy of the bonds that end up with ion dissociation at the amide bond. The rupture in NH-CO bonds leads to ions of the y and b series (Schroeder et al. 2004). High-energy C-trap dissociation or HCD is similar to CID, but employs higher energy and shorter activation times (Olsen et al.
INTRODUCTION
26
2007). On the other hand, electron-capture dissociation (ECD) (Roman A. Zubarev, Neil L.
Kelleher, and McLafferty 1998) or electron transfer dissociation (ETD) (Syka et al. 2004) lead to the cleavage of the NH-Cα backbone bonds and generate c- and z-type fragment ions. ETD fragmentation has been shown to improve site assignment of CID-labile post-translational modifications (PTMs), such as phosphorylation (Riley et al. 2017; Chi et al. 2007) or arginine methylation (H. Wang et al. 2009).
Figure 3. Schematic representation of a peptide with the possible ions that can result after fragmentation. C-terminal fragments are of type x, y or z; whilst their complementary N-terminal fragments are of type a, b and c, respectively.
Mass Spectrometry Platform Architecture
MS instruments include different variants of a capillary or transfer tube to direct the ions ejected from the source into the high vacuum of the MS. Subsequently, different configurations of ion optics are employed to focus ions while discarding neutrals. An example is the ion funnel which consists of a series of cylindrical ring electrodes in which radio frequency of opposite polarity is applied on adjacent electrodes. This arrangement makes ions to focus, and non-charged molecules to be dispersed. Further implementations that improve ion focusing are bent flatapoles, implemented in the Q Exactive instruments. This flatapole is oriented at 90ºC from the source and because of this arc, the neutral particles cannot follow the bent path and are discarded (Figure 4a).
Different types of MS analyzers are combined in the instruments that are employed in shot-gun proteomics. These analyzers use different physical properties to measure the m/z of the ions.
Some of the most common are the time of flight (TOF), the ion trap (IT), the quadrupole (Q) and the Orbitrap (OT).
IT: The ion trap consists of a device that stores ions using an oscillating electric field and measures their m/z by ‘resonant ejection’. This is achieved using a RF quadrupolar field in which only ions of a determined m/z are stable, the remaining are expelled. Ion traps possess high sensitivity, speed and robustness. Besides, it has the advantage over other MS analyzers that can also fragment ions. However, the resolution that can be achieved
INTRODUCTION
27
with this type of analyzers is much lower than with current OT or TOF instruments.
Moreover, ion traps are limited by the “1/3 rule” which restricts the lower mass of the fragments that can be measured to one third of the mass of the precursor.
Q: the mass-to-charge ratio of an ion affects the way it travels through a magnetic or electric field. This is the main principle of quadrupole instruments. A quadrupole consists of 4 parallel metal rods in which each opposite pair is connected electrically, so a combination of DC and RF voltage can be applied on each pair. Only the ions of a given m/z ratio have a stable trajectory for a given RF and DC voltages. Also, the quadrupole can be used in hybrid MS architectures to guide ions when the DC voltage is switched off.
OT: an Orbitrap mass analyzer operates by radially trapping ions around a central spindle electrode (Makarov 2000). As the ions rotate around the inner electrode, they move along its axis with a frequency characteristic of their m/z ratio. Acquisition of transients and the Fourier transformation of that signal yields frequencies and their intensities. The Orbitrap is characterized for its high resolution, which has even improved through different generations. In 2012, the Orbitrap Elite (Michalski et al. 2012) presented a new Orbitrap with two-fold resolving power of the previous one for the same transient length. Latest generation of Orbitrap instruments can provide a resolving power of up to 500,000 FWHM.
TOF: In contrast to the other mass analyzers described here, time of flight instruments measure the mass-to-charge ratio of ions as a function of their velocity when accelerated equally by an electric field. Ions with different mass-to-charge ratio have different velocities and, therefore, they would arrive to the detector at different times. Modern TOF instruments include an electrostatic reflector, called reflectron that improves resolution without increasing the dimensions of the instrument. The reflectron acts as a retarding field that corrects the kinetic energy dispersion of the ions with the same m/z. Ions with higher kinetic energy will penetrate into the reflector more deeply, so they will spend more time in there than ions with lower energy.
INTRODUCTION
28
Figure 4: (a) Schematic representation of a Q-Exactive HF-X (Kelstrup et al. 2018). (b) Schematic representation of a QTOF Impact (Beck et al. 2015).
Modern mass spectrometers are built using a combination of these analyzers. Currently, popular instruments in the field of proteomics consist on a quadrupole coupled to a high resolution MS detector, such a TOF (Figure 4a) or an Orbitrap (Figure 4b). The quadrupole can be used to transmit the entire mass range and act as an ion beam guide for MS1 analysis, or it can serve as a mass filter, transmitting only a defined mass window around the selected precursor for MS2 analysis. Other hybrid MS architectures, such as the LTQ Orbitrap Velos (Olsen et al. 2009) or the Orbitrap Fusion (Senko et al. 2013), employ an ion trap which can act both as a mass filter and as a mass analyzer. MS devices can have a collision cell dedicated to ion fragmentation, such
INTRODUCTION
29
as Q Exactive instruments that implement a cell to perform HCD fragmentation. Also, in the LTQ Orbitrap Velos CID fragmentation is performed in a modified ion trap called “Dual pressure ion trap” that separates the events of ion capture and fragmentation by CID. On the other hand, specific ionization sources are required to perform ETD fragmentation, such as the one implemented in the Orbitrap Elite and Orbitrap Fusion Lumos.
Protein identification
Tandem MS analysis provides information about the m/z of the peptides and the fragments of the most abundant species selected in each MS1 scan (Figure 5a). These data contain the information needed for peptide identification and, therefore, for infering the proteins present in the sample.
Protein identification is performed by algorithms called “search engines”. Some popular search engines are Sequest (Wolters, Washburn, and Yates 2001), Andromeda (Cox et al. 2011) or MS Amanda (Dorfer et al. 2014). Peptide sequences are identified by interrogating each tandem MS spectrum against spectra derived from a theoretical database. This theoretical database is normally conformed by all the proteins that can be expected in the sample, which can be accessible from public repositories, such as Uniprot. Search engines digest in silico the proteins of the database with the same enzyme used experimentally in order to recreate all the possible peptides present in the sample. Afterwards, for each peptide, fragmentation spectra are calculated to generate theoretical tandem MS spectra. This information is matched with every MS1 and MS2 spectrum.
Peptides are identified when their m/z coincides, within certain window of mass tolerance, with the theoretical m/z of a peptide and the experimental MS2 spectrum matches, with certain degree of confidence, with the theoretical information (Figure 5b)(Nesvizhskii and Aebersold 2005;
Nesvizhskii 2007). Each identified spectrum is designated as a Peptide Spectrum Match (PSM), and finally, identified peptides are assembled into proteins. Importantly, a protein can only be confidently identified if a peptide that unambiguously pertains to this protein is identified. On the contrary, if several proteins are identified by shared peptides and, consequently, cannot be distinguished independently, are reported as a “protein group”.
INTRODUCTION
30
Figure 5: (a) Schematic diagram that represents the experimental workflow for a typical shot-gun proteomics experiment. First, proteins are extracted and digested using trypsin. Resulting peptide mixture is analyzed by LC-MS/MS. The Data Dependent acquisition method consists of two steps:
MS1 and MS2. During MS1 all mass-to-charge ratios of the eluting peptides are measured. In the next step (MS2) the most abundant precursors are isolated and fragmented sequentially to measure their fragment ions. (b) Schematic diagram that depicts the search engine workflow used to identify the proteins from a MS analysis. The search engine uses a strategy known as Target- Decoy database in which it uses an organism of choice database (target) and generates a decoy database by reversing the sequences. Then, the target and decoy proteins are digested in silico and their theoretical fragment ions are calculated. Finally, each precursor selected in the DDA analysis is compared with the peptides that match its m/z from the Target-Decoy database. Then, it will look for the fragments to select the peptide that better fit the experimental spectrum.
Importantly, spectra can be wrongly assigned to peptides and, therefore proteins can be misidentified. This problem is magnified considering that search engines process thousands and even millions of spectra in each experiment. This “multiple testing problem” demands the implementation of appropriate methods to estimate the rate of incorrect peptide assignments and subsequent protein misidentifications. Search engines implement algorithms, such as Percolator (Käll et al. 2007), that estimate the proportion of wrongfully identified spectra by using a strategy known as target-decoy database search (Elias and Gygi 2010; Nesvizhskii 2010). This strategy is based on the premise that obvious, necessarily incorrect “decoy” sequences added to the search space will correspond with incorrect identifications. Technically, it consists of searching the experimental spectra against two databases: the target database that contains the expected protein sequences and the decoy database that contains “false” protein sequences. There are many alternatives to generate the decoy database, for instance, by reversing the sequences in the
INTRODUCTION
31
“forward” database. The decoy database is processed in parallel to the target database as explained previously (i.e., in silico digestion and generation of fragments). During the search, each spectrum is compared against the sequences from both databases. If a spectrum is matched with a sequence from the “decoy” database it would be deemed as a false positive. Afterwards, all PSMs are sorted by their identification confidence score in order to determine a threshold to limit the proportion of false positives allowed in the results. The proportion of decoy assignments passing the selection criteria is referred as the False Discovery Rate (FDR) which is normally fixed below 1% at the PSM, peptide and protein level.
Quantitative proteomics
Complete understanding of biological processes cannot be only limited to the qualitative assessment of the proteins present in the sample. Biological effects normally depend on the dynamics of protein expression. Therefore, over the past years there has been an extensive development of tools that allow to measure and quantify the dynamics of the proteome. The approaches to quantify the relative changes of protein expression between experimental states can be classified as label-free and label-based approaches.
Label-free quantitative proteomics
Label-free is the most straightforward approach since it does not require extra steps during sample preparation. Although spectral counts were commonly used in the past, currently, most algorithms devoted to label free quantification use the area of the chromatographic peak of a peptide to compare the relative abundance of proteins between different experimental conditions (Figure 6a).
MaxQuant is often used to perform label-free quantification (Cox and Mann 2008). MaxQuant implements, among many functions, feature recognition and ‘match between runs’. A ‘feature’ in an MS spectrum is defined by the mass and the intensity of a peptide peak. MaxQuant detects peaks in each MS scan by fitting a Gaussian peak shape to each isotope of the peptide envelope, and then assemble all into a three-dimensional plane over the m/z-retention time plane. Secondly,
‘match between runs’ allows transferring peptide identifications from one sample to non- sequenced or unidentified peptides in other samples. This highlights the importance of a reproducible LC-MS system, since the matching works best provided that the retention times between runs are made comparable by alignment. Finally, MaxQuant also implements the MaxLFQ approach to determine protein intensity and normalize abundance between runs (Cox et al. 2014).
INTRODUCTION
32
Label-based quantitative proteomics
Label-based quantitative approaches are based on labeling each sample chemically or metabolically so that all samples can be mixed together and analyzed simultaneously by MS. An important advantage of label-based approaches is that samples can be pre-fractionated to increase the depth of the analysis without affecting, a priori, the accuracy and precision of the quantitative measurements. Depending on the method used to label the samples, we can differentiate between metabolic and chemical labeling (Figure 6).
Metabolic labeling methods, such as SILAC, are based on the biological incorporation of isotope labels into proteins in living cells. This is normally achieved with stable isotope-labeled amino- acids, such as arginine and lysine, that are incorporated into the culture medium of growing cells (Ong et al. 2002),. This allows to mix samples immediately after collection and to be processed simultaneously, which minimizes experimental variability due to sample preparation.
Nevertheless, SILAC approaches are limited in the number of cellular states that can be compared, to 2 or 3 (Figure 6b).
Although there are a wide range of chemical labeling techniques (e.g., ICAT, dimethyl, O18 labeling), in this project we will focus on the isobaric labeling. Isobaric labeling stands out among other quantitative approaches by its high multiplexing capacity. Currently, two types of isobaric labeling reagents are commercially available: iTRAQ (‘isobaric tags for relative and absolute quantification’) (Ross et al. 2004) and TMT (‘tandem mass tag’) (Thompson et al. 2003). iTRAQ reagents are available as iTRAQ 4plex, that allows to compare four experimental conditions, and iTRAQ 8plex, for up to eight experimental conditions. On the other hand, TMT multiplexing capacity varies between TMT 6plex, to measure up to 6 six conditions (Dayon et al. 2008), and TMT 11plex, that allows up to eleven conditions (McAlister et al. 2012).
In contrast to SILAC in which proteins are labeled, isobaric reagents label peptides from distinct experimental conditions with tags that have identical overall mass but vary in the distribution of heavy isotopes. Isobaric reagents consist of three parts: and amine reactive group, a mass
‘normalizer’ and a ‘mass reporter’. The amine reactive group serves as a link with the amine group of the side chain of lysine and the peptide N-termini. The total mass of the normalizer and reporter is the same for all reagents; however they are linked by a cleavable bond that releases the
“mass reporter” after fragmentation. The mass reporter is unique for each tag allowing to quantify each experimental condition (Ross et al. 2004) (Figure 6c).
Despite the great benefits that come with multiplexing, isobaric labeling quantification is affected by an issue that undermines its accuracy. During precursor selection for fragmentation, the desired peptide is co-isolated with other peptides whose mass fall within the isolation window. As a
INTRODUCTION
33
consequence, the precursor is co-fragmented with undesired peptides and, therefore, the reporter ions for all species are mixed, leading to a biased measurement. Since the vast majority of the proteome remains unaltered upon perturbation of a biological system, differential peptides are commonly co-isolated with peptides from non-changing proteins. As a result, the constant background contribution of the non-changing peptide compresses the ratios of the differential one towards unity. This issue, commonly named as ‘ratio compression’, can have a strong impact on the accuracy, as actual fold-changes of 10 can result in a measurement of 3.5 (Ting et al. 2011).
Figure 6: Schematic representation of the different quantification approaches most commonly used in shot-gun proteomics. (a) Label-free quantification consists on processing several samples in parallel and acquiring the MS data independently in each one. (b) Metabolic labeling such as SILAC consists on labeling the proteins in living cells. After labeling, samples from different experimental conditions are mixed, subjected to sample preparation and analyzed. (c) Isobaric labeling: samples are digested in parallel and peptides from each experimental condition are labeled with an isobaric reagent. Afterwards, samples are mixed. Peptides from different conditions elute simultaneously providing a unique MS1 signal. After selecting the peptide for fragmentation, the reporter ion from each isobaric reagent is released.