3. PROPUESTA
3.1. Diseño de procedimientos
3.1.1. Recepción y distribución de correspondencia
It was hypothesised that previous studies of the CSF endopeptidome had only scratched the surface and that further discoveries could be made, especially if peptides in the lower concentration ranges could somehow be accessible for detection. As previously noted, one of the major issues in MS-based studies of biological materials is the sheer number of analytes present, and in particular the fact that a few analytes are excessively highly represented [9, 93, 118, 235, 360]. To attempt to overcome this hurdle, extensive sample preparation protocols for pre-treatment, cleaning/de-salting and deconvolution of endogenous CSF peptides for analysis by LC-MS/MS, as well as software-based peptide identification strategies of the resulting data were developed.
A protocol previously developed in house by Mikko Hölttä et al. (2012) [228] for selective purification of endogenous peptides from CSF, based on MWCO-filtration, was further developed and optimised. Primarily the step for chemical pre-treatment of the sample was modified and off-line RP-HPLC pre- fractionation, based on the work of Tanveer Batth and colleagues (2014) [262], was introduced following peptide purification (MWCO-filtration). On-line RP-HPLC was carried out over a 180 min gradient on an Ultimate 3000 RSLC nano-flow system, allowing for a high degree of peptide separation prior to ms/ms-analysis performed on an Orbitrap Fusion Tribrid mass spectrometer (both from Thermo Scientific). Finally, ms/ms-data was analysed employing a combination of three different peptide identification algorithms. A schematic representation of the workflow has been included in figure 13.
40
Figure 13: The workflow developed in paper I for attempting identification of as large a section as possible of the CSF
endopeptidome. Brief step-by-step explanation; 1) sample extraction by means of lumbar puncture followed by removal of insoluble CSF components by means of sedimentation through centrifugation, 2) chemical sample pre-treatment meant to denature higher protein structures, optional addition of standards or isobaric labelling, 3) removal of proteins through ultrafiltration (MWCO), 4) removal of soluble CSF components and excess reagents as well as small hydrophobic CSF components such as lipids, 5)off-line peptide pre fractionation and concatenation over an alkaline mobile phase gradient, 6) on-line LC-MS/MS analysis of each concatenated fraction, 7) peptide identification through processing of raw ms/ms-data by three peptide identification algorithms, comparison of IDs and evaluation of results.
Since the amount of sample material which can be analysed at-a-time by nano-HPLC MS/MS is limited (this being one of the defining caps for detection of low-abundant peptide species rather than the sensitivity of the instrument) splitting the total peptide content into sub-fractions allowed for an increase
41
in the total amount of initial sample material used in the protocol [257, 353]. Thus, employing pre- fractionation causes that the relative concentration of each individual peptide species was increased, aiding in the detection of low-abundant peptides [257, 258, 262, 361]. Another effect of fractionating the sample prior to analysis is that the complexity (in practice defined as the number of individual peptide species eluting from the HPLC-column at any given gradient time-point) is reduced, giving the MS a better opportunity to detect each analyte [251, 256, 257].
The last alteration made to the original protocol involved identification of peptides from ms/ms-data. Compared to peptides generated from proteins by means of proteolytic degradation in vitro (proteomics), endogenous peptides tend to receive a low identity score when employing proteomic software for ms/ms- data analysis [173, 206, 285, 352]. This issue was partially addressed by engaging a machine learning feature, Percolator™, available in ion-fingerprinting-based software (Mascot, SequestHT), and partially by analysing the data with a tertiary software based on de-novo sequencing (PEAKS).
Percolator adapts the scoring algorithm iteratively based on respective common features of the subsets of most and least confident (highest/lowest scoring) peptide sequence match (PSMs) [362, 363]. Since the employed proteomics softwares were originally developed for identifying peptides, generated through proteolytic activity in vitro, they may be ill-adapted for endogenous peptides, particularly if said peptides contain PTMs [24]. The percolator feature seemed to alleviate the inherent problems/incompatibilities to some degree when performing peptidomics with tools developed for
proteomics (see figure 14).
Figure 14: Comparison of number of identified endogenous (A, C) and tryptic (B, D) peptides with Mascot (A, B) and
SequestHT (C, D). PSM were scored using the default scoring algorithm of the respective software (blue) as well as the machine learning-based algorithm, Percolator (yellow).
A third algorithm, employed by de novo-sequencing software PEAKS, was also engaged to compare the functionality of this strategy to identify endogenous peptides. Very briefly; de novo-sequencing algorithms goes through individual ms/ms-spectra and, by measuring distance (in m/z) between detected peaks of fragment ions, sequentially “builds” the sequence one (or a few) amino acids at a time, adding a so called “sequence tag” to each spectra including a confidence value for the particular peptide sequence [364]. Following the initial sequencing the peptides can be compared to a database. However,
42
the basic concept of de novo-sequencing allows for unbiased identification of peptide sequences5 and is
especially useful for identifying peptides containing PTMs and to pinpoint their most likely position [365, 366].
By further developing the protocol for peptide purification and acquisition and employing pre- fractionation of the sample resulted in a near 10-fold increase in acquired ms/ms spectra during analysis compared to the previous protocol. The 10-fold ms/ms spectra increase, depending on which individual proteomics software was employed (see figure 15), translated to a similar increase in PSMs and a 5- to 8-fold increase in actual peptide identification. Importantly, we were also able to show that it was possible to combine the identification results of the employed softwares without running into multiple testing issues (or we could show that the extent of the multiple testing issues was smaller than the set level of FDR of 1%).
Due to the small total identification overlap, or consensus, between search algorithms of less than 15% we concluded that not combining the resulting peptide identification would result in substantial information loss. Hence, we found it reasonable to consider the identification results complementary rather than comparative and combined all unique peptide IDs into a single library of endogenous CSF peptides containing 18.031 entries resulting from three separate analyses of pooled CSF from two different sources (all data available via ProteomeXchange under identifier PXD004863). The approach has been suggested previously by, among others, Shteynberg et al. (2013) as an option to optimise data- utilisation in shotgun proteomics [367]. However, we were first in showing the substantial benefit when studying endogenous peptides.
Figure 15:Comparison of proteomics software for identification of endogenous CSF peptides.
Over 1900 proteins were represented by endogenous peptides in the sample set of two CSF pools. Among these were microtubule-associated protein tau (6 peptides), amyloid precursor protein (213 peptides, 58 of which spanning the amyloid beta sequence), NfL and NfH (1 unique peptide each) and
5 Ion-fingerprinting algorithms predicate peptide identification on comparison to in silico-generated fragment ion
43
a further 60 proteins with known or suspected involvement in neurodegeneration were represented by nearly 3000 peptides [9, 221, 368, 369].
We concluded that the protocol employed here was successful for the purpose of identifying previously undetected sections of the human CSF peptidome. The workflow is labour intensive and substantial sample handling results in increased risks of introducing contaminants. However, this and subsequent studies employing the workflow (see paper III) indicate that investing time and resources in a large scale initial trial may result in a long list of potential candidate markers for further studies. Finally, with the work presented in this paper we were able to show the scale of the human CSF endopeptidome – simply containing such a large amount of information would warrant further study in the opinion of this researcher.