• No se han encontrado resultados

DESARROLLO DE CONTENIDOS DE LA ESTRUCTURA GENÉRICA

DEL DOCUMENTO TÉCNICO

2.2. DESARROLLO DE CONTENIDOS DE LA ESTRUCTURA GENÉRICA

It is well known that many tandem mass spectra acquired in proteomics experiments from complex mixtures are actually chimeric spectra composed of two or more peptide species [118]. In the seminal work of Kall et al.[75] in describing the Percolator algorithm, the authors showed an example spectrum which could be aligned with two different peptide sequences (shown in Fig. 4.8 (A) and (B)).

87

Figure 4.8: Chimeric spectrum composed of two different peptide sequences. Both sequences (A) and (B) differ by 1.1 Da, but because the isolation window in MS1 is 3 Da, both peptides are co-fragmented in MS2.

The peak list for the chimeric spectrum was submitted to the Digger and Mascot search algorithms and searched against the LudwigNR_Q113 sequence database with “yeast” as taxonomy filter using low-resolution mass accuracy settings (+- 3 Da and +-0.5 Da precursor and fragment ion tolerance, respectively). Both peptide sequences are identified with high scores relative to all other hits. For Mascot (data not shown), the Homology ion score threshold was slightly higher than the score of the second peptide hit, hence erroneously precluding the identification of both peptide sequences. Digger, however, identified both sequences with high scores relative to the other peptide hits (Fig. 4.9 (A) and (B)). Since the default scoring mode in Digger has intensity normalisation enabled (Eq. 4.3), both correct peptide sequences have lower scores (Fig. 4.9 (A) compared with when intensity normalisation was disabled (Fig. 4.9 (B)). Of course, the background scores of incorrect peptide hits are also elevated when intensity normalisation is disabled; suggesting that a more sophisticated intensity normalisation routine could be applied when multiplexing strategies become routine.

88

Figure 4.9: Two peptides identified with confidence by Digger from a chimeric spectrum.

Digger default scoring with intensity normalisation enabled (A), and intensity normalisation disabled (B).

89

4.4 Summary

Given the high diversity of peptides in biological systems and their involvement in key regulatory processes, including development and immunity, there is a perceived need for improved experimental methods and software tools for their discovery. Detection of these molecules as well as modified forms of these molecules using a MS-based peptidomic workflow is well established, however freely available or open-source software tools for analysing these data sets are currently limited. Commercial software tools such as PeaksDB [119] and the ProteinPilot [120] search engine currently fill this niche, but these programs were primarily designed for the analysis of proteomics data sets, and have been shoe-horned for the analysis of peptidomics data sets.

In this work, we have developed a search engine and scoring function that is capable of identifying peptides with confidence within a large search space without resorting to multiple passes or refinement steps (which could invalidate the target-decoy approach to false discovery rate estimation). Based on comparisons with a widely-used and popular search algorithm (Mascot) we demonstrate that the Digger search algorithm performs extremely well, especially for tandem mass spectrometry data acquired at high-mass accuracy. This bodes well for the future of the Digger search algorithm, as it is envisaged that the majority of peptidomics or proteomics data acquired in the coming years will be of high mass accuracy.

Although not explored in this chapter, the Digger search engine was designed as a stand-alone search engine capable of taking advantage of unlimited compute resources. It is currently designed to run on multiple cores within a symmetrical multiprocessor (SMP) environment, but could easily be extended to run on cloud infrastructure or within a client-server compute cluster.

An issue with many widely-used search algorithms is their handling of the diverse array of potential modifications that researchers are interested in - a common problem is the inability to apply multiple different modifications to the same amino acid residue. There are no such limitations in the current algorithm, other than a user-settable maximum number of simultaneous modifications allowed per peptide.

90

The Digger scoring function has been shown to be robust to increases in search space and because its score is based on empirically observed match statistics, the score reflects any changes in search parameters or in the number of sequence entries searched. The advantages of this approach become obvious when high-mass accuracy data are analysed and fragment ion mass errors are extremely low (sub or low PPM). Similarly, the presence or absence of different ion-series types (because of amino acid composition statistics) are easily modelled in this context, since individual and/or combinations of different ion-series types can be tested on a per spectrum basis. Extending the scoring to different mass spectrometry ionisation techniques such as ETD or high-energy CID is therefore trivial, and is simply a matter of adding new ion-series types. The major assumption in this work is of course that the decoy candidate peptides constitute an adequate null model. This should be true as long as data are analysed and searched against reasonably comprehensive sequence databases appropriate for the experiment. The score computed by Digger for each PSM is essentially a p-value like quantity and is therefore stand-alone for single or small data sets. Similar to Mascot, an Identity threshold could be computed from the output, because the number of candidate peptides per spectrum is written to the search results file. Further, compatibility with post- processing tools, such as Percolator, are ensured because the top ten peptide hits per tandem mass spectrum for both target and decoy searches are output. This allows re- scoring and statistical treatment of the data, especially for large-scale data sets where it is appropriate to compute false discovery rates based on for example target-decoy approaches.

Data dependent tandem mass spectrometry acquisition (DDA) strategies are gradually being replaced with data independent (DIA) strategies, particularly for targeted quantitative workflows. These same DIA experiments allow a mass spectrometer to essentially dig deeper into a sample with a concomitant increase in the number of multiplexed spectra and hence the requirement for search algorithms capable of de- multiplexing. The Andromeda search algorithm employs a two stage search strategy to accomplish this de-multiplexing search strategy. We have shown in this work that the Digger search algorithm is capable of scoring multiple co-eluting peptides in a robust

91

manner and together with a post-processing tool, such as Percolator, able to assign significance to these co-eluting peptides.

The development of the Digger search engine and scoring function, whilst only a single component, begins to address the bioinformatics requirements for the analysis of MS- based peptidomics data sets. We have also shown that this search algorithm is uniquely adapted to undertake the analysis of proteomics data where increased search space (e.g. PTM’s) poses a challenge for current search tools.

93

Chapter 5. Re-scoring, re-ranking, assisted PTM