Comparación entre resultados de modelación numérica y base de datos para un

4. Modelación numérica y analítica de pilares

5.3. Validación de las curvas de estabilidad

5.3.1. Comparación entre resultados de modelación numérica y base de datos para un

The MS raw data can be processed with commercial programs(e.g., the Proteome Discoverer

by Thermo Fisher) or with open-source toolboxes (e.g., OpenMS106). Both types of software

have advantages and disadvantages. Proteome Discoverer is specialized on processing raw files from mass spectrometers of the same distributor, which allows user-friendly analysis. On the other hand, the whole software is a black box, meaning that neither the source code is available nor a full access to the integrated algorithms is possible. In contrast, open-source software like OpenMS allows full access to the algorithm and supports more than one MS manufacturer. However, OpenMS only has a limited user interface, which makes setting up a processing pipeline for the data more difficult. On the other hand, due to the free source code, almost every parameter and algorithm can be optimized for the specifc machine and experimental set-up. Here, we provide a benchmark of four identification pipelines (Figure 5.10). We implemented the first two in the Proteome Discoverer 1.4 and used either Sequest HT or

Sequest HT + Percolator (version 2.04)123. We used OpenMS to implement the other two,

which use either Comet38,39or Comet+ Percolator (Comet: 2017.01 rev. 2; Percolator: 3.1).

Only runs in which Comet identified more than 300 peptides were used in the benchmark (122

HLA class I/ 62 class II runs). Excluding runs with very few peptides removes outliers, which

would cause large standard deviation. The whole benchmark was like in all analyses in this thesis performed with an run level FDR of 5%. Our benchmark compares first the number

5. Analysis of the benign tissue immunopeptidome

of identified HLA class I and II peptides per sample, and second the fraction of binding HLA class I peptides and the number of identified peptides (binding predicted with netMHCpan-3.0). These two parameters provide a good estimate for assessing which identification pipeline and algorithm performs best.

Figure 5.10: Workflow of the benchmark of the identification algorithms. First, Comet

and Sequest HT are used for identification. Next, either the data was directly FDR filtered or advanced identification statics calculated using Percolator. After filtering, the binding affinity is predicted using netMHCpan-3.0. Last, the data is evaluated and compared.

Figure 5.11 shows an overview of the number of identified HLA class I and II peptides per sample. The results indicate that the choice of the identification software has a large influence on the number of identifications. The median number of identified peptides is two times larger with Comet than with Sequest HT. Furthermore, we were interested in the potential gain of identified peptides when Percolator is used. Therefore, we used Comet and Sequest HT with and without Percolator. Figure 5.12 shows a median gain of 50% more peptides if Percolator is used. The comparison for HLA class II also shows a gain in identified peptides by Comet. However, the median gain is only 35% without Percolator and 12.5% with Percolator. We did not perform any binding prediction for HLA class II, because no good peptide binding predictors are available.

With Sequest HT we hat to use an old version of Percolator (2.04) because no newer version is available for the Proteome Discoverer. With Comet, we used Percolator 3.1. In the

Comet+ Percolator setting, we used OpenMS and set the digestion enzyme to none. Proteome

Discoverer always uses trypsin as the digestion enzyme for Percolator and does not provide an option to change it. Since, our peptides are not tryptic digested this provides the machine learning algorithm a wrong prior knowledge, which could result in a incorrect learning of the

Results 0 2000 4000 6000 Count Sequest Comet

+ Percolator + Percolator + Percolator Identified Peptides Binder Identified Peptides

HLA Class I HLA Class II

Figure 5.11: Results of the benchmark of Sequest HT, Sequest HT+ Percolator, Comet,

and Comet+ Percolator. The number of identified HLA class I peptides per MS run (122 runs) is shown on the left. The number of binding peptides (binders) is on the right. The binding was predicted with netMHCpan-3.0 and the individuals’ HLA type. Due to its bad prediction performance, no prediction for HLA Class II was performed. The line inside the boxplots represents the median, the box borders are the first and third quantile (25th and 75th percentiles), and the whiskers are the largest value no further than 1.5 x IQR from the hinge (IQR is the inter-quartile range, or distance between the first and third quartiles)144.

peptide properties and a wrong significance calculation. Nonetheless, Percolator was robust and assigned more significant peptide identifications even when the wrong enzymatic cleavage was set by the Proteome Discoverer.

In addition to the number of identified peptides, we were also interested in the number of putative binders in our dataset. The number of binders can be used to verify that the found peptides are viable identifications. We calculated the binding affinity with netMHCpan-3.0 and the individual’s HLA type. Figure 4.11 shows that we also gain more peptides when we use

Comet+ Percolator than with Sequest HT + Percolator. Furthermore, we calculated the ratio

of binders to identified peptides. The percentage of binders was slightly higher with Sequest

HT+ Percolator (2.2% on average).

The benchmark between Sequest HT and Comet showed that Comet is superior to Se-

5. Analysis of the benign tissue immunopeptidome 0 25 50 75 100

Binders/ Identified Peptides [%]

Sequest Comet

+ Pecolator

Figure 5.12: Results of the benchmark of Sequest HT, Sequest HT+ Percolator, Comet,

and Comet+ Percolator. The fraction of binding HLA class I peptides and the number of identified peptides of 122 MS runs is shown as quality measurement. The binding was predicted with netMHCpan-3.0 and the individuals’ HLA type. The line inside the boxplots represents the median, the box borders are the first and third quantile (25th and 75th percentiles), and the whiskers are the largest value no further than 1.5 x IQR from the hinge (IQR is the inter-quartile range, or distance between the first and third quartiles)144.

algorithms are based on the Sequest algorithm, Comet has been shown to be superior. Because Sequest HT is a commercial software no detailed description of the algorithm is available. Therefore, we can only assume that the slight changes in the base algorithm made by Comet

and two decades of development led to the massive gain of identifications38. As a consequence

of the benchmark, we used Comet+ Percolator to identify our peptides. There are many other

identification algorithms like Mascot94, OMMSA45, MaxQuant30, or MSGF+63, which we did

not consider in this benchmark. The results of multiple identification algorithms could also be combined, which might result in even better identifications. However, a complete benchmark of all identification algorithms would go beyond the scope of this thesis.

In document UNIVERSIDAD DE CHILE FACULTAD DE CIENCIAS FÍSICAS Y MATEMÁTICAS DEPARTAMENTO DE INGENIERÍA CIVIL (página 91-95)