Herramienta bioinformática para la evaluación de la calidad de EM/EM para la identificación.

(1)

UNIVERSIDAD POLITÉCNICA DE MADRID E SCUELA T ÉCNICA S UPERIOR DE I NGENIERÍA

A GRONÓMICA, A LIMENTARIA Y DE B IOSISTEMAS GRADO EN BIOTECNOLOGÍA

D EPARTAMENTO DE BIOTECNOLOGÍA-BIOLOGÍA VEGETAL

Herramienta bioinformática para la evaluación de la calidad de EM/EM para la identificación.

TRABAJO FIN DE GRADO

Autor: Manuel Fernández Martínez

Tutor: Alberto Gil de la Fuente Cotutor: Manuel Martínez Muñoz

Julio de 2021

(2)

ii

UNIVERSIDAD POLITÉCNICA DE MADRID

Escuela Técnica Superior De

Ingeniería Agronómica, Alimentaria y de Biosistemas

GRADO DE BIOTECNOLOGÍA

BIOINFORMATIC TOOL FOR MS/MS QUALITY ASSESSMENT FOR IDENTIFICATION

HERRAMIENTA BIOINFORMÁTICA PARA LA EVALUACIÓN DE LA CALIDAD DE EM/EM PARA LA IDENTIFICACIÓN.

TRABAJO FIN DE GRADO

Manuel Fernández Martínez

MADRID, 2021

Director: Alberto Gil de la Fuente

Prof. Colaborador Doctor, Dr. en Ciencias y Tecnologías de la Salud Dpto. de Ingeniería Biomédica, Universidad San Pablo CEU

Cotutor: Manuel Martínez Muñoz Prof. Titular de Universidad

Dpto. de Biotecnología – Biología Vegetal, ETSIAAB - Universidad Politécnica de Madrid

(3)

iii

HERRAMIENTA BIOINFORMÁTICA PARA LA EVALUACIÓN DE LA CALIDAD DE EM/EM PARA LA IDENTIFICACIÓN.

Memoria presentada por Manuel Fernández Martínez para la obtención del título de Graduado en Biotecnología por la Universidad Politécnica de

Madrid

Fdo.: Manuel Fernández Martínez

VºBº Tutor y Director del TFG D. Alberto Gil de la Fuente

Prof. colaborador doctor, Dr. en ciencias y tecnologías de la salud Dpto. de Ingeniería Biomédica

UNIVERSIDAD SAN PABLO CEU

VºBº Cotutor

D. Manuel Martínez Muñoz Prof. Titular de Universidad

Dpto. de Biotecnología – Biología Vegetal ETSIAB ‐ Universidad Politécnica de Madrid

Madrid, 7 de julio, 2021

(4)

iv

General Index

Figure Index ... vi

List of Abbreviations ... viii

Abstract ... x

Resumen ... xi

CHAPTER 1: Introduction and aims ... 1

1.1. Introduction to metabolomics ... 1

1.1.1. What is metabolomics? ... 1

1.1.2. First appearance and characteristics ... 1

1.1.3. Targeted/Untargeted metabolic experiments ... 2

1.2. Metabolomic workflow ... 2

1.2.1. Sample preparation ... 2

1.2.2 Data acquisition ... 3

1.2.3. Data processing and statistical analysis ... 4

1.2.4. Metabolite identification ... 5

1.2.5. Biological interpretation ... 8

1.3. Motivation and state of the art ... 8

1.4. Objectives ... 8

CHAPTER 2: Materials and Methods ... 10

2.1. Experimental set-up ... 10

2.1.1. Chemicals ... 10

2.1.2. Analytical setup ... 10

2.2. Algorithm design ... 10

2.2.1. Signal to noise ratio measuring ... 10

2.2.2. Coelution of isobaric species detection ... 11

2.2.3. Coelution of reference masses detection ... 12

2.2.4. Crosstalk detection ... 13

2.2.5 Designed classes ... 13

(5)

v

2.2.6 Quality evaluation ... 14

2.3. Implementation ... 16

2.3.1. Logical approach to spectrum quality evaluation ... 17

2.3.2. Scoring approach to spectrum quality evaluation ... 18

2.3.3. Mixed approach to spectrum quality evaluation. ... 21

2.3.4. Graphical User Interface ... 22

CHAPTER 3: Results and discussion... 25

3.1. S/N measurement ... 25

3.2. Comparison of quality evaluation methods ... 25

CHAPTER 4: Conclusions and Future work ... 28

4.1. Conclusions ... 28

4.2. Future work ... 28

References ... 29

(6)

vi

Figure Index

Figure 1: Coelution of isobaric species example in MS. Centroided MS spectra from MSMS_AFX3_n, retention time 6.746 min, scan id 404762. Precursor ion for next MS² is labeled with a black diamond, an isotope M+1 with a green triangle and a coeluting ion with a

red circle. ... 6

Figure 2: Coelution of reference masses example. Centroided MS² spectra from MSMS_AFX3_n, retention time 9.924 min, scan id 595429. Precursor ion in green and coeluting reference mass with a red circle. ... 7

Figure 3: Crosstalk example. Centroided MS² spectra from MSMS_AFX15, retention time 5.981 min, scan id 358865. Precursor ion in green and coeluting reference mass with a red circle. ... 7

Figure 4: Coelution of isobaric species detection above 10% precursor ion intensity. .. 12

Figure 5: Coelution of reference masses detection above 10% of the most intense peak. ... 12

Figure 6: Crosstalk detection above 10% of the most intense peak. ... 13

Figure 7: UML of Spectrum_object and Peak_object classes. ... 14

Figure 8: Flowchart of logic approach to evaluate spectrum quality. ... 15

Figure 9: Flowchart of scoring approach to evaluate spectrum quality... 15

Figure 10: Flowchart of mixed approach to evaluate spectrum quality. ... 16

Figure 11: Logical approach to evaluate spectrum quality pseudocode. ... 17

Figure 12: S/N score pseudocode ... 18

Figure 13: S/N score for S/N values 1-110. ... 18

Figure 14: Inaccuracies score pseudocode ... 19

Figure 15: Inaccuracy score for relative intensity values 1-100. ... 20

Figure 16: Scoring approach to evaluate spectrum quality pseudocode... 21

Figure 17: Mixed approach to evaluate spectrum quality pseudocode... 22

Figure 18: Run tab of main window of graphic interface, unfilled (left) and filled (right). ... 22

Figure 19: Input file example for graphic interface. ... 23

Figure 20: Output file example for graphic interface. ... 23

Figure 21: Options tab of main window of graphic interface. ... 24

Figure 22: Results window and detailed information window of graphic interface. ... 24

Figure 23: Precision and recall of direct increment approach in grass level calculation. ... 25

Figure 24: Scoring method assessed quality and gold standard quality of spectra. ... 26

(7)

vii

Figure 25: Mixed and logical methods assessed quality and gold standard quality of spectra. ... 26 Figure 26: Mixed and logical methods assessed quality and gold standard quality of spectra with three categories. ... 27

(8)

viii

List of Abbreviations

mRNA Messenger ribonucleic acid

NMR Nuclear magnetic resonance

MS Mass spectrometry

GC Gas chromatography

LC Liquid chromatography

HPLC High-performance liquid chromatography UPLC Ultra-performance liquid chromatography HILIC Hydrophilic interaction liquid chromatography RPLC Reversed-phase liquid chromatography IPLC Ion-pairing liquid chromatography

CE Capillary electrophoresis

IM Ion mobility

LC-MS Liquid chromatography coupled with mass spectrometry

RT Retention time

EI Electric ionization

ESI Electrospray ionization

MALDI Matrix assisted laser desorption ionization

IT Ion trap

Q Quadrupole

QQQ Triple quadrupole

TOF Time of flight

QTOF Quadrupole time of flight

m/z Mass to charge ratio

PCA Principal component analysis

DFA Discriminant function analysis

(9)

ix MS² or MS/MS Tandem mass spectrometry

CID Collision-induced dissociation

SID Surface-induced dissociation

LC-MS² Liquid chromatography-tandem mass spectrometry CEMBIO Center of Metabolomics and Bioanalysis

S/N or SNR Signal to noise ratio

UML Unified Model Language

GUI Graphical user interface

(10)

x

Abstract

Metabolite annotation and identification is a major bottleneck in metabolomic studies and liquid chromatography coupled to mass spectrometry (LC-MS) is the most common instrumental technique in these studies. Annotation usually comes after statistical analysis to select features of interest. It consists of searching for metabolite candidates with similar mass, followed by analysis of tandem mass spectrometry (MS/MS or MS²) spectra. Some tools exist for the automatization of this process; however, their performance depends on the quality of MS² spectra, which might have several inaccuracies that could hinder or even prevent the annotation of the fragmented metabolite. Experienced researchers find these problems only after inspecting each MS² spectrum individually, investing a considerable amount of time. After bibliographic revision of the state of the art we found that there is no relevant tool that performs automatic quality assessment for MS² spectra.

In this work, we have developed and implemented a tool to qualitatively assess MS² spectra according to the likeliness of success in the annotation of the metabolites. The tool assesses the spectra based on the presence of coelution signals in the collision cell, the existence of crosstalk, and signal to noise ratio (S/N). Coelution signals can arise from different mass to charge ratio (m/z) within the same isolation window or from the presence of reference masses in the collision cell. Three main algorithms were proposed to analyse these characteristics, a logical, a scoring and a mixed algorithm; and designed two approaches at spectra quality levels, one with 5 and the other with 3 levels. We tested these approaches with the use of 35 gold standard annotated spectra. The more effective method was found to be the mixed approach with 85% success in quality assessment and allowed for sorting of features using a score.

A graphic interface was developed and implemented with the use of Tkinter, while the tool was developed using Python 3.7 and the package pyOpenMS (Röst, et al., 2014). The input to the tool is the MS raw data from both MS and MS² analysis in mzXML or mzML format.

Output is presented in a file and with the use of the graphical interface. The tool sorts the features of interest according to the spectra quality; therefore, researchers can consider the likelihood of success in identifying the features, along with their biological interest, when deciding whether to invest time in their annotation.

(11)

xi

Resumen

La anotación de metabolitos y su identificación es un impedimento importante en los estudios metabolómicos y la cromatografía líquida acoplada a la espectrometría de masas (CL- EM) es la técnica instrumental más común en estos estudios. La anotación de metabolitos suele realizarse después de análisis estadístico para seleccionar características de interés. La anotación consiste en buscar metabolitos candidatos con una masa similar, seguido de un análisis espectrometría de masas en tándem (EM/EM). Existen algunas herramientas para la automatización de este proceso; pero su rendimiento depende de la calidad del espectro EM/EM, que puede tener varios defectos que podrían dificultar o impedir por completo la anotación del metabolito fragmentado. Los investigadores con experiencia encuentran estos problemas solo después de haber inspeccionado cada espectro EM/EM de manera manual, invirtiendo un tiempo considerable en ello. Después de una revisión bibliográfica del estado del arte, no encontramos ninguna herramienta relevante que evalúe cualitativamente espectros EM/EM de manera automática.

En este trabajo, hemos desarrollado e implementado una herramienta para cualitativamente evaluar espectros EM/EM de acuerdo con la probabilidad de anotación. La herramienta valora los espectros según la presencia de señales de co-elución de señales en la celda de colisión, la existencia de diafonía y la relación señal-ruido. Las señales de co-elución provienen de compuestos con distinta relación masa carga dentro de la misma ventana o de la presencia de masas de referencia en la celda de colisión. Tres algoritmos fueron propuestos para analizar estas características, un algoritmo lógico, uno de puntuación y otro mixto; y dos modos de clasificar espectros en niveles de calidad fueron diseñados, uno con 5 categorías y otro con 3.

Testeamos estas aproximaciones comparando contra 35 espectros anotados. El método más efectivo resultó ser el mixto con 3 categorías, con un 85% de acierto en la clasificación de la calidad de los espectros y que además permitía el ordenamiento de los espectros gracias a una puntuación.

Se desarrolló e implementó una interfaz gráfica con el uso de la librería Tkinter, además de la herramienta con Python 3.7 y la librería pyOpenMS (Röst, et al., 2014). El input de la herramienta son los datos brutos de un análisis con EM y EM/EM en el formato mzXML o mzML. El output se realiza tanto en un archivo de texto como en las propias ventanas de la interfaz gráfica La herramienta ordena características de interés de acuerdo con la calidad espectral; por lo que los investigadores pueden considerar la probabilidad de éxito en la identificación de las características, junto con el interés biológico, al decidir si invertir tiempo en su anotación.

(12)

1

CHAPTER 1: Introduction and aims

1.1. Introduction to metabolomics 1.1.1. What is metabolomics?

Omics branches of science are characterized by their holistic capabilities in biology and are directed at the collective detection of genes, messenger ribonucleic acid (mRNA), proteins and metabolites (Vailati-Riboni, et al., 2017). The suffix -ome refers to the subject of study of these branches, genomics studies the genome, transcriptomics the mRNA, proteomics the proteome and metabolomics the metabolites. Metabolites are the end products of cellular regulatory processes, and their levels can be regarded as the ultimate response of biological systems to genetic or environmental changes. The metabolome is the set of metabolites synthesized by an organism (Fiehn, 2002).

1.1.2. First appearance and characteristics

Metabolome as a term was first introduced in 1998 to measure the change in relative concentrations of metabolite as the result of deletion or overexpression of a gene (Oliver, et al., 1998). Metabolomics was first proposed by Nicholson et al. as “the quantitative measurement of the dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification” (Nicholson, et al., 1999).

Metabolomics is a rapidly developing research field that focuses on the comprehensive analysis of metabolites with molecular masses lower than 2000 Da present in biological samples (Hollywood, et al., 2006). Metabolomics provides a snapshot of the metabolic state of the cell.

This allows the researcher to detect and quantify changes in multiple organisms such as food and plant sciences, drug development, toxicology, environmental science or medicine (Yang, et al., 2019).

The advantage of metabolomics compared to the other omics is that it is closer to the physiological state of the cell. For example, while genomics and proteomics give the information of which proteins are active in the cell and may relay a general understanding of the metabolic flux, small difference in levels of protein can lead to changes in single metabolite levels therefore permitting earlier discover of biomarkers in experiments that cannot be detected by other omic sciences.

(13)

2

1.1.3. Targeted/Untargeted metabolic experiments

Targeted metabolomics studies are aimed to analyze a set of previously selected metabolites of interest. In these studies, metabolites need to have a reference standard available, and the hypothesis has to be established in advance. The number of metabolites analyzed is relatively small compared to untargeted approaches and they usually try to quantify the amount of each compound of interest. An example of targeted metabolomics is metabolic profiling, which was first described as the analysis of a predefined set of metabolites that can be studied for pathogenic circumstances and after drug therapy (Horning & Horning, 1971).

Untargeted metabolomic studies focus on maximizing the amount of biological information in order to analyze the composition of the samples. No prior hypothesis or knowledge is necessary, since the biological interpretation will be performed once all the compounds from the sample have been detected and, ideally, identified and quantified. Thus, these types of studies are suit for hypothesis generation, discovery and detection of unexpected changes in the metabolic state of the sample. Untargeted metabolomics includes fingerprinting and footprinting metabolomics. Metabolic fingerprinting is defined as “a rapid classification of samples according to their origin or their biological relevance” (Fiehn, 2001). Metabolic footprinting refers to the monitoring of metabolites consumed from and secreted into the medium (Kell, et al., 2005).

1.2. Metabolomic workflow

The metabolomic workflow is composed of the necessary steps to execute targeted and untargeted metabolomics analysis. The metabolomics workflow includes biological question, experimental design, sample preparation, data acquisition, data processing, statistical analysis, metabolite identification and biological interpretation and validation.

The metabolomics workflow initiates with a biological question to be answered.

This is an important step because it will determine the experimental design that follows and the type of approach to the analysis (untargeted/targeted).

1.2.1. Sample preparation

Sample preparation includes the collection and storage of the samples, followed by isolation of the analytes of interest and removal of all contaminating molecules such as genes, proteins, and salts. Sample preparation starts with quenching of the metabolism with cold. For

(14)

3

unicellular organisms or biofluids, this is done with cold-buffered methanol, while animal and plant tissue is usually quenched using liquid nitrogen (Hollywood, et al., 2006).

1.2.2 Data acquisition

The main data acquisition techniques used in the metabolomics workflow are nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS). NMR provides information with minimum sample treatment. The shortcomings of this technique are low sensitivity and less clear identification. MS is commonly carried out with a prior separation technique using chromatographic or electrophoretic methods. The advantages of this approach are a higher sensitivity and clearer identification, due decreased ion suppression effects and sample complexity.

Chromatography achieves the separation of complex samples by injecting a mobile phase into a column with a stationary phase, metabolites are separated because of different affinities for the stationary phase. A greater affinity for the stationary phase increases the time that the metabolite is in the column, this is called retention time. Thus, metabolites are separated and leave the column at different retention times. Chromatographic techniques can be classified by the state of the mobile phase in two categories, gas chromatography (GC) and liquid chromatography (LC). GC separates volatile compounds, otherwise requiring further sample treatment called derivatization, that volatilizes molecules. GC is not suited for a fast analysis of a small set of defined compounds because of its time-consuming pre-treatment but can non- selectively scan the volatile part of the metabolome. LC is a widely used separation method that achieves high specificity and sensibility with short analysis times. Popular LC separation techniques used in metabolomics include high-performance liquid chromatography (HPLC) and ultra-performance liquid chromatography (UPLC). There are various types of LC for different polarities and groups of metabolites, including hydrophilic interaction liquid chromatography (HILIC), reversed-phase liquid chromatography (RPLC) and ion-pairing liquid chromatography (IPLC). Capillary electrophoresis (CE) is a separation method that utilizes the difference in ionic mobility in liquid between the metabolites to separate them. It is used for the assessment of highly polar or charged metabolites. Ion mobility (IM) is a separation technique that also separates metabolites by mobility but in a gas buffer, it separates them more rapidly. The separation type and the type of spectroscopy abbreviations are commonly unified to describe the whole process, for example liquid chromatography coupled with mass spectrometry would be LC-MS. In this work, we are going to focus on MS approaches, particularly LC-MS.

(15)

4

After separation, a given retention time (RT) window of sample enters the mass spectrometer through an inlet. RT is the time that the sample or compound has been retained in the separation column. Mass spectrometers are composed by an ion source, a mass analyzer and a detector. Prior to sorting by the mass analyzer, metabolites must be ionized into a gas form through procedures such as electric ionization (EI), electrospray ionization (ESI) or matrix assisted laser desorption ionization (MALDI). These ions are then transported to the mass analyzer that sorts them by mass to charge ratio, before being detected. Types of mass analyzers include ion trap (IT), quadrupole (Q), triple quadrupole (QQQ), OrbiTrap, time of flight (TOF) and quadrupole time of flight (QTOF). Detection in MS platforms provides the mass to charge ratio (m/z) and intensity of each metabolite (that gives information about the quantity).

1.2.3. Data processing and statistical analysis

In MS-based metabolomics, each sample returns raw data composed by several scans.

Each scan is a histogram that represents the measurement of ions in the mass analyzer during a small retention time window. Ions in the scan are peaks with an m/z and an intensity value. Data processing is used to transform the raw data to aid the visualization of characteristics of each ion.

Mass analyzer typically save raw data files into a data format specific to the company that produces them, so sometimes the first step in data processing is to convert the files into an open-source data format. Common data formats include NetCDF, widely used mzXML and more recently mzML.

The proceeding pipeline in MS methods usually includes filtering, feature detection, alignment and normalization (Katajamaa & Oresic, 2007). A frequent first step is data centroiding, that involves converting peaks into data points. Centroiding can be done at the same time as data acquisition, this enables easier usage due to smaller file size. LC-MS data has chemical and random noise. Chemical noise is normally caused by buffers and solvents, particularly at the beginning and at the end of an elution and random noise is caused by the mass analyzer itself. (Katajamaa & Oresic, 2007). Feature detection aims to find all true ion characteristics and not false positives. Alignment procedures clusters features between multiple samples, while normalization removes recurrent bias in intensities while preserving relevant biological differences.

Statistical analysis is performed between control and test samples to identify statistically significant differences between the metabolites present. Examples of univariate statistical test performed in metabolomics are Student’s t-test, Mann-Whitney U-test and ANOVA.

(16)

5

Multivariate analysis methods can be unsupervised, such as principal component analysis (PCA) or supervised, such as discriminant function analysis (DFA). These tests are used to reveal ions with significant differences in intensities between the control and the samples, as they tend to be biomarkers or an important characteristic to be studied further.

1.2.4. Metabolite identification

In untargeted studies, the mass spectral features acquired by instrumentation must be identified to continue with the biological interpretation. Sometimes the researchers are only interested in the features that have been detected as statistically significant, called relevant from now on, but ideally the researchers identify both relevant and non-relevant features detected. This does not apply to targeted studies, as the metabolites are already known. The aim of this part of the study is to identify all the relevant features to provide biological significance.

However, metabolite identification is an important bottleneck in untargeted metabolomics and measured ions usually do not match metabolites in databases using common adduct forms (Uppal, et al., 2016). Therefore, after a presumptive metabolite has been identified, a step further from MS is required for confirmation. The most common technique for achieving confirmation in untargeted metabolomics comparing the fragmentation pattern of the metabolite with either a reference standard or a pattern previously uploaded obtained from databases (Uppal, et al., 2016).

Fragmentation patterns are obtained via tandem mass spectrometry (MS² or MS/MS). A tandem mass spectrometer is essentially two mass spectrometers connected by a collision cell.

First, a sample volume within a retention time (from the chromatographic column) is ionized and inserted into the first mass analyzer. Ions are sorted and filtered, so that only selected ones (ideally) pass onto the collision cell, these are called precursor ions. Ions are selected within an isolation window that is the interval of m/z that pass into a collision cell and is different for each MS² spectrometer. In the collision cell, ions are fragmented into a pattern of mass ions called product ions. Fragmentation in the collision cell is commonly achieved with collision-induced dissociation (CID), which increases ion energy and induces fragmentation through bombardment of ions with neutral molecules (inert gas). Other fragmentation methods include electron transfer methods, photodissociation and surface-induced dissociation (SID). Product ions are then sorted and separated by m/z in the second mass analyzer, before being detected.

There are several inaccuracies related to liquid chromatography-tandem mass spectrometry (LC-MS²) (Vogeser & Seger, 2010) that hinder identification because they increment the complexity of the spectrum, adding m/z peaks that are not products of the

(17)

6

precursor ion. These include coelution of isobaric species, coelution of reference masses and crosstalk.

Coelution of isobaric and/or isomeric species occurs when ions enter the collision cell of the MS² spectrometer because they have an m/z comparable to the precursor ion. This occurs when the isolation window in the first mass analyzer is wide enough (low resolution) to include the m/z of the coeluting ion. Coelution produces fragmentation peaks that are not directly detectable in the resulting MS² spectrum as their m/z is not known. The presence of peaks that are not product of precursor ion fragmentation reduces the confidence in the spectrum and lowering the a priori identification likeliness. Additionally, coelution of isobaric species detection cannot be achieved in MS² spectra only, as these compounds may have fragmented and not have an identifiable peak with an m/z within the isolation window. Researchers achieve detection of coelution by finding coeluting peaks in the MS spectrum before the MS² spectrum, using visualization software such as MassHunter Profinder. If a peak is present in the MS spectrum after the precursor peak and within the isolation window, that peak is likely to coelute in the collision cell and is identified as a coelution. Of course, adducts of the precursor ion are not considered coelution. A coelution example is presented in Figure 1. The precursor ion is labeled with a black diamond and the isotope M + 1 with a green triangle. Here, the isolation window is 1.3 m/z, so the next peak after the isotope (labeled with red circle) represents a coelution into the collision chamber.

Figure 1: Coelution of isobaric species example in MS. Centroided MS spectra from MSMS_AFX3_n, retention time 6.746 min, scan id 404762. Precursor ion for next MS² is labeled with a black diamond, an isotope M+1 with a green triangle and a coeluting ion with a red circle.

Reference masses are present in the sample and are used to calibrate the MS². Ions of these reference masses can sometimes enter the collision cell because of their abundance. The m/z of the reference masses are known by the researchers and produce peaks in the MS² that are recognizable and are detected manually with visualization software. Although their m/z is

(18)

7

known, they can fragment and contaminate the spectrum, leading to greater complexity. The fragmentation pattern of these reference compounds is also usually known by the researcher because of experience, so they do not hinder manual identification as much. However, automatic annotation software may not detect and ignore these compounds or human error might occur. An example of coelution of reference masses is shown in Figure 2. The coeluting reference mass 112.99 has not fragmented and is labeled with a red circle, the precursor ion is in green.

Figure 2: Coelution of reference masses example. Centroided MS² spectra from MSMS_AFX3_n, retention time 9.924 min, scan id 595429. Precursor ion in green and coeluting reference mass with a red circle.

Although its presence is uncommon, crosstalk is a mass spectrometric effect that occurs when the collision cell is not completely emptied between scans and ions from previous fragmentations are detected. Crosstalk leads to spurious peaks appearing through the entirety of the spectrum and that reduce identification likeliness. Detection is done by recognizing peaks with m/z greater than the precursor ion. This is because the precursor ion or an adduct should have the greatest m/z of the spectrum as all the rest would be fragments. A crosstalk example is given in Figure 3.Crosstalk induced peaks are highlighted in red.

Figure 3: Crosstalk example. Centroided MS² spectra from MSMS_AFX15, retention time 5.981 min, scan id 358865.

Precursor ion in green and coeluting reference mass with a red circle.

(19)

8

MS² spectral quality is important for identification via fragmentation patterns (Yates, 1998). Scoring fragmentation patterns with reference MS² database entries is highly dependent of peaks present in the spectrum (Neumann & Böcker, 2010).

1.2.5. Biological interpretation

Biological interpretation serves the purpose of answering the biological question that promoted the metabolomic study. It is carried out by finding metabolic pathways for identified metabolites. Different steady-state levels of metabolites associated to metabolic pathways can be used (for example) to reveal alterations in enzymatic reactions for studying regulatory processes or drug targets in metabolism (Weckwerth & Morgenthal, 2005).

1.3. Motivation and state of the art

As metabolite annotation is a great bottleneck in metabolomic studies, various approaches to automate of this process have been created. A straightforward approach is matching sample MS² spectra with fragmentation patterns found in databases. These fragmentation patterns can be experimentally obtained or predicted. Examples of this type of tool are: XCMS² (Benton, et al., 2008), NIST MS Search, MS-DIAL (Tsugawa, et al., 2015), CEU Mass Mediator 3.0 (Gil-de-la-Fuente, et al., 2019), etc. However, the cost of generating the standard MS² for spectral libraries has promoted in silico fragmentation tools for compound identification. In silico fragmentation tools include CFM-ID 3.0 (Djoumbu-Feunang, et al., 2019), MetFrag (Ruttkies, et al., 2016), MAGMa (Ridder, et al., 2012), etc.

Assuming that reference spectra from databases are accurate, the performance of these tools depends on the quality of MS² spectra obtained in biological samples, where low quality spectra may derive in wrong annotation. Recognizing these problems depends on the experience of the researcher, having to find the issues manually, spending a large amount of time and increasing the likelihood of human mistakes. Although machine learning tools have been created to assess the quality of MS² spectra in proteomics (Flikka, et al., 2006) (Koenig, et al., 2008) (Salmi, et al., 2006), a relevant non-machine learning tool using expert knowledge for spectral quality assessment in metabolomics is not available.

1.4. Objectives

The general aim of this project is the development of a software tool that qualitatively assess MS² spectra in relation to the likeliness of success in identification. To reach this aim, the following specific objectives were proposed:

(20)

9

1. Bibliographic revision of metabolomics, MS and MS² spectra quality evaluation.

2. Design and implementation of algorithms for MS² spectra quality evaluation with the use of pyOpenMS library (Röst, et al., 2014).

3. Validation of previously implemented algorithms using a “gold standard” to verify that spectra can be automatically evaluated in relation to the likeliness of success in identification.

4. Implementation of a graphical tool for spectra analysis and subsequent evaluation.

(21)

10

CHAPTER 2: Materials and Methods

2.1. Experimental set-up

In this work, a gold standard of 35 spectra with annotated features for validation of the software was provided by analytical chemist members of the project that belongs to the Center of Metabolomics and Bioanalysis (CEMBIO¹) from CEU-San Pablo University.

2.1.1. Chemicals

The organic solvent used was acetonitrile of MS grade. Reference mass solutions for LC-MS were from Agilent Technologies. Ultrapure water was used in the preparation of mobile phase A.

2.1.2. Analytical setup

The analysis of the samples was performed using a UHPLC system (1290 Infinity II system, Agilent Technologies), coupled to a G6545B LC/QTOF (Agilent Technologies) with Dual AJS ESI ion source mode. Storage was performed in centroided mode.

2.2. Algorithm design

After bibliographic revision of the state of the art in spectral quality control, we discussed with the team the relevant features that define quality for identification. A first proposal was devised to detect characteristics that hinder identification. The main characteristics in MS² that were defined as important barriers for metabolite identification were low signal to noise ratio and presence of: coelution of isobaric species, coelution of reference masses and crosstalk.

2.2.1. Signal to noise ratio measuring

To detect signal to noise ratio (S/N or SNR) in MS² spectra, two components are needed: grass level (also noise level or baseline) and most intense peak. Grass level is the imaginary intensity line that separates products of precursor ion fragmentation from chemical and random noise peaks. The most intense peak of each spectrum was obtained by simply setting the greatest intensity as the intensity of the first peak and scanning all the other peaks for

1 Centre for Metabolomics and Bioanalysis (CEMBIO), Department of Chemistry and Biochemistry, Facultad de Farmacia, Universidad San Pablo-CEU, CEU Universities, Urbanización Montepríncipe, Boadilla del Monte, Madrid, 28660 Madrid Spain

(22)

11

a greater intensity (and substituting the previous intensity if greater). Reference masses’ m/z inputted were ignored for the calculation of the most intense peak.

There were three approximations for grass level measurement in this work. The first approach consisted in finding the top 1% of peaks, then calculating the median intensity and defining that as the grass level. The second approach was finding the low 20% of peaks and calculating the median. The third approach was selecting peaks in the top and low 20% of m/z signals and then finding the mean intensity of them. We decided to use the third approach because it was the most effective at finding an adequate grass level. This method returned grass level intensity values below the observed ones, so we directly multiplied the grass level by a constant. S/N is then calculated dividing the intensity of the most intense peak by the grass level calculated.

2.2.2. Coelution of isobaric species detection

The tool follows the same procedure for detection of coelution of isobaric species in an MS² spectrum as researchers. Given a specific MS² spectrum, the algorithm will find its precursor in their previous MS. Then, it will look for another peak with an m/z within the isolation window (also given) and label it as a coelution (if not an M+1isotope).

In discussions with the team, it was found that coelutions with low intensity compared to the precursor ion were not hindering the identification. This is because they cause spurious peaks with low intensity compared to product ions. A first approach for detection of this type of coelution was to distinguish peaks above a level of intensity. This level of intensity was defined as a percentage of the precursor ion intensity (standard value of this percentage was fixed to 10%). This is illustrated in Figure 4, where the black line represents 10% of the intensity of the precursor ion (black diamond), the adduct is ignored (green triangle) and the coelution is detected (red circle).

(23)

12

Figure 4: Coelution of isobaric species detection above 10% precursor ion intensity.

2.2.3. Coelution of reference masses detection

Reference masses are detected directly by m/z in the MS² spectra because they are known compounds (see introduction for more details) and their m/zs are defined as a parameter.

Analogously to the coelution of isobaric species, we found that low intensity reference masses do not hinder identification. The first approach was to detect reference masses above a level of intensity defined as a percentage of the most intense peak. Default value of the percentage was 10%. This is illustrated in Figure 5, where the black line represents 10% of the intensity of the most intense peak (black square) and the red circle marks a reference mass coelution.

Figure 5: Coelution of reference masses detection above 10% of the most intense peak.

(24)

13

2.2.4. Crosstalk detection

Crosstalk detection is achieved by looking for peaks with an m/z greater than the precursor ion. As in coelution, we found that low intensity crosstalk peaks did not hinder identification. The first approach was to detect crosstalk peaks above a level of intensity defined as a percentage of the most intense peak. Default value of the percentage was 10%. This is illustrated in Figure 6, where the green peak is the most intense and the precursor ion, and the black line represents 10% of its intensity. The red circles mark the detected crosstalk.

Figure 6: Crosstalk detection above 10% of the most intense peak.

2.2.5 Designed classes

Two classes were designed to save spectral information, Spectrum_object and Peak_object. Each Peak_object object stores the m/z and intensity information of a single data point (peak) in the spectrum. A Spectrum_object object saves all the Peak_object in a spectrum.

It also saves information about the spectrum such as retention time, spectrum id, number of peaks (size), precursor intensity and m/z. In addition to this, a function to get the most intense peak in the spectrum was added. Figure 7 shows the Unified Modeling Language (UML) diagram of both classes.

(25)

14

Figure 7: UML of Spectrum_object and Peak_object classes.

2.2.6 Quality evaluation

Considering the previously mentioned factors various approaches were designed to evaluate MS² spectra quality. The first approach was logical and reduced quality based on the presence of inaccuracies (coelution and crosstalk) and deficient S/N. It considers the intensity of the inaccuracies by distinguishing them between weak and strong. These distinctions are done with similar criteria to detection but different percentage values. Weak coelution of isobaric species is defined as one that has an intensity lower than 40% of the precursor ion (in the MS).

Weak crosstalk or reference masses peaks have an intensity lower than 40% of the most intense peak. Low S/N is lower than 10. We defined two different approaches to qualify the spectrum quality: 5 levels: very good, good, regular, bad and very bad; and 3 levels: good, regular and bad. The flowchart for quality evaluation using the 5 levels approach and using a logic algorithm is represented in Figure 8.

(26)

15

Figure 8: Flowchart of logic approach to evaluate spectrum quality.

A second approach with a scoring system was designed. The scoring method has a range of score from 0 to 1 and inserts spectra within the quality categories as shown in Figure 9.

Figure 9: Flowchart of scoring approach to evaluate spectrum quality

A third, mixed method was designed to evaluate spectral quality. This method first classifies spectra using the logical approach. Each classification is assigned a logical score that is then summed to the weighted result of the scoring approach. Final score is ranged between 0 and 1. This is shown in Figure 10.

(27)

16

Figure 10: Flowchart of mixed approach to evaluate spectrum quality.

2.3. Implementation

All software was implemented using Python 3.0. The library pyOpenms (Röst, et al., 2014) was utilized for reading mzXML an mzML files. The graphic interface was created with

(28)

17

the use of Tkinter² built in python package as it is a simple graphic interface and has a large amount of documentation.

2.3.1. Logical approach to spectrum quality evaluation

The logical algorithm was implemented in python in the function

“determine_quality_logical”. The implementation is presented in Figure 11.

lower_signal_noise_ratio = 6 medium_signal_noise_ratio = 50 if number_strong_coelution > 1:

quality_out = 1 # Very bad

# Has_coelution: Yes

elif number_strong_coelution > 0:

if number_weak_coelution > 0 or number_strong_crosstalk > 0\

or number_weak_crosstalk > 0 or number_strong_reference_masses > 0\

or number_weak_reference_masses > 0\

or signal_noise_ratio < lower_signal_noise_ratio:

quality_out = 1 # Very bad else:

quality_out = 2 # Bad

# Has_coelution: Weak

elif number_weak_coelution > 0:

if number_strong_crosstalk > 1 or number_strong_reference_masses > 1:

quality_out = 1 # Very bad

elif number_weak_coelution > 1 or number_weak_reference_masses > 0\

or number_weak_crosstalk > 0\

quality_out = 2 # Bad

else:

quality_out = 3 # Regular

# Has_coelution: No else:

if number_strong_reference_masses + number_weak_reference_masses +\

number_weak_crosstalk + number_strong_crosstalk > 0\

quality_out = 3 # Regular

elif signal_noise_ratio < medium_signal_noise_ratio:

quality_out = 4 # Good

else:

# Very good quality_out = 5

Figure 11: Logical approach to evaluate spectrum quality pseudocode.

2 https://docs.python.org/3/library/tkinter.html

(29)

18

2.3.2. Scoring approach to spectrum quality evaluation

The scoring approach defined quality with a weighted score. Best score was defined as 1 and worst as 0. S/N was defined using the following method in Figure 12.

gradient = 1/100 threshold = 100

if signal_noise_ratio >= threshold:

score = 1

else:

n = 1 - (threshold * gradient)

score = signal_noise_ratio * gradient + n

Figure 12: S/N score pseudocode

This method makes S/N score range between 0 and 1, a greater S/N corresponds to a greater score. The way we measure noise ensures that S/N will not be smaller than 1, so negative values are not observed. Values of S/N greater than the threshold are equal to 1. In Figure 13 the linear regression used is represented, where S/N values from 1 to 110 are given.

Figure 13: S/N score for S/N values 1-110.

Inaccuracies’ score was defined using the source code shown in Figure 14:

0 0.2 0.4 0.6 0.8 1 1.2

0 20 40 60 80 100 120

S/N score

S/N

S/N score with relative intensity 1 to 110

(30)

19

# Inaccuracies score medium_gradient = -1/100

# We define thresholds

lower_threshold_percentage = 10 medium_threshold_percentage = 40 upper_threshold_percentage = 50

# We define parameters for the medium line equation

n_medium = 1 - lower_threshold_percentage * medium_gradient

# We define parameters for the upper line equation

medium_threshold_score = medium_gradient * medium_threshold_percentage + n_medium upper_gradient = (0 - medium_threshold_score)/(upper_threshold_percentage - \ medium_threshold_percentage)

n_upper = - upper_threshold_percentage * upper_gradient

# Percentage of most intense peak

coelution_intensity_percentage = (coelution_peak_intensity / most_intense_peak_intens ity) * 100

# We obtain score with line equations if coelution_intensity_percentage < 10:

score = 1

elif coelution_intensity_percentage < medium_threshold_percentage:

score = coelution_intensity_percentage * medium_gradient + n_medium elif coelution_intensity_percentage < upper_threshold_percentage:

score = coelution_intensity_percentage * upper_gradient + n_upper else:

score = 0

Figure 14: Inaccuracies score pseudocode

In Figure 14, “coelution_peak_intensity” is the inaccuracy peak intensity. The variable

“most_intense_peak” refers to the peak by which the inaccuracy peak intensity is relativized. In the case of crosstalk and reference masses, this value is the most intense peak in the MS² spectrum; while for coelution of isobaric species, the value is given by precursor ion intensity in MS spectrum. This allows for a relative intensity value of each inaccuracy peak (“coelution_intensity_percentage”) that is assessed through thresholds. The variable

“lower_threshold_percentage” represents the threshold below which relative inaccuracy intensity is not significative, thus returning a score of 1. If relative inaccuracy intensity is between the lower threshold and the medium, score decreases in a linear fashion with a gradient of -1/100 (“medium_gradient”). If relative inaccuracy intensity is above the medium threshold, it decreases more sharply until it reached the upper threshold. Relative inaccuracy intensity above upper threshold returns the lowest score, 0. This is more clearly described in Figure 15, where inaccuracy score is represented and relative inaccuracy values from 1 to 100 were given.

(31)

20

Figure 15: Inaccuracy score for relative intensity values 1-100.

There may be several inaccuracies in a spectrum and each one produces an individual score. These scores among with the S/N score are then weighted following the next formula, where “i” stands for inaccuracy, weight for coelution of isobaric species and crosstalk is 0.8 and weight for coelution of reference masses is 0.7.

𝑠𝑐𝑜𝑟𝑒 = 𝑖_{𝑤𝑒𝑖𝑔ℎ𝑡}∗ 𝑖_{𝑠𝑐𝑜𝑟𝑒}+ (1 − 𝑖_{𝑤𝑒𝑖𝑔ℎ𝑡}) ∗ 𝑆/𝑁_{𝑠𝑐𝑜𝑟𝑒}

The implementation of the scoring method for spectral quality evaluation is shown in Figure 16.

0 0.2 0.4 0.6 0.8 1 1.2

0 10 20 30 40 50 60 70 80 90 100

Inaccuracy score

Relative intensity values

Inaccuracy score with relative intensity 1 to 100

(32)

21

# Weight of coelution of isobaric species and crosstalk scores weight_coelution = 0.8

# Weight of coelution of reference masses weight_reference_masses = 0.7

# Coelution of isobaric species and crosstalk scores

crosstalk_coelution_scores = score_coelution_peaks + score_crosstalk_peaks

if len(crosstalk_coelution_scores) > 0:

min_crosstalk_coelution = min(crosstalk_coelution_scores)

weighted_coelution_score = weight_coelution * min_crosstalk_coelution

else:

weighted_coelution_score = weight_coelution * 1

# Coelution of reference masses scores if len(score_reference_masses_peaks) > 0:

min_reference_masses = min(score_reference_masses_peaks)

weighted_reference_masses_score = weight_reference_masses * min_reference_masses

else:

weighted_reference_masses_score = weight_reference_masses * 1

# Calculate final score

final_score_coelution = weighted_coelution_score + signal_noise_ratio_score\

* (1 - weight_coelution)

final_score_reference_mass = weighted_reference_masses_score + signal_noise_ratio_score *\

(1 - weight_reference_masses)

final_score += min(final_score_coelution, final_score_reference_mass)

Figure 16: Scoring approach to evaluate spectrum quality pseudocode

2.3.3. Mixed approach to spectrum quality evaluation.

The mixed approach to spectrum quality evaluation combines both the logical and the scoring approaches. The implementation of this method is straightforward, simply converting the logical output into a score, weighting the scoring evaluation and adding them. The source code is shown in Figure 17.

(33)

22

logical_evaluation_score = (logical_evaluation_output - 1) * 0.2 weighted_scoring_evaluation = scoring_evaluation_output * 0.2

final_score = logical_evaluation_score + weighted_scoring_evaluation

if final_score > 0.8:

quality = 5 # Very good

elif final_score > 0.6:

quality = 4 # Good

quality = 3 # Regular

quality = 2 # Bad

else:

quality = 1 # Very bad

Figure 17: Mixed approach to evaluate spectrum quality pseudocode

2.3.4. Graphical User Interface

For the graphical user interface (GUI) we decided that it would be clearer for the user to reduce our quality categories from 5 to 3. There are no clear guidelines to distinguish between levels that are not standardized, therefore 3 levels indicating good, regular and bad look clearer.

The GUI was implemented with two main components: (1) a main window to select input, output, m/z of reference masses and execute the tool in the run tab; and (2) an options tab to select variables in the mixed method algorithm. The results are presented in a new window that is shown when the tool finished spectra processing.

The run tab in the main window has four entry options: spectra folder, input file, output file and reference masses file. Paths of these parameters can be copied into the entry and saved using the “Save” button or browsed in the device utilizing the “browse” button. Information can be read from common plain text data formats such as csv and txt. An image of the main window is shown without filling (left) and filled (right) in Figure 18.

Figure 18: Run tab of main window of graphic interface, unfilled (left) and filled (right).

(34)

23

Spectra folder is the directory in which metabolomics raw data experiments are saved in the device. Input file is the file in which information about the specific spectra is detailed. MS² spectrum information must include the name of the experiment file, RT window (lower RT and upper RT) and m/z window (precursor m/z and delta m/z) in which it is found. Information is presented in csv format. An example of input file is illustrated in Figure 19.

Figure 19: Input file example for graphic interface.

Output file is the file in which results are written in csv format, it can be existent or be created in the inputted path. Output information includes experiment raw data file name, assessed quality (in string and integer), calculated score, S/N ratio value and number of inaccuracies present. An example of output file is given in Figure 20.

Figure 20: Output file example for graphic interface.

Reference mass file refers to the file in which the m/z of reference masses that the user wants to input into the tool are saved. Saving or browsing this file inserts the m/z into the ScrolledText widget below (white box below “Reference masses file” label in Figure 18). The m/z of reference masses can also be written or pasted into the ScrolledText widget. The m/z must be written in csv format (separated by semicolon). Only the m/z of reference masses present in the ScrolledText widget are the ones utilized by the tool.

The options tab in the main window has all the algorithm parameters that can be changed in the tool. If not focused, the option format (all positive integers and decimals) is

(35)

24

evaluated and if wrong, reports an error. The options tab, filled with default values, is showcased in Figure 21.

Figure 21: Options tab of main window of graphic interface.

The results window presents the summary the tool output. It presents the score, assessed quality and basic information of each spectrum. Every spectrum has a button below the

“Number” label that can be pressed to access more detailed information such as S/N, and m/z and intensity of inaccuracies in a new window. The results window (left) and details of a “Bad”

quality spectrum (right) can be seen in Figure 22.

Figure 22: Results window and detailed information window of graphic interface.

(36)

25

CHAPTER 3: Results and discussion

3.1. S/N measurement

Using the third approach for noise measurement, we calculated the precision and recall of the peaks compared to annotated peaks from the gold standard. The best multiplier constant that we found was 11, with 65% of precision and a 93% of recall. Average precision and recall of all gold standard spectra, resulting from the direct increment strategy are represented in Figure 23.

Figure 23: Precision and recall of direct increment approach in grass level calculation.

3.2. Comparison of quality evaluation methods

Scoring method assessed quality and annotated quality of gold standard spectra are shown in Figure 24, ordered by annotated quality of gold standard. We can observe that assessed quality is different from the annotated quality. Some “bad” (quality level 2) and “very bad” (quality level 1) spectra were classified as “good” (quality level 4). Square distance between gold standard and assessed quality vectors was 45, while Euclidean distance was 6.71.

Error rate was 54.29% and success rate was 45.71%.

Multiplier constant value

Precision and recall values

(37)

26

Figure 24: Scoring method assessed quality and gold standard quality of spectra.

Logical and mixed approaches assessed quality and annotated quality of gold standard spectra are shown in Figure 25, ordered by annotated quality of gold standard. There are differences between annotated and assessed quality. However, these changes are just one quality level. There are no instances of “good” (quality level 4) or “very good” (quality level 5) spectra, classified as “bad” (quality level 2) or “very bad” (quality level 1), or vice versa. Square distance between gold standard and assessed quality vectors was 17, while Euclidean distance was 4.12. Error rate was 40% and success rate was 60%.

Figure 25: Mixed and logical methods assessed quality and gold standard quality of spectra.

These results show that the logical and mixed approaches are more effective at assessing spectral quality. The problem with the logical method is that spectra of the same

0 1 2 3 4 5 6

0 5 10 15 20 25 30 35 40

Quality level

Spectrum number

Assessed quality of spectra in five categories

Gold standard Assessed quality

0 1 2 3 4 5 6

0 5 10 15 20 25 30 35 40

Quality level

Spectrum number

Assessed quality of spectra in five categories

(38)

27

category within them cannot be distinguished. Thus, the mixed method was selected for this tool because it allows ordering the spectra.

In the graphic interface, we decided to change the number of categories from 5 to 3.

New categories are “good” (quality level 3), “regular” (quality level 2) and “bad” (quality level 1). The results are illustrated in Figure 26.There are differences between annotated and assessed quality decrease. There are no instances of “good” spectra, classified as “bad” (quality level 2) or vice versa. Square distance between gold standard and assessed quality vectors was reduced to 5, while Euclidean distance was decreased to 2.24. Error rate was 14.28% and success rate was 85%.

Figure 26: Mixed and logical methods assessed quality and gold standard quality of spectra with three categories.

The result of reducing the categories from 5 to 3 are more accurate. The downside is that information is lost as the number of categories is reduced, but this is not that relevant with a score because spectra can still be ordered. In addition to this, categorization in fewer categories makes results clearer to users.

0 0.5 1 1.5 2 2.5 3 3.5

0 5 10 15 20 25 30 35 40

Quality level

Spectrum number

Assessed quality of spectra in three categories

(39)

28

CHAPTER 4: Conclusions and Future work

4.1. Conclusions

• After a bibliographic revision of the state of the art, it was found that there is no relevant tool to qualitatively assess MS² for metabolomics experiments.

• Three algorithms were designed and implemented with logical, scoring and mixed approaches. Methods to automatically detect inaccuracies in MS² and S/N ratio were accomplished. Noise measuring was completed with a 65%

precision and a 95% recall.

• When comparing with the gold standard, the mixed and logical approaches proved to be more effective at MS² spectra quality assessment, with a success rate of 60% in 5 categories. A reduction in the number of categories proved to be more efficacious, with an 85% success rate. Therefore, this new system using 3 levels was implemented for the default algorithm in the graphical interface.

• A GUI with the use of Tkinter² was completed. The tool was uploaded to GitHub³ with the aim to be parametrized. We expect the tool to be useful for researchers that want to extract identifiable features from an MS² metabolomic experiment.

4.2. Future work

• Testing of the tool by real users is the next step. The formal validation is key to ensure the success of the software developments.

• Further development in the validation of the tool could be using a larger number of annotated (gold standard) spectra for a more robust result.

• A machine learning approach could be developed when enough annotated spectra are available.

3 https://github.com/Manuel-Felipe-Fernandez/MS2_evaluator

(40)

29

References

Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G., 2008. XCMS2: Processing Tandem Mass Spectrometry Data for Metabolite Identification and Structural Characterization.

Analytical Chemistry, 80(16), pp. 6382-6389.

Djoumbu-Feunang, Y. et al., 2019. CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification. Metabolites, 9(4), p. 72.

Fiehn, O., 2001. Combining genomics, metabolome analysis, and biochemical modelling to understand metabolic networks. Comparative and Functional Genomics, Volume 2, pp. 155-168.

Fiehn, O., 2002. Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology, Volume 48, pp. 155-171.

Flikka, K. et al., 2006. Improving the reliability and throughput of mass spectrometry- based proteomics by spectrum quality filtering. Proteomics, Volume 6, pp. 2086-2094.

Gil-de-la-Fuente, A. et al., 2019. CEU Mass Mediator 3.0: a metabolite annotation tool.

Journal of Proteome Research, Volume 18, pp. 797-802.

Hollywood, K., Brison, D. R. & Goodcare, R., 2006. Metabolomics: Current technologies and future trends. Proteomics, Volume 6, pp. 4716-4723.

Horning, E. C. & Horning, M. G., 1971. Human metabolic profiles obtained by GC and GC/MS. Journal of Crhomatografic Sciences, Volume 9, pp. 128-140.

Katajamaa, M. & Oresic, M., 2007. Data processing for mass spectrometry-based metabolomics. Journal of Chromatography A, Volume 1158, pp. 318-328.

Kell, D. B. et al., 2005. Metabolic footprinting and systems biology: the medium is the message. Nature Reviews Microbiology, Volume 3, pp. 557-565.

Koenig, T. et al., 2008. Robust Prediction of the MASCOT Score for an Improved Quality Assessment in Mass Spectrometric Proteomics. Journal of Proteome Research, Volume 7, pp. 3708-3717.

Neumann, S. & Böcker, S., 2010. Computational mass spectrometry for metabolomics:

Identification of metabolites and small molecules. Analytical and Bioanalytical Chemistry, Volume 398, pp. 2779-2788.