1. MARCO GENERAL
1.1. EL TRANSPORTE COMO SECTOR DE ACTIVIDAD ECONÓMICA
1.1.2. El sector de transporte en Euskadi
aiming for maximal growth, the objective is biomass production, which is the rate at which metabolic compounds are converted into biomass constituents, such as nu- cleic acids, proteins and lipids. Mathematically, an “objective function” is used to quantitatively define how much each reaction contributes to the phenotype. This is mathematically formulated as a system of linear equations. In flux balance anal- ysis, these equations are solved using linear programming (see Section 2.4.4). By simulating the whole reconstructed metabolic network of an organism of interest, we obtain the wildtype growth rate under specific flux bounds and metabolic con- straints (or conditions). When performing a single gene or reaction deletion under the same conditions by limiting its corresponding fluxes to zero (so-called knockout simulation), a mutant’s growth rate is measured and compared to the wildtype’s. A knocked-out gene or reaction is predicted as essential under the given condition if the mutant model yields much lower biomass production in comparison to the wildtype. Flux balance analysis is a widely-used and well-established method for assessing the essentiality of genes [20, 49, 55, 117]. For example, analyzing flux balances under the conditions of aerobic glucose (by limiting the glucose uptake rate) using the CO- BRA toolbox [20] and a newly reconstructed metabolic network of E. coli yielded 92% accuracy when predicting the essentiality of genes [55] under aerobic glucose conditions and yielded 88% accuracy for rich nutrient conditions. However, FBA approaches need clear definitions of nutrition availability and biomass production under specifically given environmental conditions, and it is difficult to characterize the uptake rates for each compound of a rich medium, especially for situations like the gut of a host of intestinal pathogens (for a good overview of these aspects see [56, 144]).
1.6
Main contributions of this thesis
In the following I summarize the main contributions of this thesis: • Analysis of metabolic networks
We developed an algorithm to examine the ability of the metabolic network to obtain the products of a knocked-out reaction from its substrates via alterna- tive pathways. Basically, each reaction in the network was deleted (knocked out in silico), respectively. A breadth-first search algorithm tested whether the neighboring compounds of the knocked-out reaction could be produced by other reactions and pathways of reactions. With this approach, we tested whether deviations in the network could be used to replace the knocked- out reaction (see this method in Section 2.4.3 and the results in Section 3.1).
This was successfully applied to detect potential drug targets for Plasmod- ium falciparum [54] and used as one of our descriptors in our other investi- gations. This method was invented by us and reported for the first time in our article [54]. Furthermore, other descriptors based on metabolic networks, genomic data and transcriptomic data have been analyzed and examined for their potential to identify drug targets, and these are described in Sections 2.4 and 2.5.
• Machine learning based approach to integrate the descriptors In this thesis, we developed a workflow for a machine learning method that integrates a large variety of different descriptors to identify drug targets. First, the metabolic network was constructed using various qualitative and quanti- tative information from public databases and the literature (see Sections 2.2 and 2.3). With the technique of machine learning (explained in Section 2.7), a large set of features (explained in Sections 2.4 and 2.5) was integrated and used for a classification of gene essentiality. Finally, the results showed that our methods can be used to detect potential drug targets in pathogens and that these methods are feasible for validating experimental knockout data (see Sections 3.2.3 and 3.2.4). With this newly-developed, integrated approach, we showed that using a machine learning based approach made it possible to achieve 79% sensitivity and 97% specificity, which were comparable to those achieved by flux balance analyses (sensitivity: 51%, specificity: 97%, see Sec- tion 3.2.2). It is worth noting that, in contrast to FBA, our approach does not depend on any additional (in addition to the essentiality data serving as the gold standard) experimental information or elaborate literature study. Fur- thermore, we show that the method can be used to predict the essential genes of a query organism using the experimental information about essentiality from a related bacterial reference organism (see Section 3.3.1).
The results of our research have been published in a peer-reviewed confer- ence proceedings article [128] and two original journal articles [125, 127] in a journal with a good reputation in our field of systems biology. Additionally, we described our approach in a book chapter [126]. The developed approach was also used in other related projects [53, 54].
Chapter 2
Methods
To integrate a variety of information for the purpose of gaining insight into the essentiality of a gene or protein, topology descriptors of metabolic networks, ge- nomic data and transcriptomic data have been assembled for a machine learning approach. Our approach is based on a collection of methods from the areas of network analysis and machine learning. This chapter first summarizes the general workflow in Section 2.1. An explanation of the data, including the metabolic net- works and knockout screens that we used, is given in Section 2.2. The construction of the network is addressed in Section 2.3, followed by the extraction of network descriptors, such as deviations and flux balance analysis features, which is given in Section 2.4. Genomic and transcriptomic analysis features are explained in Sec- tion 2.5, including homology analysis and gene expression analysis. Preprocessing and feature evaluation are explained in Section 2.6. Our classification method and learning techniques are described in Section 2.7. Finally, performance measures are explained in Section 2.8.
2.1
General workflow
An overview of our workflow is shown in Figure 2.1. First, the metabolic networks were constructed for the organisms that were investigated with biochemical reactions from public databases. For each gene, the features of the gene or the corresponding reaction were calculated to describe its topology in the metabolic network and its genomic and transcriptomic relations. These features were then normalized and statistically analyzed by comparing them to essentiality classes (used as a gold standard) taken from experimental genome-wide knockout screens. Next, Support
Figure 2.1: The workflow. The workflow for the prediction of essential genes by integrating network and genomic information using Support Vector Machines.
Vector Machines (SVMs) were trained based on the features to distinguish between essential and non-essential genes. The trained machines were evaluated and then used as a prediction model for gene essentiality. This model was then applied to identify potential drug targets and to predict new query genes.