INFRAESTRUCTURA DE INVESTIGACION Y DESARROLLO 7

This study has been published [82]:  Publication V

Bauer T, Eils R, and König R: RIP: The regulatory interaction predictor - a machine learning based approach for predicting target genes of transcription factors. Bioinformatics. 2011 Aug 15;27(16):2239-47. Epub 2011 Jun 20

6.1. Motivation

When I started my PhD, I was exited by the idea of reconstructing a genome-scale human regulatory network that would elucidate the means by which cell signaling drives the dynamics of gene expression. The applications of such a network would be enormous. It would be possible to trace observed changes in mRNA levels back to the source, i.e. the controlling elements (TFs and upstream signaling pathways). In pathogenesis, causative molecular mechanisms could be extrapolated and their elements targeted in therapy to name only one exciting application. All through my cooperation projects I worked on gene expression profiling and follow-up analyses with the aim to understand the molecular alterations behind carcinogenesis. I learned how large-scale promoter analyses can identify potential TFs in control of transcriptional changes. However, the PWM scan technique, even though applicable with good results as demonstrated in previous chapters, tends to produce large numbers of false positives which reduces the precision of the predictions considerably.

The core elements of regulatory networks are TFs, target genes, and regulatory interactions (RIs) between them. Several approaches have been developed to reconstruct regulatory networks on different scales and model organisms (reviewed in [18,19]), but essentially they have not achieved satisfactory results in the attempt to realize the idea I described above. Major issues of most present methods are:

a) Statistics for inferring RIs based on questionable assumptions of gene regulation and/or missing validation of the assumptions in the used data (proof of principle). b) Improper transfer of gene regulatory principles from prokaryotes to eukaryotes. c) High computational demands of the models that drastically limit the number of

included network components.

d) Over-simplification, i.e. the use of gene expression data only instead of including data representing different aspects of gene regulation.

e) Lack of an objective true positive set (True RIs) to estimate the performance or to validate the findings.

f) Insufficient precision (high false positive rate) or insufficient recall (low re-discovery of known RIs).

Up to date, a lot of attention and effort are still focused on providing solutions to this unresolved major task of systems biology. In this chapter, I will describe the method I have developed to contribute to the realization of this idea. I will illustrate the achieved improvements and provide examples of successful application of my method.

Most current algorithms for large scale RI inference are based solely on gene expression data and assume a direct relationship between the gradients of TF mRNA and its target genes. While this assumption may be true for a large number of TFs in prokaryotes, it

is not met by many TFs in human, where post-translational modifications affect TF activity or degradation kinetics [33]. Unfortunately, techniques to measure protein quantity and kinetics in high-throughput are at the best in a developmental stage and data are available in insufficient numbers only. I therefore developed an in silico method to compensate this limitation.

A biological concept that has been around for long is the principle that genes that share biological functionality are co-expressed, and this co-expression is achieved by co- regulation. So instead of considering statistics between TF mRNA gradients and potential target genes, I analyzed statistics of co-regulated target gene sets and subsequently deduced their regulatory TFs from known RIs, thereby overcoming shortcomings of conventional methods and lack of protein data. Human gene expression data covering a large spectrum of biological conditions are available in abundance, and thus I conducted a correlation meta- analysis of thousands of gene expression profiles to identify co-expressed genes in a large number of primary human tissues. Additionally, I analyzed gene promoters employing comprehensive PWM scans to acquire putative TF binding data that are unbiased by experimental conditions, as in case of e.g. ChIP analysis. Finally, I extracted a considerable amount of RIs identified in published experiments that were assembled in Transfac database [2]. Our concept was to have a machine learning classifier learn the trends of correlation and TFBS enrichments within RIs known to be co-regulated and then predict RIs on a genome- wide scale to discover new RIs. For this purpose, we defined 10 elaborate features (quantifiable characteristics) that combined the results of correlation and PWM analyses of known RIs. I trained numerous SVMs with the features of defined training sets and performed cross-validations to estimate the quality of the predictions. I eventually combined all SVMs into one master classifier termed “regulatory interaction predictor” (RIP) that achieved considerably good recall and precision. RIP was then used to predict RIs between 303 TFs and 13 069 genes. The predictions were validated by pathway analysis, with an independent RI database, and further applied to a (published) in vivo study on interferon α (IFNα) signaling in monocytes to identify key TFs affected by IFNα induction.

6.2. Main Results

6.2.1. Training machine learning classifiers to predict TF target genes – the workflow

The algorithm we developed for our supervised machine learning approach to predict RIs between TFs and target genes is depicted in Figure 6.1. Defining sets of true positives (TP) and true negatives (TN) of sufficient sizes was an essential prerequisite for training the SVMs. I extracted 2896 RIs between 303 TFs and 949 target genes from Transfac database,

which were defined as the TP set (True RIs). Vice versa, all other possible combinations of the 303 TFs and 949 genes (=284 641 unknown RIs) were defined as the TN set (True non- RIs). There may be a number of True RIs within the set of unknown RIs, but even if one assumed that at present only 10% of RIs had been discovered in total, the defined True non- RIs would only contain ~26 000 wrongly labeled RIs. Compared to the much larger amount of remaining ~258 000 True non-RIs, this would still be acceptable.

Correlation meta-analysis Microarray data TRANSFAC Transcription factor binding sites Promoter scan 10 features Gold standard 2896 True RIs 284 651 True non- RIs Training (¾) 2172 213 488 Validation (¼) 724 71 163 20 x random sampling SVM training Ensemble classifier SVM 1 SVM 2 SVM 3 … SVM 100 100 x bootstrap sampling 20 x Performance estimation Combined master classifier for prediction of new candidate RIs Network topology TFs genes Database knowledge

Figure 6.1 | General workflow of RIP. Features for inferring regulatory interactions (RIs) between TFs and genes were derived from three different aspects: tightly correlated genes identified by meta-analysis of gene expression profiles, TF binding site predictions, and database content of co-regulated genes from the training set (gold standard). The information of the gold standard was also used to define True RIs and True non-RIs.. For training of Support Vector Machines (SVMs), True RIs and True non- RIs were divided into a training set and a validation set. An equal number of True RIs and True non-RIs were randomly drawn (by bootstrapping) 100 times and used to train 100 different SVMs yielding one ensemble classifier. Each ensemble classifier was evaluated with its validation set. This procedure was repeated 20 times yielding an averaged estimate about their performances. The classifiers were combined to one master classifier (RIP) containing 2000 SVMs, and applied to predict new RIs.

We then needed to describe RIs by quantifiable features that reflect characteristics of regulatory influence of TFs on target genes. We based these features on two assumptions:

1) Gene sets that are involved in a common biological process are co-regulated. Common TFs should thus control these gene sets under specific conditions, and these genes should frequently show correlation on the mRNA level.

2) Gene sets directly regulated by a common TF (TF-modules) ought to possess (enrichments of) corresponding TFBSs in their promoter sequences.

We deduced 10 features from these assumptions by a) analyzing correlation of gene pairs in 4064 human gene expression profiles from 76 biological conditions (e.g. tumor type, tissue type, etc.), b) conducting genome-wide PWM scans, and c) using statistical descriptors of network structure arising from the True RIs. Before training the SVMs, I tested if the assumed principles underlying our features were reflected by the data.

6.2.2. Genes with correlated gene expression share biological processes

I conducted a correlation meta-analysis by calculating Pearson correlation coefficients for all possible gene pairs within 13 069 genes (all genes represented on the microarray platform Affymetrix HGU133A) in 76 biological conditions. The correlation coefficients were used to select gene pairs at different stringency by applying two filters CC and FoC. CC was the minimum (absolute) correlation coefficient that was required in a minimum fraction of conditions FoC. Therefore, CC controlled correlation intensity, and FoC controlled correlation frequency, and they were both applied at different stringency levels. The functional relation of the filtered gene-pairs was estimated using selected Gene Ontology annotations (GO, [83]) and a method adapted from [84]. In brief, I selected 81 GO terms that represented a broad range of biological functions, and that were still sufficiently specific. The functional relatedness of gene pairs was quantified by the Functional Similarity score (FS-score), which is the percentage of gene pairs sharing at least one selected GO term. Figure 6.2 shows the results for gene pairs filtered at various stringency levels. FS- scores between 14.8% and 58.3% were achieved (stringency parameters CC=0.6 to 0.9, FoC=0.25 to 0.5). For a wide range of cutoffs (selecting ≤5000 genes, see Figure 6.2), the FS- scores increased with higher stringency (up to CC=0.8, FoC=0.35) from 14.8% to 57.3%., which was what we expected assuming that genes sharing biological functions tend to correlate (assumption 1). Interestingly, the FS-score of filtered gene pairs fluctuated to some extent towards the highest stringency levels (<300 selected gene pairs) before recovering and reaching its summit. This behavior resulted from an increased proportion of constitutively expressed gene families (e.g. hemoglobins, histones, immunoglobulins) that

Number of gene pairs

In document ---INFORME DE VENEZUELA--- (Versión revisada - Agosto 2006) (página 31-40)