• No se han encontrado resultados

In this chapter we have explained our aims with its challenges and proposed a combination of steps to overcome them and achieve our goal. The pipeline developed is called UNIP (Unique Network Identification Pipeline) and consists of a list of steps to deal with certain characteristics of microarray data. To verify that UNIP robustly and reliably generates unique networks we test it on multiple independent synthetic datasets downloaded from a publicly available repository database.

We selected three networks with comparable numbers of nodes in a way that when the datasets are integrated the total number of variables stays below 100. This allows Bayesian networks to work with a computationally reasonable input dataset.

To simulate different conditions and the noise typical of real microarrays, we merge the data together adding random values. We also perturb the original data to simulate increasing level of noise from no-noise (0%) to 10% until 90%. For each level of noise a GRN for each study is built using Bayesian networks. Given the graphical structure obtained, the similarity measure is calculated and the studies are grouped in study-clusters. Finally, for each study-cluster a

(a)

(b)

Figure 4.12: The figures show the group of samples and variables respectively obtained using the bicluster method QuestMotif (Murali & Kasif 2003). Each bar represents a sample-group indicated with a number on the x-axis. The different colours indicate to which original network the samples in the sample-group truly belong to. The y-axis indicates the number of samples in Figure a and the number of variables in Figure b.

consensus network first and a unique-network afterwards is built and the prediction-accuracy intra and inter clusters is measured.

The simulated data study indicates that our pipeline works almost perfectly when the input data presents no-noise (0%). The same behaviour is followed when the noise level only slightly increases to 10%. Furthermore, it proved to be reasonably resilient to noise until 50% of the data is affected. While as expected much of the power is lost when the data is 90% or more random and therefore contains little information.

Both the network clustering process and the detection of variables that truly belong to the original networks seem robust and only fail at higher level of noise.

In conclusion we can state that our pipeline appears robust and reliable enough to explore real microarray data.

In the following chapter we will use our method with two sets of real microarray data studies: Wheat and Fusarium. Unlike the case of synthetic datasets, real data requires a pre-processing step which may affect the following results. In addition to the prediction-accuracy two different tools Mapman (Thimm et al. 2004) and AIC-MICA (Lysenko et al. 2011) are used as support to the biological validation. We will show that wheat datasets behave similarly to the case of zero or very low noise, while Fusarium appears to be associated with noisier data as a result of more clearly defined conditions for wheat.

Analysis of Real Data

5.1

Introduction

In the previous chapter we developed a pipeline called UNIP to semi automatically identify subnetworks that are specific to a set of conditions. The pipeline takes as input a set of raw independent microarray datasets (studies) obtained using the same platform to avoid bias and extra pre-processing. The data is downloaded from public databases such as Array Express (Rustici et al. 2013, Parkinson et al. 2007) and NCBI GEO (Edgar et al. 2002). For each study it builds a GRN and uses a network similarity measure to group the studies into study-clusters using clustering which aims to cluster studies which belong to similar generic conditions. For example ‘salt stress’ and ‘drought stress’ both belong to the generic category of ‘stress-enriched’ and are therefore clustered together. After a consensus network for each study-cluster is cal- culated the unique-networks (study-specific subnetworks) are derived. Finally intra and inter clusters prediction accuracy are calculated to refine the results.

The first step developing this pipeline was to test it using synthetic data with characteristics that are already well known in order to evaluate the results. The findings proved the pipeline able to reliably identify sub-networks specific to a set of studies and to be robust for quite high levels of noise. Microarray data generated from organisms subjected to different conditions (even under well standardised experimental conditions) involves a lot of bias and noise. We now apply UNIP to real datasets, explore the findings and statistically evaluate the results. When analysed we have to keep into consideration experimental variation, bias and both hu- man and machine errors without forgetting issues with the structure of microarray, involving thousands of genes but only few tens of samples. In this Chapter we focus our attention on two different organisms: wheat and on Fusarium.

The following Chapter is organized as following. Section 5.2 describes the adaptations made to the UNIP pipeline to work with real datasets. Section 5.3 explains the real dataset structure and the results obtained from our pipeline. Section 5.4 compares the findings with other popular techniques. Section 5.5 shows what we found in the literature in support of our findings. Section 5.6 explores the results on Fusarium. Finally, section 5.7 discusses the findings.

Documento similar