• No se han encontrado resultados

Biological networks show some characteristics that seems to satisfy the scale-free model organi- zation. For example, these networks show a high degree of internal order that governs the cell’s molecular organization (Barabasi & Oltvai 2004) rather than a random one and the growth criteria can be satisfied by their evolutionary history. Although scale-free framework seems to easily explain the complexity of biological networks, Khanin & Wit (2006) and Stumpf et al. (2005) suggest to use some caution.

In this work we decide to explore building networks beyond pairwise correlation but, given its success, we investigate WGCNA as a comparison of our pipeline abilities. The first pipeline per- formance study on real data is made on multiple studies of wheat. We focus on 16 independent studies downloaded from Array Express database (Rustici et al. 2013, Parkinson et al. 2007) of stress enriched and non-stress condition, each containing 61290 genes. Table 3.1 shows the studies and their corresponding number of samples and descriptions.

We first want to check if the entire system can be described by a scale-free network. So, after the studies are merged together we calculate the connectivity k and then plot k vs p(k) to explore the nature of the datasets. Figure 3.3 shows on one side the histogram of the connectivity which denotes a high number of nodes with a low connectivity and lower one but still present for hubs and on the other the relation of k vs p(k) in logarithmic scale. As also highlighted in the figure title the value of R2 is equal to 0.83 which we can consider close enough to 1, as well as the

absolute value of the slope. On these first results we can deduce that the general underlying mechanism of these studies can be described through a scale free network.

Given that, we now want to build weighted co-expression networks, one for each wheat dataset and compare it afterwards with our pipeline results. For computational reasons, we first need to reduce the number of variables. First, the genes that are not part of the Gene Ontology database (Ashburner et al. 2000) and therefore not biologically known (yet), are discarded. Then, the standard deviation for each gene in each study across all samples is calculated and only the genes with sd ≥ 2 in at least 4 of the 16 studies are finally selected for the rest of the analysis. The value of the sd threshold is defined by the user based on the number of genes that the user

believe can be reasonably analysed. The first step reduces the genes from 61290 to 21487, that after the second step are reduced to the final number of 67 genes. More details can be found in Chapters 4 and 5. For each study, once we build the co-expression similarity matrix we need to transform it into the adjacency matrix to define the final study-network. Since we are in- terested in directed networks, we choose to apply the soft-thresolding procedure which requires the selection of the parameter β (power). Common practise requires to set β = 6 for signed networks and β = 12 for unsigned ones. Although, different studies imply different underlying mechanisms and possibly a different β value. We explore a set of values for the parameter β from 1 to 30 and analyse the effects.

Wheat Studies

Study Label Samples Description 1 E-MEXP-971 60 Salt stress

2 E-MEXP-1415 36 S and N deficient conditions 3 E-MEXP-1193 32 Heat and Drought Stress 4 E-MEXP-1694 6 Re-supply of sulfate 5 E-MEXP-1523 30 Heat stress

6 E-MEXP-1669 72 Different nitrogen fertiliser levels 7 E-GEOD-4929 4 Study parental genotypes 2 8 E-GEOD-4935 78 Study 39 genotypes 2

9 E-GEOD-6027 21 Meiosis and microsporogenesis in hexaploid bread wheat

10 E-GEOD-9767 16 Genotypic differences in water soluble carbohydrate metabolism

11 E-GEOD-12508 39 Wheat development 12 E-GEOD-12936 12 Effect of silicon 13 E-GEOD-11774 42 Cold treatment

14 E-GEOD-5937 4 Parental genotypes 2 biological replicates from SB location

15 E-GEOD-5939 72 36 genotypes 2 biological replicates from SB location 16 E-GEOD-5942 76 Parental and progenies from SB location

Table 3.1: Study numbers, labels, number of samples and descriptions of the wheat microarray dataset.

Figure 3.3: Scale-free plot. The figure on the left hand side show the distribution of the connec- tivity (k), while the one on the right represent the relation between k and p(k) in logarithmic scale highlighting that the slope is close to -1.

Figure 3.4 shows the variation of R2in correspondence to different values of β. As previously

explained the closer to 1 R2 gets the better it is. Therefore, in each study, we select the first

value of beta that corresponds to R2 ≥ 0.8. Although, in the figure, many studies never reach

the threshold 0.8, leaving us with no β to select. This may be due to the low number of samples available per study or even to the set of reduced genes. Once the values of β are chosen, we calculate the adjacency matrices that are going to define the study networks, one per study, which in turn are used as a base to create unique networks to compare with the ones obtained by our pipeline. The unique networks resulting from the WGCNA procedure can be seen in Chapter 5.

Figure 3.4: Scale independence. Each plot shows the variation of R2 for different values of β

(power) for each single study under analysis. The red horizontal line identifies the threshold set at 0.8. Above which R2 satisfies the scale-free criteria therefore the corresponding value of β

Documento similar