• No se han encontrado resultados

2. Marco teórico conceptual

2.4. Algunos aspectos alrededor de la lectura

For the first simulated data set, we chose a list of one thousand genes randomly in a way that variance component estimates for all three random factors were nonzero for each gene in this list. We start generating data for these genes by using the random part of the linear mixed model (3.1). Random effects for a gene were simulated by sampling from Normal distributions with mean zero and variance equal to the variance component estimate of the corresponding random factor for that gene from the original barley data. Residual errors were generated similarly by using the variance estimates of the residual error. After simulating from the random part of the model (3.1) for each of the one thousand genes, ten percent of the genes were set to have fixed treatment effects. Therefore, a set of hundred genes were selected randomly among previously chosen one thousand genes. For those hundred genes, intensities were changed by adding fixed treatment effects, which were the twelve genotype- by-time interaction means of the corresponding gene from the original barley data. After simulation, the data set was analyzed by two methods: the analysis by fitting the generating linear mixed model for each gene independently and the analysis by fitting the hierarchical Bayesian model described in the previous section.

We used the SAS PROC MIXED procedure to fit the full mixed linear model 3.1 for each gene independently. Variance components were estimated under REML method, and the KR method was set to determine the denominator degrees of freedom and F -statistics for the test of fixed effects. By applying the method of Storey and Tibshirani (2003), the number of significant genes with respect to fixed effects are reported in Table 3.1 at four nominal control levels of False Discovery Rate (FDR). For example, when FDR is controlled under 0.1, twenty five genes are declared to have significant genotype effect. Sixteen of these genes are correctly identified (Table 3.2). Controlling the FDR at same nominal level, fifty seven genes are declared to be significant for changing expression levels over time. Fifty six of them are correctly identified (Table 3.2). Twenty one genes are declared to have significantly different time patterns over two genotypes. Nineteen of them were correctly identified for significant interaction between genotype and time (Table 3.2).

The first simulated data set was also analyzed by fitting the hierarchical Bayesian model described in Section 3.3. Four statistics (3.4, 3.5, 3.6, and 3.7) were calculated for each gene by using the posterior distribution of each treatment effect. Genes were ordered for significance

with respect to genotype, time, and genotype-by-time effects by using the values of F1, F2, F3,

or F4. Table 3.2 reports the number of correctly identified genes under hierarchical Bayesian

analysis when F4 is used for ranking the genes. For example, all of the most significant 25

genes with respect to genotype effect were correctly identified. Similarly, 55 genes among the most significant 57 genes with respect to time effect, have true time effect, and 20 out of top

21 genes with the highest F4 value for interaction effect, were correctly identified for having

true different time patterns over two genotypes. If we looked at the most significant 100 genes under both analyses, the hierarchical Bayesian method identifies many more significant genes correctly than the classical mixed linear model analysis (Table 3.2). For example, when testing for genotype effect, the classical mixed linear model analysis identified 30 genes correctly among the most significant 100 genes whereas the hierarchical Bayesian analysis identified sixty five genes correctly among the most significant 100 genes. These sixty five genes includes all of the 30 genes, which were correctly identified by the classical mixed linear analysis. These observations imply that the hierarchical Bayesian approach will be superior to a traditional full linear mixed model analysis with regard to identification of differentially expressed genes. Formally, we can use receiver operator characteristic (ROC) curves to compare the effec- tiveness of the methods for separating differentially and non-differentially expressed genes. The p-values produced by each method provide a rank order of genes from most evidence for differ- ential expression to least evidence. The quality of the ranking varies from method to method and can be judged by comparing ROC curves. The greater the area under an ROC curve the better the corresponding method is at distinguishing null genes from non-null genes. In Fig-

ure 3.7 and Figure 3.7, we observe that hierarchical Bayesian analysis with F4 statistic has the

highest power for separating differentially and non differentially expressed genes with respect to the genotype effect and the genotype-by-time interaction, respectively. For genotype, time,

the same rank order of genes. As a last observation, we should also mention that the simulated data set was analyzed by fitting the true generating linear mixed model for each gene under the classical approach. But the hierarchical Bayesian analysis provided much better rank order of the genes with respect to genotype, time, and interaction effects (Figures 3.7, 3.7, and 3.7).