The goal of many microarray experiments is to find genes whose expression is altered, or differentially expressed, as a result of some experimental factor, such as disease state, drug treatment, or time. The dynamics of these changes can also be observed by measuring changes in gene-expression at different time points, making up a time course of gene-expression. In the earliest microarray analyses, so-called differential expression was calculated by looking at the average change in expression of genes between experimental conditions, often termed the fold change (Schena et al., 1995; DeRisi et al., 1996; Schena et al., 1996; Eisen et al., 1998). However, this approach fails to take into account the variability seen between the replicates and so may be unreliable (Chen et al., 1997). The average expression value for a gene may appear to be higher for one subset of samples as compared to another, but if the samples used to estimate this average show high variability amongst themselves, this difference in expression may be unreliable.
1.5.3.1 Hypothesis testing
The statistical significance of a test statistic is usually expressed in terms of the p- value – the probability under the null hypothesis of observing a test statistic value equal to or greater than that observed. A statistically significant difference is said to exist between two groups if the p-value is below some significance level 𝛼, often taken as 0.05. In this case, we define the amount of evidence required to reject the null hypothesis in advance. The significance level 𝛼 can also be regarded as the probability of rejecting the null hypothesis when it is actually true, or observing an effect when in fact there is none. This is known as a Type I error,
105 or false positive. Thus, if 𝛼 = 0.05, we expect to see false positive results 5 % of the times. On the other hand, if the null hypothesis is not rejected on the evidence of the samples, but is in fact false, this is known as a Type II error, or false negative. Again, this can be described as not observing an effect when there is one. The probability of this type of error is denoted 𝛽. Typically, the reliability of such tests is measured on three criteria:
1. Specificity – The ability of a statistical test to correctly determine true negative outcomes (1 − 𝛼). Can be increased by reducing the significance level 𝛼, although this may result in an increase in false negatives.
2. Sensitivity/Power – A measure of a test‟s ability to accurately reject the null hypothesis when it is false (1 − 𝛽). Can be increased by increasing the sample size.
1.5.3.2 ANOVA and the t-test
Statistical hypothesis testing can be used to quantify the significance of observed differential expression to determine whether observed changes in expression across groups is likely related to a biological effect, or purely due to chance. Hypothesis tests can be parametric or non-parametric. For parametric tests, the distribution of the data is assumed a priori, whilst for non-parametric tests no such assumptions are made. An assumption often made of log-transformed microarray data, and for many parametric tests, is that errors follow a normal distribution. Whilst this has yet to be conclusively tested, evidence suggests that in some instances it may be a valid assumption (Giles and Kipling, 2003), particularly for in vitro and transgenic studies whereby within-group variation would be expected to be small (Olson, 2006).
Replicate microarray data are collected for each condition of interest and the variance in the signal between replicates for a particular gene is compared with the variance between the conditions to give some idea of the reliability of differentially expressed genes (Lee et al., 2000; Kerr and Churchill, 2001a; Kerr and Churchill, 2001b; Nadon and Shoemaker, 2002; Kerr, 2003). Many studies
106 focus on the conditions of a single experimental variable, such as drug treatment, disease state, treatment time, etc. For such experimental designs, significance analysis is routinely performed by calculating a t-test statistic (two classes) or one-way ANOVA F-statistic (multiple classes) (Wolfinger et al., 2001; Cui and Churchill, 2003; Pavlidis, 2003; Churchill, 2004). The role of these test statistics is to identify significant differences between the means of the different sample groups – that is, differences in the means that would not be expected by chance alone. These tests therefore consider the variance between group means relative to the pooled variance of observations within each group.
ANOVA can be extended to analyse for significant changes in gene-expression in response to more than one experimental variable. Experiments comparing the effects of two or more variables on gene-expression will often be designed with a factorial treatment structure, such that all combinations of experimental conditions are represented (e.g. male & treated; female & treated; male & untreated; female & untreated) (Fisher, 1926). Such analyses consider not just the main effects of the experiment variables, but also their interactions; modifications of the combined main effects caused by interdependencies between the variables. For instance we may see that a drug under study imposes a stronger effect on males than on females.
Whilst t-test statistics can be calculated for cases when the variance in the two groups is not equal, the ANOVA F-test statistic assumes equal variance across the groups and relies on the parametric assumption of a normal distribution of error terms. As previously discussed, such an assumption may not hold for gene- expression data, so resulting p-values should be treated cautiously. Also, this procedure performs tens-of-thousands of hypothesis tests simultaneously across the genes on the arrays. Each test is assumed to be independent, however given the complex interactions between genes within the cell due to co-expression (simultaneous expression due to related function), this assumption is likely false. However, the ANOVA F-test is often used as a starting point for analyses in order to gain an understanding of data structure and dependencies before progressing to more sophisticated techniques that consider the relatedness of the individual
107 genes, such as empirical Bayes approaches (Baldi and Long, 2001; Efron and Tibshirani, 2002), or to reduce the dimensionality prior to higher order analyses such as hierarchical clustering (Section 1.5.4).
1.5.3.3 Analysis of gene-expression using linear models
Linear model analysis is used to fit a relevant model that can be used to identify statistically significant effects. The variable of interest, or response variable, is related to the predictor variables through a linear model. For some transcript
𝑔 ∈ 1, … , 𝐺 with gene-expression 𝑌𝑔 = 𝑦𝑔1, … , 𝑦𝑔𝑛 over 𝑛 samples, a linear
model can be applied to 𝑦𝑔𝑖 with experiment variables𝑥1 = 𝑥11, … , 𝑥1𝑛 , 𝑥2 = 𝑥21, … , 𝑥2𝑛 , 𝑒𝑡𝑐. as predictor variables:
𝑦𝑔𝑖 = 𝛽0+ 𝛽1𝑥1𝑖+ 𝛽2𝑥2𝑖+ … + 𝛽𝑝𝑥𝑝𝑖 + 𝜖𝑔𝑖
1-13
Where 𝛽0 is the intercept term for transcript g across all samples 𝑖 ∈ (1, … , 𝑛),
𝛽 = 𝛽1, … , 𝛽𝑝 are the regression coefficients (see below) for the predictor variables 𝑥1, … , 𝑥𝑝, 𝑋 = 𝑥1𝑇, … , 𝑥𝑝𝑇 is the design matrix of observed values for the predictor variables 𝑥1, … , 𝑥𝑝 for each observation i, and 𝜖𝑔𝑖 is some error term assumed to be IID ~𝑁 0, 𝜎2 .
It is assumed that the response variable 𝑌𝑔 = 𝑦𝑔1, … , 𝑦𝑔𝑛 for transcript 𝑔 is made up of 𝑛 independently observed values, and that each value of the response variable is observed for some designed value of the predictor variables. These are typically considered as valid assumptions for microarray analyses. The relationship of the response variable 𝑦𝑔𝑖 to the predictor variables can also be written using the R-specific notation 𝑦𝑔𝑖 ~ 𝑥1+ 𝑥2+ ⋯ + 𝑥𝑝 , where the ~ symbol implies that “𝑦𝑔𝑖 is modelled by the additive main effects of 𝑥1, 𝑥2, 𝑒𝑡𝑐”.
If interaction terms are of interest (for instance, we may suspect that the effect of drug treatment depends on age), these terms can be included in the model also:
108
𝑦𝑔𝑖 ~ 𝑥1+ 𝑥2+ ⋯ + 𝑥𝑝 + 𝑥1: 𝑥2 + ⋯ + 𝑥 𝑝−1 : 𝑥𝑝
1-14
Where 𝑥𝑛: 𝑥𝑚 indicates the 1st order interaction between terms 𝑥𝑛 and 𝑥𝑚 for
𝑛, 𝑚 ∈ 1, … , 𝑝 . Here we consider only 1st order interaction terms, although higher order interactions may also be of interest. For 2-colour microarray experiments, the gene-expression data 𝑦𝑔𝑖 are typically normalised log-ratios, and for 1-colour experiments are typically normalised log-signals.
The explanatory variables 𝑥𝑘 for 𝑘 ∈ 1, … , 𝑝 can take many forms, including both categorical variables which take one of a finite number of levels, and numerical variables which take any value within a continuous range. These variables can represent a wide range of experimental conditions. The ANOVA model is a special case of the linear model in which all model terms are taken from a restricted set of designed factor levels. Analysis of covariance (ANCOVA) is an extension of ANOVA including both factors and continuous explanatory variables showing a linear relationship to the response variable (covariates). Such covariates may influence the response of the factor terms on the response variable, and ANCOVA models allow the removal of such nuisance covariate effects. The error terms, or residuals, 𝜖𝑔𝑖 are assumed to be IID such that 𝜖𝑔𝑖 ∈ 𝑁(0, 𝜎2).
Whilst the assumption of normality in microarray data is contentious, it is generally accepted to hold after transformation of the data to the log scale, and it has been suggested that such a transformation may in fact be unnecessary to ensure normality (Giles and Kipling, 2003).
For numerical variables 𝑥1, … , 𝑥𝑝 , the coefficients 𝛽 = 𝛽1, … , 𝛽𝑝 can be calculated using regression analysis, or curve fitting. One such method is ordinary least squares fitting, whereby the residual sum of squares (RSS), 𝑛𝑖=1𝜖𝑖2, is minimised. The linear model described in Equation 1-13 can be considered the equation of a curve within a 𝑝 + 1 dimensional space. Thus regression analysis aims to fit a 𝑝-dimensional surface such that the residuals are minimised. For instance, consider the simplest case of 𝑝 = 1 – e.g. analysing the effect of the
109 numerical variable age on the expression of a single gene (the subscript 𝑔 is removed for convenience). This would be modelled by the equation:
𝑦𝑖 = 𝛽0+ 𝛽1𝑥𝑖 + 𝜖𝑖
1-15
All values 𝑦𝑖 and 𝑥1𝑖 (𝑖 ∈ (1, … , 𝑛)) can be plotted on Cartesian co-ordinates, with
the response variable 𝑦 on the vertical-axis and the explanatory variable 𝑥 on the horizontal-axis. By assuming a linear relationship between x and y, one such model is to fit a straight line with y-intercept 𝛽 0 and slope 𝛽 1, such that the RSS is
minimised. Note however that “linear model” does not imply a linear relationship. A linear model is defined as a model where the explanatory variables are related to the response variable through a linear combination of terms (for instance
𝑦 ~ 3.1 + 2.7𝑥 + 1.4𝑥2 is a linear model despite the second order term x2
, with explanatory variables x and x2). The solution to this simple linear relation regression is given by:
𝛽 1 = 𝑛𝑖=1 𝑥𝑖− 𝑥 𝑦𝑖 − 𝑦 (𝑥𝑖 − 𝑥 𝑛
𝑖=1 ) 2
𝛽 0 = 𝑦 − 𝛽 1𝑥 1-16
Where 𝑥 and 𝑦 are the means of the x and y variables respectively. Fitting this model to a random sample of the global population results in estimates of the model parameters 𝛽 . The residual error 𝜀𝑖 for each observation i is defined as the
difference between the fitted value (µ) and the observed value (yi), and the
residuals are assumed to be IID and ~𝑁(0, 𝜎2). However, this imposes constraints on the number of measured values that are free to vary. That is, if 100 measurements are randomly sampled from the population with residuals
𝜖1, … , 𝜖100, the final measurement will necessarily be defined as 𝜖100 = − 99𝑖=1𝜖𝑖.
In this case, we say that estimation of this statistic has 99 degrees of freedom
(DF).
For factorial variables (such as with ANOVA), regression coefficients can be calculated for each level of the factor by assigning dummy variables to each
110 factor. If the resulting coefficient estimate 𝛽 1 is high, this suggests an effect on the
response variable. For a higher number of explanatory variables, the procedure is the same over a higher dimensional space, with the size of the regression coefficients relating to the size of the effect.
An F-test statistic can be used to test for significant differences between nested models to judge for improvements in model fit, by giving a measure of the significance of the difference in the resulting change in RSS. For nested models
𝑌1 ~ 𝑡1+ ⋯ + 𝑡𝑝1 and 𝑌2 ~ 𝑠1+ ⋯ + 𝑠𝑝2, where 𝑝2 < 𝑝1 and the terms 𝑡𝑖 ∈ 𝑥1, … , 𝑥𝑝, 𝑥1: 𝑥2, … , 𝑥𝑝−1: 𝑥𝑝 for 𝑖 ∈ (1, … , 𝑝1), and terms 𝑠𝑗 ∈ 𝑡1, … , 𝑡𝑝1 for
𝑗 ∈ (1, … , 𝑝2), the F-test statitic is given by:
𝐹 =(𝑅𝑆𝑆2− 𝑅𝑆𝑆1) (𝑝1 − 𝑝2)
𝑅𝑆𝑆1 (𝑛 − 𝑝1) 1-17
The F-statistic is used to test the null hypothesis that the p1-p2 term(s) that differ between Y1 and Y2 have no effect on the response variable. This approach can be
used in an iterative manner to test the significance of each coefficient term in the model by comparing the model fit with and without each term. Thus for two models U and R with n observations, where model U has k unrestricted coefficients and model R restricts m of the coefficients to zero, the F-test statistic is defined as:
𝐹 = 𝑛 − 𝑘 (𝑅𝑆𝑆𝑅− 𝑅𝑆𝑆𝑈)
𝑚. 𝑅𝑆𝑆𝑈 1-18
The fraction of the total SS explained by each of the terms in the model is calculated sequentially to account for the inclusion of previous terms in the model. Traditionally, one of three methods can be used to determine the explained SS for each model term (Yates, 1934; Speed et al., 1978; Herr, 1986; Langsrud, 2003). In a Type I sum of squares (SS) method, the significance of each term in the model is calculated by sequentially adding a term, recalculating the SS, and comparing the models before and after (Overall and Spiegel, 1969). This is
111 repeated until the model becomes saturated. If the design of the experiment is not orthogonal or is unbalanced, the resulting SS can be greatly influenced by the order in which terms appear in the model, and different permutations can give vastly different results (Langsrud, 2003). In contrast, Type II and Type III SS are not reliant on the order of the terms, and are therefore better suited for unbalanced designs. In a Type II SS method, each term under consideration is adjusted for the terms in the model that do not contain the term of interest. In a Type III SS, each term is adjusted for all other terms in the model. Many statistical packages offer the Type III SS as a default, although this has been criticised due to the fact that this this can lead to inclusion of interaction terms without the inclusion of corresponding main effect terms, and Type II SS has been found to have higher power when analysing unbalanced designs (Langsrud, 2003).
Often, a common approach in significance analysis of gene-expression changes across multiple variables is to fit a saturated model for each gene incorporating all variables and their interactions. However, it is preferable and more relevant to fit a single model for each gene to account for per-gene variability (Jin et al., 2001; Wolfinger et al., 2001; Smyth, 2004). Several approaches to significance analysis of gene-expression data using linear models have been previously described (Kerr
et al., 2000; Jin et al., 2001; Wolfinger et al., 2001; Chu et al., 2002; Smyth, 2004), and one of the most widely used is the limma (linear models for microarray data) package in R (Smyth, 2004; Smyth, 2005).
1.5.3.4 Multiple testing
The use of hypothesis testing for gene-expression analysis is popular due to simplicity, and the large number of pre-existing methods available. However, problems may arise due to the large number of tests performed at any one time. The cutoff for significance in many experimental procedures is 𝛼 = 0.05, which indicates that we can expect to see a false positive for 1 out of every 20 tests. Thus the simultaneous testing of tens-of-thousands of genes may result in hundreds, or even thousands, of false positive results. The multiplicative nature of probabilities
112 indicates that the p-values cannot be calculated in isolation for multiple tests. This is termed the multiple testing problem. To account for the multiplicative nature when testing multiple hypotheses, multiple testing corrections (MTC) must be applied to correct the p-values for the number of concurrent tests performed (Dudoit et al., 2003; Reiner et al., 2003).
One class of MTC are the family-wise error rate (FWER) corrections, which gives the probability of making one or more false discoveries across a family of tests. p- values for each gene are adjusted based on the number of individual tests performed, which accounts for the multiplicity of hypothesis testing. One of the best known FWER MTCs is the Bonferroni procedure (Bonferroni, 1936), which is often found to be very conservative in its p-value estimates. A second class of MTCs are the false discovery rate (FDR) procedures which control the proportion of incorrectly rejected null hypotheses, and are generally considered to be less conservative than FWER procedures. The modified p-value gives the expected proportion of false positives that can be expected in a set of tests at a given confidence level.
FDR and FWER MTCs reduce the number of false positive results, but also reduce the power of the statistical test for individual genes. The FDR also gives fewer false negatives than the FWER, but increases power at the cost of the specificity and is often seen to be less stringent (Reiner et al., 2003). One of the most widely used FDR corrections is that of Benjamini and Hochberg (Benjamini and Hochberg, 1995). An improvement on this correction, the q value of Storey et al. (Storey, 2002; Storey and Tibshirani, 2003) improves the power of the test and eliminates the need to set the error rate before-hand. Since FDR corrections such as these require test statistics for each gene to be independent, or at most weakly dependent, there remains criticism as to the relevance of their application to the field of microarrays where dependence exists between many genes (Jung and Jang, 2006; Gordon et al., 2007), although further MTCs that account for positive regression dependency between hypothesis test statistics are available (Benjamini and Yekutieli, 2001). Regardless, given the huge number of tests performed simultaneously when observing differential expression using hypothesis testing,
113 the use of some form of correction is required to account for and minimise false positive results.