2.3 MARCO TRIBUTARIO: TRATAMIENTO DE LOS GASTOS DE
2.3.1 TRATAMIENTO TRIBUTARIO SECTORIAL DE LOS GASTOS DE
The image processing described in the previous section results in a file containing information on the signal strength for each probe on the array. Further low-level data manipulation is required to organise the data into a structure whereby each probe set on the array is represented by a single value relating to the relative abundance of the corresponding mRNA transcript such that comparisons can be made between different arrays (i.e. comparing abundance of a particular transcript between conditions) and between probe sets (e.g. looking for similarly expressed transcripts).
The first step in this procedure is to remove the NSB signal to ensure that background signal bias is removed and that measured signal values relate specifically to the transcript of interest. Probe-level signal values are then summarised across the probe set to give a single value. Finally, to ensure that comparisons can be made both across arrays and across probe sets, a normalisation procedure is used to remove systematic variation in the data by scaling signal values across samples and probe sets such that they are comparable on the same scale. This is often done such that all probe sets have mean 1 across the samples. A number of methods are available for performing these transformations, and several of the most common algorithms are discussed in this section.
1.4.2.1 MAS 4.0
Early results suggested that the subtraction of MM signal from PM signal was linear with RNA concentration (Lockhart et al., 1996). The earliest editions of Microarray Suite (MAS 4.0), the Affymetrix supplied software for analysis of GeneChip microarray data, used a simple average difference method to remove signal information from the MM probes from the „real‟ signal of the PM probes and summarise set (Affymetrix, 1999). For a given probe set 𝑛 = 1, … , 𝑁 on array
68
𝐴𝑣𝐷𝑖𝑓𝑓𝑖𝑛 = 1
𝐴𝑖𝑛 𝑃𝑀𝑗 ∈𝐴 𝑖𝑗𝑛 − 𝑀𝑀𝑖𝑗𝑛
𝑖𝑛 1-1
Where 𝑗 = 1, … , 𝐽 is the physical position of the probe pair within the probe set, and 𝐴𝑖𝑛 is the subset of probes for which 𝑑𝑖𝑗𝑛 = (𝑃𝑀𝑖𝑗𝑛 − 𝑀𝑀𝑖𝑗𝑛) is within 3
standard deviations of the average of 𝑑𝑖2𝑛, … , 𝑑𝑖 𝐽−1 𝑛. This calculation is based on the underlying model for probe level correction:
𝑃𝑀𝑖𝑗𝑛 − 𝑀𝑀𝑖𝑗𝑛 = 𝜃𝑖𝑛 + 𝜀𝑖𝑗𝑛
1-2
𝜃𝑖𝑛 represents the mean expression of the target transcript 𝑛 on array 𝑖, and 𝜀𝑖𝑗𝑛
represents the probe-level error. The summary described in Equation 1-1 assumes that the error terms 𝜀𝑖𝑗𝑛 have equal variance for all probes in the probe set.
However, it has been shown that this assumption does not hold for GeneChip data since probes with a higher mean-intensity also have a larger variance in their errors (Irizarry et al., 2003b). Another problem that arises with this method for background subtraction is that often (1/3 of all probes in some cases) the signal for the MM probes is higher than that of the PM probes, indicating that the MM probes are sensitive to targets of the PM probes (Affymetrix, 2002a; Irizarry et al., 2003b). This may result in the loss of real signal and not just background. More worryingly, the correction 𝑃𝑀 − 𝑀𝑀 produces negative values for these probe pairs, precluding the use of a log transformation to account for the multiplicative errors, and producing negative expression values for roughly 5 % of probe sets (Wu et al., 2004). The loss of signal by subtracting MM probe signal can also result in a large amount of noise, particularly at lower intensity levels, reducing accuracy and making prediction of differential expression difficult.
1.4.2.2 MAS 5.0
To avoid the problems of noise seen at lower intensity levels using the MAS 4.0 algorithm, a log transformation was used to reduce the dependence of the variance
69 of the error terms on the mean (Hubbell et al., 2002), and a robust estimator – the Tukey biweight (Hoaglin et al., 2000) – was introduced to down-weight the effects of outlying probes on the summary signal over the probe set. For some cutoff value c chosen in advance, the Tukey biweight function is defined as:
𝜓 𝑥 = 𝑥 1 − 𝑥2 𝑐2 2 𝑓𝑜𝑟 𝑥 < 𝑐 0 𝑓𝑜𝑟 𝑥 > 𝑐 1-3
To minimise the introduction of noise due to removal of MM signal, the concept of the ideal mismatch (IM) was introduced. If the MM probe signal is lower than the PM signal for a particular probe, the MM signal is assumed to be informative for NSB with no cross hybridisation, and the MM value is taken as the ideal mismatch value. If MM probe values are generally lower than PM values across the probe set, except for a small number of probes, then the IM for these uninformative probes is imputed from the biweight mean of the PM and MM ratio. If however the MM probe signals are generally higher than the PM probe signals across the probe set, the IM value is taken as a value slightly below that of the corresponding PM signal (Affymetrix, 2002b). Therefore, for probe pair 𝑗 of probe set 𝑛 on array 𝑖, the MAS5.0 signal is computed as:
𝑀𝐴𝑆 5.0 𝑠𝑖𝑔𝑛𝑎𝑙𝑖𝑛 = 𝜓(log2 𝑃𝑀𝑖𝑗𝑛 − 𝐼𝑀𝑖𝑗𝑛 )
1-4
This summary method is currently employed in the GeneChip Operating System (GCOS) supplied by Affymetrix. However, despite the addition of the robust Tukey biweight estimator, data calculated using the MAS 5.0 algorithm are still noisy, particularly at lower intensity levels (for instance, see Figure 1.4.3) (Irizarry et al., 2003b). A strong probe effect, additive on the log scale, is detected even after removal of MM signal (Li and Wong, 2001; Irizarry et al., 2003b). This indicates that subtraction of MM signal alone is insufficient to remove probe- specific effects.
70
1.4.2.3 Model based expression index
Due to the reproducibility of arrays produced using photolithographic and inkjet techniques, individual probe affinities can be modelled well. Li and Wong (2001) suggested a multiplicative model-based approach to estimate expression for each probe set using probe-specific affinities. This approach is termed the model based expression index (MBEI), and is implemented in the analysis package DNA-Chip Analyser (dChip) (Li and Wong, 2001):
𝑃𝑀𝑖𝑗 − 𝑀𝑀𝑖𝑗 = 𝜃𝑖𝜙𝑗 + 𝜀𝑖𝑗
1-5
Where PMij and MMij represent the detected PM and MM signal for the probe in
the jth (𝑗 = 1, . . , 𝐽) position of the probe sets for array 𝑖 = 1, … , 𝐼, 𝜙𝑗 represents the probe specific affinities for the jth probe in each probe set which can be estimated from the multiple arrays in the analysis, 𝜃𝑖 are the estimates of the
expression for each probe set on array i, and the 𝜀𝑖𝑗 are error terms assumed to be
independent and identically distributed (IID) across the arrays. Estimates for 𝜃𝑖 are calculated by iteratively fitting the model with variable 𝜙𝑗, aiming to minimise the sum of the squared residuals.
This process corrects expression estimates for the effects of individual probe affinities improving precision. However, since this procedure still removes MM signal for NSB correction, the problems of noise are still present, albeit reduced. Also, it was found that this procedure results in underestimates of the predicted values for higher concentrations of RNA in spike-in studies (Irizarry et al., 2003b).
1.4.2.4 Robust multi-chip averaging
By performing extensive statistical analyses on a spike in study using known concentrations of 16 probe sets on the Affymetrix HGU95A GeneChip (Affymetrix, 2002c), Irizarry et al. (2003b) concluded that the probe signal
71 strength increases linearly on the normal scale, but not on the log scale. This indicates that NSB is additive and not multiplicative as suggested by Li and Wong (Li and Wong, 2001). Given that probe effects appear to be additive on the log scale, this led several researchers to suggest the need for a method that modelled background in an additive fashion, and the error in a multiplicative fashion (additive on the log scale) (Durbin et al., 2002; Huber et al., 2002; Cui et al., 2003). Given the problems seen with removing MM signal in NSB correction, an improved method for probe-level normalisation was suggested based on multi- variate linear models estimated using only PM signal (Irizarry et al., 2003a). This measure was termed the Robust Multi-chip Averaging (RMA).
Model based estimates of the NSB probability density function negates the need for including the MM signal in the NSB estimation. Assuming the additive background model 𝑃𝑀𝑖𝑗𝑛 = 𝑠𝑖𝑗𝑛 + 𝑏𝑔𝑖𝑗𝑛, background corrected signal is defined
as:
𝐵(𝑃𝑀𝑖𝑗𝑛) ≡ 𝐸(𝑠𝑖𝑗𝑛|𝑃𝑀𝑖𝑗𝑛)
1-6
Computation of the background adjusted signal is performed by using a kernel density estimate over the detected PM signals to produce a smooth probability density curve, allowing estimation of the expected signal given that the PM signal
PMijnis detected. Background adjusted values are log transformed (typically base
2) and are normalised using quantile normalisation (Bolstad et al., 2003) to remove systematic differences between arrays and ensure that the distribution of the log-transformed values more closely approximates a normal distribution (~𝑁(0, 𝛿2)). A linear additive model is fitted to the background adjusted, normalised and log transformed PM signal, 𝑌𝑖𝑗𝑛 , for array 𝑖 = 1, … 𝐼, probe 𝑗 = 1, … , 𝐽, and probe set 𝑛 = 1, … 𝑁:
𝑌𝑖𝑗𝑛 = 𝜇𝑖𝑛+ 𝛼𝑗𝑛 + 𝜀𝑖𝑗𝑛
72 Where 𝛼𝑗𝑛 represents the individual probe affinity effect, 𝜇𝑖𝑛 represents the real log scale expression for array 𝑖, and 𝜀𝑖𝑗𝑛 represents the error term, assumed to be IID with normal distribution ~𝑁(0, 𝜎2). It is also assumed that the probes on the array were designed such that the probe intensities are on average representative of the corresponding gene-expression, such that 𝛼𝑗 𝑗 = 0. Finally, median
polishing (Holder et al., 2001) is applied to the estimates 𝜇𝑖 of log scale expression levels for each array 𝑖 to protect against the effect of outlying probes. The main benefits of using RMA stem from the fact that background correction is not reliant on removal of MM data, which may measure actual signal as well as NSB. Figure 1.4.3 shows a comparison of the GC-RMA summary method with the Affymetrix standard MAS 5.0 method. This figure shows a clear reduction in variance using GC-RMA as compared to MAS 5.0, particularly at lower expression levels, indicating increased precision in the expression estimates. GC- RMA also results in improved sensitivity and specificity for fold change estimation, reducing the number of false positives (Irizarry et al., 2003a). It is also interesting to note that this figure indicates that the signal intensity of each probe appears to be higher when using GC-RMA than when using MAS 5.0. This may be due to lower levels of background signal detected for all probes using GC- RMA than MAS 5.0.
One problem with the RMA probe-level normalisation procedure is that the use of only a global background adjustment does not adjust well for NSB. Although RMA reduces the number of false positives, robust estimation of the expression of some genes can result in an increase in the number of false negatives during analysis for differential expression, particularly for lower abundance targets (Wu
et al., 2004), indicating that accuracy of fold change estimates is sacrificed for precision.
The GeneChip Robust Multi-chip Averaging (GC-RMA) method of probe level normalisation proposed by Wu et al. (2004) improves upon the background correction portion of the RMA algorithm by using a probe-specific weighting of the MM probe signal that is dependent on the content and position of higher
73 affinity guanine and cytosine nucleotides within the oligonucleotide sequence (Naef and Magnasco, 2003). This prevents losing 50 % of the data by using a more sophisticated MM subtraction method that is dependent on MM probe sequence. The probe affinity is modelled as the sum of the individual effects of the bases in the probe sequence:
𝛼 = 𝜇𝑗 ,𝑘1𝑏𝑘=𝑗 𝑗 ∈ 𝐺,𝐶,𝐴,𝑇 25 𝑘=1 1-8 𝜇𝑗 ,𝑘 = 𝛽𝑗 ,𝑙𝑘𝑙 3 𝑙=0 1-9 1𝑏𝑘=𝑗 = 1 𝑖𝑓 𝑏𝑘 = 𝑗 0 𝑖𝑓 𝑏𝑘 ≠ 𝑗 1-10
Where 𝑘 = 1, … ,25 is the position along the 25-mer oligonucleotide probe, 𝑏𝑘 is the base content at the kth nucleotide position, and 𝜇𝑗 ,𝑘 is the contribution that base j has on the overall affinity when in position k. This estimate is used to correct the MM values for their individual affinities to estimate NSB. These estimates were found to model NSB almost as well as the MM signal, with the advantage that computed estimates do not detect real signal. This process retains the benefits in the precision of the results as compared to MAS 5, but does not suffer from the loss in accuracy that is seen in RMA.
75
Figure 1.4.3: Comparison of MAS 5.0 and GC-RMA probe-level summary methods
Summary and normalisation of probe-level data is required to provide a single intensity signal for each probe-set. Two widely used algorithms are MAS 5.0 and GC-RMA. These procedures improve concordance between replicate samples, as can be seen by the scatterplots shown here. The signal for each probe set for two replicate samples from the main experiment (Panc T 4hr (1) and (2)) were plotted against each other on a simple Cartesian plot. Perfect similarity between the two replicates would be identified by points lying along the 45° identity line. a) While application of the MAS 5.0 algorithm to the data resulted in relatively high similarity between the two replicates, a large amount of variability was detected for probe-sets with lower signals. This may result in a large number of probe sets called as false positives. b) This region of high variability was not present after application of the GC-RMA algorithm, resulting in a tighter fit along the identity line. This „squashing‟ of the highly variable region greatly reduces the number of false positive calls, but may also inadvertently reduce the fold change of real low abundance biological variation, resulting in an increase in the false negative rate.
77