The combination of several copy number ratios of adjacent genomic fragments decreases the resolution and enhances the reliability of the result. Single gains and losses without support from adjacent genomic fragments can be discarded as they often represent experimental errors or genomic fragments with a wrongly assigned chromosomal position.
Ideas for a multi-resolutional preprocessing of genomic profiles include
• Moving average,
• Identification of segments with similar copy number ratios and
• Wavelets.
Alternative solutions to deal with the curse of dimensionality would have been:
• Increasing the number of samples and
• Incorporation of previous knowledge.
Previous knowledge was used for excluding genomic fragments with a high error rate (e.g., genomic fragments with cross-hybridisation problems). The number of samples was always limited by the availability of tumour probes and financial limits. For future applications of this workflow, higher genomic resolutions will be possible due to increased sample numbers.
Moving average
The simplest algorithm is the combination of overlapping or adjacent mea- surements using a sliding window approach. I used this method successfully for the LOH data. The drawback of a moving average is that it does not preserve the edges of aberrated regions.
Identification of segments with similar copy number ratios
More advanced methods detect chromosomal regions with similar copy num- ber ratios and assign the same (estimated) copy number ratio to all measu- rements inside a region. These methods were discussed in section 4.2.2. After all and for continuous genomic profiles, I used the Glad algorithm that searches for piecewise constant functions. For each detected region, it assigns
4.2. DATA PREPROCESSING AND FEATURE SELECTION 75 the copy number ratio of an estimated piecewise constant to all measurements inside this region.
Wavelets
One idea for a multi-resolutional preprocessing is based on a wavelet trans- formation. Wavelets are mathematical functions that divide genomic pro- files into different frequency components with a resolution adequate to their scale. The wavelet transformation is a refinement of the Fourier transfor- mation. The underlying idea of all transformations is that the transformed representation of the data facilitates the analysis of the data.
Wavelets have been successfully applied for denoising of images, image com- pression (JPEG2000), EEG (electroencephalogram) and ECG (electrocardio- gram). Recently, wavelets have also been used for denoising of continuous genomic profiles [HSG+05].
Mathematically, wavelets decompose a signal (a genomic profile) into a set of basis functions [Mal99]:
W f(u, s) =
Z ∞
x=−∞
f(x)ψ∗u,s(x)dx. (4.2.1)
ψ∗
u,s(x) are the complex conjugated basis functions. Each basis function is
called a wavelet, has a mean value zero
Z ∞
x=−∞
ψu,s(x)dx= 0 (4.2.2)
and is generated from a basic wavelet (mother wavelet) by scaling (parameter
s) and translation (parameter u)
ψu,s(x) = 1 √ sψ x−u s . (4.2.3)
The simplest example of a (mother) wavelet function is a Haar wavelet:
ψ(x) = 1 0≤x < 1 2 −1 1 2 ≤x≤1 0 else. (4.2.4)
A chromosomal position cop y n umber r atio B chromosomal position cop y n umber r atio
Figure 4.5: Chromosomal aberrations of high and low frequency are plotted against an ideogram (schematic chromosomal map). A) Distinct chromosomal regions: high frequency B) Large chromosomal regions: low frequency
However, I used the Daubechies wavelet family [Mal99] due to their compact support. A closed formal representation of this family does not exist and they are iteratively calculated by a filter bank approach.
The signal-processing terminology is based on the frequency of signals. How is such a frequency defined for genomic profiles?
The frequency of a genomic aberration is determined by its length. Large ge- nomic aberrations are equivalent to low frequencies whereas distinct genomic aberrations have high frequencies (Fig. 4.5). Finally, each aberration can be described one-to-one by its frequency (length) and location (in kB).
Fig. 4.6 shows a wavelet analysis of the cancer cell line HL60. The large aberration of chromosome 5 is reflected by the wavelet coefficient of a low frequency whereas the distinct aberration of chromosome 8 leads to a wavelet coefficient of a high frequency. Please note that the spatial resolution of the wavelet coefficient with a higher frequency is much better than the wavelet coefficient of a lower frequency.
Due to the fact that the genomic fragments on the array-CGH chip are not strictly equidistant, copy number ratios on (128) virtual equidistant genomic fragments using an interpolation algorithm (splines) are computed (using the algorithm described in section 4.2.1). This is a prerequisite for the application of wavelets.
My idea was to use all wavelet coefficients above a given threshold as input of a classifier. In retrospect, this approach did not outperform the use of the original features (copy number ratios after a multi-resolutional prepro- cessing).
4.2. DATA PREPROCESSING AND FEATURE SELECTION 77 A 0 20 40 60 80 100 120 0.4 0.6 0.8 1.0 chromosomal position
copy number ratios
B 0 20 40 60 80 100 120 0.5 1.0 1.5 2.0 2.5 3.0 3.5 chromosomal position
copy number ratios
C translation scaling 6 5 4 3 2 1 0 16 32 48 64 D translation scaling 6 5 4 3 2 1 0 16 32 48 64 E translation scaling 6 5 4 3 2 1 0 16 32 48 64 F translation scaling 6 5 4 3 2 1 0 16 32 48 64
Figure 4.6: Wavelet analysis of the cancer cell line HL60. Genomic profiles of chromosomes 5 (A) and 8 (B), their most important (Daubechies) wavelet compo- nents on chromosome 5 (C) and 8 (D), and all wavelet coefficients on chromosome 5 (E) and 8 (F). The underlying matrix-CGH-measurements were kindly provided by Bernhard Radlwimmer. All calculations were performed using the statistical software package R and the wavethresh library.