• No se han encontrado resultados

Classifi

er

(number of probe sets used)

NC (50)

0.01(0.02)

0.02(0.02)

0.01(0.02)

0.02(0.03)

PAM (20)

0.01(0.02)

0.01(0.02)

0.01(0.02)

0.01(0.02)

LIKNON

(10)

0.02(0.03)

0.02(0.03)

0.02(0.02)

0.02(0.02)

k-NN

(50)

0.01(0.02)

0.01(0.02)

0.01(0.02)

0.01(0.02)

SVC/P

(10)

0.02(0.03)

0.02(0.02)

0.02(0.03)

0.02(0.03)

SVC/RBF

(10)

0.03(0.02)

0.02(0.02)

0.02(0.02)

0.02(0.02)

Table 5A. Performance of different classifi cation algorithms: AML dataset, inv(16) problem. Mean test set error (standard deviation) over 100 random splits of the original data into a training set (90%) and a test set (10%). Error is defi ned as average error per class, i.e. corresponding to assuming a prior probability of occurrence of a class of 50%. Classifi ers were trained for 10, 20, 50, 100, 250, 500 and 1000 probe sets selected by the variation fi lter; results shown here are for the number of probe sets resulting in the smallest average test set error over the four methods, indicated between brackets after the classifi er name.

MAS dChip RMA GCRMA

Classifi

er

(number of probe sets used)

NC (10) 0.16(0.06) 0.19(0.09) 0.16(0.08) 0.26(0.08) PAM (20) 0.14(0.06) 0.14(0.06) 0.14(0.06) 0.14(0.06) LIKNON (20) 0.12(0.05) 0.11(0.05) 0.12(0.06) 0.13(0.05) k-NN (200) 0.10(0.05) 0.11(0.05) 0.11(0.06) 0.13(0.06) SVC/P (20) 0.12(0.05) 0.11(0.05) 0.13(0.06) 0.13(0.05) SVC/RBF (100) 0.09(0.05) 0.10(0.05) 0.10(0.05) 0.14(0.05)

MAS dChip RMA GCRMA

Classifi

er

(number of probe sets used)

NC (1000) 0.34(0.09) 0.34(0.08) 0.33(0.08) 0.35(0.08) PAM (10) 0.29(0.06) 0.30(0.06) 0.30(0.05) 0.30(0.06) LIKNON (20) 0.27(0.08) 0.31(0.07) 0.31(0.08) 0.28(0.07) k-NN (20) 0.28(0.07) 0.29(0.06) 0.30(0.06) 0.29(0.07) SVC/P (20) 0.28(0.08) 0.31(0.07) 0.31(0.07) 0.28(0.06) SVC/RBF (20) 0.28(0.05) 0.30(0.04) 0.30(0.04) 0.29(0.04)

Table 5C. AML dataset, CCR problem.

MAS dChip RMA GCRMA

Classifi

er

(number of probe sets used)

NC (1000) 0.04(0.09) 0.03(0.09) 0.03(0.08) 0.07(0.13) PAM (1000) 0.05(0.10) 0.07(0.13) 0.09(0.14) 0.06(0.13) LIKNON (1000) 0.10(0.12) 0.06(0.11) 0.10(0.13) 0.18(0.18) k-NN (500) 0.06(0.11) 0.06(0.12) 0.08(0.12) 0.04(0.10) SVC/P (1000) 0.09(0.12) 0.05(0.11) 0.04(0.09) 0.07(0.12) SVC/RBF (10) 0.17(0.13) 0.16(0.13) 0.17(0.14) 0.14(0.14)

Table 5D. CNS dataset, MED problem.

separated from the effect of applying different models for summarization. And, although the dChip and RMA summarization models are more related to each other than the MAS and RMA ones, MAS pre-processed data shows more similarity to RMA than does dChip.

In practical terms, the question of which method will give expression value estimates closest to the actual data is still to be answered; this study has not attempted to answer it, because we have not used data with accompanying ground truth. We showed that results of various analyses are not always dependent on the choice of pre-processing method. Analyses such as calculating expression levels or

assessing differential expression are reasonably susceptible to differences between pre-processing methods; clustering as well, except when looking for clearly present structure (that is, using a small number of clusters); but classifi cation far less so. The message is that while care should be taken in assigning biological meaning to individual probe set measurements, this holds less for global statements about the data.

Several other studies have been performed to assess the level of concordance in differential gene sets between pre-processing methods and noted that the choice of the method was of major infl uence, with different studies favoring different pre-processing methods (9, 11, 13). Our results do not conclusively confi rm one or more studies, although results partially overlap. One major difference with other studies is the size of the used datasets, where one of the datasets used in this study is considerably larger. It is to be expected that with the evolution of the array technology, the number of profi led samples in any single patient-cohort study is likely to increase.

The effects of the choice of pre-processing method are far more profound in the CNS dataset than in the AML dataset. Several possible explanations can be given for this, but it is not possible to single any of them out based only on the two datasets used in this study. The AML dataset contains more samples, which allows for better parameter estimates in the analysis methods presented in this work. Furthermore, Affymetrix technology has evolved over time, resulting in a more stable platform for the AML dataset (HG-U133A) than the CNS dataset (HuGeneFL). Biological differences also play a role in the two datasets. The amount of viable cells obtained from bone marrow is also likely to be higher compared to solid tumors, which often show necrotic areas, leading to difference in RNA-quality and -degradation. Also, tumor cells can be purifi ed from bone marrow samples using Ficoll-centrifugation, a technique which is not available for the solid tumors which were hybridized in the CNS dataset, resulting in less contamination with other cell types in hybridized samples, which is known to be an important factor (34). We recommend that the emphasis in setting up a large microarray-based study should therefore be on the quality of the biological sample and the quality of RNA rather than on the choice of the pre-processing procedure. However, we do believe that an inverse relation exists, with the importance of the method of normalization and expression summarization increasing when the quality of the biological sample and the number of studied samples decrease. Although we base this on a limited number of pre-processing methods and data sets, we think that taking into account more available methods will have no effect on our conclusion.

List of Abbreviations

AML Acute myeloid leukemia CNS Central nervous system

PNET Primitive neuro-ectodermal tumors MED Medullablastoma

GLIO Malignant glioma RHAB Rhabdoid tumors

SAM Signifi cance analysis of microarrays CDF Cumulative distribution function

CCR Continuous complete remission PAM Prediction analysis of microarrays k-NN k-Nearest neighbour

NC Nearest centroid

SVC-P Support vector classifi er with polynomial kernel of degree d SVC-R Support vector classifi er with radial basis function kernel of width σ

Acknowledgements

The authors would like to thank Sahar Barjesteh van Waalwijk van Doorn – Khosrovani, Renee Beekman, Claudia Erpelinck , Judith Gits and Antoinette Van Hoven – Beijen for providing RQ-PCR data and Ruud Delwel for helpful discussions. The authors are grateful to the anonymous reviewers for their helpful remarks, which improved the manuscript.

References

1. Affymetrix: Microarray Suite User Guide. In., vol. Version 5; 2001.

2. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 2001, 98(1):31-36.

3. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193.

4. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15.

5. Naef F, Magnasco MO. Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys 2003, 68(1 Pt 1):011906.

6. Wu Z, Irizarry RA, Gentleman R, Murillo F, Spencer F. A model based background adjustment for oligonucleotide expression arrays. In. Baltimore: John Hopkins University; 2004.

7. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantifi cation of differential expression. Bioinformatics 2002, 18 Suppl 1:S96-104.

8. Affymetrix: Guide to probe logarithmic intensity error (PLIER) estimation. In.; 2005. 9. Shedden K, Chen W, Kuick R, et al. Comparison of seven methods for producing Affymetrix

expression scores based on False Discovery Rates in disease profi ling data. BMC Bioinformatics 2005, 6(1):26.

10. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264.

11. Hoffmann R, Seidl T, Dugas M. Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biol 2002, 3(7): RESEARCH0033.

12. Liu WM, Mei R, Di X, et al. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 2002, 18(12):1593-1599.

13. Rajagopalan D. A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 2003, 19(12):1469-1476.

14. Freudenberg J, Boriss H, Hasenclever D. Comparison of preprocessing procedures for oligo- nucleotide micro-arrays by parametric bootstrap simulation of spike-in experiments. Methods Inf Med 2004, 43(5):434-438.

15. Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20(3):323-331.

16. Bullinger L, Dohner K, Bair E, et al. Use of gene-expression profi ling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med 2004, 350(16):1605-1616.

17. Valk PJ, Verhaak RG, Beijen MA, et al. Prognostically useful gene-expression profi les in acute myeloid leukemia. N Engl J Med 2004, 350(16):1617-1628.

18. Yeoh EJ, Ross ME, Shurtleff SA, et al. Classifi cation, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profi ling. Cancer Cell 2002, 1(2):133-143.

19. Su AI, Welsh JB, Sapinoso LM, et al. Molecular classifi cation of human carcinomas by use of gene expression signatures. Cancer Res 2001, 61(20):7388-7393.

20. Pomeroy SL, Tamayo P, Gaasenbeek M, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415(6870):436-442.

21. Van der Reijden BA, de Wit L, van der Poel S, et al. Identifi cation of a novel CBFB-MYH11 transcript: implications for RQ-PCR diagnosis. Hematol J 2001, 2(3):206-209.

22. Barjesteh van Waalwijk van Doorn-Khosrovani S, Erpelinck C, van Putten WL, et al. High EVI1 expression predicts poor survival in acute myeloid leukemia: a study of 319 de novo AML patients. Blood 2003, 101(3):837-845.

23. van Waalwijk van Doorn-Khosrovani SB, Erpelinck C, Meijer J, van Oosterhoud S, et al. Biallelic mutations in the CEBPA gene and low CEBPA expression levels as prognostic markers in intermediate-risk AML. Hematol J 2003, 4(1):31-40.

24. Mulloy JC, Jankovic V, Wunderlich M, et al. AML1-ETO fusion protein up-regulates TRKA mRNA expression in human CD34+ cells, allowing nerve growth factor-induced expansion. Proc Natl Acad Sci U S A 2005, 102(11):4016-4021.

25. Bengtsson M, Stahlberg A, Rorsman P, Kubista M. Gene expression profi ling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. Genome Res 2005, 15(10):1388-1392.

26. Sheskin DJ. Handbook of parametric and nonparametric statistical procedures., Third edition edn. Boca Raton, FL: Chapman & Hall/CRC; 2004.

27. Ge U, Dudoit S, Speed TP. Resampling-based multiple testing for microarray analysis. In.; 2003.

28. Tusher VG, Tibshirani R, Chu G: Signifi cance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9).5116-5121.

29. Lange T, Roth V, Braun ML, Buhmann JM. Stability-based validation of clustering solutions. Neural Comput 2004, 16(6):1299-1323.

30. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002, 99(10):6567-6572.

31. Bhattacharyya C, Grate LR, Rizki A, et al. Simultaneous relevant feature identifi cation and classifi cation in high-dimensional spaces: Application to molecular profi ling data. Signal Processing 2003, 83:729-743.

32. Duda RO, Hart PE, Stork DG. Pattern classifi cation, second edition edn. Hoboken, NY: Wiley Interscience; 2003.

33. Jain KJ, Dubes RC. Algorithms for clustering data. Upper Saddle River, NJ: Prentice Hall Inc.; 1988.

34. de Ridder D, van der Linden CE, Schonewille T, et al. Purity for clarity: the need for purifi cation of tumor cells in DNA microarray studies. Leukemia 2005, 19(4):618-627.

Mutations in nucleophosmin NPM1 in acute myeloid

Documento similar