School of Medicine and Health Sciences

(1)

INSTITUTO TECNOLÓGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY

Campus Monterrey

School of Medicine and Health Sciences

Transcriptomic meta-analysis of sorted CD133

⁺

stem cells and their analogues in cancer yields a set of common differentially expressed

genes and overrepresented functional categories

A thesis presented by

Jocelyn Nikita Campa Carranza

Submitted to the

School of Medicine and Health Sciences

in partial fulfillment of the requirements for the degree of Master of Science

in Biomedical Sciences

Monterrey, Nuevo León, México, June 9^th, 2020

(2)

iv

PREFACE The present dissertation was conducted at the School of Medicine and Health Sciences from Tecnológico de Monterrey, Campus Monterrey under the supervision of

Raquel Cuevas Díaz Durán, PhD and Mirna González González, PhD with the financial support of Tecnológico de Monterrey, and CONACyT fellowship (CVU. 923850)

(3)

v

PREFACIO Esta tesis se realizó en la Escuela de Medicina y Ciencias de la Salud del Tecnológico de Monterrey, Campus Monterrey bajo la asesoría de la Dra. Raquel Cuevas Díaz Durán y la Dra. Mirna González González, con el apoyo financiero del Tecnológico de Monterrey, y del CONACyT (CVU. 923850)

(4)

vi

Dedication

To my parents Julieta and Vicente, I am because of you; I hope you look at me and think all your sacrifices have been worth it.

To my brother Dylan,

keep thriving, we are in this together.

(5)

vii

Dedicatoria

A mis padres Julieta y Vicente, todo lo que soy es gracias a ustedes; espero poder seguir enorgulleciéndolos, así como ustedes me motivan.

A mi hermano Dylan,

nunca te rindas, estamos juntos

en esta aventura llamada vida.

(6)

viii

Acknowledgements

I would like to express the deepest appreciation to all the kind people around me, who have been instrumental and supportive to me while getting this master’s thesis to completion after having to

change the focus of my project due to external reasons.

I am very grateful to my principal advisor, Professor Raquel Cuevas, who continually motivated me to learn new concepts. Without her guidance, patience and persistent help, this dissertation

would not have been possible in such reduced amount of time.

I place on record, my sincere gratitude to my co-advisor, Professor Mirna González, who always demonstrated her concern in getting things done the best possible way and for all her support.

I would like to thank my committee members, Professor Emmanuel Martínez for all his support and advice with the programing code in R, and Professor Víctor Treviño. Both of them introduced me to Bioinformatics and their enthusiasm for teaching awakened my curiosity for

the field.

I would like to acknowledge the group of friends I have made while completing this program:

Alonso, Mayela, Karo, Romeo, Bianca, Mariana, and Itzel for their support in hard times and for the fun times as well.

In addition, I would also want to thank in a very special way my non-blood brothers Jesús and Pato, simply for appearing in my life and being there for me.

Finally, I would like to acknowledge the financial support of Tecnológico de Monterrey, and CONACyT for my fellowship (CVU. 923850).

All of you have been consistently excellent at providing guidance and comfort during this step of my journey to become a clinician-scientist.

(7)

ix

Agradecimientos

Quisiera expresar mi más grande agradecimiento hacia todas esas personas tan especiales que me rodean, que me han motivado y apoyado durante la trayectoria de mi maestría, incluso después

de haber cambiado el enfoque de mi proyecto debido a causas externas.

Estoy muy agradecida con mi asesora principal, la Dra. Raquel Cuevas, que continuamente me motivo a seguir aprendiendo cosas nuevas. Sin su apoyo y paciencia, definitivamente esta tesis

no hubiera sido posible terminar en tan poco tiempo.

De igual manera doy gracias a mi co-asesora, la Dra. Mirna González, quien siempre demostró su preocupación por que las cosas salieran de la mejor manera posible y por todo su apoyo.

Así mismo quisiera agradecer a los demás miembros de mi comité, al Dr. Emmanuel Martínez por todo su apoyo y consejos con el código de programación en R, y al Dr. Víctor Treviño.

Ambos me introdujeron al campo de la Bioinformática y su entusiasmo por la enseñanza despertó en mi un interés para adentrarme en ese campo.

También quiero agradecer al grupo de amigos que hice durante mi tiempo en esta maestría:

Alonso, Mayela, Karo, Romeo, Bianca, Mariana, e Itzel por su apoyo en tiempos difíciles y por los momentos de diversión también.

Adicionalmente, quiero agradecer de manera muy especial a mis hermanos adoptivos Jesús y Pato, simplemente por aparecer en mi vida y estar ahí para mi.

Finalmente, agradezco el apoyo económico por parte del Tecnológico de Monterrey, y de CONACyT por la beca brindada durante mis estudios (CVU. 923850).

Todos ustedes han sido un pilar esencial durante esta etapa de mi carrera.

(8)

x

Transcriptomic meta-analysis of sorted CD133⁺ stem cells and their analogues in cancer yields a set of common differentially expressed genes and over-represented functional

categories by

Jocelyn Nikita Campa Carranza ABSTRACT

The use of stem cells has been exploited for their potential application in regenerative medicine due to their properties of self-renewal, proliferation, differentiation, and immunomodulation. The isolation of primitive stem cells focuses on the presence of surface biomarkers, prominin-1/CD133 among them, on account of the potential therapeutic applications that have been reported for CD133⁺stem cells in preclinical studies. However, CD133 is also one of the most prominent and commonly reported surface biomarkers for cancer stem cells (CSCs). Prominin-1 has also been associated with proliferation, cell survival, and autophagy in precursor and mature cells.

Accordingly, prominin-1 appears to be a good candidate for targeting but its biological implication remains to be further determined.

Here, we made use of publicly available gene expression data of sorted CD133⁺ cells from normal and cancerous sources to perform an integrated meta-analysis to identify a set of differentially expressed (DE) genes and attempt to find functional relationships among them. A subset of statistically significant genes was further validated in silico. The identification of representative genes and a co-expression network had the aim to better elucidate the underlying biological function of prominin-1/CD133. The present project melds biomedical knowledge with the use of bioinformatics, exploiting the availability of large databases of genomic information. Moreover, our methodology is a cost-effective approach to extract knowledge from biological data in a fast and accurate way.

Keywords: stem cells, prominin-1, CD133, cancer stem cells, differentially expressed genes, meta-analysis, functional relationships, co-expression network, bioinformatics

(9)

xi

List of tables

Table 2. 1 Well-known stem cell surface markers ... 7

Table 2. 2 CD133 as a biomarker ... 9

Table 3. 1 Methodological summaries of the analyzed microarray experiments ... 26

Table 3. 2 Methodological summaries of the analyzed RNA-seq experiments ... 27

Table 4. 1 Number of significantly regulated genes per microarray study. ... 36

Table 4. 2 Number of significantly regulated genes per RNA-seq study. ... 42

Table 4. 3 Genes with a positive correlation to PROM1 expression pattern in glioblastoma. ... 48

Table 4. 4 Genes with a negative correlation to PROM1 expression pattern in glioblastoma. .... 48

Table 4. 5 Expression of the set from differential up-regulated genes in colon carcinoma. ... 58

Table 4. 6 Expression of the set from differential down-regulated genes in colon carcinoma. .... 58

Table 4. 7 Expression of the set from differential up-regulated genes in hESCs. ... 59

Table 4. 8 Expression of the set from differential down-regulated genes in hESCs. ... 60

Table A 1. Acronyms ... 74

Table A 2. Abbreviations ... 75

Table A 3. Microarray study IDs ... 76

Table A 4. RNA-seq study IDs ... 76

(10)

xii

List of figures

Figure 3. 1 Schematic representation of microarray data processing. ... 34 Figure 3. 2 Schematic representation of RNA-sequencing data processing. ... 35

Figure 4. 1 PCA for the microarray studies data using z-scores of the normalized expression values. ... 38 Figure 4. 2 Summary data from the meta-analysis of independent microarray studies. ... 40 Figure 4. 3 Filtered lists of regulated genes considering only the statistically significant DE genes. ... 41 Figure 4. 4 PCA performed for the RNA-seq studies data using z-scores of the normalized

counts. ... 43 Figure 4. 5 Intersection of lists of DE genes from the RNA-seq experiments. ... 44 Figure 4. 6 Intersection of lists of DE genes with the genes present in all studies. ... 45 Figure 4. 7 Heatmap showing z-scores of expression values of all samples and their hierarchical clustering and relationship grouping of samples. ... 46 Figure 4. 8 Results from overrepresentation analysis for DE genes. ... 49 Figure 4. 9 Results from the overrepresentation analysis for DE genes correlated with the

expression of PROM1 in non-cancer studies. ... 51 Figure 4. 10 Results from the overrepresentation analysis for DE genes correlated with the

expression of PROM1 in cancer studies. ... 52 Figure 4. 11 Results from the overrepresentation analysis for DE genes in glioblastoma. ... 53 Figure 4. 12 Results of PPI network analysis for positive correlated DE genes with the

expression of PROM1 in normal stem cells. ... 55 Figure 4. 13 Results of PPI network analysis for positive correlated DE genes with the

expression of PROM1 in GBM stem cells. ... 56 Figure 4. 14 PROM1 expression profile graphs per experiment. ... 57

(11)

xiii

In recent years, there has been a major impact from the use of stem cells and cell therapy in regenerative medicine. Stem cells possess unique abilities that confer them exploitable characteristics for their use in the treatment of a wide variety of diseases and injuries. The identification and isolation of stem cells can be achieved through the use of cell-sorting techniques, based on the expression of cell surface markers. Isolation of primitive stem cells focuses on the presence of surface biomarkers related to stemness, such as CD133. This well-known stem cell surface marker is also one of the most commonly reported biomarkers for cancer stem cells (CSCs).

The exact function of CD133 is still not entirely understood, but its prevalent expression in various tissues, stem cells and their analogues in cancer indicates an important role. Different molecular mechanisms have been investigated to better understand the implications of CD133 in normal and cancer stem cells, due to associations that have been reported with the presence of the biomarker.

For instance, potential therapeutic applications have been reported for CD133⁺stem cells including cell differentiation (Torrente et al., 2004; Uchida et al., 2000); and treatment of hematopoietic and cardiovascular disorders (Bornhäuser et al., 2005; Steinhoff et al., 2017). Additionally, the widespread expression of CD133 in CSCs and its proposed relation to cell proliferation signaling pathways such as Wnt/ β-catenin (R. Wang et al., 2016) and PI3K-Akt (Wei et al., 2013), have inspired an interest in the better elucidation of the biological function of this biomarker.

In this regard, there is an opportunity to address the better understanding of CD133 expression through the study of gene expression profiles of cells that share the presence of the biomarker. The compilation of transcriptional profiles of CD133+ stem cells can unveil important related genes in

(16)

2

an attempt to find functional relationships among those and shed further light on the molecular mechanisms influenced by the expression of CD133.

In this research, a systematic assessment of publicly available gene expression data of CD133⁺ cells was implemented as a strategy to provide a more precise estimate of the underlying biological implications of the biomarker. This approach has a more valuable contribution to the analysis than using a single study and can establish a proof of principle for further confirmatory studies, narrowing the research focus prior to reaching clinical settings. Moreover, the methodology used in this project is a cost-effective approach to extract knowledge from biological data exploiting the availability of large genomic databases.

1. 2 Research question

What biologically related sets of genes are enriched in stem cells expressing CD133 and their analogues in cancer that could help better elucidate its underlying biological function?

1. 3 Justification

Prominin-1/CD133 is a known cell surface biomarker of primitive stem cells, but also one of the most prominent and commonly reported cancer stem cell markers. Therefore, it is used as an effective tool for the identification and isolation of stem cells. However, the biological function of CD133 remains elusive. Considering that gene expression profiles of sorted CD133+ cells from normal and cancerous sources are available in public databases, the identification of related differentially expressed genes and functional categories can help to unravel potential biological functions of this marker yielding potential therapeutic targets.

(17)

3 1. 4 Hypothesis

Finding differentially expressed and co-expressed genes in sorted CD133⁺cells, that engage in related signaling pathways, could help better elucidate the underlying biological function of CD133.

1. 5 General objective

To perform a meta-analysis using transcriptomic data sets from cells expressing the multipotent stem cell marker prominin-1/CD133 in different sources to identify differentially expressed genes and their related functional categories.

1. 6 Specific objectives

1. To convey a search in GEO DataSets database (NCBI) for microarray and RNA- sequencing experiments.

2. To run differential gene expression, correlation and gene set enrichment algorithms.

3. To identify representative gene signatures of CD133⁺ normal stem cells and CD133+ cancer stem cells.

4. To validate the most statistically significant genes in silico.

1. 7 Thesis structure

The present document was divided in the following chapters:

Chapter 1. Introduction.

A general overview of the work is presented, the hypothesis, justification and objectives are established.

(18)

4 Chapter 2. Theoretical Framework.

General concepts and theoretical background of the multipotent stem cell marker CD133 are presented. The previously reported functions of the marker and its association with normal stem cells and cancer stem cells, are also addressed.

Chapter 3. Methods.

The methodology used for data preprocessing and downstream processing are described in detail.

Chapter 4. Results.

The main findings of the project are addressed.

Chapter 5. Discussion.

The obtained results are analyzed, discussed and compared with published work.

Chapter 6. Conclusions and Perspectives.

Conclusions and future applications of the results are presented. The proposed research question is answered.

Chapter 7. Appendix.

Abbreviations, acronyms, study IDs, conference presentations and complementary research work are presented.

Chapter 8. Bibliography.

References used in this document are listed in APA format.

Curriculum vitae.

(19)

5 2. THEORETICAL FRAMEWORK

2. 1 Stem cells and their application in regenerative medicine

The use of stem cells (SCs) in regenerative medicine holds promise for treating a wide variety of diseases and injuries due to their unique properties of self-renewal, proliferation, differentiation, and immunomodulation. For this reason, in the last 20 years there has been a significant investment in basic and clinical research in the stem cell field, which includes the isolation, generation, expansion and clinical application of diverse types of SCs (Ratcliffe et al., 2013).

Stem cell research has increased since the first reported isolation of human embryonic stem cells (hESCs) by (Thomson et al., 1998). Nevertheless, after the ethical conflicts that SCs obtained from human embryos generated, stem cell research focus switched on to stem cells derived from adult tissues, and the more recently described induced pluripotent stem cells (iPSCs) isolated from various cell types through reprogramming technology (Robinton & Daley, 2012). Adult stem cells were described by (Pittenger et al., 1999) demonstrating the ability of human bone marrow-derived adult mesenchymal stem cells (BM-MSCs) to differentiate into diverse cell types, unveiling the multipotency of adult stem cells. iPSCs were first described by (K. Takahashi & Yamanaka, 2006), who derived iPSCs directly from mouse somatic cells through ectopic co-expression of four reprogramming transcription factors and thus providing an alternative source of pluripotent SCs for further research. Moreover, mesenchymal stem cells (MSCs) derived from adult tissues, including bone marrow and other connective tissues are currently a topic of focus in regenerative medicine, due to their high proliferation and differentiation capacity into diverse cell types, such as osteoblasts, chondrocytes and adipocytes (Pittenger et al., 1999), in addition to hepatocytes (Lee et al., 2004), neurons (Sanchez-Ramos et al., 2000) and glial cells (Lee et al., 2004).

(20)

6

The identification, isolation, and characterization of stem cells is essential for their further application in regenerative medicine. This can be achieved through the use of cell-sorting which exploit the expression of cell surface markers, among other techniques. Well-known cell surface markers have been described for different cell types, including stem cells (Sousa et al., 2014). All cells are coated with specialized proteins that can selectively bind or adhere to other signaling molecules, however, some of these proteins are uniquely present in specific cell types and therefore can act as cell markers. A number of molecules have been defined as surface markers for stem cells (Table 2. 1). Stem cell marker variety involves carbohydrate-associated molecules, cluster of differentiation (CD) antigens, among other surface antigens (Zhao et al., 2012). Some of these molecules relate to stemness and aid in the identification of a more primitive and multipotent subset of cells. The isolation of primitive stem cells focuses on the presence of these “stemness”

surface biomarkers such as CD133. Potential therapeutic applications that have been reported for CD133⁺ stem cells, including differentiation into myogenic cells (Torrente et al., 2004), neurons and glial cells (Uchida et al., 2000); and treatment of hematopoietic (Bornhäuser et al., 2005; Lang et al., 2004) and cardiovascular disorders (Stamm et al., 2007; Steinhoff et al., 2017).

(21)

7 Table 2. 1 Well-known stem cell surface markers

Cell type (human) Markers References

Umbilical cord mesenchymal

stem cells (UC‐MSCs) CD29, CD44, CD49b, CD105 (SH2), CD166,

HLA‐ABC (H.-S. Wang et al., 2004)

Bone marrow mesenchymal stem cells (BM‐MSCs)

CD29, CD44, CD73 (SH3), CD90 (Thy‐1), Stro‐

1, CD106 (VCAM‐1), CD105, CD166, HLA‐

ABC

(Choong et al., 2007;

Gronthos et al., 2003) Hematopoietic stem cells

(HSCs) CD133, CD34, CD38, CD59 (Drake et al., 2011)

Neural progenitor/stem cells CD133, SSEA4, CD15, CD44, CD117 (c-KIT),

Sox2 (Barraud et al., 2007;

Vinci et al., 2016)

Embryonic stem cells (ESCs)

CD133, SSEA3, SSEA4, TRA-1-60, TRA-1-81, CD90 (Thy-1), CD324 (E-Cadherin), CD117 (c- KIT), CD326, CD29 (β1 integrin), CD24 (HAS), CD59 (Protectin), CD31 (PECAM-1), CD49f (Integrin α6), Cripto (TDGF-1), CD26 (DPP-4)

(W.-T. Kim & Ryu, 2017; Zhao et al., 2012)

Adipose-derived stem cells

(ASCs) CD90, CD44, CD29, CD105, CD13, CD34,

CD73, CD166, CD10, CD49e, CD59 (Mildmay-White &

Khan, 2017)

CD: Cluster of differentiation; HLA: human leukocyte antigen; SH3: Src homology 3 domain; SH2: Src homology 2 domain; Thy‐1: thymocyte antigen; STRO‐1: stromal precursor antigen‐1; VCAM‐1: vascular cell adhesion molecule‐

1; SSEA: stage specific embryonic antigens; TDGF-1: teratocarcinoma-derived growth factor-1.

2. 2 CD133: more than a stem cell marker

CD133 (prominin-1) is a 97 kDa transmembrane glycoprotein, member of the prominin family of pentaspan membrane proteins, which are specifically localized in cellular protrusions (Glumac &

LeBeau, 2018). It has been established as a cell surface biomarker for the identification and isolation of stem cells from different sources. CD133 was initially discovered by (Yin et al., 1997) as a target of AC133 monoclonal antibody (mAb), specific for the CD34⁺ population of hematopoietic stem cells (HSCs). Moreover, another study from the same year, concerning murine neuroepithelial cells reported the identification of prominin while using an antibody (mAb 13A4) that was found to specifically label the apical plasma membrane of neuroepithelial cells (Weigmann et al., 1997). The 13A4 antigen within the apical plasma membrane was found to be

(22)

8

selectively present within microvilli and absent from the rest of the plasma membrane. After this specific localization and after molecular cloning, the novel membrane protein was named

“prominin”.

The expression of prominin-1 is not only restricted to stem cells; it has been reported to be expressed in epithelial cells in adult tissues including mammary and salivary glands, testis, digestive tract, trachea and placenta (Fagerberg et al., 2014). CD133 is also found in non-epithelial cells such as rod photoreceptor cells (Jászai et al., 2011), playing a role in the formation of the photoreceptor discs. CD133 alone or in combination with other biomarkers has been used to identify different types of stem cells.

CD133 is also one of the most prominent and commonly reported surface biomarkers for cancer stem cells (CSCs). CSCs are defined as a subpopulation of cells within the heterogenous tumor bulk. This subset of cells has the ability to self-renew and differentiate into the non-CSCs than conform the rest of the tumor. Moreover, CSCs have demonstrated to have the ability to initiate tumors upon transplantation. Human CSCs have been studied in immunocompromised mice, whereas mouse CSCs have been evaluated in syngeneic animals to evaluate their tumor-initiating properties (Duarte et al., 2012; M. Zhang et al., 2008). CD133 has been postulated to identify the CSC populations from various tumor types including several brain cancers (Singh et al., 2004), prostate cancer (Collins et al., 2005), colorectal carcinoma (O’Brien et al., 2007), breast cancer (Al-Hajj et al., 2003), hepatocellular carcinoma (Suetsugu et al., 2006), ovarian (Cioffi et al., 2015) and lung cancer (Eramo et al., 2008). These studies have shown that CD133⁺CSCs possess self- renewal capacities and the ability to regenerate a histologically similar tumor after transplantation into immunodeficient mice. The identification and isolation of CSCs with the use of CD133 has been widely used, but it is also controversial due to the non-exclusive expression of CD133 to

(23)

9

CSCs. Based on these previously mentioned studies, CD133 has been adopted as a common biomarker to identify diverse stem cells (Table 2. 2). However, the physiologic function of CD133 in normal biology and in the progression of cancer remains elusive.

Table 2. 2 CD133 as a biomarker

Origin and function Expression in

normal SCs Expression in adult tissues/cells Expression in CSCs Marker for HSCs and

neuroepithelial cells.

hESCs, hematopoietic, neural

Epithelial cells (mammary and salivary glands, testis, digestive tract, trachea, placenta) and photoreceptor cells

Breast, colorectal, gastric, GBM, hepatic, lung, ovarian, pancreatic, prostate

HSCs: hematopoietic stem cells; hESCs: human embryonic stem cells; SCs: stem cells; CSCs: cancer stem cells;

GBM: glioblastoma

Different molecular mechanisms have been investigated to better understand the implication of CD133 in normal and cancer stem cells. Studies from both types of stem cells have suggested that CD133 may play an important role in several cellular pathways and mechanisms. There has also been progress in the better understanding of the predictive and prognostic power of CD133 presence in solid tumors. However, different culture conditions, animal models, and assays employed yield some conflicting and different results. A potential approach to overcome this, is the study of gene expression profiling of cells sharing the presence of the biomarker prior to any in vitro or in vivo experimental studies, which could aid in the elucidation of the underlying biological function of the protein. By compiling transcriptional profiles of sorted CD133+ cells from different sources readily available in public databases, we can unveil expression signatures as an initial approach to help investigate and shed further light on the molecular mechanisms implicated by the expression of CD133.

(24)

10 2. 3 Gene expression profiling

Gene expression profiling refers to the simultaneous measurement of expression levels of a large number of genes, typically in multiple experiments spanning a variety of cell types, treatments, or environmental conditions. Expression profiling involves the assay of messenger RNA (mRNA) levels with microarrays or high-throughput sequencing (RNA-seq), with a posterior normalization and analysis of data generated from microarrays or RNA-seq. The majority of gene expression studies have used the expression profiles of cases and controls in order to understand a specific disease by pinpointing genes and molecular pathways that diverge in their expression amongst the two studied groups.

Expression profiling studies are typically employed, specifically in systems biology, for the identification of genetic variants involved in gene regulation and in mediating disease pathogenesis (Gilad et al., 2008), to identify polymorphisms related to complex diseases that could be used in the development of new therapeutic targets (Manolio & Collins, 2009), and for the identification of molecular markers that could be used as tools for disease diagnosis and prognosis (K. Kim et al., 2008).

Identification of differentially expressed (DE) genes in gene expression profiling studies is fundamental for the determination of key genes, biological pathways and gene ontologies associated with several conditions and pathologies. This derived list of genes allows further and complementary analyses to understand and interpret the meaning of the changes in gene expression. These methods allow annotating genes to groups based on particular encoded protein motifs, in gene ontology (GO) categories including molecular function, biological process and cell component, and other number of sources including the Kyoto Encyclopedia Gene Group (KEGG) pathway. Gene co-expression networks can also be performed to identify similarly behaving genes

(25)

11

and infer common functions that can later be analyzed to identify overlying biological features (van Dam et al., 2017).

These studies are often combined with additional information and complementary analyses. Still they suffer from several limitations because of the low reproducibility and variance between the implemented platforms. Nevertheless, these studies can stablish a proof of principle and provide the basis for later confirmatory studies. Moreover, a way to increase robustness is by integrating multiple gene profiling datasets with a common denominator among them.

2. 4 Microarray technologies

Microarray technologies have been widely used in the biomedical science field. These methods have made it possible to make sense out of gene expression patterns to study multiple physiological and pathological cellular states.

The basis of array technologies is the use of specific single-stranded deoxyribonucleic acid (ssDNA) sequences to probe for its complementary sequence forming hybrids. Moreover, the aim of microarray technologies is to detect and measure the expression levels of thousands of genes simultaneously in the same experiment for the identification of potential biomarkers for that specific biological sample. There are mainly two types of DNA arrays depending on the type of probe that is spotted. One type uses single-stranded oligonucleotides and the other type uses complementary DNA (cDNA) (Trevino et al., 2007). Briefly, the general process of DNA arrays consists in converting the extracted ribonucleic acid (RNA) of the tissue or cells to labeled cDNA or mRNA, with a subsequent hybridization to the chip. If a particular transcript present in the original RNA contains a sequence that matches with one on the chip, the labeled mRNA or cDNA will then hybridize to the fragment or oligo. The position of the attached fluorescent dyes identifies

(26)

12

the transcript sequence and the fluorescence intensity is proportional to its abundance. The image analysis and data processing consist of an initial transformation of the image of each spot to a numerical reading. This is achieved through the identification of each spot, a subsequent integration of intensities in the defined spot, and a final estimation and subtraction of the surrounding background noise. The result is an integer value assumed to be proportional to the target sequence of that specific probe. At the end of the experiment, comes the statistical analysis which highlights the real significance of the results and the associated need for multiple tests (Trevino et al., 2007).

The amount of published and publicly available transcriptional profile or gene expression data enables integrated analysis of several studies in order to detect generalities and particularities of gene expression and biological functions in certain conditions. With the advent of public repositories for microarray data, such as the Gene Expression Omnibus (GEO) (Clough & Barrett, 2016) developed by the National Center for Biotechnology Information (NCBI), many experiment results are now available for independent evaluation and re-analysis. The GEO repository is an international public database for microarray experiments, next-generation sequencing (NGS) and other forms of high-throughput functional genomic data (Barrett et al., 2013). The data available in GEO represents original research deposited by the scientific community and in compliance with the Minimum Information About a Microarray Experiment (MIAME) guidelines (Brazma et al., 2001) for the description of microarray experiments.

2. 5 High-throughput sequencing

In recent years, high-throughput technologies, often referred to as next-generation sequencing (NGS), have become very attractive and competitive alternative to the microarray technologies.

(27)

13

RNA sequencing has become the gold-standard for whole-transcriptome high-throughput data generation technology, since its initiation in 2008 (Mortazavi et al., 2008). RNA-seq data has been proved to be highly reproducible, with few systematic differences among replicates (Marioni et al., 2008) and has now become a superior tool for measuring mRNA, as compared to array-based approaches.

Compared to microarray technology, RNA-seq has several distinct advantages: it (1) produces data with low background noise, allowing transcript detection at low expression levels, (2) can detect novel transcripts, allele-specific expression and alternative splicing isoforms, and (3) does not require prior probe selection, thus avoiding biases (Wilhelm et al., 2008; Shahjaman et al., 2020).

The primary interest of RNA-seq experiments in many studies is differential gene expression analysis between groups. In general, for RNA-seq technology, a series of five steps are conducted for this analysis. First, the samples are fragmented into cDNA and then sequenced with a high throughput platform. Second, the generated sequences are mapped to a reference genome or transcriptome. Third, the number of mapped reads is calculated based on the outcome of the prior alignment. Fourth, with this estimation of expression levels, subsequent statistical methods are applied to test the significance between groups. Finally, the DE genes are identified and their relevance is evaluated (Costa-Silva et al., 2017; Z. H. Zhang et al., 2014). As the popularity and applications of RNA-seq have increased in the past years, several software tools have also been developed for differential gene expression analysis from the obtained data.

2. 6 Differential gene expression analysis

One of the most essential type of analysis in gene expression profiling is the determination of genes with downregulated expression or upregulated expression in comparison with another group

(28)

14

of samples. This allows relating gene expression to physiology depending on the downregulation or upregulation between two or more groups of samples. These genes are determined as differentially expressed and are detected through several statistical approaches. The goal of differential gene expression analysis is to identify genes with significant changes in expression levels depending on the type of sample or experimental conditions. The most common approach and actually the first method to identify differentially expressed genes was the fold change in fluorescence intensity (Mutch et al., 2001). This fold change is calculated by dividing the two measured intensities to obtain a ratio value that is further transformed to logarithm base 2 (log2).

This gives a mean log-ratio of zero, meaning that the expression level has not changed, and the symmetry of data distribution is improved. Whereas a two-fold change in gene expression is equivalent to log2-ratios of +1 (upregulation) or -1 (downregulation).

The use of fold-change as a method to select differentially expressed genes is on account of the simplicity of the method. Larger values account for larger fold-change and therefore higher biologic significance. Nonetheless, this is not always the case because the fold-change method does not take into account the variance of the expression values measured. Therefore, this method should only be used in combination with other statistical methods (Tarca et al., 2006).

The complex nature of gene expression gives rise to challenging statistical problems and involves the use of a number of specialized statistical techniques. In order to overcome this, the limma software was developed over the past decade (Ritchie et al., 2015) to provide the ability to analyze gene expression experiments in a flexible and statistically rigorous way. The limma package is a core component of Bioconductor (Gentleman et al., 2004), an R-based open-source software development project in statistical genomics. Limma allows the analysis of gene expression data arising from microarray or RNA-sequencing technologies. The package allows to use the same

(29)

15

analysis pipeline for data from all technologies after initial pre-processing and normalization of the raw data. Moreover, differential expression (Phipson et al., 2016) and differential splicing analyses of RNA-sequencing data (SEQC/MAQC-III Consortium, 2014) can be performed.

The hallmark of the limma approach is the use of linear models to analyze entire experiments as an integrated whole rather than piece comparisons between pairs of treatments. The effect of this information sharing between samples, allows to model correlations that may exist between samples due to repeated measures or other causes. Linear models permit general analyses and allow adjustment for the effect of multiple experimental factors or for batch effects. The linear model might even include expression values of one or more genes as covariates, allowing to test for inter- gene dependencies and therefore allow testing of flexible hypotheses, not only simple comparisons between groups but also interaction effects or more complex customized comparisons (Ritchie et al., 2015).

Regarding the RNA-seq differential gene expression analysis, there is a variety of software tools from where to choose from. To date, there is no consensus about which methodology ensures validity of the results in terms of precision, robustness and reproducibility. Some of the main methodologies for differential gene expression analysis with RNA-seq data include mapping software tools such as Bowtie2 (Langmead & Salzberg, 2012), TopHat2 (D. Kim et al., 2013), BWA (H. Li & Durbin, 2009), and STAR (Dobin et al., 2013). For the differential expression analysis, the software tools that constitute the state of art in this field, include DESeq2 (Love et al., 2014), baySeq (Hardcastle & Kelly, 2010), EBSeq (Leng et al., 2013), edgeR (Robinson et al., 2010), limma+voom (Law et al., 2014), NOISeq (Tarazona et al., 2015) and SAMSeq (J. Li &

Tibshirani, 2013). From these, DESeq2, NOISeq, and limma+voom have been shown to be the

(30)

16

most balanced tools taking into account precision, accuracy and sensitivity (Costa-Silva et al., 2017).

2. 7 Gene co-expression analysis

With the recent advances in transcriptomics and NGS, and their respective methods of analysis, functional status of unknown genes can now be identified from a systematic perspective (van Dam et al., 2015). Co-expression network analysis is a method to infer gene function from genome- wide gene expression data, this approach constructs networks of genes with a tendency to co- activate across a group of samples, followed by an analysis of the network (van Dam et al., 2017).

One of the main approaches for the assignment of biological functions to specific genes, with the use of microarrays, is the differential approach method, where two or more sets of sample results from different conditions are compared and genes are identified by their differential expression levels. Co-expression analyses may combine many transcriptomic profiles from various tissues and different conditions, in order to group genes that show correlated expression patterns, further implying that those genes are involved in the same biological processes.

Co-expression analyses can be performed with gene expression data from microarray and RNA- seq technologies. In order to identify genes with a coordinated expression pattern and then perform downstream analyses, the methodology requires three main steps. First, individual relationships between genes or with a specific gene of interest are defined based on correlation measures, showing the similarity in expression patterns across samples. The correlation measures that are usually implemented include Pearson’s or Spearman’s coefficient (Ala et al., 2008), least absolute error regression (van Someren et al., 2006), or a Bayesian approach (Friedman et al., 2000).

Second, these relationships are used to construct a network representing connections between

(31)

17

genes and the strength of these co-expression associations. Finally, the network can be used to identify groups of genes through clustering techniques with a subsequent interpretation by functional enrichment analyses in order to pinpoint overrepresented functional categories in the lists of genes.

These tools help classify human genes according to their co-expression levels with other genes of interest. Moreover, when analyzing heterogenous sample groups from different conditions, differential co-expression analyses must be performed to reveal genes that have different co- expression partners between homeostatic and pathologic states, and therefore can explain differences between groups (van Dam et al., 2017).

2. 8 Gene set analysis

After defining a set of genes of interest, that were considered differentially expressed, the next step in the analysis is to attempt to find functional relationships among those genes that might help better elucidate the underlying biology. The commonly used methods for addressing this goal are referred to as gene-set analysis (GSA) or pathway analysis. In GSA, you have gene sets defined by a reference knowledge database, which aggregate genes based on their biological or functional properties. Knowledge databases contain collections of molecular knowledge which usually include molecular interactions, regulation, molecular products and functional associations. These knowledge bases contain gene sets that may be used to compare differentially expressed genes.

There is a number of these databases, including gene ontology (GO) (Ashburner et al., 2000), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2014), MetaCyc (Caspi et al., 2014), and Molecular Signatures Database (MSigDB) (Subramanian et al., 2005). MSigDB is a compendium of numerous databases, with a total of 22596 gene sets which are divided into 8

(32)

18

major collections, and several sub-collections. The eight different collections are divided into positional (c1, 299 sets), literature curated (c2, 5501 sets), motif (c3, 831 sets), computation (c4, 858 sets), GO (c5, 9996 sets), oncogenic (c6, 189 sets), immunologic (c7, 4872 sets), and hallmark (h, 50 sets), as of the latest updates made to the collections for MSigDB 7.0 released on August 2019. Therefore, MSigDB is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment methods.

Pathway based analysis methods have significantly advanced the capacity to explore large-scale data with the potential to provide comprehensive understanding of underlying molecular mechanisms. Pathway analyses can be classified into overrepresentation analysis (ORA), functional class scoring (FCS), and pathway topology-based approach (PTA). ORA corresponds to the first-generation of pathway analyses and consists in the determination of DE genes individually, and that list of gene names is taken into account. On the other hand, the second- generation FCS does not determine DE genes but uses the fold-change values of individual genes.

The third-generation PTA also considers pathway structures in addition to fold-change. ORA, also called functional enrichment analysis, is the earliest pathway-based approach to identify overrepresented pathways with a predetermined list of genes. This functional enrichment analysis is based on hypergeometric testing (HGT). Considering an urn containing one ball for each gene in the universe and assigning a black color to those that are interesting, and the others being colored white, the number of interesting genes can be modeled using a hypergeometric distribution, under the null hypothesis that there is no relationship between being interesting and being in a given GO category. If there are s number of interesting genes in the GO category containing k number of genes, the probability of seeing s or more black balls in k draws, without replacement from the urn, is computed (Falcon & Gentleman, 2008). Selection of the universe has an important impact

(33)

19

in the observed p-values; therefore, it should include only those genes that could have been selected as interesting, being those genes that are represented by probes in the explored arrays. Furthermore, hypergeometric testing is a competitive method that tests whether the pathway of interest contains more significant genes compared to those outside the pathway than expected by chance (Evangelou et al., 2012). Some issues naturally arise, such as multiple testing, were thousands of pathways may be individually tested for enrichment, and this could lead to significant enrichment p-values appearing by chance alone. Therefore, a multiple testing correction to control the false positives obtained from large-scale data analysis, should always be performed and the most commonly used method is the False Discovery Rate (Benjamini et al., 2001). ORA is the most widely used functional analysis method because it is easily performed, but an equal weight is assumed for all the genes included in a gene set. Moreover, FCS approaches cover a range of additional methods that are extended from ORA, in which each individual gene is ranked with a p-value based on its importance. The earliest application of FCS or gene-set based scoring is the gene set enrichment analysis (GSEA). This method focuses on the idea that if a set of functionally related genes is correlated with a phenotype, there is a trend that the set enriches in a certain area of the ranked gene list according to their differential expression between the sample classes (Subramanian et al., 2005). GSEA is a method that makes associations of significant genes belonging to certain functional categories, however, the hypergeometric test is a commonly used competitive test that typically produces more rigorous results.

2. 9 Meta-analysis studies

Meta-analysis studies are a type of study design that integrates the experimental data of several independent studies and plays a central role in the biomedical field. Previous research studies are

(34)

20

systematically assessed in order to derive conclusions about a certain topic, therefore, outcomes from a meta-analysis may include a more precise estimate about the impact of the comparison than any individual study contributing to the analysis. Meta-analyses are typically conducted to evaluate the strength of evidence present on a certain disease and treatment and to obtain an estimate of the effect. Besides improving the precision estimate of effect, meta-analysis results can also answer questions not originally raised by the individual studies, resolve controversies arising from conflicting studies, and generate new hypotheses (Haidich, 2010). Moreover, when it comes to the citation impact of study designs, meta-analyses receive more citations than any other type of study (Patsopoulos et al., 2005).

The studies for meta-analysis are chosen based on inclusion criteria. These inclusion criteria are ideally defined at the beginning of the study protocol. It is not possible to find every relevant study for the analysis, because some of these studies might not even be published. In that case, useful sources for unpublished studies include public repositories for raw experiment data, such as GEO (NCBI). Not even meta-analytic studies will provide the definitive understanding of underlying biological factors; however, this approach has demonstrated to be more valuable than just one study contributing to the analysis. This approach can provide the basis for more confirmatory studies, narrowing the research focus.

2. 10 Background

2. 10. 1 Reported mechanisms and functions for prominin-1/CD133

The exact physiological function of CD133 is not known, but its prevalence of expression in various tissues indicates its importance. As mentioned before, CD133 is found in plasma membrane protrusions with cholesterol-rich microdomains (Corbeil et al., 2001; Weigmann et al.,

(35)

21

1997) and is a biomarker used to identify stem cells in both homeostatic and cancerous conditions.

CD133 expression is detected in 22 out of 44 human tissue types, with membranous and cytoplasmatic location (http://www.proteinatlas.org), detection is derived from antibody-based protein profiling using immunohistochemistry. Moreover, antibody staining is consistent with mRNA expression levels and CD133 shows tissue specificity or enhancement for retina and salivary gland tissues (http://www.proteinatlas.org).

The interaction of CD133 with membrane cholesterol and its presence in cholesterol-rich membrane microdomains (Corbeil et al., 2001; Röper et al., 2000) has been proposed to be necessary for maintaining stem cell properties (Karbanová et al., 2017). This widespread expression of CD133 within lipid complexes in cellular protrusions has been associated to have a role in the regulation of cellular processes related to remodeling of the plasma membrane, particularly proliferation, differentiation, cell migration, autophagy and carcinogenesis (Thamm et al., 2019). The involvement of CD133 in photoreceptor disc morphogenesis and synthesis has not yet shown a clear mechanism, but it appears to be related to the presence of sufficient amounts of cholesterol (Zacchigna et al., 2009).

Moreover, the presence of CD133 has been proposed as a key regulator for the appropriate response of stem cells to extracellular signaling, development, regeneration and pathological processes (Singer et al., 2019). Cellular differentiation controlled by CD133 has been supported by the exosome-mediated release of small membrane vesicles containing the molecule during stem cell differentiation, this supports the concept that CD133 is necessary to maintain stem cell properties and that its loss results in cell differentiation. Moreover, this release of CD133- containing membrane vesicles has shown to have an additional function in intercellular communication (Bauer et al., 2011; Marzesco et al., 2005). The role of this endocytic–exocytic

(36)

22

pathway in the release of CD133, which occurs concomitantly with cellular differentiation and intercellular communication, has also been associated to other signaling pathways.

The Wnt/ β-catenin signaling pathway plays an important role in morphogenesis, embryogenesis and proliferation. Recent studies have shown that the Wnt/ β-catenin pathway plays a role in the development of glioblastomas (Denysenko et al., 2016; Jiang et al., 2017) and in increasing the stemness characteristics of CD133+ liver cancer SCs in hepatocellular carcinoma (HCC) (R. Wang et al., 2016), however its physiological role is not completely understood. Moreover, CD133 has also been reported to interact with histone deacetylase 6 (HDAC6) and β-catenin (Mak et al., 2012). This association stabilizes β-catenin via HDAC6 activity, which leads to activation of the signaling pathway and prevention of CSCs differentiation. Conversely, downregulation of CD133 results in increased β-catenin acetylation and degradation, resulting in a decreased proliferation.

Therefore, CD133 can accelerate cancer cell growth by activating this pathway and targeting of CD133 could be a potential approach to prevent cancer cell proliferation.

The PI3K-Akt pathway involves a series of kinase activations that result in signal transduction cascades with subsequent effects on cell metabolism, growth, proliferation and survival (Hemmings & Restuccia, 2012). Recent findings associate CD133 as an activator of this pathway, due to the higher level of phosphorylated-Akt in CD133⁺ cancer cells, in glioma stem cells (Wei et al., 2013) and HCC stem cells (S. Ma et al., 2008). In the CD133+ cancer stem cells, Src kinase phosphorylates tyrosine-828 (Y828) in the C-terminal cytoplasmic domain of CD133. After this, the Y828 residue interacts with p85 leading to activation of the p110 catalytic subunit of PI3K.

PI3K then converts phosphatidylinositol 4,5-bisphosphate (PIP2) into phosphatidylinositol 3,4,5- trisphosphate (PIP3) and diacylglycerol (DAG). PIP3 then leads to the activation of Akt (also named, protein kinase B), resulting in an increased self-renewal of glioma stem cells (GSCs). An

(37)

23

inhibition of this pathway activity was also demonstrated following a CD133 knockdown, with a consequent reduction in the self-renewal and tumorigenicity of GSCs.

2. 10. 2 Association of cancer stem cells with CD133

Cancer stem cells (CSCs) are a subpopulation of cells within the heterogenous tumor bulk, with reminiscent capabilities to their homeostatic analogues such as self-renewal and differentiation.

This gives CSCs the ability to maintain the tumor’s proliferation and they are also considered responsible for metastatic spreading and chemoresistance (Chen et al., 2012). CSCs were identified for the first time in 1997 in acute myeloid leukemia (AML) (Bonnet & Dick, 1997) and since then they have been proposed as the tumor initiating cells and responsible for disease recurrence.

Moreover, CSCs have been identified in diverse solid tumors, including breast (Al-Hajj et al., 2003), brain (Hemmati et al., 2003; Singh et al., 2004), thyroid (Todaro et al., 2010), melanoma (Boiko et al., 2010), colon (Todaro et al., 2007), liver (Stephanie Ma et al., 2007), prostate (Collins et al., 2005), lung (Ho et al., 2007), ovarian (Hu et al., 2010), and gastric (Fukuda et al., 2009);

and are capable of tumor initiation when transplanted to non-obese diabetic, severe combined immunodeficiency (NOD-SCID) mice. In addition, several biomarkers have been postulated to distinguish CSC populations from various tumor types. Among them, CD133 is a well-known CSC marker used for the identification and isolation of this subpopulation of cells from various tumors types, including hepatocellular, colorectal, lung, prostate and glioblastomas. On that account, CD133 has become a reasonable target for immunotherapy, the new frontier of cancer treatment.

Chimeric antigen receptor-modified T-cell (CART) immunotherapy offers the opportunity to specifically eliminate the CSC subpopulation. A recent case report has showed efficacy with the infusion of CART anti-CD133 cells, in a patient affected by cholangiocarcinoma (Feng et al.,

(38)

24

2017), suggesting it as a feasible treatment for other solid tumors. However, endothelial toxicities were also induced probably due to the off-tumor effect on the CD133 antigen expressed in normal cells. Moreover, anti-CD133 CAR vector transduced T cells are currently in clinical trials (NCT02541370) for the treatment of patients with relapsed and/or chemotherapy refractory advanced malignancies.

Albeit CD133 is expressed in normal and cancer stem cells and is downregulated upon differentiation, indicating that its expression is restricted to primitive stem cells; it has been shown that CD133 expression does not change upon differentiation (Kemper et al., 2010). It appears that CD133 is expressed on both CSCs and differentiated tumor cells, but there are tertiary conformational changes in the protein as a result of glycosylation that mask specific epitopes and block the binding of antibodies, especially AC133. Thus, antibodies can be used to detect stem cells, but results should be carefully interpreted, and a deeper insight should be taken into consideration.

In spite of the fact that the physiological role of CD133 is not entirely understood, the previously reported implications of the protein, its association with cell proliferation signaling pathways, and its high expression in CSCs, make it a good candidate for targeting. Nevertheless, more studies need to be performed to better elucidate the biological function of CD133 prior to reaching clinical stages. Therefore, a systematic assessment of previous research studies involving CD133 with a meta-analysis could provide a more precise estimate of the biological implications of the biomarker.

(39)

25 3. METHODS

3. 1 Acquisition of data from gene expression profiles

A manually curated search of the GEO DataSets database was performed in June 2019 and updated in December 2019 to identify published gene expression data sets of microarray and RNA-seq experiments utilizing cells sorted via fluorescence-activated cell sorting (FACS) or magnetic- activated cell sorting (MACS) with the CD133/1 antibody. These data sets were combined into matrices, where each column represented a sample and each row a gene. The combined matrix resulted in a collection of expression value data for sorted cells and their respective control group.

All publicly available series of data sets meeting the following inclusion criteria were included for analysis: (1) the experiment type must be expression profiling by array (microarray-based) or high throughput sequencing (RNA-seq); (2) the experiment must describe sorting cells by FACS or MACS with the CD133/1 antibody; (3) title and characteristics of the study were searched for keywords such as “CD133”, “stem cells” or “hESCs”, “RNA”; (4) sample organism must be

“Homo sapiens”; (5) transcriptome data should include only protein coding genes; and (6) the use of array platforms for Affymetrix or Illumina. The methodological summaries of the considered microarray and RNA-seq experiments are shown in Table 3. 1 and 3. 2, respectively. The initial query of experiments included 22 microarray data sets and 8 RNA-seq data sets, which were then manually curated to only include those experiments with sufficient amount of samples, with a corresponding control group and in compliance with all inclusion criteria, yielding 9 microarray data sets and 5 RNA-seq data sets for the following analysis.

(40)

26

Table 3. 1 Methodological summaries of the analyzed microarray experiments

GEO

accession Year CD133+

cells

CD133+

samples

CD133^- cells (control)

CD133-

samples

Tissue/

Condition Platform GSE25979 2012 Non-

adherent

EPCs 3 HUVECs 4 Umbilical

cord blood

Affymetrix Human Exon 1.0 ST Array GSE25979 2012

Non- adherent

EPCs 3 HUVECs 4 Umbilical

cord blood

Affymetrix Human Exon 1.0 ST Array

GSE34152 2013 GBM 2 Negative

fraction 2 Glioblastoma

Affymetrix Human Genome U133 Plus 2.0 Array GSE7181 2007 GBM

neurosphere- like cells 3

Adherent GBM from neg fraction

3 Glioblastoma

Affymetrix Human Genome U133 Plus 2.0 Array

GSE34152 2013 NSCs 2 Negative

fraction 2 Normal neural cells

GSE62600 2014 ESC-derived neurosphere

NSCs 3 NPCs 3 ESC-derived

neurospheres

Affymetrix Human Genome U133A Array

GSE16694 2009

Cord blood progenitor

cells 2 ESCs 2 Umbilical

cord blood

GSE90628 2017

Human infant kidney- derived cells

2 Negative

fraction 2 Kidney healthy donor

Affymetrix Human Gene 2.0 ST Array

GSE24759 2010 HSCs 10 CD34⁺

fraction 4 Umbilical cord blood

Affymetrix GeneChip HT- HG_U133A Early Access Array

(41)

27

Table 3. 2 Methodological summaries of the analyzed RNA-seq experiments

GEO

accession Year CD133+

cells CD133+

samples

CD133-

cells (control)

CD133-

samples Tissue/ Condition Platform

GSE62905 2014 Liver cancer

stem cells 2 Negative

non-cancer

fraction 2 HCC cell lines Huh7 and PLC8024

Illumina Genome Analyzer IIx

GSE86237 2016 Positive cells from intracranial

xenografts 19 Primary

tumor

(bulk) 7

Intracranial xenografts of patient derived GBM stem cells

Illumina HiSeq 2500

GSE72202 2015 Sorted cells from the tumor 2

Normal, non- neoplastic cells

2 GBM Illumina

HiSeq 2500

GSE99385 2017

U87MG cells with ectopic expression of CD133

3 Not

transfected 3 GBM Complete

Genomics

GSE85297 2016

Sorted primary tumorsphere

cells 4 Negative

fraction 4 GBM Illumina

HiSeq 2000

3. 2 Preprocessing and integration or raw data 3. 2. 1 Microarray experiments

The annotations of the individual samples from the series experiments were taken into account and only samples with a description including isolation of CD133 positive cells by FACS or MACS were kept. Samples describing the negative fraction for CD133 isolated cells or cells not expressing the surface marker were kept as control. Each sample was manually classified according to the cell type and expression of CD133.

After downloading the raw intensity files of the chosen samples from each microarray experiment GEO Series and their corresponding platform annotation file (GPL), data matrixes were generated including the expression values for the samples, ID_REF number, and gene name or symbol for each probe. Some experiments already had the values with normalized signal intensity, otherwise

School of Medicine and Health Sciences