• No se han encontrado resultados

CAPÍTULO 5: PROPUESTA

5.1. Sistema de administración de inventarios

5.1.3. Variables del sistema de administración de inventarios

Three different practical classification data sets will be examined in this thesis, namely the colon cancer, leukemia and SRBCT data sets. All of these are wide microarray data sets, with the number of genes, i.e. the number of predictor variables, exceeding the number of observations (𝑝 ≫ 𝑁). The first two data sets contain binary classification data and the third represents a multi-class classification problem. Details on the data sets are given in the sections which follow.

5.3.1 Colon cancer data set

The first practical data set which is considered in this thesis is the colon cancer microarray data set (henceforth referred to as the colon data set) which was originally studied by Alon et al. (1999) and subsequently investigated by Guyon et al. (2002), Weston et al. (2003), and Rakotomamonjy (2003). The data set has gene expression samples that were analysed with an Affymetrix Oligonucleotide array. It should be noted that a special pre-processing of the data is performed by these authors analysing the colon cancer data

set, which entails log-transforming the input variables, standardizing the resulting values and applying a so-called “squashing function”, 𝑓(𝑥) = 𝑐 arctan (𝑥𝑐), to diminish the effect of outliers. This thesis also applies this pre-processing to the colon cancer data set.

Colon tissues were biopsied from patients, and then tested to determine if adenocarcinoma was present in the colon tissue. The American Cancer Society (2014) defines adenocarcinoma as a type of cancerous tumour that forms in the cells that produce mucus to lubricate the inside of the colon and rectum, and is the most common type of colon cancer. The colon data set was formed using 62 samples of biopsied colon epithelial cells from potential colon cancer patients, which were reported as tumorous colon tissue if adenocarcinoma was present in the tissue samples; otherwise they were reported as normal colon tissue.

Table 5.1: Summary of the colon cancer microarray data set

Tumorous colon tissue Normal colon tissue Total number of colon

tissue

40 22 62

Source: Alon et al., 1999

The colon data set consists of 𝑁 = 62 colon tissue samples and 𝑝 = 2 000 gene expression levels analysed in the colon tissue samples. The response variable, 𝑌, in the colon data set represents the normal colon tissues and tumorous colon tissue, and originally consisted of 1’s and -1’s; however, for the purpose of this thesis the response variable was transformed to be made up of 0’s and 1’s. There are therefore two population groups, i.e. 𝐺 = 2, with 22 observations in Group 1, which is the normal colon tissue represented by 0’s in 𝒚 and 40 observations in Group 2 (tumorous colon tissue represented by 1’s in 𝒚). Classification procedures will be applied to the colon data set using the gene expression levels to classify future colon tissue as being cancerous or non-cancerous. The colon cancer data set is considered a high error rate data set (Boulesteix, 2004:16).

5.3.2 Leukemia data set

The second data set considered is the leukemia data set which was introduced by Golub

et al. (1999). This data set contains the gene expression levels for 𝑁 = 72 leukemia

acute myeloid leukemia (AML). The samples of cells were obtained from bone marrow or peripheral blood of 72 individuals with leukemia, and the gene expression levels for each sample were measured using Affymetrix (one-colour) high-density oligonucleotide arrays containing probes for 6817 human genes (Dudoit et al., 2002:80). The arrays had 7 129 locations but only 6 817 contained probes relative to human genes and the rest are controls and replicates.

It should be noted that there is confusion across papers regarding the exact number of input variables as some authors quote that there are 6 817 gene-expression measurements, while others quote 7 129 gene-expression measurements. However, in this thesis the leukemia data set is used as obtained in the varbvs package in R as described below.

According to Golub et al. (1999:531) it is important to distinguish ALL from AML in order to successfully treat leukemia patients. This is the case since chemotherapy regimens for ALL generally contain corticosteroids, vincristine, methotrexate, and L-asparaginase, whereas most AML regimens rely on a backbone of daunorubicin and cytarabine. While ALL patients treated with AML therapy can achieve remissions (and vice versa), the probability of the patients being cured decreases and the patient is exposed to unwarranted toxicities resulting from application of the incorrect treatment regimen (Golub

et al.,1999:531).

The data set consists of 47 ALL-leukemia samples which form Group 1 (denoted by 𝑌 = 0) and 25 cases in Group 2, which is the AML-leukemia samples (represented by 𝑌 = 1). Note that in the original leukemia data set, the ALL class is further divided into T-cell or B-cell, but this is not considered in this thesis.

The original leukemia data set undergoes a pre-processing procedure described in Dudoit

et al. (2002) to find the genes with the most variability and exclude low variance genes,

resulting in only 𝑝 = 3 571 genes remaining in the reduced data set. The gene expression levels in the leukemia data set were normalised before undergoing pre-processing to ensure comparability across the samples.

Dudoit et al. (2002) describe three pre-processing steps which were applied to the normalized matrix of intensity values available on the website (after pooling the 38 mRNA samples from the learning set and the 34 mRNA samples from the test set):

(b) filtering: exclusion of genes with 𝑚𝑎𝑥𝑚𝑖𝑛 ≤ 5 or (𝑚𝑎𝑥 − min) ≤ 500, where max and min refer to the maximum and minimum intensities for a particular gene across the 72 mRNA samples;

(c) base 10 logarithmic transformation and standardisation.

Step (b) in the pre-process is implemented to eliminate genes presenting insufficient variation across samples in the analysis. This thesis follows the protocol in Dudoit et al. (2002) exactly and the reduced leukemia data set was obtained from the varbvs package in R.

Golub et al. (1999) state that it is possible to achieve excellent classification accuracy on this data set even with quite trivial methods and it is therefore considered a low error rate data set.

5.3.3 SRBCT data set

The SRBCT data set consists of 𝑝 = 2 308 gene expression levels for 𝑁 = 83 small round blue cell tumours (SRBCT) found in children. This data set is presented in Khan et al. (2001). The expression levels for the genes were obtained from glass-slide complementary DNA (cDNA) microarrays (Tibshirani et al., 2010:6567). Poplack and Pizzo (1997) explain that SBRCTs of childhood, which include neuroblastoma, rhabdomyosarcoma, non-Hodgkin’s lymphoma, and the Ewing’s family of tumours, are so called because of their similar appearance on routine histology. The types of SRBCTs all have a very similar appearance but belong to four distinct categories which makes correct clinical diagnosis using light microscopy extremely challenging. However, it is crucial to accurately diagnose the SRBCT as treatment options, responses to therapy and prognoses range widely depending on the diagnosis (Khan et al., 2001).

The SRBCT data set is a multiclass data set and SRBCT patients belong to one of the following four childhood tumour classes: BL (Burkitt lymphoma), EWS (Ewing Family of Tumors), NB (Neuroblastoma) and RMS (Rhabdomyosarcoma), as shown in Table 5.2.

Table 5.2: Summary of the SRBCT microarray data set

BL EWS NB RMS

11 29 18 25

The response variable, 𝑌 ∈ {1,2,3,4}, corresponds to the four tumour classes, namely BL, EWS, NB and RMS respectively. The original data set consists of 63 training samples

and 25 test samples and is available at

https://statweb.stanford.edu/~tibs/ElemStatLearn/data.html. However, out of the 25 test samples, the response value of five samples were missing, as they were not SRBCT. These five cases were removed for the purpose of the thesis.

There is some confusion regarding the group names and their respective labels across the papers which investigate the SRBCT data set; however, in this thesis the names and labels for the SRBCT data set follow those used in general agreement.

The training and test set were combined to create one data set consisting of 83 SRBCT patients. The classification models built on the SRBCT data set attempt to distinguish between these four tumours based on gene expression values.

Documento similar