Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

(1)

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Campus Estado de M´exico

School of Engineering and Sciences

Predicting Drug Responses in Cancer Cells using Genomic Features and machine learning

A thesis presented by

Cody Eduardo Evans Trejo

Submitted to the

School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of Master of Science

in

Computer Science

Monterrey, Nuevo Le´on, May, 2020

(2)

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Campus Estado de M´exico School of Engineering and Sciences

The committee members, hereby, certify that have read the thesis presented by Cody Eduardo Evans Trejo and that it is fully adequate in scope and quality as a partial requirement for the degree of Master of Science in Computer Science.

Dr. Victor Manuel Trevi˜no Alvarado Tecnol´ogico de Monterrey Principal Advisor

Dr. Juan Emmanuel Mart´ınez Ledesma Tecnol´ogico de Monterrey Principal Advisor

Dr. José Tamez Peña Tecnológico de Monterrey Committee Member

Dr. Antonio Mart´ınez Torteya Universidad de Monterrey Committee Member

Dr Ra´ul Monroy Borja

Director of Program in Computer Science School of Engineering and Sciences Monterrey, Nuevo Le´on, May, 2020

i

(3)

Declaration of Authorship

I, Cody Eduardo Evans Trejo, declare that this thesis titled, ”Predicting Drug Responses in Cancer Cells using Genomic Features and machine learning” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree at this University.

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this dissertation is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Cody Eduardo Evans Trejo Monterrey, Nuevo Le´on, May, 2020

ii

(4)

Dedication

I would like to dedicate this thesis to my mom for giving me encourage and love until her last days. During the course of my Master’s degree, many things have happened, and I would not have been able to make it without the help of my family, girlfriend and friends.

iii

(5)

Acknowledgements

I would like to express my deepest gratitude to Dr. Emmanuel Martinez for all his patience, flexibility, support and dedication. To Dr. Victor Trevi˜no for his advises, knowledge and support throughout this 2-year period . To Tecnologico de Monterrey and its tuition support, and CONACyT’s scholarship grant. Thanks to all my Professors and colleges who were instru- mental this last two years of my life.

iv

(6)

Predicting Drug Responses in Cancer Cells using Genomic Features and machine learning

by

Cody Eduardo Evans Trejo Abstract

This document presents an analysis for the prediction drug responses in cancer cells using cancer genomic features for the Master’s Degree in Computational Sciences at Instituto Tecnologico y de Estudios Superiores de Monterrey. Cancer is a genetic disease characterized by the progressive accumulation of mutations. There are several genomic features involved in oncogenesis such: gene mutation, copy number, expression, and epigenetic alterations.

These features vary depending the person and type of cancer, making it difficult to determine whether a drug will response successfully for each specific case. Recently, two large-scale pharmacogenomic studies screened multiple anticancer drugs on over 1000 cell lines in an ef- fort to elucidate the response mechanism of anticancer drugs. Based on this data, we proposed a drug-response prediction framework that uses gene expression, methylation, copy number, mutation, protein expression features and drug sensitivity data from the Cancer Cell Line Encyclopedia (CCLE) database. For this we compare the performance of several algorithms such as Random Forest, Support Vector Machine, Elastic-Net and Extreme Gradient Boosting Tree (XGBoost). Robustness of our model was validated by cross-validation. The dataset of RNAseq using XGBoost obtain the highest average accuracy for individual datasets. Our unified model achieved good cross validation performance for most drugs in the Cancer Cell Line Encyclopedia ( 85 % accuracy).These results suggest that drug response could be effectively predicted from genomic features using a battery of machine learning algorithm. Our model could be applied to predict drug response for certain drugs and potentially could play a complementary role in personalized medicine.

v

(7)

List of Figures

2.1 The drug response curve and the effective concentration. From Feher, J. 2017 8

2.2 microRNA dataset . . . 9

2.3 Methylation 1kb dataset . . . 10

2.4 Methylation CpG dataset . . . 10

2.5 Gene alteration dataset . . . 11

2.6 Drug response dataset . . . 11

2.7 Density plot of gene expression for Cell Line DMS53 A. Raw miRNA data B. Transformed miRNA data . . . 14

2.8 Cancer type frequency in drug AEW541 . . . 15

2.9 Density plot of IC50in drug response database . . . 16

2.10 Cluster plot of IC50with k-means of 2 cluster . . . 17

3.1 Possible hyperplanes. From Ghandi, R. 2018 . . . 21

3.2 Code for predictive model glmnet in CARET . . . 25

3.3 Performance of predictive model glmnet in CARET . . . 26

4.1 Graphs that shows the performance metrics to assess the predictive power of XGBoost model in different data sets for the 19 drugs. (a) Accuracy metric values of the 19 drugs for each individual dataset. (b) Accuracy metric values of the 6 datasets for each individual drug. . . 31

4.2 Beeswarm Boxplot of the four predictive models performance for individual data sets through the 19 drugs . . . 32

4.3 Graphical representation of the mean performance of the four predictive models for individual data sets and for the unified data set . . . 33

4.4 For each cancer, color-coded bars indicate the percentage of patients (maximum 12 patients) with high and medium CAB012994 expression level. Low or not detected protein expression results in a white bar. . . 35

4.5 Gene network of SPRY4, DUSP6, ETV4, FYB and MAP2K1 . . . 36

viii

(10)

List of Tables

2.1 Distributions of samples per dataset. . . 9 2.2 Raw data vs Clean Data . . . 13 2.3 Section of the drug response data . . . 15 3.1 Preliminary predictive models and their respective Tuning Parameters . . . . 28 4.1 Accuracy, Run time and number of feature of predictive models in microRNA

dataset . . . 30

ix

(11)

Chapter 1 Introduction

Cancer is a generic term for a large group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. According to the World Health Organization, cancer is the second cause of death in the world and is responsible for 9.6 million deaths in 2018. Almost one in every six deaths in the world is due to this disease [48]. Cancer is a genetic disease characterized by the progressive accumulation of genomic alterations. There are several genomic features involved in oncogenesis such: gene mutation, copy number, expression, and epigenetic changes. In this area, the Cancer Cell Line Encyclo- pedia (CCLE) has registered over 1600 mutated genes and 392 recurrent mutations affecting 33 known cancer genes [6].

Clinical trials are complex and expensive, but pre-clinical data, such as cell lines, can dramatically increase the likelihood of success during clinical development [47]. Cell lines are good experimental model and are widely used in pre-clinical state for drug development.

They are one of the main tools in medical and biological research. Compared with multicellular animal models, the use of cell culture is a simpler system to study biochemical or molecular processes [17]. The use of cell lines, has improved the knowledge of many diseases, analyze the response to drugs and determined the effect of specific mutations in the genome [3]. The research with cell lines has helped to characterize some of the typical features of cancer, as well as a diversity of therapeutic responses [28].

Three important projects: the Genomics of Drug Sensitivity in Cancer (GDSC) [58], the Cancer Cell Line Encyclopedia (CCLE) projects [6] and the Cancer Genome Project (CGP) [20] performed genomic profiling (somatic mutation, copy number alterations, and gene expression levels) for hundreds of cancer cell lines and treated them with multiple established compounds. The database of these three projects are publicly available and have become essential pieces for the development and testing of methods for predicting drug response.

The fast advances of Sequencing of second generation DNA is allowing a continuous entry of new information and knowledge about cancer. In recent years, it has been possible to sequence the transcriptome, whole exome or genome of cancer samples [46]. As a result, bioinformatics have an important role in personalized medicine research due the needs for techniques that can integrate and analyze such extensive data for improve our understanding

1

(12)

CHAPTER 1. INTRODUCTION 2

of disease and potential therapeutic options [52].

However, the number of cell lines and compounds make data analysis cumbersome.

Furthermore, additional complexity exists given the disparity of response to drug. Though having quite similar clinical symptoms, different patients may have different responses to the same drug or therapy [34]. So personalized medicine, which makes medical decisions based on patients’ genetic content, becomes the main direction of the future cancer therapeutic approach. Therefore, accurate prediction methods are necessary to facilitate and speed up drug discovery, repositioning process, and biomarker discovery.

Drug response prediction is a well-studied problem in which the molecular profile of a given sample is used to predict the effect of a given drug on that sample. For any prediction, it is necessary to make a relationship between a feature and a result or response. In cancer research, genomic data from cell lines are often utilized as features and drug response is the result. Large-scale drug sensitivity tests in cancer cell lines have been used to identify ge- netically significant interactions between drugs and genes using linear regression model [42].

This machine learning models primarily computes an optimal coefficient vector to minimize an objective function that measures the coherence between the model and the training data.

For example, Dong et al. [16] propose a support vector machine (SVM) classification model to accurately predict drug sensitivity according to gene expression profile in the CCLE dataset.

This implies that the attaching weight corresponds to the importance of each characteristic (feature selection) to predict the response of the cell line, typically represented by the IC50 values (drug concentration for a 50 % inhibition of cell growth). Gupta et al. use genomic feature based model to predict anticancer drug [25]. This kind of early approach employing cell lines genomic feature typically achieve a relatively good classification accuracy (AUC

>0.6) [31].

Constantly new features, strategies and predictive models for optimization of cancer drug response prediction have been proposed from many angles. However, the model is far from giving an optimal or clinically usable result. In this chapter, it is outline the main mo- tivations of the research, the specific problem to be addressed, the method we are proposing, the scope of the thesis, and an outline of the dissertation.

1.1 Problem Statement

The systematic translation of cancer genomic data into oncogenesis knowledge and the clinical treatments remains challenging. Such efforts should be greatly assisted by robust preclinical model systems that reflect the genomic diversity of human cancers and for which detailed genetic and pharmacological annotations are available. Although linear regression models can help determine certain relationships between genes and drug response, the limits of the data do not allow us to accurately obtain a predictive model. Clinical data sets in cancer research are high-dimensional data that reduce predictive power of a classifier or regressor.

Classical regression methods will fail in high-dimensional data (least squares, logistic, and

(13)

Cox-PH regressions). In machine learning, this is known as the curse of dimensionaly problem [33].

For example the data of CCLE have 947 human cancer cell lines but there are more than 20,000 human genes that can be studied [20]. The number of genes profiled have orders of magnitude larger than the number of samples. This type of data sets is termed high- dimensional. If we have more features than observations, we run the risk of massively overfitting our model, this would generally result in undesirable test performance. Too many features can confuse certain machine learning algorithms, such as clustering algorithms, because it makes each observation in the dataset appear equidistant from all the others. And because clustering uses a distance measure, such as Euclidean distance to quantify the similarity between observations, then all observations appear similar, and distinctive clusters cannot be formed. Other problem is that the effective use of mutation or copy number information in drug response prediction is challenging because it is more difficult to learn from binary or discrete-valued features [13].

Generally, there are three parts in a predictive model that are necessary to consider in order to overcome the challenges associated with applying machine learning to drug response prediction . The first is the high-dimensionality problem described above.The second is the pre-processing and filtering data, often the data needs to be preprocessed before it can be used in a specific machine learning method [26]. Third is the choice of the machine learning algorithm, not all machine learning methods are applicable to all situations. Some are better for certain kinds of problems than others, and it is important to recognize this. Some methods may have assumptions or data requirements that render them inapplicable to the problem being analyzed [26]. Thus, from a computational perspective, predicting drug-response data in cancer cell lines using the genetic expression, the protein expression, and the genomic alteration provides a good computational challenge. It poses a difficult problem for feature engineering and selection as well as to avoid prediction overfitting using machine learning algorithms. So the design and implementation of strategies in the three general parts may obtain accurate prediction results better or at least similar to those obtained in previous models. Although different machine learning algorithms have been used separately [6] [16] [18] [19] [25] [37] , it has not been considered to integrate a library of filtering methods, machine learning algorithms and feature engineering methods, able to use different types of algorithms, depending on the type of variable, to obtain a more accurate prediction.

1.2 Objectives

The general objective of this project is to evaluate and develop a drug-response prediction system using a integrate library of machine learning algorithms and feature engineering methods.

The system should be able to use different types of algorithms, depending on the type of variable, to obtain a more accurate prediction. To improve prediction, We will also incorporate genetic expression, proteomic, and epigenomic profiling data. To achieve this general objective, the following particular objectives are considered:

(14)

• Assess the performance of different machine learning algorithms predicting responses of cancer therapies.

• Evaluate the performance of machine learning algorithms while reducing the search space.

• Assess the predictive potential of the different genomic information data set in independent predictive models

• Design and implement filter selection methods capable of processing and simplifying various types of variables (continuous and discrete).

• Design and implement strategies of feature engineering able to find new types of biomarkers to determine the responses of different types of drugs.

• Compare the performance of the prediction system against those presented in the literature.

1.3 Research Questions

Through the use of a library of machine learning algorithms that includes the use of regularized regression, Kernel Based Methods and Ensemble Methods based on Decision Trees for drug-response prediction, we can improve the predictive level or find better biomarkers for drug response prediction in cancer cell lines than those reported on the literature. We propose to include genomic, proteomic, and epigenomic profiling data as genomic feature. The research questions that we will be answering are:

• Can an unified features from all the machine learning methods improved the general performances of independent predictive models?

• What are the advantages of each machine learning method and which variable take better advantage over the rest?

• Does a predictive model perform better including certain genomic features?

• Can the use of better strategies of feature engineering able to find new types of biomarkers to determine the responses of different types of drugs?

• What are the advantages of predictive models developed following this approach?

1.4 Solution overview

In order to carry out this project successfully, the actions to be carried out have been divided into three categories: data, algorithm, and tests. In the case of data, the most reliable, clear and reproducible datasets are necesary for improved overall model accuracy collected, curated and

(15)

preprocess 6 data sets (microRNA, RNAseq, Gene Alteration, Met1kb, MetCpG and Protein expression) of cancer cell line encyclopedia. Filter strategies are constructed to reduce the informative genes during the pre-processing part by filtering feature with little or no variance at all. For algorithms, we evaluate the usage of regression, kernel based methods, ensemble methods based on decision trees. A comparison is made for choosing the best model for each type of method. The last part is the training and validation of the models and the documen- tation and demonstration of results. To assess the predictive potential of the models as well as that ones of the 6 genomics information data set, four independent predictive models were built in the 6 data set for the 19 drugs, giving us a total of 456 predictive models.

1.5 Scope of the thesis

The following sections are divided as follows: Chapter 2 focuses on the description of the elements involved in the data and the genomic characteristics of cancer features and the important data that are part of the genomic feature are detailed. Chapter 3 give an overview of the theoretical framework of Machine Learning, related work and describes the solution models, Chapter 4 show the results and discussion of of the models; and finally, Chapter 5 discusses the conclusions of the thesis and establishes the future work.

(16)

Chapter 2 Data

Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data, data analysis, data cleaning, data exploration and transformation.

This chapter focuses on the description of the elements involved in the data. The genomic characteristics of cancer and the important data that are part of the genomic feature are detailed. It is also explained the response to drugs and the elements used in the correlation with the genomic feature for prediction. As well as the origin, behavior and distribution of the data.

In addition, the definition, importance, process and tools that are used for data preparation are mentioned. Finally, it is described the actual change made to the different datasets used in this thesis.

2.1 Biological information

2.1.1 Cancer

Although it is commonly seen as a disease, cancer is actually the name given to the col- lection of diseases that involves abnormal cell growth with the potential to invade or spread to other parts of the body. Normally, human cells grow and divide to form new cells as the body needs them. When cells multiplies during life or become damaged, they die, and new cells take their place. In the case of cancer cells, this orderly process is damaged by accumulation of mutations or exposures to damages [14]. Those damaged cells create a resistance to apoptosis (programmed cell death) and begin to reproduce when and where it is not necessary. These extra cells can divide without stopping and may form growths called tumors.

A tumor may be generated for reasons unrelated to cancer, in which case they are considered benign while those produced by cancer are called malignant tumor. The difference between a malignant and a benign tumor, is that the former grows disproportionately and aggressively, invading adjacent tissues and metastasizing (spread in distant tissues or organs) [29] [48].

Cancer is a genetic disease characterized by the progressive accumulation of genomic alterations [14]. That means, it is caused by changes to genes that control the way our cells function, especially how they grow and divide. This genetic changes may be hereditary but

6

(17)

CHAPTER 2. DATA 7

can also arise during the lifetime as a result of cell replication errors or damage in DNA (environmental exposure) [29]. The genetic changes that contribute to cancer are call cancer driver genes and affect three main types of genes: proto-oncogenes, tumor suppressor genes, and DNA repair genes [6]. Proto-oncogenes are involved in normal cell growth and division and some changes in them allow cells to grow and survive when they should not. Tumor suppressor genes, as the name implies, control cell growth and division. DNA repair genes are involved in fixing damaged DNA. Mutations in these genes can increase the number of errors in cell replication, thus increasing the probability of mutations in the two types of genes above [29].

The genomic alteration involved in oncogenesis can be generalized in three genomic feature that are: somatic mutation, copy number variation (CNV) and alterations in methylation profiles. Somatic mutation refers to those genetic alteration acquired by any cell that is not involve in sexual reproduction. That mean that it never transmit to its descendants the mutations they have undergone. Somatic mutations are frequently caused by environmental factors [24]. Copy number is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals of the same population [44].

CVN is then the increase or decrease of these sections of genes. Focal recurrently aberrant copy number segments (RACSs) are particularly important in cancer studies because they are believed to encode key genes driving cancer growth. Recurrent focal gains are associated with oncogenes and focal losses associated with tumor suppressor genes [61]. Methylation is a gene expression control important for gene inactivation in cancer cells. Hypermethylation of CpG islands has been described in almost every type of tumor showing its relevance in tumorigenesis [50].

2.1.2 Cell Lines and Drug Response

A cell line is a permanently established cell culture that will proliferate indefinitely given appropriate fresh medium and space. Lines differ from cell strains in that they become immor- talized. It allows the examination of stepwise alterations in the structure for complex tissues, where in vivo examination of individual cells is difficult, if not impossible [55]. Cell lines are easy experimental model and are widely used in pre-clinical state for drug development. They are one of the main tools in medical and biological research. Compared with multicellular animal models, the use of cell culture is a simpler system to study biochemical or molecular processes [17], which has improved the knowledge of many diseases, analyze the response to drugs and determined the effect of specific mutations in the genome.[3] The research with cell lines has helped to characterize some of the typical features of cancer, as well as a diversity of therapeutic responses [28].

The drug response, describes the change in effect on an organism caused by differing levels doses of a drug after a certain exposure time. One of values use to describe this interaction is the half maximal effective concentration (EC50). EC50 refers to the concentration of a drug that induces an average response between the baseline and the maximum after a specific exposure time. For this reason it is commonly used as a measure of a drug’s potency. This drug response is measure in molar units (M). As is shown in the figure??, EC50 represents

(18)

CHAPTER 2. DATA 8

the concentration of a compound where 50 % of the population exhibit a response, after a specified exposure duration. The response give as the potency while the concentration the efficacy. In the same drug-respond curve, it can be represented other important drug response value that is the half inhibition concentration (IC50). IC50 is a quantitative measure that indi- cates how much of a particular inhibitory substance (e.g. drug) is needed to inhibit, in vitro, a given biological process or biological component by 50%. For competition binding assays and functional antagonist assays, this value is the most common summary measure of the dose-response curve [22].

Figure 2.1: The drug response curve and the effective concentration. From Feher, J. 2017

2.2 CCLE Data

The Cancer Cell Line Encyclopedia (CCLE) is a public project that contains large Open Access files. The CCLE is made possible through a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation to perform detailed genetic and pharmacologic characterization of a large number of human cancer models. The CCLE public project contains Open Access sequencing data (in the form of reads aligned to the hg19 broad variant reference genome) for nearly 1457 cancer cell line samples. This information is distributed throughout 68 databases.

From this information we consider the use of 6 datasets with genomic information presented in Table 2.1, a dataset with drug response information in the cell lines. As can be seen in Table 1 despite having information on 1457 cell lines, the genomic information of all of them is not registered and it is variable depending on the database. In addition, it is important to consider that we cannot use information from cell lines that do not have drug response information. Due to this, data cleaning is imperative before we can work with them.

Each database has information specific to the type of genomic characteristics. Starting with microRNA (miRNA), the first two columns mention the name and description and the

(19)

CHAPTER 2. DATA 9

Data set Genes Cell Lines

RNAseq 56300 1156

Methylation CpG 54531 848

Gene alteration 48271 1048

Methylation 1kb 20192 850

microRNA 734 956

Protein Expression 215 702 Table 2.1: Distributions of samples per dataset.

other columns are the cell lines, as shown in figure 2.2. The name column is the CCLE nomenclature for miRNA (i.e., nmiR00001.1, nmiR00001.2). The description column presents the general name (e.g., hsalet7a, hsalet7b) in the literature. In RNAseq dataset, the name column is the CCLE nomenclature (i.e., ENSG00000223972.4, ENSG00000223972.4) and the literature name is in the description column (e.g., DDX11L1, WASH7P). The cell line columns have the cell line name followed with an underscore and the type of cancer (e. g.

22RV1 PROSTATE ). This cell line nomenclature is the same for all the 7 dataset with the exception on RNAseq because it also have the Depmap ID in parenthesis (e. g. 22RV1 PROSTATE (ACH000956)).

Figure 2.2: microRNA dataset

Methylation modifies the function of DNA when it is found in the promoter gene. So CCLE divided the methylation into two databases, the first to promoter 1kb upstream tran- scription start site (TSS) and the second to promoter CpG (5’—C—phosphate—G—3’) clusters. Because of this we name them methylation 1kb and methylation CpG. For Methylation 1kb the column are TSS id, gene, chr, fpos, tpos, strand and avg coverage, the rest of the columns are the cell lines (Fig. 2.3). TSS id has the compressed information from gene, chr, fpos, tpos. That is, the gene, the number of the chromosome in which the gene is located, the starting position and end position in the gene. The column strand is referred to the DNA sense and it give us + or -. A double-stranded DNA molecule will be composed of two strands with sequences that are complements of each other. To help molecular biologists specifically identify each strand individually, the two strands are usually differentiated as the ”sense” strand and the ”antisense” strand. An individual strand of DNA is referred to as positive-sense (also

(20)

CHAPTER 2. DATA 10

positive (+) or simply sense) if its nucleotide sequence corresponds directly to the sequence of an RNA transcript which is translated or translatable into a sequence of amino acids. The other strand of the double-stranded DNA molecule is referred to as negative-sense (also negative (-) or antisense). The last column avg coverage is the average base pair coverage.

Figure 2.3: Methylation 1kb dataset

In the case of CpG methylation it have: cluster id, gene name, RefSeq id, CpG sites hg19 and avg coverage columns. The RefSeq id is the NCBI Reference Sequence id for the gene while gene name is the general name in literature. CpG sites hg19 is the cpg positions in the latest human genome (h19). And because there can be diferent cpg positions, the same gene can have different lecture expression so the cluster id represent the gene name and numer (e.

g. SGIP 1) to differentiate the same gene in a diferent cpg site. The rest of the columns in methylation CpG dataset are the cell lines.

Figure 2.4: Methylation CpG dataset

For gene alteration, two first column are informative and the rest are cellular lines (fig 2.6). The informative column is duplicated as name and description. It includes the gene followed by the type of genomic alteration that can be mutation (MUT), copy number alteration (CNA), amplification (AMP) and deletion (DEL). Thus CNTN1 MUT is a mutation in the

(21)

CHAPTER 2. DATA 11

CNTN1 gene.

Figure 2.5: Gene alteration dataset

For protein expression dataset we have the cell line column and the rest columns are proteins. Finally, for the drug response database, the following column are described: CCLE Cell Line Name, Primary Cell Line Name, Compound, Target, Doses, Activity Data, Activity, Num Data, FitType, EC50, IC50, Amax and ActArea. The most important characteristics were considered: CCLE Cell Line Name, compound, IC50 and ActArea. CCLE Cell Line use the same cell line nomenclature as miRNA, Compound describe the drug, IC50 is in micromolar (µM) and ActArea describe the area under the curve.

Figure 2.6: Drug response dataset

2.3 Features

The features in the data will directly influence the predictive models it is used and the results that can be achieved. It means that the better the features is prepared and selected, the better the results achieved. Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it improved overall model accuracy (predictive power) of machine learning algorithms by creating features from raw data that help facilitate the machine learning process [63].

Not all features are equal important. There are some features that will be more important than others to the model accuracy, the irrelevant one to the problem need to be removed as well as the redundant one in the context of other features. But it is necessary to be careful

(22)

CHAPTER 2. DATA 12

when discarding the features that could expose us to the risk of over-fitting our model.

For a better processing data, there are several strategies that we can agglomerate in three feature engineering processes: processing, normalization and weighting. Feature processing is characterized by modifying the values, these modifications depend on the objective and type of data. We can divide it into three methods used to convert the data: Feature binarisation, feature discretization and feature value tranformation. Feature binarisation refers to the process in which numeric values are transformed to Boolean. This type of process is frequently used to indicate the presence or absence of said characteristic. Feature discretization converts continuous data to discrete data and considers whether it is equal sized partitions and equal interval partition. Feature value transformation is a method used to standardize the range of independent variables or feature data [63] [57].

The second feature engineering processes is normalization. Feature normalization is the process of restructuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and improve data integrity. This process is very important, since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. The last one is feature weighting, it is a technique used to approximate the optimal degree of influence of individual features using a training set. When successfully applied relevant features are attributed a high weight value, whereas irrelevant features are given a weight value close to zero [57].

Feature weighting, better known as feature selection, can be used not only to improve classification accuracy but also to discard features with weights below a certain threshold value and thereby increase the resource efficiency of the classifier. Feature selection can use different methods to classify and choose features but there are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods [49][41] [63].

Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable. Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores [49].

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features. An example if a wrapper method is the recursive feature elimination algorithm [49].

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods. Regularization methods are also called penalization methods

(23)

CHAPTER 2. DATA 13

that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients). Ex- amples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression [49].

2.4 Data preparation

From the 6 genomic datasets we require the gene information and the cell lines. Fur- thermore, of the cell lines we are interested in those that have an IC50 value in the drug database. So in miRNA data we change description column to Gene and of the 956 cell lines, we only have drug response information of 470, so we filter those that did not have this information. For RNAseq it would be the same, using the description column and the 432 cell lines that have drug response information. In the case of CpG methylation, we will only consider cluster id and 381 cell lines from 848. The same was necessary for protein expression, methylation 1kb and gene alteration datasets. For the drug response database, the following important characteristics were considered: cell lines, chemical compound, IC50and AUC. In addition, the type of cell is added as one of its column. Finally, the names of the columns received uniformity, as is the case of the genes whose name varied in each database (Symbol, Hugo Symbol, name of the gene, gene, description, etc.) or the case of cell lines (1321-N-1, 131321N1 CENTRAL NERVOUS SYSTEM, 131321N1). So in the end as can be seen in table 2.2 we only will use 42.51% of the initial data.

Data set Raw data Clean Data

RNAseq 56300 x 1156 56300 x 472

Gene Alteration 48271 x 1048 48271 x 481 Methylation CpG 54531 x 848 54531 x 381 Methylation 1kb 20192 x 850 20192 x 391 Protein Expression 215 x 702 215 x 422

miRNA 734 x 956 734 x 432

Table 2.2: Raw data vs Clean Data

2.4.1 Genomics Data distribution

The genetic expression of the RNAseq and miRNA databases were reported in RPKM (Reads Per Kilobase Million) to normalize for sequencing depth and gene length. Using the density graph of Figure 2.7a we can see that for cell line DMS53 values range from 0 to 20000 with the majority from 0 to 100. The distribution of RPKM values is skewed, and by log-transforming it we could bring it closer to normal distribution. So log-transform (log2(x + 1)) of RNAseq and miRNA datasets make similar variation across orders of magnitude.

In the case of the rest of the genomic databases, they show a normal distribution. In methylation, both data sets, have values between 0 and 1. On the other hand, the expression of the protein is in a range of -5 to 5. Finally, Gene Alteration dataset have binary values (presence or absence of genetic alteration).

(24)

CHAPTER 2. DATA 14

(a) (b)

Figure 2.7: Density plot of gene expression for Cell Line DMS53 A. Raw miRNA data B.

Transformed miRNA data

2.4.2 Drug Data distribution

As mentioned briefly in section 2.2, the drug database is a 5 column by 11670 row data frame. These columns are: Type, Cell Line, Drug name, IC50 and AUC. The Type column is composed of 23 types of cancer mentioned in Figure 2.8. It can be seen that the number of Cell lines for each type is not uniformly distributed. Lung cancer in this case has 93 cell lines while salivary gland and biliary tract have 1 cell line each. Although for each drug the numbers of cell lines in each type of cancer will change slightly, the tendency will be similar.

There are a total of 24 different of drugs in the column Drug name that are: AEW541, Nilo- tinib, 17-AAG, PHA-665752, Lapatinib, Nutlin-3, AZD0530, PF2341066, L-685458, ZD- 6474, Panobinostat, Sorafenib, Irinotecan, Topotecan, LBW242, PD-0325901, PD-0332991, Paclitaxel, AZD6244, PLX4720, RAF265, TAE684, TKI258 and Erlotinib.

The IC50 values of the drug response database are provided in µM with a range from 0 to 8. It is important to detonated that 8 µM isn’t the maximum value that can reach but was fixed. This means that half inhibition concentration that can be obtained with 8 µM or more.

In this aspect, the IC50 values are already pseudo discretize. The complete discretization of the IC50into two group (Sensitive and Resistance) is extensively describe in the section 2.4.3.

As can be seen in Figure 2.10, the density is bimodal. 23.4% is observed within the values of 0 to 2 and 59.4% is observed from 6 to 8. The remaining 17% is noticed between 2 and 6. Even though for each type of drug the 1 and 8 are always the highest peaks, it depends on the drug which is the most dominant of the two, in the case of the drug 17-AAG for example 88% of the values are from 0 to 1 and only 5% is from 7 to 8. While the drug Panobinostat have 100% of his values between 0 and 1.2.

(25)

CHAPTER 2. DATA 15

Figure 2.8: Cancer type frequency in drug AEW541

To avoid using an unbalanced database that subsequently affects the prediction of machine learning algorithms, a drug filter was created that cleans the drug database. First, a table was created like the one observed in table 2.3 that counts how many cell lines of a type of drug for a type of cancer are Resistant or Sensitive. In this way, the drugs Nutlin-3, Panobinostat, Irinotecan, RAF265 and TAE684 were discarded and within this thesis only 19 drugs were evaluated. It is defined sensitive to IC50 values from 0 to 4 and Resistant from 4.5 to 8. For more information on the discretization of IC50 see section 2.4.3. In this table it can be seen that there are some types of cancer that are only resistant, are only sensitive or that have very little data to compare with each other. The table was then filtered so that in a row there are at least 3 resistant and at least three sensitive. Any row with less than 10 cell lines was also removed. Finally, the Drug filtered table with dimensions of 156 rows and 5 columns was obtained. This table refines the drug database and is also used to filter genomic data in section 2.4.4. In this way the drug database is reduced to 6 columns by 5846 rows. The sixth column is called class and can result in S (sensitive) or R (resistant).

Cancer Type Drug name Resistant Sensible Total Sum

BILARY tRACT AEW541 1 0 1

BONE AEW541 5 6 11

ENDOMETRIUM AEW541 19 1 20

PLEURA AEW541 7 0 7

PROSTATE AEW541 2 1 3

LUNG AZD6244 77 15 92

Table 2.3: Section of the drug response data

(26)

CHAPTER 2. DATA 16

Figure 2.9: Density plot of IC50in drug response database

2.4.3 Clustering IC

50

There are several clustering methods as otsu, k means and lloyd for discretizes our IC50

values. Otsu method is one of the most successful methods for image thresholding. However, Otsu method is an exhaustive algorithm of searching the global optimal threshold, while K- means is a local optimal method. Moreover, K-means does not require computing a gray-level histogram before running, but Otsu method needs to compute a gray-level histogram firstly.

Therefore, K-means can be more efficiently extended to multilevel thresholding method than Otsu method [39].

To discretize the values of IC50, a k-means clustering was used. It is a method of vector quantization that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. For determining the number of k the NBclust package was used [43].

NbClust package provides 30 indices for determining the number of clusters and pro- poses to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. The method used was k-means and the distance was Euclidean. The index are: ”kl”, ”ch”, ”hartigan”,

”ccc”, ”scott”, ”marriot”, ”trcovw”, ”tracew”, ”friedman”, ”rubin”, ”cindex”, ”db”, ”silhou- ette”, ”duda”, ”pseudot2”, ”beale”, ”ratkowsky”, ”ball”, ”ptbiserial”, ”gap”, ”frey”, ”mc- clain”, ”gamma”, ”gplus”, ”tau”, ”dunn”, ”hubert”, ”sdindex”, ”dindex”, ”sdbw”.

Among all indexes: 11 proposed 2 as the best number of clusters, 7 proposed 3 as the best number of clusters, 2 proposed 4 as the best number of clusters, 4 proposed 5, 12 , 13, 14 as the best number of clusters one indices each one and 3 proposed 15 as the best number of clusters. In Conclusion, according to the majority rule, the best number of clusters is 2.

(27)

CHAPTER 2. DATA 17

Figure 9 graphically shows the distribution of IC50in two groups using the k-means method.

The cut between the two clusters is in the range of 4 to 4.5. In this way it is considered from 0 to 4 as sensitive and from 4.5 to 8 as resistant.

Figure 2.10: Cluster plot of IC50with k-means of 2 cluster

2.4.4 Filtering Genomic Data

As mentioned in section 2.2 there is a total of 6 datasets (microRNA, Methylation 1kb, Methylation CPG, Gene Alteration, Protein Expression and RNAseq) that contain genomic data and which together comprise a total of 174564 feature. MiRNA and protein expression comprise less than 1% of that total feature, so the other 5 datasets are considered for attribute selection. The selection of features is an inherent problem in any data mining process. Seen as a whole, feature selection techniques are preprocessing algorithms that help us reduce the number of attributes that make up the raw information collected in order to improve the performance of the learning algorithm.

An algorithm was generated that takes the genomic dataset, the refined drug table and the drug dataset as input and generates an output of the already filtered genomic dataset. Take for example the RNAseq dataset, what it does is to take the cell lines from the first row of the refined drug table, in the case of table 2.3 then it would be 92 cell lines of lung cancer and then we take the respective features. In this way we would have a matrix of 54354 X 94.

For each row the respective filtering rule of each genomic dataset is applied. In the case of RNAseq, every row with a standard deviation of 1 and any row whose value is greater than or equal to 1 is eliminated. What this filtering rule does is to eliminate any row that has little variation between cell lines. Then, the non-deleted genes are stored in a general list and the process is repeated 156 times (each row of the table of filtered drugs). The selected genes are taken from the list and these genes are filtered in the genomic dataset. In this way we went

(28)

CHAPTER 2. DATA 18

from having 54354 genes to 11256. Initially it was considered a standard deviation of 0.5 that gave a filtering of 23837 but when comparing both sets of genes in XGboost, the standard deviation of 1 obtained a slightly greater accuracy in half the time of processing.

For each dataset, the filtering rule was configured. For the two methylation dataset, any row whose value is less than 0.3 and whose value is less than 0.7 were eliminated. In the case of 1kb methylation the features were reduce from 16494 to 8570. For the case of CpG methylation, it goes from 54531 features to 14399 features.

(29)

Chapter 3 Machine Learning

Algorithms are a sequence of instructions used to solve a problem developed by pro- grammers to instruct computers in new tasks. Instead of programming the computer every step of the way, this approach gives the computer instructions that allow it to learn from data without specific new step-by-step instructions written by a programmer. This means computers can be used for new, complicated tasks that could not be manually programmed. The basic process of machine learning is to give training data to a learning algorithm. The learning algorithm then generates a new set of rules, based on inferences from the data. This is in essence generating a new algorithm, formally referred to as the machine learning model. By using different training data, the same learning algorithm could be used to generate different models. For example, the same type of learning algorithm could be used to teach the computer how to translate languages or predict the stock market.

3.1 Learning Algorithm

Learning algorithm have the ability to extrapolate from test or training data to make pro- jections or build models in the real world. Think of these algorithms as tools for “pulling data points together” from a raw data mass or a relatively unlabeled background. The learning algorithm can be grouped depending on the nature of the feedback available to the learning system in three type: supervised learning, unsupervised learning and reinforcement learning.

Supervised learning consist of a target which is to be predicted from a given set of feature.

Unsupervised learning is used for clustering population in different groups and reinforcement learning is trained to make specific decisions. Thus, this thesis falls into the category of supervised learning.

19

(30)

CHAPTER 3. MACHINE LEARNING 20

For supervised learning, the algorithms can be grouped in extensive number of types of models but the most commonly used machine learning algorithms are: Regression, Decision Tree, kernel based methods, Naive Bayes, k-nearest neighbors, Dimensionality Reduction Algorithms, Neural Networks and ensemble methods. In this thesis we will evaluate the usage of regression, kernel based methods, ensemble methods based on decision trees.

3.1.1 Regularized regression

Regression is used to estimate real values based on continuous variables. Here, we es- tablish relationship between independent and dependent variables by fitting a best line but classical regression methods will fail in high-dimensional data. In high-dimensional data, the presence of predictors with very small contributions to predictive power is likely. Keeping these predictors in the model may generate noise, leading to overfitting and lowering the prediction performance when the true vector of parameters is sparse. Regularized regression, use convex penalty terms on coefficients. The magnitude (size) of coefficients, as well as the magnitude of the error term, are penalized. Complex models are discouraged, primarily to avoid overfitting [5]. Coefficient where the regularization fit them to zero, can be eliminated providing a mechanisnism for feature selesction.

A model that enters this group is known as LASSO (least absolute shrinkage and selection operator). LASSO performs the feature selection by taking only the essential variables and giving a coefficient of zero to the others. Another widely used model is ridge regression [21] [31]. Ridge regression reduces the standard errors by adding a degree of bias to the regression estimates. Elastic net is very popular and used in cancer genomics data analysis [31]

[40] [45] [16] [20]. The elastic net method overcomes the limitations of the LASSO method which uses a penalty function based on equation 3.1.

Llasso( ˆ) = Xn

i=1

(yi Xiˆ)²+ Xm

j=1

| ˆj|² (3.1)

Use of this penalty function has several limitations in the case of high-dimensional data have few examples, the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty (k k²), which when used alone is ridge regression. Elastic Net aims at minimizing the following loss function defined by equation 3.2.

Lenet( ˆ) = Xn

i=1

(yi Xiˆ)²

2 + (1 ↵

2 Xm

j=1

ˆ_j²+ Xm

j=1

| ˆ^j|) (3.2) The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum[27]. That means, Elastic net regression combines the advantage of LASSO and ridge regressions by optimizing the linear combination of the objective function and the two penalties [32].

(31)

3.1.2 Kernel Based Methods

The kernel based method is a class of algorithms for pattern analysis. Even though there are many algorithms that solve these tasks, the data need to be in feature vector representa- tions via a user-specified feature map. While kernel methods require only a user-specified kernel[30]. In drug prediction there have been several kernel base methods such as DEMKL, cwKBMF, KRL but the best known member is the SVM [4]. SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. The objective of the SVM algorithm is to find a hyperplane in an N-dimensional space (N is the number of features). Hyperplanes are decision boundaries that help classify the data points, falling the data points on either side of the hyperplane can be attributed to different classes. We can see in the figure 3.1, there are many possible hyperplanes that could be chosen to separate the two classes of data points. SVM objective is to find a plane that has the maximum margin. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence [4].

Figure 3.1: Possible hyperplanes. From Ghandi, R. 2018

3.1.3 Ensemble Methods based on Decision Trees

A decision tree is one way to display an algorithm that only contains conditional control statements. It can be described as:

G(Q, ✓) = nlef t

Nm

H(Q_{lef t}(✓)) +nrigth

Nm

H(Q_right(✓))

where G is the information gain, n is the number of data points on left/right side of the threshold, Nmis the total number of data points, H is the the chosen splitting criteria function, Qis the data at node m, ✓ is the feature and threshold being evaluated. That means that that the information gain measures how well a given split splits our data so that the target values of the data points in each child node are most homogeneous.

(32)

Tree based learning algorithms have an advantage over “black-box” models, such as neural nets, in terms of comprehensibility. The logical rules followed by a decision tree are much easier to interpret and map non-linear relationships quite well because they are non-parametric, that means they don’t make any assumptions about how the data is distributed. Numerous decision tree algorithms have been developed over the years: C4.5, CART, SPRINT, SLIQ, Rainforest, among others[35]. But decision trees also have some problems, they are locally optimized. So splits are made to minimize or maximize the chosen splitting criterion. Because of the greedy nature of splitting, imbalanced classes also pose a major issue for Decision Trees when dealing with classification. Many of minority class can get lost in the majority class nodes, and then prediction of the minority class will be even less likely than it should. And finally, the tree complexity is explicitly correlated with over-fitting when inadequate use of stopping criteria or pruning [38].

Ensemble methods are a fantastic way to capitalize on the benefits of decision trees, while reducing their tendency to over fit. Significant improvements in classification accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class. Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelli- gence problem. Ensemble learning is primarily used to improve the performance of a model, or reduce the likelihood of an unfortunate selection of a poor one [1]. Some of the most popular ensemble methods based on Decision Trees are: Random Forest, Extremely Randomized Trees, Bagging, Adaptive Booster, Gradient Boost and XGBoost [35].

Random forest is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.That means, Random Forests grows many classification trees and put the input vector down each of the trees in the forest, each tree gives a classification and the forest chooses the classification having the most votes. So given an ensemble of classifiers h1(x), h2(x), ..., hK(x), and with the training set drawn at random from the distribution of the random vector (Y, X), the margin function is defined as

mg(X, Y ) = avkI(hk(X) = Y ) max avkI(hk(X = j)

where I(·) is the indicator function. The margin measures the extent to which the average number of votes at (X, Y ) for the right class exceeds the average vote for any other class [9]. The generalization error for forests converges almost sure to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them [51].

Extreme Gradient Boosting (XGBoost) is an algorithm that has recently been commonly applied in machine learning and Kaggle competitions. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. The training proceeds itera- tively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It’s called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. The speed and permance of XGBoost is due to 3 important system optimization and 3 algorithmic enhance- ments [54].

(33)

The three symtem optimization are: parallelization, tree prunning and hardware optimization. XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This switch improves algorithmic performance by offsetting any parallelization overheads in computation. For prunning, XGBoost uses max depth parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’

approach improves computational performance significantly. Finally, this algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics [54].

The three algorithmic enhacements are: regularization, sparsity awareness and weighted Quantile Sketch. It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting. XGBoost naturally admits sparse features for in- puts by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.Finally, it employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets [54].

3.2 Related work

In the case of drug prediction, a common strategy is to primarily model the relationships between gene features and drug efficacy outcomes using linear regression model. For example, Dong et al. [16] propose a SVM classification model to accurately predict drug sensitivity according to gene expression profile in the CCLE dataset. Gupta et al. use genomic feature based model to predict anticancer drug [25]. This type of models are frequently used to find cancer biomarker that can link the somatic mutation or expression level of a gene with the predict therapy outcomes. But the specific models depends of the type of variable. When the drug response is a vector of continuous values it is commonly used least squares regression, as was made by Li et al.[37]. For a vector of binary responder status in drug response, K. El Emam et al. [19] applied logistic regression. And if the drug response is a vector of patient survival with censorship to remove patients after a follow-up time, C.Bellera et al. [7] applied Cox proportional hazard (Cox-PH) regression to identify essential features [32]. This kind of early approach employing cell lines genomic feature typically achieve a relatively good classification accuracy (AUC >0.6) [31].

Although linear regression models can help determine certain relationships between genes and drug response, the limits of the data do not allow us to accurately obtain a predictive model. Take for example the data of CCLE, they have 947 human cancer cell lines but there are more than 20,000 human genes that can be studied [20]. The number of genes profiled have orders of magnitude larger than the number of samples. This type of data sets is termed high-dimensional. Each classical regression method computes an optimal coefficient

(34)

vector to minimize an objective function that measures the coherence between the model and the training data. However, a unique optimal coefficient vector does not exist in a high- dimensional setting because many sets of coefficients could make the model perfectly fit the training data, even when the variables are completely unrelated to the response. This brings us to what is known as overfitting, where the fitted models may not have reliable prediction performance on an independent test data set [13] [32].

To overcome the challenges associated there are various machine learning that use variable selection for work with high-dimensional data [51]. For instance, Barretina et al. [6]

proposed an Elastic-Net model to select anti-cancer DRA markers including gene mutation, copy number variation, and gene expression and built a drug-response prediction model using the selected biomarkers. Menden et al. [45] use neural networks in the cell lines genomic features and the drug’s chemical structure properties. In summary, we can generalize the machine learning algorithms that use variable selection for work with high-dimensional data in: regularized regression, neural networks, kernel based and ensemble learning methods [32].

Other approach is associated with applying machine learning to drug response prediction, network and pathway information can be utilized. Some of the feature and heterogeneity relationships that can be use are: cell line genomic alteration, cell line-drug sensitivity, drug chemical structure and protein-protein interaction. In a recent study Fei Zhang et al. [62], propose heterogeneous network-based method for drug response prediction by incorporating cell line genomic profile, drug chemical structure, drug-target and protein-protein interaction information, where the novel results are validated by the literature evidence. The idea be- hind this approach is to provide a functional context for the features that are being used in classification, thereby improving the biological relevance, robustness, and reproducibility of resulting models [52]. Multi-task learning schemes have come into prominence as a conse- quence of their capability of exploiting inter-drug relationships during training. While multi- task learning for the drug activity prediction was largely limited to regularized linear models before, recently more sophisticated approaches have emerged. An approach called Kernel- ized Bayesian Multi-task Learning (KBMTL) design by Gonen and Margolin [23] relying on kernel-based dimension reduction and multi-task learning demonstrated notable performance for drug response prediction. A similar work by Yuan et al [59], also demonstrates the improved predictive power of multi-task learning with trace-norm regularization.

The usage of ensemble methods were also found promising in drug response prediction task, as in many other problems. Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In a DREAM challenge on predicting human responses to toxic compounds of both sub-tasks winners utilized ensemble of random forest models where each forest was devoted to a single cluster of cell lines [18]. None of these methods consider to use both multi-task and single-task methods together in an ensemble model.

(35)

3.3 Caret Library

The caret package is a set of functions that attempt to standardize the predictive modeling process for regression and classification problems. A total of 238 predictive models and a set of different tools are currently available in caret, such as: data preprocessing, model ad- justment through resampling, feature selection and estimation of variable importance, among other functions. There are many different libraries and functions for the machine learning algorithms in the programming language R. Among these libraries the variety in syntax for model prediction or training is different. The package is designed as a way to provide a smooth interface to the functions themselves. Common tasks like parameter setting and variable importance then carry standard syntax. Because it use 30 R packages within caret, to dramatically reduce the package start time, it not load all of them at package start. Caret loads the packages as needed and assumes they are installed. If a modeling package is missing, there is a message to install it.

3.3.1 Caret Model

The model structure in caret is made up of two parts: training and train control. The train control specifies the type of resampling. By default, simple bootstrap resampling is used but others are available, such as repeated k-fold cross-validation (once or repeated), leave-one-out cross-validation, out-of-bag estimates and bootstrap resampling methods can be used by train.

The last value, out-of-bag estimates, can only be used by random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. It can also be specify the number of folds in K-fold cross-validation or number of resampling iterations for bootstrapping and leave-group-out cross-validation. For train on the other hand, the first two arguments are the predictor and outcome data objects, respectively. The third argument, method, specifies the type of model (238 proficient models). Thus, a predictive algorithm using the CARET package can be described in 7 lines of code as shown in the example in the image 3.2.

Figure 3.2: Code for predictive model glmnet in CARET

(36)

Once the model parameter values and the type of resampling have been defined, the process produces a profile of performance measures that guide the user as to which tuning parameter values should be chosen. By default, the function automatically chooses the tuning parameters associated with the best value performance. In the image 3.3 we can see the result of four columns. The first two are the tuning parameters specific to the glmnet model and the number of columns depends on the number of parameters in the model used. The column la- beled ”Accuracy” is the overall agreement rate averaged over cross-validation iterations. The agreement standard deviation is also calculated from the cross-validation results. The column

“Kappa” is Cohen’s (unweighted) Kappa statistic averaged across the resampling results. For these models, train can automatically create a grid of tuning parameters. By default, if p is the number of tuning parameters, the grid size is 3 ˆp.So in general, the train function can be used to evaluate, using resampling, the effect of model tuning parameters on performance, choose the “optimal” model across these parameters, estimate model performance from a training set.

Figure 3.3: Performance of predictive model glmnet in CARET

(37)

3.3.2 Model Selection

Not all machine learning methods are applicable to all situations. Some are better for certain kinds of problems than others. The assumptions or data requirements can render them inapplicable to the problem being analyzed [26]. It was decided to use four types of machine learning algorithms: Random Forest, Support Vector Machine, Elastic-Net and Extreme Gra- dient Boosting. Keep in mind that caret has a library with a large number of algorithms, so it is normal to find that SVM has more 17 variations and Elastic-Net has three different variations.

There is the necessity to filter and compare the variations to select the best ones. Be- cause of this, three factors are required to consider for variation of the algorithm to use. The accuracy result is taken into account first, the one with the highest accuracy is chosen. If the accuracy is very similar, then consider the time it takes to run the algorithm, the one that takes the least time is chosen. Finally, when both the precision and the time are similar between the algorithms, then it is considered the one that occupies the least characteristics.

The pre-selected algorithms for a preliminary comparison are shown in in table 3.1. The table proportionate the type of predictive model as well as the parameters to be tune for each method.

(38)

Model method Type Tuning Parameters

Elastic Net Generalized

Linear Model glmnet Classification,

Regression alpha, lambda

Elastic net enet Regression fraction, lambda

Multi Step Adaptive

MCPNet msaenet Classification,

Regression alphas, nsteps, scale eXtreme Gradient

Boosting xgbTree Classification,

Regression nrounds, max depth, eta, gamma, colsample bytree, min child weight, subsample eXtreme Gradient

Boosting xgbDART Classification,

Regression nrounds, max depth, eta, gamma, subsample, colsample bytree, min child weight, rate drop, skip drop

eXtreme Gradient

Boosting xgbLinear Classification,

Regression nrounds, lambda, alpha, eta

Random Forest rf Classification,

Regression mtry Conditional Inference

Random Forest cforest Classification,

Regression mtry Rotation Forest rotation

Forest Classification K, L L2 Regularized SVM

with Linear Kernel svmLinear3 Classification,

Regression cost, Loss L2 Regularized Lin-

ear SVM with Class

Weights svmLinear

Weights2 Classification cost, Loss, weight SVM with Linear Ker-

nel svmLinear Classification,

Regression C Least Squares SVM lssvmLinear Classification tau

Table 3.1: Preliminary predictive models and their respective Tuning Parameters

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

School of Engineering and Sciences

Predicting Drug Responses in Cancer Cells using Genomic Features and machine learning

Cody Eduardo Evans Trejo

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Declaration of Authorship

Dedication

Acknowledgements

Predicting Drug Responses in Cancer Cells using Genomic Features and machine learning

by

Cody Eduardo Evans Trejo Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Problem Statement

1.2 Objectives

1.3 Research Questions

1.4 Solution overview

1.5 Scope of the thesis

Chapter 2 Data

2.1 Biological information

2.1.1 Cancer

2.1.2 Cell Lines and Drug Response

2.2 CCLE Data

2.3 Features

2.4 Data preparation

2.4.1 Genomics Data distribution

2.4.2 Drug Data distribution

2.4.3 Clustering IC

2.4.4 Filtering Genomic Data

Chapter 3

Machine Learning

3.1 Learning Algorithm

3.1.1 Regularized regression

3.1.2 Kernel Based Methods

3.1.3 Ensemble Methods based on Decision Trees

3.2 Related work

3.3 Caret Library

3.3.1 Caret Model

3.3.2 Model Selection