• No se han encontrado resultados

5. Vigilancia y Monitoreo: Es importante el monitoreo a través del control interno para lograr determinar si esta actuando en la forma esperada o sí es necesario

2.3. CLASIFICACIÓN Y TIPOS DE CONTROL INTERNO

Most of the machine learning (ML) methods are designed to build a model to perform a specific task without being explicitly or directly programmed but by learning the un- derlying patterns from training data. It is expected that the training data conveys the true patterns, which might be previously unknown, and the ML methods can capture such hidden patterns. Supervised learning methods are used to develop predictive models from training data having class labels. They aim at learning a decision surface that can separate samples of different classes as much as possible. Such methods include the classification methods used to predict category outcome (e.g., Na¨ıve Bayes, Decision Tree [18], Random Forest [19]), and the regression methods used to predict numeric outcome (e.g., Logis- tic Regression [20], Linear Regression [21]). Unsupervised learning methods mainly deal with grouping/clustering on uncategorized data (e.g., K-means [22], Hierarchical cluster- ing methods [23]), or finding a structure/pattern (e.g., Apriori [24], FP-Growth [25] ) in such data for interpretation purpose. ML has been shown effective to solve many prob- lems in breast cancer studies. Such problems include breast cancer diagnosis (classifying whether breast cancer is present), subtype diagnosis (classifying patient into subtypes), recurrence/metastasis risk prediction, survivability prediction (classifying whether patient will be disease-free, or survive longer than a certain amount of time, or estimating the

survive time of patient). Figure 1.1 shows some types of biomarkers that can be identified by ML methods.

Figure 1.1: Some types of biomarkers.

Some studies have used only clinical information to construct ML models and still be able to achieve high accuracy [26–28]. For example, the authors of [27] learned the prediction models from a Wisconsin Diagnostic Breast Cancer dataset obtained from the UCI machine learning repository (available at http://archive.ics.uci.edu/ml) to classify benign vs. malignant breast cancer. The dataset contains 32 clinical information of 458 benign patients and 241 malignant patients. They could achieve more than 90% accuracy using the algorithms K-Nearest Neighbor (KNN), Naives Bayes, Decision Tree and Support Vector Machine (SVM).

Others have relied on genomics data such as DNA methylation, gene expression (GE), mutations, copy number variations (CNVs) or copy number alterations (CNAs) to build ML models. These types of data can provide information in more detail than tradi- tional clinical and histological factors and might be more useful to discover the underlying mechanism of the disease. For example, in [29], a cluster analysis on an integration data of CNVs and GE showed that breast cancers can be regrouped into a new ten-class cate- gorization, each group has distinct copy number profiles and clinical outcome. Patricio et al. developed models using SVM to predict the presence of breast cancer in women. The model based on personal information (Age, Body mass index) and clinical features com- puted from blood analysis (Glucose, Resistin) performed better performance than other combinations of personal and clinical features [30]. By applying an unsupervised method on microarray data of about 5,000 differentially expressed genes of 98 patients, Laura J. van ’t Veer et al. [31] identified a set of 70 genes that can accurately separate patients into two groups of good prognosis (for patients who continued to be disease-free at least 5 years from the initial diagnosis) and poor prognosis (for those developed distant metastases within 5 years).

One common challenge in using genomics data is that the number of features is usually far more than the number of samples, and many features are redundant. The curse of dimensionality can lead to over-fitting issues or prevents even state-of-the-art algorithms from being able to learn good models. In this regard, the feature selection methods in ML can be employed to extract the subsets of informative features for model construction while maintaining the generalization ability of the model. Then, the selected features can be, in turn, considered as potential biomarkers. The feature selection methods include filter

methods, wrapper methods, and hybrid methods.

Figure 1.2: Typical framework for wrapper feature selection.

Filter methods evaluate one feature at a time using a particular statistic criterion such as Information Gain [18], Chi-Squared [32], Relief [33], for which we can rank the features and select the top ones. Some criteria can evaluate a subset of features, such as mRMR [34], MRMD [35]. As they are not tailored to any specific type of classifiers, the filter selection methods are often used as a pre-processing step in ML applications. Meanwhile, wrapper approaches search for an optimal subset of features using a searching algorithm along with a classification method [37–39]. The typical framework for wrapper approaches is shown in Figure 1.2. For example, in a linear forward selection (or greedy forward selection), we begin the current optimal subset of features as an empty set and try to add new features one by one. In each forward step, the classification method (e.g., Decision Tree) evaluates the combination of the current optimal subset and a new feature and the feature that gives the best improvement in performance (e.g., accuracy) is chosen for further extension. Wrapper methods usually produce better performance than filter

ones as they consider the combinations of the features into account. However, they still require a lot of computations if the number of available features is large enough, especially when using expensive classifiers such as SVM or Neural Network. In hybrid approaches, candidate features are first selected by one or more filter criteria to shrink the search space before applying a wrapper method to find the final subset of features, especially when there are too many features as compared to the number of samples.

1.3

Integrative Machine Learning Approaches

Analyzing only one type of data type might mislead the findings in studying breast cancer because of many reasons including lack of data, noise, heterogeneity in data, and others. The effective integration of many types of data and knowledge is necessary to compen- sate for such drawbacks and to better uncover the mechanism of diseases. Integrative approaches have recently gained a lot of interest in the research community. In these ap- proaches, multiple types of data are examined to provide different points of views of a problem, and they may be integrated into one unique view before applying an algorithm to solve the problem. Once integrating useful knowledge from different sources into the ma- chine learning process, we can obtain more accurate models, more robust/stable findings, or biomarkers with stronger support [29,40–45]. (A robust classifier or biomarker is the one, the testing error of which is close to the training error, where training and testing data are just slightly different by sufficiently small perturbations of the data samples [46,47].) Drier and Domany argued that gene signatures derived without reference to the underlying mechanisms of chemotherapy response do not capture meaningful biological results [48].

Thus, in [49], Dorman et al. analyzed the correlation of gene copy number, mutation, and expression to the growth inhibitory concentrations of paclitaxel and gemcitabine (GI50) in

breast cancer cell line and then in patients using Multiple factor analysis (MFA). Then, SVM built on expressions of 15 genes can predict paclitaxel resistance with 82% accu- racy and SVM built on copy number profiles of 3 genes and expression of 7 genes can achieve 85% accuracy for gemcitabine resistance prediction. Curtis et al. also suggested that integrating multiple genomics information may help to derive more robust patient classifiers [29].

Network-based prediction approaches are also considered as integrative approaches, where the existing relations among genes or molecules are taken into account for identi- fying relevant biomarkers. These methods often integrate primary data (e.g., GE, DNA methylation, CNV) with one or more secondary network data expressing the functional relationships among genes, such as gene co-expression networks, gene regulatory networks, protein-protein interaction networks (PINs) or other “omics” data. Protein-protein in- teraction networks capture physical interactions determined by experiments and computa- tionally derived interactions. There are public databases for PINs that are well-maintained such as HPRD [50], StringDB [51], PathwaysCommons [52]. A transcriptional regulatory network is a directed graph where nodes are transcript factors/microRNA or genes and edges connect a regulator to its targets. Available databases for transcriptional regulatory networks include RegNetwork [53] and TRRUST [54]. Meanwhile, a gene co-expression network is often a weighted undirected graph and its construction depends on the problem of interest [43]. Edge weights are usually computed based on GE data using a co-expression measure such as Pearson’s correlation coefficient or mutual information. In [55], they con-

struct a gene co-expression network based on the expression quantitative trait loci (eQTLs), where the gene-gene co-regulation coefficient is calculated based on the eQTLs that regulate the two genes.

Most of the network-based approaches aim at identifying the sub-networks of interacting molecules that can be potential signatures for certain conditions of interest. Such signatures are so-called sub-network biomarkers. The rationale behind this is that the molecules tend to interact with each other to perform their functions. Thus, function- ally related genes tend to be located nearby each other in the molecular networks [56]. In an extensive study on millions of gene-prognostic biomarkers and sub-network-prognostic biomarkers across 20 different training-testing partitions of 4960 breast cancer patients, Grzadkowski et al. [44] found that sub-network biomarkers produced higher overall perfor- mance and concordance across partitions. They also mentioned that integrating functional information, such as pathway data, improves biomarker performance and replicability, and that smaller biomarkers are more robust across patient cohorts. Researchers have tried different ways to assign score/weight to the nodes, edges, and even subnets of the networks in order to apply existing algorithms to find sub-network biomarkers that can accurately separate the classes. The search includes greedy search, simulated annealing, genetic al- gorithms, random walk with re-start, and integer linear programming [41–43,55,57]. For example, in [55], after constructing a weighted undirected graph, where edge weight ex- presses the gene-gene co-regulation coefficient they adopted random walk with re-start to find the disease genes. A node in an interaction network can involve thousands of edges, thus, searching for such subnet biomarkers has been a computational challenge.