• No se han encontrado resultados

CONCLUSIONES PERSONALES

Decision trees can easily be converted into a set of logical formulae. A property that is very desirebale since it allows the user to understand how examples are classied, which is nearly impossible for many other classiers [146]. Their explanatory power and the fact that they derive rules that can be expressed as logical formulae is the main reason why decision trees are used. They are often not competitive with other classication approaches in terms of their classifcation accuracy. This is especially true for high dimensional data such as microarray or mass spectrometry measurements where a single tree is not able to reect the inherent complexity. Another disadvantage is the unstable topology of decision trees. A minor change to the training set might result in a substantially dierent tree model [151]. To overcome those limitations ensemble learning methods generate many classiers and aggregate their results [152]. Two well- known methods are Boosting [153] and Bagging [154] of classication trees. In Boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In Bagging successive trees do not depend on earlier trees. Each tree is independently constructed using a bootstrap sample of the data set. A classication tree Tk is grown for each perturbation Λk of the learning set, k = 1, 2, . . . , K; a test observation x is dropped down each tree; and the classier predicts the class of that observation by that class that enjoys the largest number of total votes over all of the trees. Meaning that in the end a simple majority vote is taken for prediction. An improved variant of the Bagging approach, called random forest [155], was introduced by Leo Breiman [147,156,157].

1.8.1 Random Forest

Like bagging the random forest[155] tree ensemble constructs each tree using B bootstrap samples Λb of the learning set Λ. In addtion to this well known randomization scheme it also introduces a randomization component into tree construction itself, such that for the tree Tb each node is split using the best split among a randomly chosen subset of m variables. These are the only two tuning parameters for a random forest: the number m of variables randomly chosen as a subset at each node and the number B of bootstrap samples. The procedure is relatively insensitive to a wide range of of values of m and B. If the class of interest contains only a small proportion of the observations, as is often the case in a clinical setting, each bootstrap sample will contain only very few of the minority class observations. To prevent the resulting poor class prediction for the minority class, the majority class is undersampled (balanced random forest (BRF)) for each bootstrap sample [158,159]. After classier construction, a random forest predictor score (RF- score) based on the percentage of trees voting for a specic class is calculated. Most of the time a simple major vote (RF-score ≥ 50%) is used for class assignment. While the random forest algorithm seems to apply a counterintuitive strategy, it turns out to perform very well compared to other classiers [160,161] and is robust against overtting [155]. The random forest methodology framework can even be applied to methods other than decision trees [162]. Another advantage is an internal unbiased estimation of

classier performance similar to cross-validation. At each bootstrap iteration the data not in the bootstrap sample (out of bag (OOB) data) is used to estimate the error rate. The mean error estimation over all bootstrap iterations is referred to as the OOB error.

1.8.2 Random Survival Forest

An extension of the random forest method to right censored data is given by random survival forests [163]. It follows the same principles of constructing an ensemble of base learners but uses survival trees [163] instead of standard decision trees [146]. Those survival trees are a special type of decision trees, which are constructed using an obser- vation data set of time, status, event [164] and associated predictor covariates [165]. For this we denote survival time and censorship information for each of the n individuals within h by

(t1, δ1), . . . , (tn, δn) (1.12) An individual with no event at time tl is considered censored (δl = 0), while an event at time tlis denoted by δl = 1. The splits at a node h are of the form x ≤ c or x > c. Given the value xl of x for an individual l = 1, . . . , n, the number of observations assigned to each of the two daughter nodes created after a split is given by:

n1 = #(l : x1 ≤ c) (1.13)

n2 = #(l : x1 > c) (1.14)

The splits at each node are chosen such that the resulting daughter nodes show a max- imum survival dierence. One survival based measure of node seperation, which can be used to choose the best split, is given by the log-rank score test. The best split can now be chosen according to the log-rank score test [166], which serves as a measure of node separation. For this we assume the predictor x has been ordered such a way that x1 ≤ x2 ≤ . . . , ≤ xn. We can now compute a ranking for each survival time tl,

al = δl− δl X k=1 δk n − δk+ 1 (1.15)

where Γk = #{t : Tt≤ Tk}. The log rank score is dened as

S(x, c) = P

k<=cal− n1¯a pn1(1 −nn1)s2a

(1.16)

with the sample mean and sample variance of {al : l = 1, . . . , n} are given by ¯a and a2

s respectively. The resulting measure of node seperation |S(x, c)| is maximized over x and c, yielding the best split. The tree reaches a saturation point when a terminal node contains at least 3 events with an unique event free time.

Ensemble Estimation

Similar to the random forest method, where the predicted class is dened by a majority vote over all trees, the ensemble cumultative hazard function is produced by the average over all trees. For each tree the cumultative hazard estimates are calculated terminal nodewise. For a node h with distinct event times let dl,hand Yl,h be equal to the number of events and individuals at risk at time tl,h. The cumultative hazard for the node h can now be dened as:

ˆ Hh(t) = X tl,h<t dl,h Yl,h (1.17)

The survival rate at a time (t) can then be calculated by the product limit (PL) method [167]: ˆ Sh(t) = Y tl,h<t (1 − dl,h Yl,h ) (1.18)

The rules making up the tree assign each individual i with the predictor xi to a specic terminal node, from which the individuals cumultative hazard ( ˆH(t|xi)) is determined. The estimate described so far is based on one tree only, while the nal ensemble cumul- tative hazard is averaged over all trees (b = 1, ..., ntree) in the ensemble. This ensemble hazard ( ˆHe(tx, i) is calculated by:

ˆ He(t|xi) = 1 ntree ntree X b=1 Hb(t, xi) (1.19)

An OOB estimate of the ensemble cumultative hazard ( ˆHe∗) can be obtained for each individual by averaging only over those trees for which the individual in question was part of the OOB set. Ensemble survival rates can be calculated analogously to ensemble cumultative hazard.

Concordance error rate

The quality of a predictive model is generally reected by its prediction error on a test set or by cross-validation. In the case of random forests or random survival forests this can be done using the OOB principle. An important aspect of this procedure is the measure used to rate the prediction error. In the case of a predictive survival model this is provided by the concordance index (C-index). The concordance index reects the probability that, given a randomly selected pair of individuals, the individual that suers from an event rst also had the worse predicted outcome. The concordance error rate is calculated by 1 minus the C-index and takes values between 0 and 1, with 0 reecting perfect concordance. A concordance error of 0.5 represents a prediction not better than random guessing. To compute the concordance index we rst have to dene what constitutes a worse prediction outcome. This can be done using the following

approach. Let t∗ 1, . . . , t

N denote all unique event times in the data. Individual i is said to have a worse outcome than j if

N X k=1 ˆ He∗(t∗k|xi) > N X k=1 ˆ He∗(t∗k|xj). (1.20)

The concordance error is calculated as follows [163]: 1: Create all pairs of observed responses over all of the data.

2: Omit those pairs where the shorter event time is censored. Also omit pairs i and j if Ti = Tj unless γi = 1 and γj = 0 or γi = 0 and γj = 1. The total number of permissible pairs is denoted by Permissible-Pairs

3: For each permissible pair in which the shorter event time had the worse predicted outcome, add 1 to the running sum p. If the predictions are tied count add 0.5 to the sum. Concordance = p now denotes the total sum over all permissible comparissons.

4: Dene the concordance index (C-index) as

C = Concordance

P ermissible − P airs (1.21) and the concordance error by 1 - C

The concordance index can also be applied to uncensored data where it estimates the Mann-Whitney parameter [168]. In this case it also follos the same idea as the Kendall rank correlation coecient [169]. A possible C-index application on uncensored data is the comparisson of microarray and qRT-PCR results.