• No se han encontrado resultados

To estimate the piecewise constant profile of the copy number events, first we use the estimators for k0 and t0 of the last version of the mBPCR

method (see Chapter 3): b

K01= arg max

k∈K

p(k |Y , cn), (4.2)

b

TBinErrAk= arg max t′∈Tbk,n E "ˆk−1

q=1 k0−1

p=1 δtq,t0p Y, cn # . (4.3)

As we previously saw, bTBinErrAk corresponds to the ˆk01 positions which

have the highest posterior probability to be a breakpoint. A difference with respect to mBPCR consists in the level estimation. While in the copy num- ber model the levels were continuous random variables, now they assume categorical values. Hence, they are estimated separately (as before) with the MAP estimator instead of the posterior expected value,

b

Zp= arg max z=−2,−1,0,2

P(Zp= z |Y , ˆt, ˆk, cn), (4.4)

where ˆt and ˆk are any estimate of t0and k0, respectively.

Let us define yi j= (yi+1, . . . , yj), representing the LOH data points in

the interval[i + 1, j], and Ki jas the random variable which represents the

number of segments in the interval[i + 1, j]. Using Bayes Theorem and the independence of the LOH data points belonging to different segments, the probability in Equation (4.4), given the LOH data y, can be written as,

P(Zp= z | y, ˆt, ˆk, cn)

= P(Zp= z | ybtp−1,btp,btp−1,btp, bKbtp−1,btp= 1, cn)

= P(ybtp−1,btp| Zp= z,btp−1,btp, bKbtp−1,btp= 1)P(Zp= z |btp−1,btp, bKbtp−1,btp= 1, cn) P(ybtp−1,btp|btp−1,btp, bKbtp−1,btp= 1, cn)

Therefore, if the boundary estimator misses a clear boundary betweenbtp−1

andbtp, then the probability at the denominator of Equation (4.5) could be

zero and thus the level would not be estimated. The only way to prevent this event consists in using a good estimator for the boundaries.

Fig. 4.3 Example of esti- mated posterior probabilities to be a breakpoint. The graph shows, for each probe, the es- timated posterior probability to be a breakpoint on a sample of dataset B.

Previously, in Subsection 3.1.6, we found that the boundary estimator b

TBinErrAk is an estimator with a high sensitivity, but medium FDR. The

problem of this estimator is the following. The vector p of the posterior probabilities to be a breakpoint at each point of the sample usually rep- resents a multimodal function with maxima at the breakpoint positions, but often in a neighborhood of each maximum there are other points with high probability because of the uncertainty (see Figure 4.3). If we take the first k0points with the highest probability (according to the definition of

b

TBinErrAk), we could take points in the neighborhood of the higher maxima

and not some maxima with a lower probability (see Figure 4.3). Thus, if k0was estimated with its exact value then the sensitivity of the bTBinErrAk

would be lower. In this case, we could lose important breakpoints so that the denominator in Equation (4.5) would become zero. In practice, bK01

often slightly overestimates k0, because of the high noise of the data, and

thus this phenomenon should not happen, but to prevent even this rare case we searched for a way to improve the estimation of the boundaries.

4.1 Model 1: relationship between LOH and copy number data 129

Since the vector of the posterior probabilities usually shows the posi- tion of the breakpoints clearly in correspondence to the maxima, we es- timate the number of the segments and the breakpoints with the number of peaks and the locations of their maxima, respectively (see the next sub- subsection). After applying a kernel method to reduce the noise of the function, the algorithm for the determination of the peaks uses two thresh- olds: for the determination of the peaks (thr1) and for the definition of the

values close to zero (thr2). We will denote the corresponding estimators

by bKPeaks,thr1,thr2and bTPeaks,thr1,thr2.

In Subsection 4.6.1, we will consider several pairs of thresholds and we will apply the corresponding estimators to simulated data, in order to de- termine the best paired thresholds and to compare their performance with b

TBinErrAk. We will also compare bTBinErrAkwith bTJoint, another boundary

estimator described in Subsection 3.1.4.

Algorithm to determine maxima of a multimodal function

We have just introduced the paired estimators( bKPeaks,thr1,thr2, bTPeaks,thr1,thr2)

for the number of segments and the breakpoints. They correspond to the number and the locations of the peaks of the vector p of the posterior prob- abilities to be a breakpoint at each SNP location. We derived an algorithm for the determination of the maxima in a multimodal function to compute them.

Let us assume that we have to determine the positions of the maxima of a multimodal function f and we know its values at positions{1, ..., n} (called f= { f1,. . ., fn}). Moreover, the values f are affected by noise (in

fact, in our case f is the posterior probability to be a breakpoint at each position, which depends on the estimates of parameters).

In this framework, we have derived an algorithm to determine the posi- tions of the maxima of f :

1. Denoising of f . In order to denoise the function, we use a regression method with kernel basis, obtaining ˆf .

2. Selection of only one position per peak. We identify the positions which belong to the same peak through a threshold thr1(i.e. an interval

A corresponds to a peak if all elements of ˆfA are greater than thr1).

Then, among the positions belonging to the same peak, we select the one with the highest value of ˆf. The vector of guess locations is called q0.

3. Final selection of the peak locations. Lastly, we choose all locations i∈ q0such that ˆfi> thr2. The new vector of locations is denoted by q1.

The use of a second threshold is necessary, because the function f (i.e. the estimated posterior probabilities, in our case) can have small peaks also when it assumes values very close to zero (due to the noise). Moreover, since we cannot estimate more than kmax breakpoints (be-

cause of the definition of the prior of K), if more than kmax peaks are

selected, then the algorithm chooses the ones corresponding to the kmax

highest values of the set{ ˆfi|i ∈ q1}.

The described algorithm depends on the value of the thresholds thr1

and thr2. In the simulations in Section 4.6, we will try several pairs of the

following types of thresholds:

• thr005= max(0.005, quantile of ˆp at 0.95)

• thr01= max(0.01, quantile of ˆp at 0.95)

• thr01 90= max(0.01, quantile of ˆp at 0.90)

• thrmad= median( ˆp) + 3 ∗ mad( ˆp)

where mad is the median absolute deviation. All these thresholds derive from different definitions of which probability values are to be considered significant.