To estimate the piecewise constant profile of the copy number events, first we use the estimators for k0 and t0 of the last version of the mBPCR
method (see Chapter 3): b
K01= arg max
k∈K
p(k |Y , cn), (4.2)
b
TBinErrAk= arg max t′∈Tbk,n E "ˆk−1
∑
q=1 k0−1∑
p=1 δt′ q,t0p Y, cn # . (4.3)As we previously saw, bTBinErrAk corresponds to the ˆk01 positions which
have the highest posterior probability to be a breakpoint. A difference with respect to mBPCR consists in the level estimation. While in the copy num- ber model the levels were continuous random variables, now they assume categorical values. Hence, they are estimated separately (as before) with the MAP estimator instead of the posterior expected value,
b
Zp= arg max z=−2,−1,0,2
P(Zp= z |Y , ˆt, ˆk, cn), (4.4)
where ˆt and ˆk are any estimate of t0and k0, respectively.
Let us define yi j= (yi+1, . . . , yj), representing the LOH data points in
the interval[i + 1, j], and Ki jas the random variable which represents the
number of segments in the interval[i + 1, j]. Using Bayes Theorem and the independence of the LOH data points belonging to different segments, the probability in Equation (4.4), given the LOH data y, can be written as,
P(Zp= z | y, ˆt, ˆk, cn)
= P(Zp= z | ybtp−1,btp,btp−1,btp, bKbtp−1,btp= 1, cn)
= P(ybtp−1,btp| Zp= z,btp−1,btp, bKbtp−1,btp= 1)P(Zp= z |btp−1,btp, bKbtp−1,btp= 1, cn) P(ybtp−1,btp|btp−1,btp, bKbtp−1,btp= 1, cn)
Therefore, if the boundary estimator misses a clear boundary betweenbtp−1
andbtp, then the probability at the denominator of Equation (4.5) could be
zero and thus the level would not be estimated. The only way to prevent this event consists in using a good estimator for the boundaries.
Fig. 4.3 Example of esti- mated posterior probabilities to be a breakpoint. The graph shows, for each probe, the es- timated posterior probability to be a breakpoint on a sample of dataset B.
Previously, in Subsection 3.1.6, we found that the boundary estimator b
TBinErrAk is an estimator with a high sensitivity, but medium FDR. The
problem of this estimator is the following. The vector p of the posterior probabilities to be a breakpoint at each point of the sample usually rep- resents a multimodal function with maxima at the breakpoint positions, but often in a neighborhood of each maximum there are other points with high probability because of the uncertainty (see Figure 4.3). If we take the first k0points with the highest probability (according to the definition of
b
TBinErrAk), we could take points in the neighborhood of the higher maxima
and not some maxima with a lower probability (see Figure 4.3). Thus, if k0was estimated with its exact value then the sensitivity of the bTBinErrAk
would be lower. In this case, we could lose important breakpoints so that the denominator in Equation (4.5) would become zero. In practice, bK01
often slightly overestimates k0, because of the high noise of the data, and
thus this phenomenon should not happen, but to prevent even this rare case we searched for a way to improve the estimation of the boundaries.
4.1 Model 1: relationship between LOH and copy number data 129
Since the vector of the posterior probabilities usually shows the posi- tion of the breakpoints clearly in correspondence to the maxima, we es- timate the number of the segments and the breakpoints with the number of peaks and the locations of their maxima, respectively (see the next sub- subsection). After applying a kernel method to reduce the noise of the function, the algorithm for the determination of the peaks uses two thresh- olds: for the determination of the peaks (thr1) and for the definition of the
values close to zero (thr2). We will denote the corresponding estimators
by bKPeaks,thr1,thr2and bTPeaks,thr1,thr2.
In Subsection 4.6.1, we will consider several pairs of thresholds and we will apply the corresponding estimators to simulated data, in order to de- termine the best paired thresholds and to compare their performance with b
TBinErrAk. We will also compare bTBinErrAkwith bTJoint, another boundary
estimator described in Subsection 3.1.4.
Algorithm to determine maxima of a multimodal function
We have just introduced the paired estimators( bKPeaks,thr1,thr2, bTPeaks,thr1,thr2)
for the number of segments and the breakpoints. They correspond to the number and the locations of the peaks of the vector p of the posterior prob- abilities to be a breakpoint at each SNP location. We derived an algorithm for the determination of the maxima in a multimodal function to compute them.
Let us assume that we have to determine the positions of the maxima of a multimodal function f and we know its values at positions{1, ..., n} (called f= { f1,. . ., fn}). Moreover, the values f are affected by noise (in
fact, in our case f is the posterior probability to be a breakpoint at each position, which depends on the estimates of parameters).
In this framework, we have derived an algorithm to determine the posi- tions of the maxima of f :
1. Denoising of f . In order to denoise the function, we use a regression method with kernel basis, obtaining ˆf .
2. Selection of only one position per peak. We identify the positions which belong to the same peak through a threshold thr1(i.e. an interval
A corresponds to a peak if all elements of ˆfA are greater than thr1).
Then, among the positions belonging to the same peak, we select the one with the highest value of ˆf. The vector of guess locations is called q0.
3. Final selection of the peak locations. Lastly, we choose all locations i∈ q0such that ˆfi> thr2. The new vector of locations is denoted by q1.
The use of a second threshold is necessary, because the function f (i.e. the estimated posterior probabilities, in our case) can have small peaks also when it assumes values very close to zero (due to the noise). Moreover, since we cannot estimate more than kmax breakpoints (be-
cause of the definition of the prior of K), if more than kmax peaks are
selected, then the algorithm chooses the ones corresponding to the kmax
highest values of the set{ ˆfi|i ∈ q1}.
The described algorithm depends on the value of the thresholds thr1
and thr2. In the simulations in Section 4.6, we will try several pairs of the
following types of thresholds:
• thr005= max(0.005, quantile of ˆp at 0.95)
• thr01= max(0.01, quantile of ˆp at 0.95)
• thr01 90= max(0.01, quantile of ˆp at 0.90)
• thrmad= median( ˆp) + 3 ∗ mad( ˆp)
where mad is the median absolute deviation. All these thresholds derive from different definitions of which probability values are to be considered significant.