Villaseca prepara su II Día de la Tauromaquia

In this chapter, we have developed three hypothesis tests that are able to detect significant differential binding sites (or regions) in ChIP-Seq experiments. Given two ChIP-Seq samples, test and control, the observed read counts at genomic positions are considered as paired observations. Hence, the difference in read counts between the test and control samples is considered and analysed by the proposed tests. The assumption made on the data is clear, which is the read counts within a genomic window are assumed to follow Poisson distribution with a single and unknown rate parameter. Hence, the difference in read counts within a genomic window follows PD distribution with two unknown rate parameters. An advantage of considering the difference is that the difference observations become more independent, although the read counts in each sample sometimes show weak dependence.

To perform the tests, the two parameters of PD need to be estimated. We seek the mle. Obtaining the mle analytically is not trivial as the derivations involve Bessel terms. Hence, numerical optimisation methods are used instead to obtain the mle. As the two parameters are related, one of them is optimised and the other parameter is estimated by using a constraint. We faced a challenge in the estimation process. That is, in the optimisation process we usually obtain a positive estimate as we use a transformation that grantees that. However, estimating the other parameter by using the constraint sometimes yields a

Figure 5.7: Highest (top panel) and lowest (lower panel) significance windows from the union of the significant results by ET, WT and LRT based on their p-values. Horizontal axis represents bins (or positions) and vertical axis represents read counts. The windows are represented by using window size 200 bp, which is optimal and used in the analysis process.

Chapter 5. Parametric statistical methods for differential binding site studies 118

negative estimate. This happens when one of the windows contains no read counts or a very low number of read counts compared to the other window. To solve this issue, we just swap the optimised parameter with the other one, i.e. instead of using Equation (5.15) we use Equation (5.16), and vice-versa.

In PD parameter estimation, we focus on the mle, as they are required in the approximated tests, WT and LRT. However, a moment estimate can be provided as well. That is, for difference data z, where Z ∼ PD(λx, λy), the moment estimates of the parameters can be

shown as ˆλx =1₂(S2+ ¯z) and ˆλy =1₂(S2− ¯z), where S2and ¯z are the sample variance and

mean of the difference z. However, the moment estimates do not exist if the absolute value of the sample mean is larger than the sample variance, i.e. |¯z| > S2_{. If that happened, then}

we would obtain a negative estimate for one of the parameters, which is the same issue we faced in mle (but we solve it). Some researchers, like [4], suggest considering the negative estimate zero, and the other one as the absolute value of the sample mean, |¯z|. However, if we did consider that suggestion, the probability function of PD, Equation (5.1), would be made redundant. That is, we would not be able to use the function, Equation (5.1), because it would be either 0 or ∞. More details about PD estimators, properties, and asymptotic properties and more are given in [4, 22, 23].

The maximum likelihood estimates are asymptotically unbiased estimates for the parameters of PD. That is, the bias tends to zero as the sample size and the rate parameter go to infinity. However, we notice that large parameters exhibit larger bias when The sample is small with convergence to zero slower than small parameters. This may suggest that we can look for a formula where we can compute the sufficient sample size for a given rate parameter. However, such a formula would not be useful in practice as the actual rate parameters are unknown. On the other hand, the bias would not cause any problem in the testing methods we proposed. That is, the testing we do is all around the expected value under PD, which is the difference between two rates. Hence, the bias would not have much effect when we consider the difference.

The tests we propose involve one exact test (ET) and two approximated tests (WT and LRT). That is, in ET the same distribution under the null assumption is assumed to be the test statistic (which is PD), while in WT and LRT approximated distributions are assumed for the test statistics (which are standard normal and Chi-squared, respectively). In the evaluation of the observed and assumed distributions for the test statistics, we find that all three tests show not well fitting when the rate parameters are very small, which is due to the discreteness. However, the fitting improves quite quickly when the rate parameters increase.

Furthermore, the evaluation of the proposed tests (ET, WT and LRT) based on simulated and real data shows that LRT is the most powerful test when the rate parameters are very small in PD. However, by looking at the false-positive evaluation based on simulated data, we can say that power can be owing to the high level of false-positive rate associated with small rate parameters. On the other hand, ET and WT show very low rates of false positive error evaluation (based on simulated data) associated with low rate parameters, but do well in the power evaluation.

In the real-data-based evaluation, it can be said that ET is the most accurate test compared to WT and LRT. Although LRT in the false positive rate evaluation shows good controlling of the error at 5%, the observed variability in ET is much smaller than the observed variability in LRT, Figure 5.5 (top panel). Moreover, by looking at WT’s false-positive rate, we can say that WT is doing well in terms of power in the same figure (lower panel). To sum up, it can be said that the proposed tests (ET, WT and LRT) are able to detect differential regions or binding sites based on the assumptions. Based on the false-positive evaluation at the given level of significance, the tests show reasonable controlling for the false-positive rate, especially when the rate parameters are not very small. Based on the power evaluation, the tests show the ability to detect significant differential regions with quite small observed difference in rate parameters. Based on false-positive and power evaluations, it can be said that ET seems to be the most accurate test, then WT and LRT

Chapter 5. Parametric statistical methods for differential binding site studies 120

Chapter 6 Comparison study

6.1 Introduction

In this chapter, we perform a comparison study between the methods of differential binding sites analysis using ChIP-Seq data, which have been discussed in Chapters 4 and 5. These methods are MACS and MAnorm, diffReps, exact test (ET), Wald test (WT) and likelihood ratio test (LRT).

The rest of this chapter is constructed as follows. In Section 6.2, we perform comparisons based on simulated data. We perform comparisons based on real ENCODE data in Section 6.3. We perform a comparison based on RUNX1/ETO in Section 6.4. In Section 6.4 we also perform a comparison based on the gene expression result of RUNX1/ETO. Finally, in Section 6.5 we discuss the findings.

In document Gran cierre de Hogueras (página 67-71)