• No se han encontrado resultados

Aprueban la creación del Área de Conservación Ambiental “Valle del Alto

In any analysis that involves aggregating marker-level test results, we must be able to detect and quantify the signicance of regions like those depicted in Figure 1.1.

This is not trivial, however. As we described in Section 2.2.2, The fact that p-values among markers within a window are statistically dependent greatly increases the

dif-culty of estimating the exact null distribution. Thus this dependence introduces a lack of exchangeability between test results, which complicates matters and causes various naïve approaches to fail. In this section, I compare three approaches that I tried during my research and illustrate the consequences of non-exchangeability in testing the signicance of a region with a preponderance of low p-values.

One approach, suggested in [35], is to use circular binary segmentation (CBS; im-plemented in the R package DNAcopy). This method aggregates neighboring p-values by calculating the two sample t-test statistic comparing the mean intensity of a given region with that of the surrounding region. The signicance of this test statistic is quantied by comparing it to the distribution of maximum test statistics obtained by permuting the {pj} values [30, 31]. However, the main assumption of this approach is that the test results {pj}are exchangeable which is the justication for permuting them.

Alternatively, we may use the kernel-based method described in Section 2.2.1 to aggregate the neighboring test results, thereby obtaining Tmax = maxj{Tj}. One possible approach to generate the null distribution of Tmax is to rely on Monte Carlo integration based on the fact that, under the null hypothesis of no association, all p-values follow a uniform distribution. Thus, for any choice of transformation and kernel in (2.1), we may generate an arbitrary number of {Tj} under the null and then yield independent draws {Tmax(b) }Bb=1 from the null distribution function F0 of Tmax. Then the estimate ˆF0 is obtained using the empirical CDF of those draws {Tmax(b) }Bb=1. Through this approach, we apply a test for the signicant presence of a CNV-phenotype association through the calculation of p = 1 − ˆF0(Tmax). The crucial assumption here is that, under the null, the p-values among markers are independent and so are {Tj}.

An alternative to generate the null distribution and quantify the signicance of Tmaxis the permutation approach that is proposed and described fully in Section 3.2.2.

By permuting the phenotype prior to aggregation of the marker-level tests, it creates the independence between intensity and phenotype among markers for each permut-ing. Thus, using the empirical CDF of an arbitrary number draws {Tmax(b) }Bb=1, we would obtain the estimate ˆF0(Tmax).

Consider a genomic region in which individuals may have a CNV. The purpose of the analysis is to detect and locate such a CNV if it is associated with a particular phenotype. Thus, the null hypothesis for our association test may hold in one of two ways: (1, No CNV) no individuals with CNVs in that region are present in the sam-ple, or (2, No association) individuals with CNVs are present in the samsam-ple, but the CNV does not aect the disease and thus dose not change the probability of

develop-ing the phenotype. The preservations of type I error of the three methods discussed above under each type of null hypothesis is shown in Table 3.1. It demonstrates that while all three methods have the proper type I error rate in the `No CNV setting, only the permutation approach preserves the correct type I error in the case where a CNV is present, but not associated with the disease (No association). It's easy to see that p-values are independent for all methods under null hypothesis 1 (No CNV).

When a CNV is present but not related to the disease for null hypothesis 2 (No association), it is still true that the marginal distribution of each pj is Uniform(0,1) for each marker. This phenomenon is also illustrated graphically in Figure 3.1.

Table 3.1: Preservation of Type I error for three methods with nominal α = .05 in two possible settings for which the null hypothesis holds. The simulated genomic region contained 200 markers, 30 of which were spanned by a CNV. The CNV was present in either 0% or 50% of the samples, depending on the null hypothesis setting.

A detailed description of the simulation data is given in Section 3.4.

Circular Kernel Kernel

binary Monte Permutation

segmentationCarlo

No CNV 0.05 0.06 0.06

No Association 0.20 0.54 0.06

Table 3.1 and Figure 3.1 demonstrate that CBS and kernel Monte Carlo are not guaranteed to preserve the type I error in all settings. Exchangeability is very crucial to be considered when estimating the null distribution. I also make the following additional observations from comparison results: (1) The CBS approach is somewhat more robust to the exchangeability issue than the Monte Carlo approach; i.e., its type I error rate is not as badly violated. (2) The data simulated here for the no association setting are a little bit exaggerated: the CNV was present in 50% of the population and the signal to noise ratio was about twice as high as that typically

No CNV Monte Carlo

p

0.0 0.2 0.4 0.6 0.8 1.0

No Association Monte Carlo

p

0.0 0.2 0.4 0.6 0.8 1.0

No CNV Permutation

p

0.0 0.2 0.4 0.6 0.8 1.0

No Association Permutation

p

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1: Ability of Monte Carlo and Permutation approaches to maintain family-wise error rate under the two null scenarios. The implementation of CBS provided by DNAcopy does not return p-values (only whether they fall above or below a cuto), and thus could not be included in this plot.

observed in real data. In more realistic settings, the violation of type I error rate will be not nearly as severe. (3) Circular binary segmentation was developed for the purpose of detecting CNVs, not aggregating marker-level tests, and thus its failure to preserve the family-wise error rate in this setting is in no way a criticism of CBS in general.

Documento similar