CAPÍTULO 4. DISEÑO, IMPLEMENTACIÓN Y PRUEBA
4.3 Desarrollo del Sistema
of a different technology. Otherwise, no insight can be provided into the biological mechanisms of how the identified CNV influences the disease of interest. The pre-sented application of strategy S1 was finished without technical validation because of having found no positive CNV association effects.
3.6 Application of Strategy S2 to the Phenotype Obesity
In the following chapter, the application of strategy S2 will be demonstrated ex-emplarily to genome-wide raw CNV data of a family-based obesity sample. For this purpose, the same data set of 424 obesity trios, which has been analysed by use of strategy S1 in the previous chapter, will be re-analysed here. Subsequently, statistical results and genetic conclusions of strategy S2 for the phenotype obesity will be compared to those of strategy S1. Finally, strengths and weaknesses of the genome-wide CNV analyses strategy S2 will be discussed in comparison to adequate characteristics of strategy S1.
3.6.1 Data Set and Methods
Data set
Available genotype data for a family-based sample consisting of 424 nuclear fam-ilies, each comprising one obese child or adolescent and both biological parents, has been analysed here. All families were previously recruited and phenotypically characterized through the Departments of Child and Adolescent Psychiatry of the Universities of Duisburg-Essen and Marburg. Details on phenotypical characteris-tics can be found in chapter 3.5.1 and in Jarick et al. (2011) (Supplementary Table S1). For all 1 272 individuals, genotyping was performed on the Affymetrix 6.0 chip by the ATLAS Biolabs GmbH in Berlin (for details see chapter 3.5.1).
CNV calling and association testing
For each of the 1 272 individuals, the CNV detection step was performed in appli-cation of the PennCNV software (Wang et al., 2007) by using default parameters.
In the course of quality control (QC) for the CNV calling procedure, each CNV call that did not cover more than 20 informative consecutive probe sets was discarded from subsequent statistical analyses. As shown in later chapters, the CNV detection threshold of 20 probe sets per CNV call is the optimal threshold for Affymetrix 6.0
3.6 Application of Strategy S2 to the Phenotype Obesity
data with regard to CNV’s stability and reproducibility rates. The remaining CNVs were tested for an association with the binary trait obesity by use of the FBAT approach with assuming an additive genetic effect model. In more detail, the coding for the different marker genotypes was specified as 0, 1, 2, 3, 4 in concordance with the estimated total unphased number of DNA segment copies. In order to avoid re-dundancies, only the set of unique CNV’s start and end sites but not the whole set of available probe sets was tested for an association with the phenotype obesity. In more detail, overlapping CNVs were at first merged into several CNV containing re-gions (CNVRs). Secondly, each CNVR was divided into multiple sub-CNVRs. Here, the boundaries of each single sub-CNVR were defined to equal the breakpoints of the maximal intervals with identical CNV configuration across all 1 272 individuals.
Thus, the composition of each single CNVR is completely specified by the set of all CNV’s start and end sites of any individual CNV (see Figure 6.7 for details).
In order to allow each FBAT to account for a minimal number of informative families, only sites within 244 pre-specified genomic regions that offer a CNV vari-ability of at least five percent in both, the offspring’s and the parent’s group, were incorporated into the association testing step. Details on how these 244 CNVRs were specified and on their structural characteristics are given in chapters 5.2.2 and 5.2.3. As previously explained in detail, genome-wide significance of simultaneously testing multiple hypotheses was assessed by use of the lfdr method (Efron et al.
(2001), see chapter 3.5.2).
3.6.2 Results
A total of 47 796 CNVs were detected in the 1 272 individuals, 15 863 CNVs were observed in the offspring’s group and 31 933 CNVs in the parent’s group. Out of all detected CNVs, 39 955 CNVs were located in 244 pre-specified CNVRs with a minimal CNV variability of five percent, 13 455 in the offspring and 29 500 in the parents.
For association testing, FBATs were performed at a total of 3 525 unique CNV’s start and end sites (Figure 3.6). None of the tested sites reached statistical signifi-cance after correction for testing multiple hypotheses (minimal p-value = 0.00071).
3.6.3 Discussion
Application of strategy S2 for the genome-wide analysis of raw CNV data, to a family-based sample of 424 obesity trios, did not reveal any evidence for an associ-ation of certain CNVs with the trait obesity. This is in concordance with previous
3.6 Application of Strategy S2 to the Phenotype Obesity
Figure 3.6: Histogram and lfdr curve of CNV FBAT z-values for the genome-wide analysis of 424 obesity trios at 3 525 unique CNV’s start and end sites in 244 CNVRs. Panel A: Histogram. The red dashed curve depicts the standard normal distribution, the dashed blue line is ˆp0fˆ0, the empiri-cal null density, N (0.107, 1.0562), and the green line is the empirically estimated mixture density. Panel B: Lfdr curve, derived from empirical estimates of f0, f and p0 (Panel A). Observed CNV FBAT z-value are illustrated as ticks on the horizontal line at lfdr level 1.
results of applying strategy S1 to the same data set. Apart from a true lack of a CNV - obesity association, one potential cause for the negative finding might be seen in a power constraint, which may result from the moderate size of the analysed sample.
In contrast to strategy S1, in which statistical association testing is based on raw copy number measurements, a computational expensive CNV detection step is performed prior to the association testing when applying strategy S2. In the presented example for the phenotype obesity, the academically developed software tool ’PennCNV’ was used in the CNV identification step. Compared to alternative software programs, this HMM based program was previously shown to perform comparably well in detecting CNVs from SNP genotyping array data (Winchester et al., 2009; Koike et al., 2011). As outlined in chapter 3.2, there is currently no consensus on the optimal choice of an algorithm or software for estimating individual CNV events with reliable accuracy. When following recent recommendations of using a second algorithm on a single data set to increase confidence in the CNV data (Winchester et al., 2009), the complexity and computing efforts of strategy S2 would even considerably be extended. However, in filtering CNVs by calling results from a second CNV calling software tool it would even become less likely to list all CNVs in a sample. Of note, no CNV tested for an association in strategy S2 can be taken for sure without separate biological validation or replication.