SUBCAPÍTULO VII REFRENDO DE TÍTULOS
GUARDIA DE NAVEGACIÓN Artículo 478 Finalidad
Peter Chines and I also worked together to remove potentially bad probes from the EPIC array. Some probes have been reported to be cross-reactive, mapping to more than one genomic location [60, 303, 247, 442]. Measurements from such probes are unreliable as they likely represent aggregate DNAme signal across multiple sites. To identify cross-reactive probes on the EPIC chip, we mapped non-control probes back to the entire bisulfite-converted genome (leaving out alternative haplotypes, and ignoring a single hit to a random contig when there is a single corresponding hit to a primary chromosome), using Novoalign’s -b4 option, with allowance for up to three mismatches in the 50 bp probe alignment beyond the best alignment seen (-R120 option). We kept only uniquely mapping probes, removing 49,495 probes (Figure 2.12). 42558 49495 6435 15994 15342 24236 19261 5717 16972 16548 32070 404 849 333 303 477 ambig_map_McCartney ambig_map_ProbeProver var_3prime10_eur1k_IndelSV var_3prime10_eur1k_SNP var_3prime10_hrc_1pct var_3prime10_hrc_cohort var_CpG_site_McCartney var_CpG_site_eur1k_IndelSV var_CpG_site_eur1k_SNP var_CpG_site_hrc_1pct var_CpG_site_hrc_cohort var_TypeI_ext_McCartney var_TypeI_ext_eur1k_IndelSV var_TypeI_ext_eur1k_SNP var_TypeI_ext_hrc_1pct var_TypeI_ext_hrc_cohort 0 10000 20000 30000 40000 50000 Probes affected Reason f or Exclusion source 1000G EUR MAF>=1% HRC MAF>=1% HRC in cohort McCartney ProbeProver CpG Probes Excluded by each Criterion
CpG probes Excluded by Each Criterion
Figure 2.12 Summary of blacklist probes excluded from analysis. ProbeProver is the term used to describe the method developed and used by FUSION for ambiguous probe mapping.
In addition, probes may also contain SNPs, which if common to the population of interest, could lead to biases in inter-individual studies. For example, “methylation” signals at polymorphic CpGs merely reflect the underlying genetic polymorphism [60] as well as
exhibit significantly increased variation compared to all other probes [303]. In order to avoid such biases, we removed probes with a SNP within 10 bp of the 3’ end of the probe, within the target CpG itself, and finally, in the case of type I probes, if the variant overlaps the single base extension site. We used 10 bp as a cutoff because it is consistent with previous studies [303].
For variants we used common (MAF ≥ 1%) SNPs, indels or structural variants in the phase 3 1000 Genomes European dataset, common (MAF ≥ 1%) SNPs in the HRC reference panel r1.1, and SNPs appearing at all in our own samples, even at low frequency, after imputation to the HRC reference panel. We chose to filter probes overlapping a SNP at any frequency in our imputed HRC genotypes, because we will likely use different sample subsets for future integrated muscle and adipose studies. We wanted a consistent analysis data frame across all studies, instead of applying a different MAF filter for only adipose or only muscle samples. In total we removed 63,840 probes due to SNP overlaps. As a final step, we combined our blacklist with a previously published EPIC probe blacklist from McCartney et al. [247] for a total of 120,627 unique probes which were removed from subsequent analysis (Figure 2.12). After removing blacklist probes, I flagged probes with a high detection p-value, defined as p-value > 0.05 in ≥ 5% of samples, for removal before later analyses. The probe detection p-value quantifies the probability that the combined Meth and Unmeth signal is above the background signal, estimated using negative control probes. One potential cause of such low quality signal could be due to spatial artefacts on the array [79]. I evaluated various methods to remove low quality probe filters. First, I considered across all tissues using four samples sets: (1) all samples and controls, (2) dropping controls, (3) only samples that passed QC, and (4) only the final, analysis samples (after dropping samples removed in genotype QC step and selecting one of each replicate pair). Overall, I found the different sample subsets affected only a small number of probes, relative to the whole dataset (Figure 2.13; note WG3000808 was dropped for this analysis). Second, using the final analysis samples (after tissue specific filters), I evaluated a per tissue probe filter. I found an increased number of probes that failed in islet samples, likely due to the fact that fewer islet samples were assayed. I decided to use a conservative approach, removing probes that failed ≥ 5% of the final analysis samples per tissue type. After blacklist filters, I removed 578 adipose probes, 733 muscle probes, and 2,206 islet probes.
(a) Failed probes across all tissues S+C S Good S Analysis S 5 23 1 0 155 0 0 0 0 0 11 0 0 0 684
(b) Failed probes per tissue
Analysis S A M I 0 4 13 1521 1 6 1 0 2 47 41 9 105 0 521
Figure 2.13 Overlap of low quality probes with a high detection p-value across different sample sets. (a) S+C indicates that the probe failure rate was calculated across tissue samples and controls. S excludes the controls from the probe failure rate calculations. Good S excludes controls and samples that did not pass QC. Analysis S only uses samples that are included in the final analysis dataframe, after dropping samples removed in genotype QC step and selecting one of each replicate pair. (b) Comparison of failed probes across each tissue in the Analysis S set.