CTCF-bound genomic regions were retrieved from liver samples of adult mice of the same two inbred mice subspecies of the Mus musculus genus used for the analysis in Chapter 2: Mus musculus domesticus (C57BL/6J or BL6 for short) and Mus musculus castaneus (CAST), and their F1 hybrid offspring of two reciprocal crosses (BL6xCAST and CASTxBL6). Binding sites were derived from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) libraries of CTCF in two biological replicates for each of the four lines, for a total of 8 biological replicates (2 for each F0 parental subspecies and 2 for each of the two F1 hybrid crosses) (Figure 3.1a). Differences in binding affinities, by proxy of varying read enrichments, between the F0 parental mice and their F1 hybrid offspring were used to discern the evolutionary, genomic and functional dynamics of cis and trans variation and their impact on the pattern of CTCF binding.
By using normalised ChIP-seq read counts, different regulatory categories were assigned to each binding sites, based on the presence of single nucleotide variants (SNVs) overlapping ChIP-seq peaks and the differences in their read enrichments.
Using the approach detailed in the Methodology, we were able to distinguish between four possible regulatory categories that describe the binding of CTCF between the two subspecies into 4 categories: conserved, cis, trans, cistrans (Figure 3.1a). Under the conserved category, SNVs do not exhibit any measurable differences in their binding signal intensities between either F0 parental alleles or their F1 progeny. Cis-acting variant read enrichment is associated with that of the specific parental allele [257, 260, 612]. Despite distinctly different read signals from both F0 parental alleles, trans-acting variation affect both alleles in F1 equally, due to diffusible elements in the shared nuclear environment. Lastly, the cistrans classification encompasses the
parental subspecies, but their binding among their progeny was also observed to be different. This could indicates either a combination of cis- and trans- acting variation influencing CTCF binding in F1 due to both a common environment and allele-specific effects, or insufficient signal to allow for confident category assignment that may improve with adding more biological replicates.
Figure 3.1: Overview of the experimental design and preliminary results.
a CTCF occupancy was profiled using ChIP-seq of liver samples of male mice from C57BL/6J (BL6), CAST/EiJ (CAST), and their reciprocal F1 crosses:
BL6xCAST and CASTxBL6 in 2 biological replicates for each genetic
background. Normalized ChIP-seq read counts between BL6 and CAST at SNVs were used to assign regulatory classes for variation in CTCF binding. Based on the schematic diagram, by comparing BL6 and CAST ratios between F0 and F1 mice, CTCF binding sites can be classified into four regulatory categories:
conserved, cis, trans, and cistrans. b Summary table for the results of library alignment and peak calling sorted by genetic background. Lower mapability of aligned reads is caused by the stringent criteria of maximum mismatch of two bases per read (See Methods). c Bar plots of the results of ChIP-seq read alignment of F1 libraries to both F0 genomes. Top plot shows the F1 reads aligning to BL6 and CAST genomes, whereas the bottom plot shows F1 CTCF peaks called from reads aligned to BL6/CAST genomes (in 1000s). Similar number of peaks were called in F1s with BL6 or CAST genomes the majority of them overlap over 90% reciprocally.
A sufficient read depth was obtained following ChIP enrichment, and >20 million reads (over 50% of all reads) from each replicate aligned to their corresponding genome despite stringent mapping criteria (See Methods) (Figure 3.1b). A comparable number of CTCF peaks was obtained from all replicates in both F0 and F1, with a mean of >
44,000 peaks per replicate, consistent with the overall number of CTCF binding sites previously reported[269, 559]. Notably, an equal proportion of F1 ChIP reads mapped back to both parental, BL6 and CAST, genomes, and produced a comparable number of binding sites when each set of aligned reads were peak called (Figure 3.1c), confirming the ability to map back alleles from each F1 hybrid to their parent of origin, and the feasibility of drawing comparisons between the two subspecies.
A total of over 58,000 CTCF binding sites were identified across replicates/crosses.
The majority of these sites (75% of all binding sites) were not characterised by the presence of SNVs, thus it was not possible to investigate their allelic differences in binding between the two subspecies as they could not be told apart. There were, however, about 25% of binding sites that had one or more SNV within the peak region with sufficient read enrichment signal to quantitively resolve the difference in allelic binding in both the F0 and F1 mice (Figure 3.2a). Half of these sites have a single SNV in the peak region, with the remainder carrying two or more SNVs in their sequence. The numbers obtained in this analysis roughly compare to those obtained in a study looking at the other liver-specific transcription factors (Figure 3.2a)[257]. In order to avoid conflating the results by repeatedly counting binding sites with more than one SNV, all downstream analysis used SNVs that are at least 250 bp from the next SNV, restricting them to a single SNV per binding site (see Method).
CTCF binding sites with SNVs informative for allelic differentiation were assigned
Figure 3.2: Regulatory categories assignment demonstrates that CTCF occupancy levels are equally cis- and cistrans-driven for 2/3 of sites.
a Pie charts for the number of CTCF binding sites obtained after peak-calling and SNV mapping with associated number of SNVs. The percentages for the nested pie charts reflect their proportions out of the total number of binding sites. b Scatterplots of BL6 vs CAST log2 ratios of CTCF binding intensity signals in F0 and F1 mice. Every point represents an SNV. Regulatory categories-assignments are colour-highlighted in individual scatterplots. Direction of distribution of SNVs is indicated above each plot. Grey-coloured points are the remainder of SNVs that are not assigned to the highlighted category. c Pie charts displaying the regulatory class make-up of SNVs overlapping CTCF compared to those derived from 2 randomly selected replicates for each of the three other TFs. The numbers in brackets indicate the total number of TF binding sites with a minimum of one SNV in their sequence for each TF.
In order to verify this class assignment, the differences in the binding ratio between the F0 and F1 alleles were visualised as the ratios of the F1 BL6 allele to its CAST counterpart against the corresponding ratio of F0 alleles. As seen in Figure 3.2b, Cis-acting variants cluster along the diagonal line as their F1 ratios correspond to those of the parental lines, whereas trans-acting variants form a straight line parallel to the F0 ratios, a result of their departure from their parental alleles signals. Cistrans-variants significantly deviate from the diagonal line, filling the area between cis- and trans- variants (Figure 3.2b).
In total, there were 14,364 CTCF binding sites characterized by the presence of SNVs with sufficient read coverage to allow the resolution of allelic difference, and that we were also able to assign regulatory categories to, equivalent to the number of sites used for other TFs (13,000 – 17,000) (Figure 3.2c). Although CTCF binding sites, similar to liver-specific transcription factors, were most frequently influenced by cis-acting SNVs (35%), these were followed very closely by cistrans (28%), then conserved (20%) and trans variants (17%) (Figure 3.2c). The enrichment of cis variants on CTCF was noticeably lower than observed in the liver-specific TFs with an equal number of biological replicates. CTCF, on the other hand, showed a marked increase in the fraction of trans regulatory variation in occupancy compared to all other three TFs (17% vs 10%). Estimates for the contribution of conserved (cons) variation in CTCF and other TFs binding were equivalent (Figure 3.2c). Assignment of binding regulatory variation in CTCF was found to be statistically significantly different from all 3 liver-specific TFs (c2 test for pairwise comparison between CTCF and other TFs with Bonferroni correction, all p-values < 2.2e-16).
Although cis-acting variation was the most prominent mode of variation present in the liver-specific TF binding sites in the original analysis that used 6 biological
of SNVs, is always the second most common type of regulatory variation in TF occupancy, with trans and cistrans variation contributing less than third[257]. The increase in the cistrans effect size when the analysis was run in two biological replicates instead seems to have come at the expense of both the conserved and cis variants (Figure 3.2c).
A subsampling approach was undertaken in order to assess the validity of the assignment of cis/trans categories to CTCF binding sites based on only two biological replicates and ensure comparability with the three other liver-specific TFs. The aim was to elucidate the effect of biological replicate number on the ability to resolve the difference in read coverage into distinct regulatory categories. By randomly combining a specific number of biological replicates (from 2 to 5) for each of the three TFs, then running the cis/trans category assignment algorithm for one thousand times, we were able to obtain the range of category estimates for each of the four subsampling strategies (Figure 3.3). The results of the subsampling strategy of biological replicates in other TFs show an overall improvement in the estimates of the four regulatory categories with the addition of every extra replicate towards the values obtained when the experiments where run with 6 biological replicates (Figure 3.3 density plots). This is additionally evidenced by the narrower distribution of values, reflecting less dispersion of values, with increasing replicate number (Figure 3.3 dot plots).
These improvements, however, are not uniformly distributed among the cis/trans categories. The most conspicuous change is invariably observed with the resolution of cistrans sites into other categories, as their proportions strongly decrease with the addition of extra replicates (starting from 3 replicates). The range of value obtained in all cistrans 2-replicate runs for all TFs never matches the original 6-replicate estimate. Conserved sites (cons) show exactly the opposite pattern, increasing considerably with the addition of extra replicates. The range of cons values for 2 replicates is not that of 6-replicate. The estimates of cis-influenced do slightly increase with increasing the number of replicates, but the overall distribution of values for 2-replicate runs mostly overlaps with that of higher 2-replicate number, and occasionally (especially in the case of CEBPA) considerably overlaps with the 6-replicate estimate.
The pattern for trans sites is even subtler, with tighter ranges of values, and estimates that do not generally deviate from the 6-replicate estimate. For example, CEBPA 2-replicate mean values are nearly at the 6-2-replicate estimate (Figure 3.3).
Figure 3.3: Ascending subsampling of biological replicates in other TFs support the cis and trans proportions observed in CTCF.
Density plots of all 1000 randomised combinations of biological replicates in
factors (CEBPA, FOXA1 and HNF4A), faceted by the 4 regulatory categories:
cis, cistrans, conserved (cons) and trans. The area under each curve correspond to the entire range of values (number of binding sites classified as such after one run of the algorithm, repeated for a 1000 randomised runs) for each category, number of replicates and TF. The horizontal dot plots under each facet illustrate the distribution of the values obtained for each category per number of replicates in that TF, and correspond to the width of the curve above. The black dashed line indicates the original estimate for the number of sites for the particular category in that TF as derived from the original analysis in Wong et al.[257].
The grey dashed line indicates the number of CTCF sites calculated for 2 biological replicates, estimated for every TF based on the proportion of each particular category in CTCF, and multiplied by the overall number of sites in each TF.
These results validates the proportions of the cis and trans observed in CTCF.
Although the cistrans estimate for CTCF almost always overlaps the mean/median for 2-replicate runs in other TFs, and may similarly resolve into other categories with the addition of extra biological replicates for CTCF, the estimates for the three other categories appear different than those of the other TFs (Figure 3.3). The estimate for cons sites were generally higher than the equivalent 2-replicate mean values (almost equal to mean/median of 3-replicate runs in CEBPA and HNF4A). CTCF cis variants estimates are always lower than any of their 2-replicate counterparts in other TFs.
Although this estimate may similarly go up with the addition of extra replicates (via the resolution of cistrans sites), on this evidence CTCF cis variants would remain less abundant than in other TFs. Conversely, the CTCF trans estimate is always much higher, and as trans-assigned sites only slightly decrease with added replicates, the CTCF trans component looks to be distinctly higher.
Even though cis-acting variation was the most common in CTCF binding, the effect size on its occupancy was also different. This is clearly reflected in the Pearson’s correlation coefficient of CTCF occupancy between the two parental subspecies and their offspring (Figure 3.4a). When cis- and trans-acting variations have an equal effect on occupancy differences between the F0 and F1, absence correlation (correlation coefficient = 0) would be observed, whereas a perfect correlation (correlation coefficient
= 1) signals the lack of any trans influences. Although all four TFs (CTCF, CEBPA, FOXA1 and HNF4A) showed correlation coefficients that are considerably large (r >=
0.7, all p-values < 2.2e−16), the distribution of read enrichments from CTCF binding sites exhibit a higher degree of dispersion and deviation from the strongly-cis pattern seen in other TFs (Figure 3.4a). This is further evidenced by a lower correlation coefficient (r = 0.70), that is statistically significantly different compared to that for CEBPA, FOXA1 and HNF4A (z-test, all p-values < 0.0001).
Figure 3.4: Cis-acting variants do not display inter-peak correspondence in CTCF.
a Scatterplots for the mean F0 vs. F1 binding intensity ratios (BL6 vs. CAST) for CTCF (left) and the 3 liver-specific TFs (right). Data from 2 randomly selected replicates for the other TFs were used for plotting to allow for meaningful comparison. The correlation coefficient (r) indicate the level of
cis-from each CTCF site affected by cis-acting variation in both directions for a distance of 400 kb. Spearman’s ρ was computed for each bin through the BL6:CAST allelic ratio between SNVs in bins vs anchored SNVs. Spearman’s ρ values for each bin (right) were plotted (black dots). Red line is the linear regression line. Grey dots represent the null distribution of random subsampling from the total set of cis/trans CTCF sites. The grey line is the linear regression line for the Spearman’s ρ values from the null distribution. c A blow-up of the Spearman’s ρ values for each bin in the 50 kb range from the anchorage points for CTCF. The red line is the linear regression line and the red dashed lines mark the 90% confidence intervals of the slope of the line. Grey dots represent the null distribution. The grey line is the linear regression line for the Spearman’s ρ values from the null distribution.