Adapting ChIP-seq and other NGS technologies to address the potential effects of genetic sequence changes on the presence or absence of regulatory event of interest such as TF occupancy has led to an increased understanding on how variants within and across populations/species shape the regulatory landscape. DNA sequence changes within regulatory elements have the capacity to influence TF binding stability and its ability to induce its effects on transcription or modifying chromatin state, ultimately affecting the regulatory potential of the region and its impact on transcriptional regulation.
To study the regulatory effects of sequence variants (see 1.1.4 for more details), two methods have been used extensively: expression quantitative trait loci (eQTL) mapping (using gene expression levels as a quantitative phenotypic trait), and allele-specific expression divergence between parental strain and their F1 hybrid in genetically inbred organisms. eQTL are polymorphic DNA variants that are associated
seq to provide allele-specific gene expression levels[518]. RNA-seq produces 10s-100s millions of sequence tags that provides a complete profile of gene expression and the isoform structure of each gene[519]. eQTL mapping have been utilised to identify regulatory regions driving variation in mRNA levels and differ between local regulators acting in short genomic range in allele-specific fashion (cis) and distant-acting regulators (trans)which influence the transcriptional processes by affecting the availability of other factors involved in gene expression, resulting in similar expression levels from both alleles[252, 518].
eQTL analysis comprises four main steps: DNA genotypes processing, RNA-seq tags processing, counting of total RNA-sequence reads and eQTL mapping[520]. First, DNA reads are mapped back to their reference genome, the genotypes are called and their haplotypes are imputed using a phasing algorithm. Next, RNA-seq reads are aligned to the same reference and/or the two haploid genomes imputed based on the results of the phasing programme[521]. After that, total read counts per gene, per sample, as well as the allele-specific reads per allele of a gene, per sample are counted, removing reads with low mapability and quality scores. eQTL mapping follows whereby variation in allele-specific expression and total gene expression are associated with a cis/trans variants using a beta-binomial distribution to test for similarity/difference in gene expression between the two alleles of a gene[520]. A hierarchical Bayesian model has been suggested to test the disparity of gene expression across alleles, combining information across genome-wide loci[522].
A variant on the method involves chromatin immunoprecipitation quantitative trait loci (ChIP-QTL)[523], which combines this approach with identifying TF-DNA contacts as discussed in the previous sections. eQTL has been implemented on a genome-wide scale successfully in a number of studies to investigate distant acting variation, epistatic interactions, and determining gene expression divergence phenotypes[524-527].
Resolving the regulatory effects between cis- and trans-acting variation in eQTL studies remains fairly challenging despite advances in both experimental techniques and computational approaches[528, 529]. First, eQTL analyses require a vast number of genetically diverse samples to reach sufficient statistical power for detection[530-533]. Furthermore, eQTL analyses cannot fully distinguish between cis- and trans-acting elements. Some of the trans-acting variants may be located in close proximity on the same DNA molecule of the target gene, and some cis-acting variants may be distantly located[534, 535]. In addition, trans-eQTL have much smaller effect sizes, are less robust, less common and require a high number of association tests to investigate than cis-eQTL[536], which in turn reduces the statistical power, hindering
their detection[537]. Additionally, trans-eQTL suffer from the same confounding factors that influence their cis counterparts, be they biological (e.g. haplotype effects, tagging cis-eQTL), technical (probe binding sites variation) and statistical (missing genotypes, population structures)[538].
The F1 hybrid method, in contrast, can avoid many of those caveats, and reliably resolve the regulatory changes brought about by cis- and trans-acting variation on gene expression and TF occupancy. In this approach, variation between the two parents allow allele- specific expression to be evaluated. In F1 hybrid of the two F0 parental strains, cis-acting regulatory variants appear linked to their target gene reflected in allele-specific expression. Trans-acting regulatory variants affect both F0 alleles equally due to the shared nuclear environment. These two fundamentally different effects allow comparison of differential expression between the F0 strains and the allele-specific expression in the F1 hybrids, resolving the regulatory divergence in cis and trans across the entire transcriptome. Genes differentially expressed due to one or more regulatory variants acting in cis result in a ratio of allele-specific expression in F1 hybrids equal to the ratio of expression between the parent strains. On the other hand, if both alleles are expressed equally in the F1 hybrids, the difference is due to one or more trans-acting regulatory variants[260]. Whereas eQTL studies require a large number of crosses/samples, it is a major additional advantage that the F1 hybrid method requires only two parental strains and their F1 hybrid for analysis[252]. This approach has been used to study allele-specific gene expression in F1 hybrids in yeast[539-541], fruit flies[542, 543], and mice[257, 260, 544, 545].
To apply this method to polymorphic sites that are linked to regulatory variation, ChIP-seq peaks can be investigated to search for sequences that align across a heterozygous base in F1 hybrids. An observed difference in the binding intensity signal of one allele versus the other suggest a possible allelic effect on TF binding. For a TF binding site with two alleles, the binding signal from both alleles in F0 and F1 would be equal if no sequence effect is present, resulting in non-differential binding.
On the other hand, if the binding signal differs between F0 strains or the F1 hybrids, this indicates sequence-specific effects on the regulation of TF occupancy. These could be due to cis- and/or trans-acting variation[493]. This type of analysis requires adapting the standard ChIP-seq analysis pipeline discussed above to accommodate sequence variants that may introduce bias at the alignment step as heterozygous sites where reads identical to reference genome are aligned at a higher rate due to
‘mismatch’ penalty imposed on the non-reference allele. In the F1 hybrid analysis, two reference genomes may be created, each containing one allele for the variant site, making it possible to combine the separate alignments of reads to each of these
strains, and subsequently combine aligned reads for further analysis[257].Furthermore, there are allele-aware aligners that dynamically account for multiple alleles during alignments, such as the Genomic Short-read Nucleotide Alignment Program (GSNAP)[498]. In sum, this type of analysis requires particular care and consideration for alignment of sequence variants in order to allow the accurate detection of differential binding signals from TF binding sites.