3.4.1 Two distinct bias regions in RNA-seq data indicate two distinct molecular mechanisms †
Unlike the ChIP-seq data, the RNA-seq data examined show significant bias only on the nucleotides in the fragment itself and negligible bias in the nucleotides immediately preceding the start of the fragment. It has already been proposed [40] that this bias is due to the binding of the random hexamers to the RNA which constitutes an early stage in the conversion of RNA to DNA with reverse transcriptase.
However, the modelling in this paper suggests that the situation is more complex than was revealed in the analyses in previous papers, which assumed a single bias pattern. Modelling using multiple PCMs shows that there are two clearly distinct regions. The first, which requires multiple alternative PCMs to fit the observed data, covers the first six nucleotides. The second covers the region from nucleotide seven onwards, where a single nucleotide bias is observed which is virtually identical in all the data examined. This suggests that there are two distinct mechanisms that are responsible for the PCM patterns in these two regions. This may provide more information on the way that the random hexamer binding causes the observed bias.
3.4.2 Random hexamer related RNA-seq bias in nucleotides 1-6 †
The six nucleotide width of the first region is consistent with the hypothesis that the bias occurs as a result of the binding of the six-nucleotide-long hexamers. On previous occasions when a single bias was assumed, it was not possible to explain the results in terms of binding energies [40]. This new more complex insight into the binding should provide a better starting point for an examination of how DNA/RNA binding energies could give rise to the observed characteristic.
The asymmetry of the pattern in these six nucleotides is particularly striking, with a strong GC preference at the nucleotide at the 5’ end of the RNA. This would arise during the creation of the second strand of the DNA and may be an indication that binding is initiated at the 5’ end of the random primer. In addition, a preference for an initial GC binding may indicate that it is the three hydrogen bonds in this pairing, rather than the two-bond AT pairing, that makes it more likely that the binding will start with a CG pairing. The pattern for the following nucleotides shows a significant correlation between adjacent nucleotide positions, with a tendency for runs of Us, Cs or As. The pattern of runs of alternative
nucleotides was hidden in the previous analyses as a result of the assumption that there was only a single bias pattern present.
Appendix D also shows in more detail the virtually identical patterns for the 5’ and reverse complement 3’ end of the RNA fragments that have previously been observed. The bias at the 5’ end will result from the binding of random DNA primers to the RNA as part of the process of creating the first strand of DNA, and the bias at the 3’ end is a result of random primer binding to the DNA in order to create the second DNA strand. The more detailed model showing that the biases from both stages are very similar suggests that the random hexamer binding to DNA and RNA is governed by very similar physical processes.
3.4.3 Reverse-transcriptase related bias from nucleotide seven onwards †
The consistency of the pattern of nucleotides from nucleotide seven onwards suggests that it is caused by a different mechanism to that of the first six nucleotides. While these nucleotides may contribute to preferences in the binding of the random primer, they will be of greater significance when it comes to the binding of the reverse transcriptase and the processing of the enzyme along the RNA or DNA.
One possible explanation is that there is a greater probability of binding and transcription occurring if the first nucleotide after the random primer is an A and the fourth is a U/T. These data may consequently provide a useful additional insight into nucleotide preferences of the reverse transcriptase used in the RNA-seq protocol.
3.4.4 Implications for correcting bias in RNA-seq †
When correcting for any bias that is introduced in RNA-seq data, the existence of two separate mechanisms will influence the way in which the correction might be made. A preference for the random primer to bind at certain locations will affect the distribution of the start sites of fragments, but not necessarily the number of fragments that are ultimately sequenced in a specific region.
However, a bias in the likelihood that the reverse transcriptase will bind and transcribe a fragment may result in fragments in some regions being over or underestimated. This new analysis suggests that the details of any process for the removal of bias from RNA-seq data may depend on how these two effects might create inaccuracies in the characteristics being investigated.