BLOQUE II CALCULAS TIPOS DE ANUALIDADES
COMPETENCIAS A DESARROLLAR DURANTE EL BLOQUE
2.1 ASPECTOS BÁSICOS DE LAS ANUALIDADES
While 5’ specific RNA-seq provided an excellent resource even in its basic state, the format was not ideal for automated high-throughput interrogation of transcription start sites. While genome browsing software was available which could show individual reads aligned to the genome, the most useful format of data was the coverage plot such as those displayed in Artemis (e.g. Fig. 4.6). The coverage plot represents the number of reads aligned to each base position, and takes into account any alignment throughout the length of the read. However, the nature of 5’ RNA-seq was such that only the aligned position of the first base in each read was significant.
In order to accurately assess putative transcription start sites, it was beneficial to alter the format of the mapped data from a series of 50 base pair reads to one which represents only the 5′ end of each read. This process identified 5′ ends from a pileup of aligned reads. By searching pileup data for each chromosome from 5′ to 3′, the number of reads starting at each position was calculated. Alignment frequencies of the following 49 positions were then reduced by the same number, so that only the 5′ end positions remained. Moving sequentially along each chromosome from 5′ to 3′ meant that each base position had any non-5′ end alignments removed before being processed, ensuring accurate representation of 5′ end positions and mapping frequencies in the resulting pileup. This process was designed in collaboration with Dr H. Wu, who developed novel software to perform this task. Appendix 2.3 contains software produced by Dr H. Wu for RH identification and all future elements of this analysis, as well as a description of software utilisation (provided by Dr Wu).
The chromosomal position of the first mapped nucleotide at the 5′ end of each read was identified as the readhead (RH). From this we were able to find the frequency of RH mapping
123
to each base position and convert read coverage data to a new format listing the number of reads and therefore the number of transcripts which start at each base Fig. 4.7.
Figure 4.7. Conversion of mapped read alignments to RH frequencies. Standard RNA- seq read coverage data (blue) is shown against base position.The start position of each read was identified and mapped independently of other basesto convert this to RH coverage (red), which mapped only the base at the 5′ end of each read. Converting the data in this way greatly simplifies visualisation and further manipulation of TSS data.
124
Visualisation of both standard and RH mapping data in Artemis revealed a significant amount of background mapping. Many bases throughout the genome had been aligned with 1-2 reads, a number far too low to be considered a candidate for a transcription start site. While the majority of these bases appeared to map randomly, a large number were noted within annotated genes and shortly downstream of well mapped transcription start sites, possibly representing the 5′ ends of partially degraded mRNAs. This posed a problem when designing novel software for downstream analysis of the RH data, as such software might have
identified low coverage positions as potential transcription start sites. To circumvent this issue, these positions were filtered out prior to any further analysis.
Filtering of any nucleotide position with low RH mapping frequency was performed using novel software to change the RH frequency to 0 at any base where the RH frequency was below a given threshold (Chapter 2.5.5). Various minimum RH values were tested and the resultant RH mapping files inspected visually in Artemis to determine an optimum cut off. A minimum RH value of 3 was found to eliminate the majority of background noise, while having a negligible impact on mapping around sites of high RH frequency.
Having removed background mapping, the remaining positions to which readheads were aligned represented the real 5′ ends of sequenced transcripts. This allowed regions of significant RH clustering to be more easily identified as potential transcription start sites.
125 4.7. Identification of TSS regions
Previous studies into the internal structure of core promoters have shown that transcription start sites are not absolute positions within the promoter. The FANTOM3 (functional annotation of mouse 3) project, applied Cap analysis of gene expression (CAGE) methods (Kodzius et al., 2006; Shiraki et al., 2003) to 20 tissues from mouse and human (Carninci et al., 2005; Carninci et al., 2006). In many cases, initiation of transcription was found to occur at multiple nucleotide positions within a core promoter region. This suggested that most core promoters do not have a single TSS, but rather a number of closely located initiation sites. These sites form distinct TSS regions and are conceptually different from alternative promoters, in which core promoters are separated by clear genomic space.
Visual inspection of the filtered RH data alongside standard whole transcriptome data and the current gene model annotation showed excellent correlation between regions of RH mapping and the annotated 5′ ends of genes. The majority of RHs mapped in clusters close to predicted start sites, often with a distribution of multiple large peaks surrounded by a range of positions of lower RH frequency. This supported the hypothesis that transcription start sites are not confined to a single base, instead presenting as regions in which transcription can be initiated. In an attempt to characterise these regions, a sample of 100 genes with high levels of RH mapping at the 5′ end were selected at random, and RH distribution investigated for the start site region of each gene.
Selecting the nucleotide with the highest RH mapping frequency as the predominant transcription start site for each region, the distribution of RH mapping both upstream and downstream of this position was investigated. In 98% of cases nucleotides with substantial levels of RH mapping were confined to within a 60 bp range in either direction from the predominant start site. In those cases where mapping was observed beyond this 60 bp limit,
126
the frequency of RH mapping was extremely low. While transcripts started at these positions are probably real, their apparent scarcity beyond this point meant that they were disregarded during further analysis, which focused on the TSS of more common transcripts. Therefore, a region of 121 bp centred on each predominant TSS was defined to investigate the spread of RH distribution on a global scale. These regions were identified on a global scale by Dr H.Wu.