Our evaluation specifically covers the most important addition described in this thesis: the variation-aware alignment. We will not discuss the general performance of PALMapper in comparison to other alignment approaches, but rather focus on improvements within the specific setting that RNA-Seq data and reference genome show substantial sequence differences. For the evaluation, we considered two different datasets: an artificial dataset of simulated data where we had full control over the read-generation as well as a biological dataset, produced from two related subspecies of A. thaliana, to evaluate the performance in the context of natural variation. We begin by describing the evaluation on the simulated data and subsequently discuss the biological dataset.
gj3 s gj1 s gj gS gE rs re 3 e gj1 e gj4sgj gj6s gj6e j 4 e gj5 s g5e gj6s gj6e gj gS gE rs re 2 e gj2 s genome junctions genome read read j1 j j3 2 j4 j6 j5
B
A
C
Figure 2.5: Identifying junction combinations for junction remapping. A: List of available junc- tions. B: Possible combination of junctions. C: Alternative combination of junctions. Genome sequence is blue, reads are green. Junction spans are indicated as dark gray solid lines.
Evaluation on Simulated Data General aim of this evaluation was to measure how
much single nucleotide differences between RNA-Seq source and reference sequence influence the alignment performance and to quantify the improvement when variation-aware align- ment was used. To answer these questions, we constructed an RNA-Seq dataset originating from a heterozygous genome, generating the same number of reads from each haplotype. In consequence, when the reads are aligned back to the genome in an optimal way, het- erozygous positions should show no difference in read coverage. Any measurable deviation for one the two alleles would be due to the alignment procedure. To generate such a set of reads, we randomly chose 5,000 genes from the TAIR10 genome annotation for A. thaliana
and used the FluxSimulator [102] (version 1.1.1-20121103021450) to sample 107 reads of
length 76 nt from these genes. We chose the default error model and selected a normal distribution with mean 300 and standard deviation 50 as the insert size distribution. A list of all simulation parameters is provided in Appendix A.1. The read set was then duplicated into two identical read sets, simulating the contribution of two parents. One of the two
read sets was then mutated with a uniform mutation rate of 10−4 to randomly introduce
single base substitutions, thus generating heterozygous positions present in the read set but not in the reference genome. Given an estimated substitution rate of 1 mutation per genome per generation [183] and a generation time of ≈ 5 weeks for Arabidopsis, the last common ancestor of the two simulated individuals was 500 years ago. In total, we altered 2,951 positions in the sequence of the 5,000 genes. As we mutated the read sets and not the source genome, we expect no biases from statistical fluctuations due to expression model of FluxSimulator.
The two read sets were then merged and used in two different alignment settings. In the first setting, we aligned the reads to the same genome they were originally sampled from. In the second setting, we used the variation-aware alignment to the same genome, taking the list of altered positions into account. As the dataset was artificially constructed from the same parent, only the artificially introduced variant positions were heterozygous, each with the same allele frequency of 0.5.
0 500 1000 1500 2000 2500 3000 genes 0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 log ratio
Coverage ratio of heterozygous alleles
norm
varVarInd
Figure 2.6: Comparison of allele-specific alignment performance at a set of artificial variant loca- tions. Shown is the log-ratio of the two alleles at all simulated heterozygous loci for the alignments with (green, varVarInd) and without (red, norm) the variation-aware extension. The optimal align- ment set would show no deviation from zero for any gene.
To assess, how well the alignment was able to reconstruct the allele frequencies at variant positions, we computed the log-ratio of the number of reads carrying one allele over the number of reads carrying the other. This should result in a value of 0, if both alleles occurred at the same frequency and a value above or below zero, if the first or second allele were overrepresented, respectively. The variant-aware alignment showed a substantially larger amount of variant positions that had the same frequency of alleles than the alignment without variant information. A diagram of the results can be found in Figure 2.6.
Evaluation on Biological Data For assessing alignment sensitivity in a biological set-
ting of variation-aware alignment, we used RNA-Seq data that has been published in earlier work [92]. The two ecotypes of A. thaliana Col-0 (originating from Columbia, USA) and Can-0 (originating from the Canary Isles, Spain) were two of the evolutionary most-distant sub-species analyzed in [92] and showed a substantial amount of sequence variation between their genomes. To test for the effect of the variation-aware extension on alignment sensi- tivity, we aligned RNA-Seq reads originating from Can-0 to the Can-0 genome, the Col-0 genome and the Col-0 genome with additional information about the sequence variation. The original data was split into 23 chunks of 250,000 reads each, using the UNIX split command. All chunks were then aligned independently. Even without the variation-aware extension PALMapper shows a higher sensitivity than comparable state of the art tools (TopHat [287]; TH CA and TH CO in Figure2.7) and has an even increased performance when using the variation-aware index (Figure 2.7, PM COvi) and the variation-aware local alignment (PM COv). The fully variation-aware alignment using improved index and local
Legend: PM PALMapper TH TopHat CO aligned to Col-0 CA aligned to Can-0 vi variant indexing v variant alignment
vvi v and vi together
Figure 2.7: Sensitivity of variation-aware alignments on a biological dataset. From left to right, the bars show the percent of aligned Can-0 reads for 7 different alignment settings: PALMapper alignment to Col-0, PALMapper alignment to Can-0, TopHat alignment to Col-0, TopHat alignment to Can-0, PALMapper alignment to Col-0 plus variant aware index, PALMapper variant aware alignment to Col-0, PALMapper variation-aware alignment to Col-0 plus variant aware index. The red dashed line shows that the fully variation-aware alignment (rightmost bar) is almost as sensitive as the alignment to the Can-0 genome (second bar from left). Error-bars indicate the standard error of the mean over replicates of 23 read chunks with 250,000 reads per chunk.
alignment (PM COvvi), shows almost the same sensitivity as the alignment to the original Can-0 genome (PM CA). As discussed above, this additional sensitivity is mainly caused through alignments over regions in the genome that show variability in the reference. Al- though a sensitivity improvement of 2% seems only moderate, it can be essential for the analysis of allele-specific expression or in the context of genome wide association studies, where the link between expression differences and the genetic background is investigated.