Capítulo III: Evaluación Externa
3.1. Análisis Tridimensional de las Naciones:
3.1.1. Intereses nacionales Matriz de Interés Nacional (MIN)
We will begin with an informal description of our algorithm. Its main idea is to use local coverage information to decide where to put a read, if multiple alignment locations exist. The simplest case is to imagine two genomic locations that share both an identical and a non-identical sequence part. Reads aligning to the non-identical part can be used to infer the expected coverage of the gene, which can then be used to infer the amount of align- ments to the identical part in each locus. Following the example in Figure 2.13, we would use the level of blue reads to infer the desired number or red reads and use this information to assign the red reads to one of the two locations. An optimal assignment would place each read at a location, such that alternative regions optimally fit into their non-alternative contexts. However, testing all possible combinations of read assignments to all mapping locations is computationally infeasible. Therefore, we suggest an iterative approach to only alter the alignment location of one read at a time, keeping all other reads fixed, and apply this sequentially for all reads. Repeating this process for several iterations, converges to a local optimum in global read coverage.
However, this approach only works, if we assume a uniform coverage over the length of the gene. An idealized sequencing process would sample reads from a source sequence following a uniform distribution. That is, each read can originate from any location with the same probability. However, this is not the case for real sequencing samples. Due to various biases in different parts of the sequencing process, such as priming, amplification or fragmentation, the reads show a non-uniform distribution over the length of a transcript or gene [31, 67, 106, 210]. Therefore, making the assumption that read coverage is uni- form over the length of a whole gene is inaccurate. However, most of these biases act on a longer range of several hundred bases or have an effect that is sequence specific and thus locally similar. Hence, within a local window the distribution of reads is much more uni- form. Based on this observation, we make the reasonable assumption that the coverage of a given transcript is relatively smooth, that is, the difference of coverage between neighboring positions is small. Stated differently, we assume that the coverage within a small local window is almost uniform. Hence, we apply the procedure described above not on the level of genes but rather on windows around the alignment location. This central assumption of the algorithm is violated, if gene structure and alternative usage of isoforms within a gene influence smoothness even within a local window. To resolve this, the algorithm is able to take known structures into account. We will discuss this in the context of MiTie [25] in Section 2.4.3.
Following the idea described above and using the assumption of locally smooth coverage, the whole set of possible alignments for a given read is evaluated, with the goal to identify the mapping that results in the locally smoothest coverage. In this context we measure smoothness as the empirical variance of the position-wise coverage in a window around the alignment location. The algorithm then minimizes the variance over all possible alignment locations, choosing the alignment with the smoothest coverage as optimal. This works as follows. Given an input of k different alignments for a given read, one alignment is designated as the currently best. Depending on user preference this is either an arbitrary alignment or the mapping with the highest alignment quality. This current best mapping is then compared to each of the remaining mapping possibilities in a pairwise manner. For
a single comparison, four variance values are computed. Given two possible alignments a1
and a2 to the genomic start locations l1 and l2, respectively, the score v1+ contains the
local variance around genomic location l1 if a1 is mapped to that location and v1− if it
is mapped somewhere else; v2+ and v2− are defined analogously using the alignment a2
to genomic locus l2. In each case the score is defined as the empirical variance over the
genomic coverage of all window positions
v1 = 1 k1− 1 k1−1 X i=0 a1[l1+ i] − 1 k1 k1−1 X j=0 a1[l1+ j] 2
where k1 is the number of positions in a window around alignment a1and a1[i] indicates the
coverage at genomic position i. The window length k defaults to 20 nt and can be adapted by the user. If an alignment is present within the window, it influences the coverage and thus the local variance. After computing all four values, a1 is chosen if
v1++ v2− < v1−+ v2+
is true, otherwise a2 is chosen. A schematic visualization of the MMR principle is shown in
Figure 2.14.
A major complication arising during the computation of v1+, v1−, v2+ and v2− is the
special case that occurs when the windows of a1 and a2 share common positions. In this
situation, two different scenarios can occur:
i) the windows share positions but the alignments do not share positions, ii) the alignments share positions.
As the read is placed at either the one or the other location, in case i) the computation of v1− needs to consider coverage contributed by a2 as this will be placed instead of a1 and
v2− needs to consider coverage contributed by a1. Case ii) causes a subset of positions that
are shared by a1 and a2 to not be altered by the decision. These positions can be masked
for analysis and left out in computation, as they contribute to both locations not changing the result.
Our approach can be extended easily to also work for paired-end RNA-Seq alignments. In this case, a preprocessing-step creates all possible valid pairs of alignments of the two
+
+
Read pair Variance measure Evaluation window Coverage Average coverage
Read not mapped to location 1 ... ... but mapped to location 2
Read mapped to location 1 ... ... and not mapped to location 2
A ssignmen t 1 A ssignmen t 2
Figure 2.14: Schematic overview of the principle to resolve ambiguous read-mappings. The
candidate read-pair in gray has two possible alignments a1 to location 1 (left) and a2 to location
2 (right). Variance measures (yellow) are computed for both locations, with and without the read- pair. Variance values from the text have following correspondents in the schema: v1− – location 1
(top), v1+ – location 1 (bottom), v2+– location 2 (top), v2− – location 2 (bottom). The evaluation
windows are shown in red and the coverage of placed reads as black solid lines.
mates. An alignment pair is valid, if the corresponding alignments do not overlap in a conflicting manner. For instance, a conflict would occur, if the first read-mate is aligned into the intronic portion of the second read-mate, if both reads are aligned in the same direction, if the reads align to different chromosomes, or if both alignments have a distance outside of a user-defined maximum range. After this preprocessing-step, each alignment pair is treated as single alignment possibility akand the algorithm above is applied. As the
number of possible pairs is quadratic in the number of alignments in the worst case, the number of allowed pairs can be limited by the user.