• No se han encontrado resultados

CAPÍTULO II : MARCO TEÓRICO REFERENCIAL

2.6 CONCEPTOS CLAVES

Evaluation of general statistics To get a first overview, we evaluated each submitted

alignment set with respect to the following criteria:

• Distribution of edit operations All edit operations were distinguished into mismatches, deletions and insertions. For each category, we computed its distribution as the aver- age number of edit operations per position over the length of the read.

• Distribution of split positions Split positions are the end positions of the read segments that are implied by spliced alignments. The distribution was computed as number of split positions per position over the read length. Multiple split positions per alignment were counted individually.

• Alignment error rate The error rate was computed as the fraction of alignments that showed a mismatch either at an alignment position or for a certain quality value. We computed two distributions. One per read position over the length of the reads and one per quality value over the full quality range.

• Distribution of quality values For each alignment position, we determined its average quality value over all alignments.

All information used to compute the statistics were directly inferred from the CIGAR string (a specific alignment representation in the SAM format) or the sequence and quality strings present in the alignment files.

Agreement to the Annotation We used each alignment set to compute the agreement of its predicted intron positions to the given annotation. Agreement was measured by the F-score, which is the harmonic mean of precision (ratio of true positive introns over predicted introns) and recall (ratio of true positive introns over annotated introns). To in- dividually optimize each submission’s agreement to the annotation, we determined optimal filter settings for each submission. For this, we performed an exhaustive search over a grid of 700 different filter parameter combinations and computed the corresponding F-scores and only retained the best for comparison. For a detailed list of tested parameters we refer to Appendix A.2.

Evaluation of Ambiguous Mappers Ambiguous mappers, or multimappers, are reads

that map to more than one genomic location. To increase sensitivity, we extended this def- inition and defined a multimapper as a read that maps to more than one genomic location measured over the union of all input submissions. Two genomic locations were considered as the same, if they shared at least one exonic position in the genome. The multimapper evaluation was based on the comparison of alignment strata, which are sub-groups of align- ments stratified by their respective number of edit operations. To form such a stratum, we joined all alignments of a given read, if they used the same number of edit operations. We call this number the stratum level. Such a list of strata was generated for each submission (submission list) as well as for the union of all submissions (union list). The use of strata enables the comparison of alignment sensitivity for multimappers without a confounding effect of edit distance.

We tried three different strategies to compare the lists of strata. In each strategy, we computed a score between 0 and 1 for each read and stratum, describing its multiple align- ment accuracy. The computation of the score differs, depending on how the alignments of the read were split into strata. The total score of a submission was then computed as the score over all reads. The three strategies are defined as follows:

• Comparison per mismatch stratum defines the stratum score as the fraction of alignments in a stratum of the union list, that can be explained by the alignments of the corresponding stratum in the submission list. Strata are corresponding, if they have the same stratum level. An align- ment is counted as explained, when it has at least 90% overlapping exonic positions with an arbitrary alignment in the respective union list stratum. • Comparison per alignment list stratum computes for each stratum in a submission list the fraction of alignments in the corresponding stratum of the union list. Here, an alignment counts as explained, if there exists at least one alignment in the submission stratum that overlaps the respective alignment in one of the union strata with a lower or equal level in at least 90% of exonic positions. This fraction is then assigned to the sum of lengths of the union list strata up to the current stratum. Averaged over all alignments, the score reflects how good a single submission can explain the first k alignments of all present multimappers.

• Comparison per weighted mismatch stratum computes the score sim- ilar to the comparison per mismatch stratum but in a simplified manner.

Each stratum in the submission list is scored as the fraction of identical alignments from the same stratum of the union list. Finally, each stratum is weighted with its level plus one, thus assigning strata with more edit operations a lower weight.

Pairwise intron agreement To evaluate the pairwise agreement of spliced alignments,

we generated the relative intron agreement of the pairwise submissions. We therefore com- puted the Jaccard index of the intron agreement (ratio of intersection over union of two submission’s intron lists). We further computed for each submission what fraction of its introns is shared with exactly k other submissions. Furthermore, we computed the relative fraction of a submission’s intron list shared with each of the other submissions.

Effects on transcript prediction We used two different in silico transcript predictors to assess the downstream effects of read alignment on their results: Cufflinks [288] and Scripture [103]. To meet the input specifications of both tools, we sorted all alignments by starting position with SAMtools [167] and inferred strand information for spliced reads if necessary and possible, to provide a valid XS-Flag. For some alignments, insertions and deletions at the alignment boundaries had to be replaced by clippings, otherwise causing runtime errors. If submissions showed alignment qualities generally equal to zero, we re- placed them by 255 (no quality measurement available). Due to limited computational resources, all computations were carried out on chromosomes I, 2L, and 1 for worm, fly, and human, respectively. For Scripture we used the option -upWeightSplices in all cases. For Cufflinks we limited the intron size to a maximum value of 20,000, 50,000, and 200,000 for worm, fly, and human, respectively. Otherwise, we used the default parameters.

Documento similar