H corresponde al caso donde el fallo se inicia por deslizamiento de las unidades a través de
5.3 ANÁLISIS DE LA ROTURA CON FISURACIÓN EN LAS UNIDADES
5.3.2 Validación modelo numérico con ensayos realizados por Charry (2009)
Our aims in constructing a new model incorporating extrinsic information were the fol- lowing.
Firstly, in the search for an optimal gene structure a gene structure which regards a hint should get a ‘bonus’ over one that ignores this hint. Suppose S1 and S2 were two gene
structures that have equal a-posteriori probabilities in the ab initio model AUGUSTUS. And suppose we had one single hint which supports S1but not S2. Then in the new model
S1 should get a higher a-posteriori probability than S2.
Secondly, the bonus of a gene structure respecting a hint which refers to a range of the input sequence (exonpart and exon) should – if at all – only moderately depend on the length of that range. Experience showed that long matches of the DNA input sequence in a protein or EST sequence are not much less likely to be misleading or artifactual than shorter ones (section 5.3.3).
Thirdly, gene structures which only ’partially respect’ a hint which refers to a range should not be rewarded at all. For example, exons covering only half of an EST match do not get a bonus. If this exon was correct the EST would be wrong or belong to a different form of alternative splicing.
Fourthly, the program should not be forced to regard an insecure hint, as this can be wrong. (Advisors sometimes give ill counsel.) As Figure 5.2 shows, hints can be misleading. If the a-posteriori probability of the most likely gene structure is very high in the ab initio model, an uncertain hint which is incompatible with this gene structure should not necessarily lead to a different prediction in the new model.
And fifthly, if actual genes usually are supported by extrinsic information, a gene structure with genes for which there is no supporting extrinsic information should get a ‘malus’. Suppose again we had two gene structures S1 and S2 which have the same a-posteriori
probability in the ab initio model and are equal except that S1 contains an additional
gene or exon that S2 does not contain. Further suppose that no extrinsic information
supports this additional gene or exon. Then S1 should have a lower a-posteriori probability
than S2 in the new model. This aim needs some explanation. Hints found through
database searches all support coding regions in some way. We follow the guideline: ’no information’ is also information. As an example consider the extrinsic information of type start. Suppose the reliability of the process generating the start hints was so high that for almost all genes a start hint supporting it was given, and only a very small fraction of the start hints was wrong. Then, intuitively, a predicted gene without supporting start hint would be suspicious, because it would violate the practical experience that very rarely true genes have no supporting hint.
Remark: While accomplishing the first aim tends to increase exon-level sensitivity by giv- ing some exons a bonus, accomplishing the third aim tends to increase exon-level specificity because some exons ’get punished’ through the malus and are therefore not predicted. All of the programs mentioned in section 5.1 reach the first of these goals. Only GENOMES- CAN reaches the second goal, since here, the extrinsic information referring to a range is reduced to a single base. For the other programs, the relative bonus of a parse which respects a homology on a certain range is a product over all bases/codons of that range and approximal exponential in the length of that range. Only GENIE reaches the third goal. All programs reach the fourth goal with one exception: GENIE is forced to respect
0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 annotation HSALIFA hints hints hints 0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 hints hints hints annotation
Figure 5.2: Section of an input sequence containing one gene and extrinsic information. The upper half of the graph refers to the forward strand, the lower half to the reverse strand. The first line shows an annotated gene with three exons (black). No exons on the reverse strand were annotated. The lines labeled ’hints’ show the extrinsic information retrieved by AGRIPPA from the results of a BLAST search in a protein database. The white boxes are exonpart hints, the two black boxes are exon hints. The grey (green if viewed in color) triangles on the forward strand at 3753, and on the reverse strand at 498 and 608 are stop hints. The one on the forward strand coincides with the annotated position. The right grey triangle on the reverse strand is a start hint and the grey ’half- triangle’ on the forward strand at 2650 is a DSS hint at the correct position.
extrinsic information about introns. Only TWINSCAN reaches the fifth of our goals. In the TWINSCAN model missing extrinsic information corresponds to a conservation se- quence with the classification ’unaligned’, which is presumably more likely to be emitted from non-coding states than from coding states. The other programs which have an ab initio version behave as in the ab initio version, when no extrinsic information was found in a region. The approach explained below has been chosen to attain above goals.