ANÁLISIS DE MOVILIDAD EN BASE A LAS CADENAS DE MÁRKO

4. ANÁLISIS DEL PROCESO DE MOVILIDAD

4.2. ESCENARIO BASE PARA EL ANÁLISIS DE MOVILIDAD

4.2.2. ANÁLISIS DE MOVILIDAD EN BASE A LAS CADENAS DE MÁRKO

As the aim of our study is to speed-up fold recognition, in the following we will evaluate different methods including our own with respect to fold recognition accuracy, i.e. the

number of correct predictions divided by the size of the test set, given as percentage. Further, we will have a closer look at the preselection accuracy, i.e. the number of targets for which the correct fold was included by a preselection approach divided by the number of all targets in percent.

3.5.1 Preselection Performance on ASTRAL25

For comparison, we configured both preselection approaches (relative number of fold classes and threshold-based selection) such that they find the correct fold in their selection in exactly the same number of cases (see above): When used as described above, both find the correct fold in their selection for 88% of all targets, whereas the first approach uses a fixed number of 22 folds and the second approach visits 24 folds on average.

We evaluated whether it was necessary to introduce a ”special treatment” for targets with few predicted elements, which we expected to be harder to predict than those with more elements. For this reason, we exemplarily selected all targets with only one or two secondary structure elements (excluding coils) from our ASTRAL25 set. On this data, the first approach (using the top 5% of fold classes) still contains the correct fold in 85% of the cases (as compared to 88% for all targets). In contrast, the threshold-based version shows a clearly reduced performance, containing the correct class in only 51% in its selection. This shows that especially the first approach is applicable also for targets with few secondary structure elements.

For both approaches, we observe that with increasing number of secondary structure elements also the preselection accuracy increases: Using only targets with more than 20 secondary structure elements, the first approach selects the correct fold in 91% of the cases and the second approach nearly reaches 99%.

Overall, although they were tuned to the same preselection accuracy on all targets, the relative number of folds works much better on few secondary structure elements than the threshold, and the threshold is better for very high numbers of secondary structure elements. When used in combination, i.e. using at least 22 folds and running until the threshold is reached, it is possible to capture the good parts of both approaches. Then, in 91.5% of all cases we find the correct fold in our selection. However, using an average of 34 folds, this combination is actually comparable to the first approach alone when simply using the top 8% instead of the top 5% of folds. And indeed, for the top 8% of folds, we would have achieved a very similar preselection accuracy of also about 91%.

Apparently, there is a tradeoff between fold recognition accuracy and speed-up. Using the individual approaches or the combination of the two, an increased number of folds or a less restrictive threshold will increase the preselection accuracy but in turn include more potential folds. On the other hand, as we will see on the CATH MJ set in the next subsection, the restriction to only few fold classes by SSEA can in some cases even improve accuracy over PPA alone. In addition, the threshold-based preselection depends much stronger on the properties of a prediction setup (the expected sequence similarity between the template classes, for instance) than the first approach: When trained on a set with low similarity between template classes and then used on a set with high sequence similarities

3.5 Results 33

between template classes (and thus smaller score differences), the threshold will probably find many more folds than expected from the training data, and vice versa. In contrast, the relative number of folds can be expected to yield a speed-up on most datasets independently of the contained sequence similarities, as long as it does not happen that a very large part of the templates is concentrated in just a few of the available template classes. This illustrates that, as we have seen, the application of SSEA can help concentrating on potential fold classes in fold recognition setups, but it will depend on the intended application how to choose the approach and the corresponding parameters.

3.5.2 Fold Recognition Accuracy

In this subsection, we combine preselection with subsequent refinement using PPA for fold recognition. In direct comparison, the characteristics of the first approach seem better suited for this purpose than those of the second, as it does not depend on the number of secondary structure elements to work well, whereas the threshold-based version has considerable problems in the presence of only few secondary structure elements. Further, the combination of both approaches increases the number of folds over the first approach by more than 50% while only resulting in a few percent better preselection accuracy. In the following, we therefore use the relative number of folds as defined by the first approach as an exemplary choice of preselection method for the purpose of fold recognition: We select the top 5% of fold classes with SSEA, and we subsequently apply PPA to predict a single, final fold class for a target.

The fold recognition accuracy for this approach as well as our comparison methods on the two benchmark sets and on the training set is shown in Figure 3.3. All values were rounded to full percentages. The difficulty level of the benchmark sets decreases from left to right as indicated by the accuracy of the methods for each set.

• CATH MJ:For the most difficult set we find that sequence based methods perform poorly (PDB-BLAST: 13%, GenThreader: 14%). Secondary structure element alignment achieves 32% accuracy and PPA achieves 38% accuracy. Nonetheless, on this set, the combination with SSEA can further increase prediction accuracy to 41%, in comparison to 34% for MANIFOLD [Bindewald et al., 2003]. The reason for this im- provement is that, when only very low sequence similarities to sequences of the same fold are given (as in this set), PPA finds only very low scores against all templates. Then it is possible for unrelated templates to gain a slightly higher PPA score than a remotely related template by accident, for instance because of a few similar residues, although the overall topology may be completely different. On this set, for some cases, the restriction of the available folds by preselection prevented PPA from running into such traps. In such difficult situations, confidence measures such as score gaps [Sommer et al., 2002] may be used to abstain from a prediction completely and apply other methods instead when available. However, for this test set, the score differences between the first ranked and the second ranked fold are usually small, and therefore such an approach might significantly reduce PPA’s sensitivity.

• ASTRAL25: Here, with only 54% accuracy, secondary structure element alignment achieves significantly less hits than PPA with 79%. The combination of both yields 76%, this time decreasing accuracy by about 3%.

• SCOP DD: On the easier benchmark set containing distant homologs we find that our approach achieves 83% accuracy as compared to 75% for MANIFOLD, achieving 24% more fold recognition accuracy than the recently published BAYESPROT and even 27% more than the machine learning methods proposed by Ding and Dubchak. Again the best result is obtained by PPA alone with 84%, whereas secondary structure element alignment achieves 73%.

We find that, by speeding up the fold recognition process using preselection, we can ob- tain a similar performance to using PPA directly (CATH MJ: +3%, ASTRAL25: -3%, SCOP DD: -1%). On all three sets, both PPA and the combination of preselection and PPA clearly outperform their comparison methods.

3.5.3 Speed-Up Evaluation

In a runtime evaluation of the used implementations on an Intel Xeon DP with 2.8 Ghz, SSEA was more than a hundred times faster than PPA, with up to between 103 _{and 10}4

alignments per second as compared to 10 to 100 alignments per second for global PPA with both secondary structure and sequence profiles in our setup. This shows that SSEA is faster than PPA by about two orders of magnitude. Therefore, the speed-up achieved by preselection can indeed be considered relative to the number of discarded templates.

When using the top 5% of folds, under the assumption that we discard about 95% of the templates by discarding 95% of the fold classes, we therefore can expect a speed-up of 95% (20-fold). In fact, the real speedup depends on the selected classes. For the ASTRAL25 dataset, the average number of templates per fold class is about 9, whereas the maximum number is 175. Interestingly, the median is 4, and the distribution shows that only about 100 (i.e. about 25%) of the fold classes actually have more members than 9 in our set. Nonetheless, if we align against each template of the selected fold class, this distribution results in a true speedup as measured by the number of templates for each target of about 87%, i.e. 8-fold, as we have to use PPA against 532 templates on average instead of all 3998 of the ASTRAL set.

In document Análisis Big Data de los patrones de movilidad a partir de registros de llamada (página 50-67)