GRADOS A LOS QUE SE DIRIGEN LAS TESIS
7.2 Análisis de la información
select template fold classes based on the secondary structure elements in the target sequence as predicted by Psipred [Jones, 1999b] using a fast alignment method designed for such cases.
Once a number of potential template fold classes have been selected, often approximate topological similarity based on only the order and length of secondary structure elements is not enough to discriminate between them, as shown for the SSEA method in the results section. At this stage, we therefore change to a finer level of description by using profile- profile alignment (PPA) on both sequence and secondary structure profiles. This method is much slower than the preselection step in finding a single, final predicted class for a target but very accurate in comparison to other fold recognition methods.
An overview of the approach is shown in Figure 3.1. 1) It quickly preselects potential classes. 2) It rescores the selected classes using the second, more expensive measure for selecting the finally predicted class. As our evaluation will show, this idea allows for a reduction of computation time by about one order of magnitude as compared to PPA alone while achieving comparable results in fold recognition.
3.2
Material
3.2.1
Training and Test Data
We use three different data sets for this chapter, one well-known ”difficult” set, one newly compiled “intermediate” set, and one well-known, ”easy” set:
1. CATH MJ: The first set was introduced by [McGuffin and Jones, 2002]. It con- tains 542 nonredundant domains based on CATH [Orengo et al., 1997] version 1.7 and is divided into a subset of 252 ”known” domains which have at least one other match in this set, and 290 ”unique” domains, i. e. domains which have folds unique with respect to this set. In order to compare our method to the results of [Bindewald et al., 2003], we used their approach by selecting the set of known folds as targets and the complete set as templates, excluding identical hits. For com- parison purposes we used the classifications given by CATH V2.4 as described in [Bindewald et al., 2003]. It should be noted that, using this CATH version, we can find matching partners with respect to the CATH topology level for only 241 of the set of known domains. For the evaluation, we nonetheless keep all 252 domains as reference number for 100% accuracy.
2. ASTRAL25: The second set was compiled from the ASTRAL [Chandonia et al., 2004] subset with less than 25% sequence identity based on SCOP version 1.651. We per-
formed leave-one-out tests on all fold classes containing at least 2 members (3999 domains in 441 fold classes). This set was used for training our approach (we evalu- ated the percentage of selected folds on this set) for two reasons: First, we have no
All Template Fold Classes Preselected Fold Classes
Select High- Scoring Classes
Refined Scores for Classes
Select Highest- Scoring Class
Predicted Fold Class Rescore Members
Figure 3.1: Graphical overview of the preselection and refinement approach. At first, all templates are scored with a fast but sensitive method and potential fold classes are selected. Using only these classes, all remaining templates are rescored in the refinement step with a more selective approach and the class with the highest-scoring template is selected for the final prediction.
comparison with other methods on this dataset (in contrast to the two other sets), and second, the ASTRAL set is the the most similar set to the template sets usually used for structure prediction and fold recognition by methods competing e.g. in the CASP experiment. Therefore, a high performance on the ASTRAL set is desirable especially when considering to apply preselection for fold recognition in prediction methods such as SSEP-Domain, for instance, which also uses ASTRAL as template database.
3. SCOP DD: The third set is the test set provided by [Ding and Dubchak, 2001]. It contains 386 SCOP domains in 27 SCOP folds. This set is known to contain (distant) homologs [Bindewald et al., 2003], a fact that leads to higher recognition rate for such target-template pairs. We again follow the MANIFOLD procedure by performing leave-one-out tests on the test set only (Silvio C. E. Tosatto, personal communication).
3.2 Material 27
For this updated evaluation of the preselection approach, sequence and secondary structure profiles as well as secondary structure predictions were generated in the same manner as for the SSEP-Domain method, for instance, which makes use of the preselection approach to speed up protein domain prediction (see section 2.5.2 for details).
It should be noted that 26% of the targets in the SCOP DD set are contained in the ASTRAL25 set, i.e. 100 of the 386 domains are also used in the ASTRAL set. However, the set is much smaller and the conditions are very different to the ASTRAL set. From the CATH MJ set, 36.5% of the used protein chains in the test set are also used in the ASTRAL data (92 of 252). Nonetheless, the setup is again very different from the ASTRAL data: no cross-validation is used, the set is much smaller and the domain definitions were taken from CATH instead of SCOP. Therefore, using the SCOP DD and CATH MJ sets as test sets allows for a fair comparison with the methods quoted for these sets.
3.2.2
Quoted Methods
For the sets obtained from the literature, we are able to compare our results directly to the accuracy values reported for other methods:
• MANIFOLD (MF).The MANIFOLD method [Bindewald et al., 2003] is the most interesting comparison, since it also makes use of secondary structure element align- ment. The results are combined with PDB-BLAST and enzyme code similarity by training a two-layer neural net for weighing the three contributions.
• PDB-BLAST (PB). From [Bindewald et al., 2003] we quote their results for the PDB-BLAST method [Rychlewski et al., 2000] which generates PSI-BLAST profiles [Altschul et al., 1997] for each target and then aligns them to all template sequences.
• GenTHREADER (GT).From [McGuffin and Jones, 2002] we used the results for GenTHREADER, an approach introduced by [Jones, 1999a] which uses a sequence profile-based algorithm and subsequently analyzes the alignments by using energy potentials.
• BAYESPROT (BP). BAYESPROT utilizes tree-augmented na¨ıve Bayesian clas- sifiers. Here, we quote the results from [Chinnasamy et al., 2004].
• Ding and Dubchak (DD). Ding and Dubchak studied support vector machines and neural nets for fold recognition. The results are quoted from the original paper of 2001 [Ding and Dubchak, 2001].
Since these results were not recomputed, it should be noted that there are small differences in the setup between our approach and the quoted methods. We use Psipred [Jones, 1999b] predictions while, for the Ding and Dubchak set, MANIFOLD makes use of consensus secondary structure predictions as described by [Albrecht et al., 2003]. Furthermore, since we made use of an NR version of April 2004 to compute our profiles, these will differ slightly from the profiles generated by Bindewald et al. for MANIFOLD. The final revision of their paper was in August 2003.