• No se han encontrado resultados

INSTITUCIÓN EDUCATIVA DE LA PAZ DE LA CIUDAD DE SANTA MARTA

We performed an all-against-all comparison of the test set domains with the three iterative homology detection methods and show their performances by ROCplots (see Figure 4.2). In a ROCplot (receiver operating characteristic plot) analysis the true and false positive hits at different E-value thresholds are counted and the cumulative values are plotted. To avoid a few large folds from dominating the benchmark, we weight each hit with 1/(number of members in query SCOP fold). The fold-weighted FPs on the x-axis are shown in a logarithmic scale to highlight the important region with low false discovery rates (F DR=

F P

F P+T P , see appendix A.3). In all search iterations,HHblits detects significantly more TPs than PSI-BLAST and HMMER3. More precisely, HHblits detects twice as many TPs than

PSI-BLASTand 54% more TPs thanHMMER3at 1% FDRin the first iteration (Figure 4.2A). At a 10% FDR, HHblits detects 112% more than PSI-BLAST and 68% more than HMMER3. For an iterative search strategy the performance at this low FDRs is important, because even a single non-homologous protein could corrupt the resulting profile. At a FDRof 10%, even two iterations of HHblits detect significantly more true positives thanPSI-BLAST with three or even eight (data not shown) iterations and approximately the same number of TPs thanHMMER3 with 3 iterations (light dashed red line in figure 4.2C).

High-scoring false positives

We further analyzed the high-scoring false positive matches in the third iteration of the previous benchmark to see, if we could identify a systematic error. HHblits has 90 false positives with an E-value better than 10−3. But for many of them we could identify a similar structure or function and it is doubtful if these are real false positives. For example, 10 of the highest-scoring false positives (the best one has anE-value of 10−33) are between members of the SCOP families d.211.1.1 and a.118.24.1. These families are annotated as

Ankyrin and Pseudo-Ankyrin and the SCOP annotation for Pseudo-Ankyrin explains that there are similarities in the repeat sequence and assembly with the ankyrin repeat. Another big group of high-scoring false positives (15 matches with an E-value better than 10−3) indicates a relationship between the SCOP families a.1.2.1 and d.58.1.5. The first family belongs to the superfamily of alpha-helical ferredoxin and the other family is annotated asFerredoxin domains from multidomain proteins. The detailed annotation describes that members of this family may be more closely related to other ferredoxins than to each other. Further matches are between theSCOPfamilies c.2.1.2 and c.72.3.1, the first one belongs to

4.1 Benchmarks 46

Figure 4.2: Homology detection sen- sitivity of iterative search methods. (A)-(C) show ROCplots for differ- ent number of iterations (1, 2 and 3, respectively) on the SCOP20 test set. All but the last search it- eration are performed against the UniProt to build a profile as input for the last search iteration against a combined database containing the UniProt database and the SCOP20 dataset. TPs are defined as pairs from the same SCOP fold, FPs as pairs from different folds. At a false discovery rate (FDR) of 10%HHblits detects significantly more TPs than the other methods, for example in the first iteration twice as many as PSI-BLAST and 68% more than HM- MER3. In (C), the light red curve shows the performance of 2 iterations of HHblits and it demonstrates the clear improvement to 3 iterations of PSI-BLAST.

the large group ofRossmann-fold domainsand the second one is annotated as combination of the Rossmann-like andRibokinase-like topologies.

The analysis of high-scoring false positive matches of the other methods shows that these tools identify much more false positives with anE-value better than 10−3. HMMER3has 1809 such false positives and 95 have an E-value better than 10−9. PSI-BLAST results more than 15900 false positive matches with anE-value better than 10−3, but only 32 with anE-value better than 10−9. Some of these false positives belong to the same families mentioned before. Many of the highest-scoring false positives in PSI-BLAST stand out, because these

4.1 Benchmarks 47

are matches between two differentSCOPfamilies located on the same protein (e. g.,d1vgya2

and d1vgya1). Hence these false positive matches might be appear due to homologous over- extension (Gonzalez and Pearson, 2010), where alignments of homologous domains extend into neighboring non-homologous regions and unrelated information is included in the query profile for the next iteration.

Roc5 benchmark

Figure 4.3B gives the ROC5 plot for this benchmark, which assesses how well a method ranks the matched proteins within each search, thusE-values doesn’t need to be comparable between searches. From the TP and FP hits at various E-value thresholds we can infer a

ROC5 score (∈[0,1]) for each query. This score is defined as the area under the TP-versus-FP

ROCcurve up to the fifth false positive hit, divided by the area under the optimalROCcurve (see example in figure 4.3A). To assess the overall homology detection performance, we plot the fraction of queries with ROC5 scores above a variable ROCthreshold (∈[0,1]). Again,

HHblits outperforms PSI-BLAST and HMMER3 in all search iterations. After two search iterations, for example, HHblits achieves a ROC5 score above 0.2 for 38% of all queries, whereasHMMER3has 29% andPSI-BLAST only 27% of queries with aROC5 score above 0.2. For 20% of the queries,HHblits achieves a ROC5 score above 0.51, HMMER3above 0.39 and

PSI-BLAST above 0.33.