• No se han encontrado resultados

Estado de comparación de los importes presupuestados y los importes realizados

Organización Internacional del Trabajo

Nota 22 Estado de comparación de los importes presupuestados y los importes realizados

The huge amount of new sequences obtained yearly has caused the exponential growth of sequence databases, but functional annotation of genes and their products has so far not been able to keep up this incredible pace. Using an already established approach, a group of researchers from the function prediction community joined and organized the first CAFA experiment. The Critical Assessment of protein Function Annotation algorithms is designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. The general set-up can be easily visualized in figure 32, extracted from the assessment paper142 published after the first

Figure 32Timeline for the first edition of the CAFA experiment extracted from the Nature paper142 published in 2013.

Briefly, CAFA organizers provide a large number of almost unannotated protein sequences and give a deadline date. During this time (the prediction phase) participating groups predict the function of these proteins by associating them with Gene Ontology terms. In the following phase, target accumulation, the assessors gather experimental functional evidence for the target dataset. The prolonged duration of this phase (almost a year in the first experiment) is meant to give time for the scientific community to generate as many annotations as possible. Even so, it was possible to retrieve information for only the 0,01% of the initial set for the first experiment. Finally, in the analysis phase, methods are tested against the established benchmark set.

The second edition of the experiment started on the 29th of August 2013. The

whole protein dataset was constituted by 102,117 sequences. A small part of them (1,301) came from a large-scale collaborative project called the Enzyme Function Initiative, or EFI, whose goal is to develop integrated strategies that will enable focused experimental enzymology, genetics, and metabolomics and was constituted by putative enzymes. The organizers gathered the rest, more than 100 thousands proteins picking them from 27 different organisms.

We decided to participate with a modified version of firestar. In principle the method is not able to predict GO terms directly; FireDB stores functional annotation associated to PDB entries coming from UniProt, so firestar could transfer this information from the different templates used to generate the prediction to the target. But while it is true that functionally important residues can be found in very diverse proteins, it is not true that all of them are equally relevant to determine function. To overcome these limitations we used two different approaches.

In the first one we extracted from the Molecular function GO domain terms related to ligand binding and we associated them to the correspondent PDB molecules. For example Beta-lactose (PDB code LAT) is associated to the general GO:0005529 term (carbohydrate binding) and to the more specific GO:0030395 term (lactose binding) and in total we were able to annotate 112 compounds. Using this information, whenever firestar predicts a binding site and the correspondent interacting ligand, it automatically transfers the associated GO terms to the target protein.

In the second approach we used the mapping generated from the Gene Ontology consortium itself (http://www.geneontology.org/external2go/ec2go) between Enzyme Codes and GO terms. Basically using the Catalytic Site Atlas information stored in FireDB,

firestar is able to predict catalytic sites and at the same time, through the mapping, it can

also transfer the correspondent GO terms. To generate GO predictions we decided to use information coming from manually annotated CSA entries and we set a more restrictive conservation and coverage filters. If a catalytic site is fully conserved and the SQUARE scores of the single residues are higher than an established cut-off, the GO term(s) is directly transferred. If a third of the site is poorly conserved while the rest is highly conserved, the parental GO term(s) is transferred, while if more than a third is not conserved at all, it is directly discarded.

Our lab also participated to this edition of the experiment with the Statistically Inferred Annotation Method, or SIAM, developed by Angela del Pozo (manuscript in preparation). Briefly the algorithm searches for sequence homologs of the target protein in the Swiss-Prot database. Using a non-parametric statistical coefficient of concordance,

the set of the functional annotations (GO terms) that better fit the pool of homologs found are transferred to the target.

In principle the two methods do not overlap; firestar generates specific annotations, related to binding or/and to the catalytic activity and all the terms basically come from the “Molecular function” gene ontology domain. Since the source of information for SIAM is SwissProt, terms can come from the three domains and in general they are expected to be less specific. For these reasons, a third set of predictions was submitted, coming from the integration of the previous two.

In figure 33 general statistics for firestar and SIAM results in the CAFA2 experiment dataset are presented.

Figure 33 Overview of the predictions for the CAFA2 experiment. The pie charts refer to the coverage of the methods, orange for firestar and green for SIAM; grey portion refers to unpredicted sequences. Bar charts compare the total number of annotations generated.

Considering only the limited EFI dataset, firestar was able to generate a prediction for all but 52 sequences, with an average of two annotations per sequence. SIAM coverage is worse, but globally the method was able to transfer more GO terms per sequence. Looking at the entire dataset, the coverage of firestar is again slightly better, but the number of annotations transferred by SIAM was more than three times greater.

Figure 34Overview of the integration of firestar and SIAM predictions for the CAFA2 experiment.