Condiciones de frontera - Construcción del modelo numérico

3. Hidrogeología de la cuenca de Celaya 16

4.1. Construcción del modelo numérico

4.1.3. Condiciones de frontera

4.2.1 Introducing the rfunSim Score

In Section 2.3.3, we described our method for assessing the functional similarity of two gene products, the funSim score. It is based on the BMA approach, which calculates the similarity between two gene products A and B with sets of GO annotation GOA and GOB, respectively, as follows. For each term in GOA, find the most similar term in set GOB, and calculate the average of their similarities as rowScore(A, B). Then, for each term in GOB find the term with the highest similarity from set GOA, and calculate the average as columnScore(A, B). The GOscoreBMA_max (A, B) is then defined as maximum of the rowScore(A, B) and the columnScore(A, B). The GOscoreBMA_max (A, B) is calculated for the BP ontology (BPscore) and for the MF ontology (MFscore) (Section 2.3). The combined funSimscore is calculated as follows:

funSim(A, B) = 1 2· h BPscore max(BPscore) 2 + MFscore max(MFscore) 2i , (4.1)

where max(BPscore) and max(MFscore) denote the maximum possible scores for BP and MF, respectively. The funSim score ranges from 0 for completely unrelated gene products to 1 for gene products with identical functionality. In most cases, the funSim score is lower than the average of BPscore and MFscore. In order to obtain a more balanced score distribution, we define the rfunSim score for two gene products as (Schlicker et al., 2007b):

rfunSim(A, B) =pfunSim(A, B), (4.2) which also ranges from 0 to 1, but its values are up to 25 % higher.

Despite being a simple transformation, the square root changes the performance of the score. In order to test how well the funSim and rfunSim scores differentiate between protein pairs without sequence similarity and orthologous protein pairs, we utilized the sets of Inparanoid orthologs (IO) and of protein pairs without significant sequence similarity (NSS) described in Section 3.2. For all protein pairs in both sets, the funSim and the rfunSim scores were computed and used for estimating the performance of predict- ing true positives (protein pairs in IO) and true negatives (protein pairs in NSS). The receiver operating characteristics (ROC) curve (Figure 4.1) was calculated and visualized using the ROCR package (Sing et al., 2005) for the statistical computing environment R (http://www.r-project.org). It can be seen that the rfunSim score threshold is higher at given true positive and false positive rates.

The calibration error of a score measures how well the score coincides with the true class membership (Caruana and Niculescu-Mizil, 2004). Protein pairs with a score of 0.6 should belong to IO in 60% of the cases and to NSS in 40% of the cases, for example, and the calibration error measures the deviation form this ideal scenario. For calculating the calibration error, all protein pairs are ordered according to their score. Then, the pairs 1 - 100 are put into a bin and the percentage of true positives in this bin is calculated. Then, the mean prediction is calculated and the absolute frequency between observed true positive frequency and predicted positives gives the calibration error for this bin. This compu- tation is repeated for protein pairs 2 - 101, 3 - 102 and so on. The final calibration error is the mean of the calibration errors of the single bins. For this test, the funSim and rfunSim scores are interpreted as probabilities of two proteins to be functionally similar. ROCR was used for calculating and plotting the calibration error of both scores (Figure 4.2). It becomes obvious that the rfunSim score has a smaller calibration error than the funSim score up to a value of approximately 0.75, and roughly equal thereafter. The results from the ROC curves and the calibration error analysis support the intuition that the rfunSim score gives better results.

For a more detailed analysis of the differences of the two scores, we performed a functional comparison between proteins from Schizosaccharomyces pombe (NCBI Tax-

4.2 EXTENDEDFUNCTIONALSIMILARITY SCORES 57

Figure 4.1: ROC curve for classifying protein pairs as belonging to the sets IO or NSS. The ROC curve shows the performance of the funSim score and the rfunSim score. The squares and circles mark funSim and rfunSim score thresholds, respectively. The symbols are colored according to the score threshold they represent: red corresponds to a threshold of 0.8, orange to a threshold of 0.6, green to a threshold of 0.4, and blue to a threshold of 0.2.

onomy id: 4986) and Saccharomyces cerevisiae (NCBI Taxonomy id: 4932) with the two scores. The proteins and their GO annotations were extracted from UniProtKB release 8.4. In the following, some examples for protein pairs with varying functional similarity illustrate the difference between the funSim and rfunSim scores. The stress response protein bis1 (UniProtKB accession: O59793) from S. pombe is annotated with the function "protein heterodimerization activity" (GO:0046982) and the process "response to stress" (GO:0006950). The high pH protein 2 (UniProtKB accession: P39734) from S. cerevisiae is involved in the same process but annotated with "protein binding" (GO:0005515) as function. The funSim score of these two proteins is 0.655 and the rfunSim score is 0.809. Since both proteins are involved in the same process and "protein heterodimerization activity" is a descendant of "protein binding" in the GO graph, the rfunSim score seems to more accurately reflect the true functional similarity.

The S. pombe protein glucan endo-1,3-alpha-glucosidase agn1 precursor (UniProtKB accession: O13716) is involved in "cell septum edging catabolism" (GO:0030995) and

Figure 4.2: Calibration error for classifying protein pairs as belonging to the sets IO or NSS. The green curve shows the classification error for the funSim score and the red curve for the rfunSim score.

has "glucan endo-1,3-alpha-glucosidase activity" (GO:0051118). The protein EGT2 precursor (UniProtKB accession: P42835) from S. cerevisiae is annotated with the function "cellulase activity" (GO:0008810) and the process "cytokinesis" (GO:0000910). These two proteins have a funSim score of 0.364 and an rfunSim score of 0.603. Looking at the GO graph, it becomes evident that "cytokinesis" is an ancestor of "cell septum edging catabolism" and that the functions of the two proteins are related through the common ancestor "hydrolase activity, hydrolyzing O-glycosyl compounds" (GO:0004553). These close relationships between the MF terms and the BP terms annotated to the two proteins are more precisely captured by the rfunSim score.

Phosphatidylinositol-4-phosphate 5-kinase fab1 (UniProtKB accession: O59722) from S. pombe has "1-phosphatidylinositol-3-phosphate 5-kinase activity" (GO:0000285) in the process of "endocytosis" (GO:0006897). The 1-phosphatidylinositol-3-phosphate 5-kinase FAB1 (UniProtKB accession: P34756) from S. cerevisiae has the same function, but is annotated with three different processes, namely "phospholipid metabolism" (GO:0006644), "response to stress" (GO:0006950), and "vacuole organization and bio- genesis" (GO:0007033). Assuming that the two proteins perform the same function, the rfunSimscore of 0.711 seems more accurate than the funSim score of 0.505 although they are involved in completely unrelated processes.

4.3 FUNCTIONAL SIMILARITYSEARCHTOOL (FSST) 59

4.2.2 Adding Cellular Component to the funSim Score

The funSim and the rfunSim scores calculate functional similarity based on BP and MF annotations. In order to assess the overall functional similarity of two annotated entities, however, it is also important to take into account in which cellular compartments they execute their specific functions. Therefore, we introduce the funSimAll and rfunSimAll scores that additionally integrate CC annotation. They are defined as follows:

funSimAll(A, B) = 1 3· h BPscore max(BPscore) 2 + MFscore max(MFscore) 2 + CCscore max(CCscore) 2i

In document Modelo numérico de flujo de agua subterránea para la cuenca de Celaya T E S I S QUE PARA OPTAR POR EL GRADO DE: MAESTRO EN CIENCIAS DE LA TIERRA (página 43-48)