5. ÍNDICES PRONÓSTICOS (SCORES DE GRAVEDAD)
5.2. SISTEMAS DE EVALUACIÓN DE LA GRAVEDAD EN LOS PACIENTES SÉPTICOS EN
Group-wise functional similarity approaches can be sub-divided into three groups, namely set-based approaches, vector-based approaches and graph-based approaches, depending on how the terms annotated to gene products are considered. An overview of all approaches discussed in this section is given in Table 2.2.
Set-based approaches
The simplest approach to establish the similarity between two gene products based on their functional annotations is to apply set-based similarity techniques, such as Jaccard’s index [Jaccard, 1908] or the Dice coefficient [Dice, 1945], to the sets of GO terms attributed to these gene products. Whilst relatively inexpensive in terms of computing power, purely set-based approaches are very rarely used in this context,
2.3 Similarity between gene products
Measure Approach Similarity measure Weighting
Lee et al. [2004] Graph-based Term overlap None
Mistry and Pavlidis
[2008]
Graph-based Normalised term
overlap
None
Martin et al. [2004] Graph-based Czekanowski-Dice None
Gentleman [2005] Graph-based Jaccard None
Gentleman [2005] Graph-based Shared path None
Pesquita et al. [2008] Graph-based Jaccard IC
Lin et al. [2004] Graph-based Intersection Annotation set proba-
bility
Yu et al. [2007] Graph-based LCA Annotation set proba-
bility
Ye et al. [2005] Graph-based normalised LCA None
Sheehan et al. [2008] Graph-based IC-based (Resnik,
Lin)
Annotation set proba- bility
Jain and Bader [2010] Graph-based LCA Term-to-leave sub-
graph IC
Chabalier et al. [2007] Vector-based Cosine similarity IDF
Huang et al. [2007] Vector-based Kappa-statistic None
Benabderrahmane et al. [2010]
Vector-based Cosine similarity combination of evi-
dence code and IDF Table 2.2: Overview of the group-wise functional similarity approaches presented in Section 2.3.1
at least not on their own. This is due to the very subtle differences that can exist between adjacent levels in biomedical ontologies. Two gene products annotated with terms that are not identical but are very close in the ontology would be scored at a much lower similarity using direct set matching than using a more complex approach taking into account ontological structure.
Graph-based approaches
Graph-based approaches consider the sub-graph formed by annotation terms, thus including also indirect annotations rather than just direct annotations in the simi- larity calculations. They are by far the most commonly used group-wise functional similarity approaches used in the GO.
Although set-based similarity techniques are not used in the GO for compar- ing sets of annotations, they are used in conjunction with graph-based approaches, treating GO term induced subgraphs as sets. The earliest example of this in the GO was by Lee et al. [2004], who defined the similarity between two gene products as the intersection of their sets of GO terms. The sets of GO terms include all parent terms of the direct annotation term, i.e. the subgraphs from term to root induced by each term. Mistry and Pavlidis [2008] refer to Lee et al.’s measure as “Term
2.3 Similarity between gene products
Overlap” (TO). They present a normalised version of TO (NTO) in which the term overlap similarity is divided by the size of the smaller of the two GO term sets.
Martin et al. [2004] used a slightly more sophisticated distance measure, the Czekanowski-Dice formula, which is the cardinality of the symmetrical distance be- tween two term sets divided by the sum of the cardinalities of their union and intersection. A similar approach was proposed by Gentleman [2005], whose simUI measure divides the cardinality of the intersection of two induced subgraphs by the cardinality of their union. This is effectively Jaccard’s index. In the same work, Gen- tleman also proposed another measure, simLP, which is not based on set similarity. simLP is defined as the longest common path found in two subgraphs. A fifth set similarity-based approach of induced subgraphs, simGIC, was defined by Pesquita et al. [2007]. They combined the Jaccard index with information content by replac- ing the cardinalities of intersection and union by the sums of the information content of all the terms in the intersection and union of two term sets.
Not all graph-based functional similarity approaches applied to GO annotation make use of set similarity concepts. Lin et al. [2004] proposed to establish the shared subtree, called the “intersection tree”, for all pairs of proteins in a population, then calculate the similarity between each protein pair as the frequency of their intersection tree in the overall population. The “total ancestry measure” by Yu et al. [2007] is effectively a normalised version of this as it defines the functional similarity between two proteins as the number of protein pairs in a population with exactly the same set of LCAs as the proteins in question, divided by the total number of protein pairs in the population. Although the two measures differ in their conceptual definitions, the actual calculations are essentially the same.
The approach suggested by Ye et al. [2005] focusses on the depth of the shared part of the induced subgraphs of two gene products. Similarity is calculated by dividing the difference between depth of the deepest common term and the minimum depth of the ontology (always 1) with the difference between maximum and minimum depth of the ontology.
Sheehan et al. [2008] propose the SSA algorithm, a rule-based system that ex- tends information content similarity between GO terms, particularly Resnik’s and Lin’s measures, to a framework for describing the similarity between sets of anno- tations. Based on the GO graph structure and the relationships between terms, the SSA algorithm derives a set of “contextual terms” that describe the annotations of two gene products. This term set, called “nearest common annotation” (NCA), is used as the LCA of the gene products’ annotations and based on the instances of the term set in a corpus of annotations, the similarity between gene products is
2.3 Similarity between gene products
calculated according to the same principle as Resnik’s or Lin’s similarity between GO terms.
The measure by Cho et al. [2007] was classed as a separate graph-based functional similarity measure by Pesquita et al. [2009]. This measure is however essentially Resnik’s information content semantic similarity measure, combined with maximum functional similarity (see Section 2.3.2). The only difference in this new measure is in the way the calculation is defined. Rather than calculating the similarity between each pair of GO terms, then combining the pairwise semantic similarities into a functional similarity score, Cho et al. use the smallest GO term “annotation size” of all GO terms shared between two gene products. Annotation size is defined as the number of proteins annotated to a GO term or any of its child terms. The smallest annotation size is divided by the annotation size of the root and the similarity between two gene products is the negative log of this ratio. The measure is mentioned here due to its inclusion in Pesquita et al.’s review but is not considered as a graph- based functional similarity measure.
Similarly, Jain and Bader [2010] also define functional similarity between two gene products as the maximum information content of the lowest common ancestor of their annotation terms. In this work, the authors transform the GO into a set of subgraphs but unlike other works, where a subgraph is generally the part of the ontology between a term and the root, these subgraphs reach from a high-level term to the leaves of the ontology. The subgraphs are defined so there is minimal overlap between them. Multiple subgraphs form a meta-graph based on the position of their respective root nodes in the original GO hierarchy. The information content for each GO term within a subgraph is calculated using only the terms and annotation frequencies within that subgraph. The higher-level terms that are not part of a subgraph have their information content calculated based on occurrence probabilities from all the subgraphs they subsume. Through this system, gene products that are annotated with terms from the same subgraph have higher similarity than terms from different subgraphs.
Vector-based approaches
Vector-based approaches generally represent the gene product annotations as multi- dimensional vectors, where each dimension represents one possible GO term. Vectors can be binary, with the presence or absence of each term in a given set of annotations denoted by 1 or 0 respectively. Alternatively, vectors can be weighted, making the contribution of each term to the vector more nuanced. While vector-based
2.3 Similarity between gene products
approaches have been used in the GO context, they are far less common than graph- and information theory-based methods. This is mostly because they are highly computationally intensive, yet just like set-based approaches, fail to capture the information contained in the ontological structure. Efforts to date include Chabalier et al. [2007]’s cosine similarity-based functional similarity, Huang et al. [2007]’s kappa statistics approach used in DAVID and, most recently, Benabderrahmane et al. [2010]’s variant on weighted cosine similarity.
Chabalier et al. used the same approach described by Bodenreider et al. [2005], but calculated the similarity between gene products based on vectors of GO terms rather than the other way round. The authors also used IDF to weight the contri- bution of each GO term to a gene’s annotation vector. Pesquita et al. [2009] equate this to weighting using information content, which is not entirely appropriate as the probability of occurrence in IC is based on the total number of annotations of a term or any of its children divided by the total number of annotations in the corpus, while IDF is based on the number of occurrences of term t divided by the total number of distinct genes.
A new GO-specific weighting approach was defined by Benabderrahmane et al. [2010]. In their approach, each dimension of each vector consists of both a coefficient that is the product of a weight that reflects the evidence code of that annotation and the IDF for that term, and a base vector. In the calculation of functional similarity between two gene products using cosine similarity, the dot product between the two base vectors for a given dimension reflect the ratio of the depth of the two terms’ common ancestor and the sum of the depths of the two terms.
Huang et al. [2007] proposed to quantify the similarity between gene products using kappa statistics [Cohen, 1960], a chance-corrected measure of co-occurrence. They also represented the gene products as vectors of their annotations but included not only GO terms but also annotations from a number of other sources, such as KEGG pathways [Kanehisa and Goto, 2000], UniProt sequence features [The UniProt Consortium, 2008] and InterPro domains [Mulder et al., 2003]. Each gene product-term association is binary, with no weighting. The DAVID tool also uses the reverse approach (annotations represented as vectors of the gene products they annotated) to calculate the similarity between annotation terms.