• No se han encontrado resultados

3. MARCADORES DE RESPUESTA INFLAMATORIA

3.2. INTERLEUKINA 1 (IL-1)

Node-based approaches use the information contained in a graph’s nodes to quan- tify the similarity between two terms without taking into consideration the edges that connect the nodes. The majority of node-based approaches use the concept of “information content” (IC), which requires the use of information external to the ontology, in the similarity computation. There are only very few node-based se- mantic similarity approaches that use only the internal node-structure of the GO, including information derived from node depth and density, in order to compute semantic similarity between concepts.

First introduced by Resnik [1995], the concept of information content is based on the idea that the deeper in the hierarchy a term is, the more informative it is (i.e. the higher its information content) and the closer to the root, the less informative it is. The information content of each term in a hierarchy is calculated through the probability of occurrence of that term in a corpus or body of knowledge, i.e. the

2.2 Similarity between GO terms

Measure Approach Notes

Resnik [1995] Node-based IC(MICA)

Lin [1998] Node-based IC(MICA) & IC(terms)

Jiang and Conrath [1997] Node-based IC(MICA) & IC(terms) Couto et al. [2005] Node-based Disjoint common ancestors Schlicker et al. [2006] Node-based IC(MICA) & IC(terms)

Wu et al. [2005] Node-based Largest shared path from LCA to

root

Bodenreider et al. [2005] Node-based Cosine similarity with IDF weighting

del Pozo et al. [2008] Node-based Cosine similarity, then depth of LCA

Herrmann et al. [2009] Node-based Corpus-free variant of IC Chiang et al. [2006] Node-based Shared path with IC weighting

Chiang et al. [2008] unclear Shortest path and depth of MICA

Rada et al. [1989] Edge-based Shared path

Cheng et al. [2004]] Edge-based Shared path with depth-based

edge weighting factor

Yu et al. [2005] Edge-based Shared path and distance to LCA

Wu et al. [2006] Edge-based Shared path and distance to leaf

nodes and LCA

Jakonien˙e et al. [2006] Edge-based Shared path with weighting based on edge type

Yuan and Zhou [2008] Edge-based Shortest path between terms

Wang et al. [2007] Hybrid Shared ancestors with edge

weighting

Othman et al. [2008] Hybrid IC/depth/number of children;

distance

2.2 Similarity between GO terms

information content of a term c is

IC(c) = −ln p(c) (2.1)

where p(c) is the probability of concept c occurring in the taxonomy.

Concept frequencies in a taxonomy are derived from occurrence frequencies of a concept and its children in a corpus. In his research, Resnik used “WordNet” [Fellbaum, 1998] as the taxonomy and the “Brown Corpus of American English” [Francis and Kucera, 1982] as the corpus. The occurrence of a child term counts towards all the occurrences of all its parents. This is logical as some term β, which is a child of α, occurring in a hierarchy implies that α is occurring as well. This is called the “true path rule” [Ashburner et al., 2001].

The probability of a concept c occurring in a taxonomy is

p(c) = f req(c)

N (2.2)

where

• f req(c) =P

n∈concepts(c)total(n)

• concepts(c) is the set of concepts that are descendants of c; • total(n) is the number of occurrences of term n in the corpus; • N is the total number of terms in the corpus.

The use of occurrence frequencies can be considered a disadvantage of IC-based measures as variations in the underlying corpus lead to changes in similarity results. This makes it difficult to compare results from experiments based on different cor- pora, such as the annotations of different species and older or newer versions of the data.

In information content-based measures, the link between two ontological terms c1

and c2 is established through the ancestor terms they share. As c1 and c2 may have

more than one common ancestor, the most meaningful of those ancestors is usually considered. This is generally the “first” or “lowest common ancestor” (LCA), and also the ancestor with the smallest p(c) (or largest −ln p(c)). In Lord et al. [2003a], this is defined as the “probability of the minimum subsumer”. Another term for this ancestor is “most informative common ancestor” (MICA) [Pesquita et al., 2008], which is how this concept will be referred to from now on in this thesis. It should be noted that LCA will be distinguished here from MICA insofar that it is theoretically possible for an ancestor term a to be the LCA of two terms but not their MICA,

2.2 Similarity between GO terms

if the IC of a is lower than that of another ancestor term b, but its distance from the root is greater than or equal to that of b. For this reason, LCA will be used when referring to the distance from the root, while MICA will be used for all IC references.

Similarity between concepts c1 and c2 according to Resnik [1995] is given by

simResnik(c1, c2) = max c∈S(c1,c2)

[−ln p(c)] (2.3)

where S(c1, c2) is the set of terms that subsume both c1 and c2.

All other IC-based semantic similarity approaches developed after Resnik are variations on the same theme. While Resnik’s approach only uses the IC of the MICA to quantify the semantic similarity between two terms, other approaches take into account the IC of the terms whose similarity is calculated as well.

Resnik tested his approach against human similarity judgement data and con- cluded that it performed “encouragingly well”[Resnik, 1995], and also “significantly better than the traditional edge counting approach”[Resnik, 1995]. The main draw- back of Resnik’s approach is that it only captures the position of the common ances- tor within the hierarchy but not its distance from the query terms. This means that two terms directly connected to their most informative common ancestor would have the same similarity as two other terms with the same MICA but that are several levels removed from it in the hierarchy.

An IC-based approach by Lin [1998] addresses this problem by considering the IC of the query terms as well as that of the common ancestor. Taking into consideration the information content of the terms that are being compared as well as that of their shared parent, this approach defines the similarity between concepts c1 and c2 as

simLin(c1, c2) =

2 · maxc∈S(c1,c2)[−ln p(c)]

[−ln p(c1)] + [−ln p(c2)]

(2.4)

This approach could be considered as a normalised version of Resnik’s approach because Lin’s similarity coefficient lies between 0 and 1, unlike Resnik’s value, which can vary between 0 and infinity2 [Resnik, 1995]. Lin used the same test set as Resnik

to test his similarity score. He found that his approach led to a marginally higher correlation with human judgements than Resnik’s measure [Lin, 1998].

While addressing the drawback of Resnik’s method of not reflecting the distance between two terms and their common ancestor, the Lin approach has its own disad- vantage in that the similarity is displaced from the graph and does not reflect the

2

Practically, Resnik’s upper limit is −ln 1

2.2 Similarity between GO terms

overall position of the three elements in the hierarchy. This means that two very shallow terms can have the same level of semantic similarity as two very deep terms, provided the two pairs are equally close to their respective common ancestor.

This same problem also applies to the approach by Jiang and Conrath [1997], who combined the elements used in Lin’s approach into an IC-based distance measure. The semantic distance between two nodes is the inverse of the semantic similarity. For a measure bounded between 0 and 1, this translates to similarity = 1−distance [Othman et al., 2008]. Semantic distance according to Jiang and Conrath [1997] however is calculated as

distJ iang(c1, c2) = [−ln p(c1)] + [−ln p(c2)] − 2 × [−ln p(c)] (2.5)

This measure therefore ranges from 0 if c1 and c2 are identical to 2 × maxIC

for two leaf nodes which only have the root of the ontology as a common ancestor. maxIC is the maximum information content for a given ontology, which corresponds to an annotation frequency of 1 as a term with an annotation frequency of 0 would not have any information content, both conceptually and mathematically as ln 0 is undefined. Jiang and Conrath’s semantic distance can be transformed into a similarity measure using

simJ iang(c1, c2) =

1

distJ iang(c1, c2) + 1

(2.6)

where the addition of one to the distance is necessary to avoid infinity values [Couto et al., 2007]. Alternatively, the semantic distance could be normalised by division with 2 × maxIC, which would bring it into the [0,1] range, then the converted to similarity by subtracting it from 1.

In its original form, the Jiang approach was actually a hybrid approach including edge weighting factors whose influence can be controlled by two further weighting factors. Virtually all GO applications of this measure set these parameters to ex- clude the weighting factors, which reduces the distance measure to the node-based approach described here. For more details on the full measure, see the work by Othman et al. [2008] described in Section 2.2.3.

The validation of Jiang and Conrath’s approach used a noun portion of WordNet containing about 60000 nodes. Unlike Resnik and Lin, Jiang and Conrath did not use the entire Brown Corpus of American English to estimate the frequencies of concepts. Instead, they used SemCor [Miller et al., 1993], a subset of around 100 passages from the Brown Corpus. Their results confirmed that Resnik’s information content

2.2 Similarity between GO terms

approach produces better results than Rada et al. [1989]’s edge-based approach. Both methods performed less well than Jiang and Conrath’s approach.

In their 2003 Bioinformatics paper, Lord et al. [2003a] proposed to investigate the relationships between gene products using semantic similarity rather than sequence similarity. They considered the three IC-based approaches described so far, although only the Resnik approach was used as it was the simplest of the three. In the same year, the authors also published a conference paper [Lord et al., 2003b] in which all three approaches were compared. These two papers marked the beginning of the use of semantic similarity in the context of the Gene Ontology. Since then, a number of node-based semantic similarity measures have been developed specifically for the Gene Ontology in order to address various drawbacks of the “original three” measures used by Lord et al.

Schlicker et al. [2006] proposed relevance similarity simRel, a measure that tackles

both Resnik’s flaw of disregarding the distance between two terms and their common ancestor and Lin’s drawback of being displaced from the graph structure. Using the same information content concept as the other measures so far, relevance similarity is defined as simRel(c1, c2) = 2 · max c∈S(c1,c2)[−ln p(c)] [−ln p(c1)] + [−ln p(c2)]  · (1 − p(c)) (2.7)

Couto et al. [2005] argued that considering only the MICA of the query terms ignores important ontological information. They presented GraSM (GRAph-based Similarity Measure), a method that considers all disjunctive ancestors (ancestors that can be reached by at least one distinct path) of the query terms. The IC of all the disjunctive ancestors is averaged and used instead of the MICA’s IC in any IC-based approach. GraSM is technically not a semantic similarity measure in its own right but is included here because it is used in conjunction with IC-based approaches.

Taking a different approach than other researchers in the field, Chiang et al. [2006] created an algorithm for their GeneLibrarian tool which computes semantic similarity between GO terms as a sequence alignment measure where the path from a term to the root is the sequence and information content is used to weight each GO term. The same group also proposed another measure [Chiang et al., 2008] for another system, Similar Genes Discovery System (SGDS). This second measure is a function of the length of the shortest path between two terms and the depth of their common ancestor. It is unclear whether this method should be classed as node-based, edge-based or hybrid as the authors give no indication whether they

2.2 Similarity between GO terms

count the nodes or the edges to determine path length and term depth.

Not all node-based semantic similarity measures make use of information content to quantify the similarity between ontology terms. Using annotation data but not information content, Bodenreider et al. [2005] proposed to compute the similarity between GO terms using cosine similarity [Baeza-Yates and Ribeiro-Neto, 1999] (see Section 3.2.1 for details on cosine similarity) in a vector space model, in which each GO term is represented as a vector of the genes it annotates. The GO term vectors are weighted to balance the effect of genes that are annotated with many GO terms and the weighting method used is inverse document frequency (IDF) (see Section 3.2.1 for details on IDF). The weight for a given GO term is defined as the log of the total number of distinct genes in the database divided by the number of genes annotated to the GO term in question. This is similar to, although not the same as, the concept of information content.

Bodenreider et al. [2005] also used statistical analysis of co-occurrence and association- rule mining to find relations between GO terms. The overall purpose of their study was to find associations between GO terms from different branches of the GO. This cannot be done using most other semantic similarity approaches as these rely on GO structure-related elements such as the common ancestor of two terms

The cosine similarity approach was also taken by del Pozo et al. [2008]. In their work, the similarity between GO terms is effectively calculated twice. First, the similarity between terms is calculated using cosine similarity, based on the GO terms’ annotations to InterPro [Mulder et al., 2003] entries. From the resulting similarity matrix of GO terms, a “Functional Tree” is built using spectral clustering [Ng et al., 2001]. The similarity, or rather “Functional Distance” between GO terms is then defined as the height of their LCA in the functional tree. Pesquita et al. [2009] classed this measure as edge-based. Based on the definitions in del Pozo et al. [2008], the approach is presented here as part of the node-based approaches rather than the edge-based ones, since the first level of GO term similarity takes into account only the terms themselves, while the second level is based on a hierarchical clustering tree rather than the GO graph and the “height” concept is derived as part of the clustering algorithm rather than through the counting of edges.

Finally, some approaches use only the internal graph structure of the GO, exclud- ing all external information. One such measure was proposed by Wu et al. [2005]. Although Pesquita et al. [2009] classed this approach as an edge-based approach, the present analysis found no indication that anything other than the nodes of the GO graph were used. The confusion may be due to the language used in the paper, as the similarity between GO terms is calculated based on the “shared path” between

2.2 Similarity between GO terms

two terms, which usually implies edge-counting. The definition of path in this paper however makes it clear that it is the terms rather than the edges that connect them that are counted. Specifically, Wu et al. [2005] define the similarity between two GO terms c1 and c2 as the maximum number of common terms in any path from c1 to

the root and any path from c2 to the root. This can be rephrased as the maximum

number of terms from the LCA of c1 and c2 to the root.

Herrmann et al. [2009] proposed a variant of information content which also does not use frequency counts from an external corpus but is based entirely on the structure of the GO. Their measure, precision3 pre(c) of an ontological concept c, is

defined as pre(c) = − log Od(c) O·Oa(c) log O· Omax a (2.8)

where O is the total number of terms in the ontology4, O

d(c) the number of

(distinct) descendant terms of c, Oa(c) the number of ancestor terms of c and Odmax

the largest possible number of ancestor terms of any leaf node in the ontology. The similarity between two terms c1 and c2 is then defined as the precision of their most

precise common ancestor,

simsimCT(c1, c2) = max c∈S(c1,c2)

pre(c) (2.9)

The authors use their precision measure as part of a functional annotation-based clustering algorithm for gene products. The paper does not provide an evaluation of the measure and despite its advantage of being corpus-independent, the measure is not used anywhere else in the literature to date.