VISTA DEL ARTÍCULO
1.2.2. Estudios cuantitativos
This section provides an overview on semantic similarity measures and some of their applications in biomedical context. So far, most of the aforementioned Word Sense Disambiguation approaches consider the background knowledge sources as terminologies, without taking into account the taxonomic structure or the terms’ semantic similarity. This gap is filled by the systematic comparison of the three approaches that use ontologies with inference and semantic similarity and the use of metadata to solve the problem of WSD for ontological terms (see Chapter 3, Section 3.3). In this context, a semantic similarity measure is a function that, given two ontology terms, returns a numerical value reflecting the closeness in meaning between them.
Semantic Similarity Measures
An overview of some semantic similarity measures proposed to assess the conceptual distance between concepts or sets of concepts and some of their applications are given in Table2.3.
Rada et al.(1989) were among the first ones to talk about similarity between concepts on Semantic
Nets. They proposed a metric, called Distance, in order to assess the conceptual distance between sets of concepts when used on a semantic net of hierarchical relations. Distance between two concepts in a hierarchy is defined as the minimum number of edges separating the concepts. They also defined the Distance on sets of nodes, in order to check the similarity between sets of concepts. They tested the appropriatness of the metric for measurement of the conceptual distance between concepts in MeSH
(Nelson et al.,2001) and compared it to human assessment. They conclude that Distance is a valuable
tool for simulating human assessments of conceptual distance and evaluating some cognitive aspects of semantic nets. Their long-term goal is to solve the problem of document ranking in response to a query.
Sussna(1993) used the WordNet semantic network (Fellbaum,1998) and applied disambiguation on
a Times magazine corpus (of 5 documents). Sussna introduced the idea of mutual constraint among terms and its special case, the frozen past approach, in order to achieve total distance minimization (or “energy minimization”). He used a moving window of terms in focus while moving from the beginning of a document towards its end. In the frozen past approach actually all terms except the one being disambiguated have had their senses determined and “frozen”. Sussna concludes that using the moving frozen past window gives ascending performance to a point and then plateaus. The method trades off space for time, with the use of large data structures kept in memory, a minimum runtime processing effort and without any syntactic analysis.
Resnik (1995) introduced and quantified a new measure for semantic similarity, the information
content of a concept. He converted the measure from pure distance (number of intervening is a links) to similarity. Resnik defined the similarity between two concepts as the extent to which they share information in common. Considering this in a hierarchical concept/class space, this common information
“carrier” could be identified as a specific concept node that subsumed both of the two in the hierarchy (a parent super-class of both). The similarity value was defined as the information content value of this specific super-ordinate class. The value of the information content of a class was then obtained by estimating the probability of occurrence of this class in a large text corpus. The problem is that, sometimes, the measure produces fake high similarity measures for words on the basis of inappropriate word senses (e.g., due to synonyms). In measuring similarity between words, it is the relationship among word senses that matters.
Richardson and Smeaton(1995) introduced an approach to Information Retrieval (IR) based on com-
puting a semantic distance measurement between concepts of words and using this word distance to com- pute a similarity between a query and a document. They applied Resnik’s (Resnik, 1995) information- based similarity estimator and Rada’s conceptual distance estimator to WordNet synsets and found that the measures were less accurate than expected. Richardson and Smeaton found that irregular densities of links between concepts result in unexpected conceptual distance outcomes.
Jiang and Conrath (1997) proposed a combined approach that inherits the edge-based approach of
the edge counting scheme (Rada’s distance (Rada et al.,1989)), enhanced by the node-based approach of the information content calculation (Resnik’s information content (Resnik,1995)). They first considered the link strength factor which is the difference of the information content values between a child concept and its parent concept. Considering other factors, such as local density (the greater the density, the closer the distance between the nodes), node depth (distance shrinks as one descends the hierarchy) and link type (relation type, is a, part of), Jiang and Conrath first defined the overall edge weight for a child node and its parent and then the overall distance between two nodes as the summation of edge weights along the shortest path linking the two nodes. Jiang and Conrath tested their approach on a common dataset of word pair similarity ratings, outperformed other computational models and gave the highest correlation value with a benchmark based on human similarity judgements.
Lin (1998) provided a universal definition of similarity in terms of information theory: “the simi- larity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are”. Lin demonstrated the universality of this definition by its application in different domains, such as similarity between ordinal values, feature vectors (string similarity), word similarity and semantic similarity in a taxonomy.
Approaches to measuring semantic similarity (or semantic relatedness) can be categorized into dictionary- based24, corpus-based and hybrid (Budanitsky and Hirst, 2006; Tsatsaronis et al.,2010). Resnik’s mea- sure (based on the Information Content,Resnik(1995)) can be considered as a hybrid measure, since it combines both the hierarchy of the used thesaurus and statistical information for concepts measured in large corpora. The same applies for the measures ofJiang and Conrath(1997) andLin(1998).
Budanitsky and Hirst(2006) performed an evaluation of five semantic similarity measures (Jiang and Conrath,1997; Hirst and St-Onge,1998;Leacock and Chodorow,1998; Lin, 1998;Resnik, 1995), all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. The information-content-based measure proposed by Jiang and Conrath (Jiang and Conrath,1997) was found to perform best.
Applications in Biomedical Ontologies
Lord et al.(2003) implemented GOGraph, a tool for calculating the semantic similarity of protein pairs based on Resnik’s information content measure. They investigated the application of semantic similar- ity measures to ontological annotations of the SWISS-PROT database, as well as how the ontological structure affects the similarity.
Metric Description
Rada et al.(1989) min # of edges separating the concepts
Sussna(1993) frozen past : all terms except the ambiguous one have their senses determined & frozen
Resnik(1995) information content (common information between two concepts)
Lin(1998) universal definition: sim(A,B)=
inf o needed to state the commonality of A and B inf o needed to f ully describe what A and B are
(application as sim between ordinal values, feature vectors, word sim & semantic sim in taxonomy)
Application Description
Richardson and Smeaton(1995) Resnik + Rada, measure distance between concepts of words to compute sim between a query and a document (Information Retrieval)
Jiang and Conrath(1997) Rada + Resnik, compared to human similarity judgements
Lord et al.(2003) Resnik, semantic sim of protein pairs
Azuaje et al.(2005) gene similarity (GO terms assigned)
Schlicker et al.(2006) Lin + Resnik, comparison of sets of GO terms & gene functional sim assessment
Camous et al.(2007) Resnik, sem sim in MeSH; extend MeSH representation of Medline docs
del Pozo et al.(2008) functional distance between GO terms (term cooc in Interpro)
Tab. 2.3: Semantic similarity measures and some applications.
Azuaje et al. (2005) used a semantic similarity measure to assess gene similarity with a view to
providing a solid basis for the implementation of classification tools and the automated validation of functional associations. Azuaje et al. assessed the similarity between genes based on their GO terms. They used the distance measure and considered only the best semantic match amongst genes of group B for each gene in group A. The method gave an asymmetrical measure expressing the semantic contribution of A genes in relation to B.
Schlicker et al. (2006) introduced two semantic similarity measures for comparing sets of GO terms
and for assessing the functional similarity of gene products. The first measure (sim rel) was based on Lin’s (Lin,1998) and Resnik’s (Resnik,1995) measures and took into account how close two GO terms are to their lowest common ancestor (LCA) as well as the LCA’s relevance (i.e. how general/specific it is). Based on the sim rel score, the second measure, called funSim, compared the annotation of two gene products. The funSim score could compare two sets of GO terms from different ontologies and allowed for partial matches (was independent from the sequence similarity). Therefore, it was suitable for comparison of multi-functional gene products.
Camous et al. (2007) applied Resnik’s information content measure to evaluate semantic proximity
between concepts within the MeSH hierarchy. They proposed a method for extension of ontology-based representations of biomedical documents and used the Medical Subject Headings for this representation. The initial MeSH-only representations were extended with MeSH concepts that were semantically close within the MeSH hierarchy. The extension method was evaluated within a document triage task organized by the Genomics track of the 2005 Text REtrieval Conference (TREC) and lead to an improvement of 18.3% over a non-extended baseline in terms of normalized utility, the metric defined for the task.
A recent review by Pesquita et al. (2009) describes the semantic similarity measures applied to biomedical ontologies and proposes a classification according to the strategies they employ: node-based vs. edge-based and pairwise vs. groupwise. The authors also survey the existing implementations of semantic similarity measures and describe examples of applications to biomedical research.