LOS GRUPOS INTELECTUALES ESPAÑOLES Y LOS DEBATES SOBRE LA CIENCIA
4. Científicos, grupos intelectuales y los debates sobre la ciencia
4.4. La ciencia “útil”
The systems which participated in the BioCreAtIvE workshop [27] are relevant for this work. In the first subtask 2.1, a system is given a document, an associated protein and a GO code, and is asked to return a segment of text from the document that supports the annotation of the text with the GO code (the evidence text). In the second subtask (2.2), a system is given a document and an associated protein, and is asked to return all GO codes that the pair should be annotated with, along with the associated evidence text for each GO code.
FiGO. In Couto et al. [57] the authors present a method to identify GO terms from unstructured text. This task is highly relevant to the open problem number 1 discussed in this thesis. The systems aim was to identify Gene Ontology terms coocurring with protein names in sentences.
The main idea of the system named FiGO is the use of the evidence content of term names. In a preprocessing step the system creates a statistic of each word used in the ontology, basically the frequency of usage in ontology terms. The term punt binding adds
9www.medstract.org
one to the frequency of the two words ’punt’ and ’binding’. The word evidence content is defined as:
W ordEC(w) = −log( F req(w) M axF req)
where M axF req is defined as the highest observed word frequency in the ontology. The name evidence content is defined as:
N ameEC(n) = X
w∈W ords(n)
W ordEC(w))
A sentence is associated with a concept if one of the concepts synonyms shares enough words with the sentence. A threshold can be set to manipulate the sensitivity of the algorithm. Unfortunately no information about the specific tokenization method, the stopword list and the sentence segementation algorithm is provided. The authors do not mention any morphological methods employed by FiGO.
The authors tested their system in the first BioCreAtIvE workshop. Their system per- formed second best in task 2.1 with a precision of 28.72% and in task 2.2 about 8.91%. In comparision with the other systems this was a good result. The authors argue reasonably that their system would achieve about 30% more precision if the task had not a second part, the identification of protein names, which is a difficult task by itself.
The authors list several problems in their approach which they think decreased the performance:
• FiGO predicted obsolete terms, terms which have be marked for deletion from the ontology. In another context this might even be a desireable behaviour of the system. Obsolete terms might be still relevant for other ontologies.
• FiGO did not filter the GO terms that could not be annotated with Human proteins (e.g. germination). The workshop task restricted the annotation to human proteins. The workshop used manual curation data from the Human GOA database which itself was restricted to human genes and proteins. This seems more a problem of gene name mapping than of Concept Recognition. Hakenberg et al. [111] present a system to map human gene and protein names to their database identifiers. A prefiltering of the sentences for mentions of human gene/protein names would clearly improve FiGOs performance without changing the system.
• FiGO predicted GO terms out of context. A pre-selection of relevant ontology terms would resolve this problem.
• FiGO selected sentences from irrelevant sections (e.g. “Material and Methods”). Again a pre-processing step segmenting the text correctly would improve FiGOs performance without changing the system at all.
• FiGO does not search for term mentions accross sentence bounderies. It is not clear if the overall performance of the system would benefit from segmenting the text differ- ently. A high number of false positive candidates could decrease the precision of the system dramatically without further filtering.
• FiGO does not take into account how often a term was used in the texts. A high frequency might indicate a correct term. On the other hand it might also indicate a rather general term.
• FiGO does not take the word order into account. It is not clear whether the system would benefit from taking the word order into account. A sense disambiguation step to filter out false candidates seems more preferably.
FiGO represents a promissing approach to Concept Recognition. The system achieved 28.72% precision in the BioCreAtIvE task 2.1. The recall numbers were among the highest. The evaluation included the task of identifying protein names which reduced the precision in the combined task. Therefore another evaluation was executed:
GOAnnotator [58] is an application of the FiGO algorithm. It additionally uses a concept similarity measure to discard candidates which are not similar enough to a pre-selected GO term, named uncurated annotations. The pre-selected term was derivded from other automatically derived annotation of annotation databases, e.g. UniProt. The validation of automatically annotated proteins with GO terms is a tedious work. Out of 1953 uncurated proteins for 66 proteins GOAnnotator presented evidence text to human curators. The 66 proteins were seleceted because GOAnnotator reported the highest confidence for its annotation and the identified terms were most similar to the pre-selected terms. Given the hints of the uncurated annotation the system had a very good precision of 93%. The authors argue that syntactical techniques, the modification of the segmentation and the identification of noun phrases could improve the performance.
MeKE. Another GO term extraction method uses a hybrid approach of phrase pattern matching and Bayes sentence classification [50]. It is based on Chiang and Yu [49] and uses a simple tokenization at whitespaces, punctuations and brackets. The tokenizer turns “alpha(1,3)- fucosyltransferase” as well as “alpha-1,3-fucosyltransferase” into “alpha 1 3 fucosyltransferase”. A simple stemming of plural forms in contrast to a full morphological anaylsis is executed, e.g. “protein transport” and “protein transporter” are equal.
The system works in stages. First sentences are split and positions of candidates of protein names and GO terms stored in a relational database. Protein names and GO terms were stored as a dictionary and processed with the same tokenizer and stemming algorithm. GO term variants were added to the dictionary by mining other terminological resources for terms with a minimal edit distance to the original terms, e.g. “inner mitochondrial membrane” (source LocusLink, now EntrezGene) becomes a variant for “mitochondrial inner membrane” (source GO).
Now sentences with co-occurring GO terms and proteins are processed by a POS tagger and phrases are identified using a finite state automaton. Previously identified syntactic pattern, using sentence alignment [49], are then used to filter the sentences, e.g.
{ {NP {{. . P}}} <- Place holder for Protein {VP {{‘plays‘ . .}}}
{NP {{‘role‘ . .}}} {. {{. IN .}}}
{NP {{. . G}}} <- Place holder for GO term }
A sentence matching such a specific pattern is selected immediately. If no pattern matches the sentence is classified using a Na¨ıve Bayes classifier described in Chiang and Yu [49]. If the classifier accepts the sentence as a protein-function describing sentence the sentence is selected as a candidate. Candidates selected by the sentence classifier have a lower priority than the ones selected by the pattern matching. Finally the same steps are executed for the GO term variants. The candidates are ranked using again the sentence classifier which was basically trained to estimate the probability that a sentence describes a gene function.
The system was most precise in the BioCreAtIvE task 2.1 with 80% correct annotations for proteins with GO terms and 34.15% in task 2.2. However the methods recall was very low overall. The authors argue that the use of learned patterns account for the high precision. Patterns in their approach are short sequences of consequtive words, e.g. prefixes such as “Amino acid”, “Role of”, “Show that”.
Proximity words network. Verspoor et al. [272] found that GO terms have frequently short labels and are therefore not informative enough. They suggest a method to compute a network of other words frequently appearing in the same paragraph of a GO term. Those words were taken as additional hints for the mention of the GO term itself. The system score 25,73% precision in the BioCreAtIvE task 2.1 and 5.52% in task 2.2.
Sliding windows. Some systems use a technique named sliding window to restrict the range of the text under consideration for each extraction candidate. Ehrler et al. [75] used a sliding window of 5 words across each sentence and a vector space model to identify GO term. Official results reported 25.57% precision for the task 2.1 and 12.3% in task 2.2. Krallinger et al. [147] use sentences as their basic unit of sliding windows and achieve 28.86% precision on the task 2.1.
Label-based SVM. Rice et al. [227] trained a Support Vector Machine on 1858 term- protein-document triplets for 638 distinct GO terms. (2.82 per Term) As document features they used labels identified by the C-value method which is also used for terminology gen- eration from large document corpora [87]. The idea behind this is that labels generated with this method represent a corpus well and linguistic variants can be identified during this generation process. Grouping those lexical variants of the same label creates normalized features of documents which can be learn by a machine learning method. The precision of the system with 12.53% precision was low and also the recall was lower than other system. A reason for this result is the small amount of training data and the fact that 43% of the GO terms were not at all represented in the training data. In task 2.2 the system had a precision of 3.48%.
Statistical learning. Ray and Craven [219] followed a statistical approach. They exten- sively used external databases to train their Na¨ıve Bayes classifier on bags of words. Namely they used the SGD, FlyBase, WormBase and TAIR databases which are all contributers to the GOA database [41]. In contrast to Couto et al. [57] they estimated the informative nature of a term word not on the ontology terminology but on the training data. For each GO term they counted unigrams, bigrams and trigrams of words in the abstracts provided by GOA annotations. They run a χ2 test and ranked the strings according to their indi-
cation of the term. They found for example that the word ’symporter’ is not significantly indicative for the presence of the term sodium symporter activity. The current version of GO contains 94 concept names with the word ’symporter’ and 47 concepts with the word ’sodium’. Therefore the approach of Couto et al. [57], FiGO, would rate the word ’sodium’ more informative than the word ’symporter’ as well but in contrast to this method based on the ontology terminology.
The authors used the ontology relations to collect more training data for general terms. GOA annotations associated with a terms descendant concepts were also used for training the ancestor terms. The accumulated training examples were used to train Na¨ıve Bayes classifiers for each GO term. In the BioCreAtIvE task 2.1 the system had 20.10% precision and 21.31% precision in task 2.2.
BioCreAtIvE Evaluation. The evaluation carried out by Blaschke et al. [27] revealed that the systems made mostly non-overlapping predictions. Except the system of Chiang and Yu [50] which made mostly predictions overlapping with predictions of other systems. This suggests that all approaches found a partial solution to the problem. The MeKE system seems to be most general to the problem solution. Although it generated a low number of prediction. The recall performances of the systems are not directly comparable as the manual curation was carried out after the submissions. The curators only judged the correctness of the submissions but did not curate the full corpus. So the maximum number of correct predictions is unknown for the corpus.
Interestingly the GOA database curators use their own large scale electronic GO anno- tation tool which is not textmining based. They report a precision of 91 to 100%. They conclude that current textmining methods can not support their work [43].
Latent Semantic Analysis. A recently published article [270] describes a system which uses latent semantic analysis (LSA) for browsing PubMed articles using the Gene Ontology. The system transforms documents and queries into vectors of stemmed words. The high dimensional vector is then projected into a lower dimension via singular value decomposition. The reduced vectors are then ranked using a the cosine similarity measure. To assess the performance of their system the authors select three GO terms and manually compare 16 articles for their association with GO terms assigned by Doms [67] and their system. They conclude that latent semantic analysis can associate documents with terms not mentioned in the text.
LSA & Patterns. A similar approach was combined with the idea of sliding windows as mentioned earlier. In Ruch [234] the author describes a system combining a module aiming at high recall, a vector space model, based on word stems and cosine similarity, and a module aiming at high precision, a pattern based approach similar to the one described in Ehrler et al. [75]. The result of the first module is a ranked list of candidates. Candidates also identified by the second module get a boost and are ranked higher depending on their length. The author finds that GO term are more difficult to identify than MeSH term. Synonyms are more helpful for GO than for MeSH. The vector space module is complementary to the pattern module.
MetaMap. MetaMap [12] is a program that maps biomedical text to concepts in the UMLS Metathesaurus. The program relies heavily on dictionaries. Namely the SPECIAL- IST lexicon and the UMLS Metathesaurus. The data files provided for the program are man- ually maintained and contain several hundred thousand entries. The SPECIALIST lexicon lists acronyms, abbreviations, synonyms, derivational variants and meaningful combinations of these. The Metathesaurus contains more than 1.3 million strings and the SPECIALIST lexicon provides 360,688 entries with 625,041 inflected forms.
The basic idea of MetaMap is to identify noun phrases in the text and marking the head noun. Then for each word in the phrases known variants are looked up in the SPECIALIST lexicon. Metathesaurus labels containing those variants become candidates. The candidates are ranked by mapping the labels to the noun phrase from the text. Four measures are combined: centrality (involvement of the head noun), variation (an average of inverse variant distance from its original), coverage and number of pieces. The best of the candidates are organized into a final mapping in such a way as to best cover the text.
The output is used by human curators to annotate MEDLINE abstracts. New articles in PubMed are manually annotated with a delay of one or two month. MetaMap is written
in Prolog and C. A Java implementation, named MMTx, is available for UMLS licensees. The following section discusses means of evaluating Concept Recognition algorithms.