1. CAPÍTULO I: PLANTEAMIENTO DEL PROBLEMA
2.1. Marco histórico 10
2.2.3. Artículos relacionados con el tema
In the beginning Section 2.1 (Introduction) provided a wider introduction and mo- tivated the importance of ontology learning methods to facilitate the development of biomedical ontologies, to assist biocuration, and to endorse ontology-based lit- erature search. The preliminaries from information retrieval, linguistics, and statis- tics used in this chapter have been introduced in Section 2.2 (Preliminaries). An overview was provided on the related literature on term recognition (Section 2.3.1) and the associated extraction of synonyms and abbreviations (Sections 2.3.2 and 2.3.3). The literature on the generation of textual definitions (Section 2.3.4) includ- ing definitional question answering has been reviewed prior the methods on finding taxonomic relations (Section 2.3.5). The next pages will provide a summary and the critical assessment of the related work from the area of ontology learning which has been presented in this chapter. Specifically, the focus will be set upon the applica- bility and availability of the published work for the applications in the life sciences motivated earlier. The relation to own work is drawn by identifying open issues re- quired to be addressed for the intended use in ontology engineering, biocuration and semantic search.
Term recognition
In literature the quality of term recognition is describe with precision values be- Overview on
term recognition methods
see Table 2.7
tween 0.70 and up to 0.98 which truly depends on the experimental setting and the judgement on correctness of predicted terminology. The recall measured was usually above 0.70. The precision depends on how terms are selected from a big- ger set of term candidates, often formulated as a ranking problem. High precision corresponds to a good ranking which retrieves domain-relevant terms first. It can be assumed, that for state-of-the-art term recognition a precision of 0.75 and more can be reached within the top region of predicted terms. Hence term recognition is already useful for the fast acquisition of domain vocabulary. For the special case of gene or protein name recognition an F-measure of 0.86 was reached in the BioCre- ative II challenge. For all evaluations where the terms to extract are contained in the texts, overall recall is only depended on the quality of the detection method for noun phrases. Therefore high recall values can be easily achieved. In open systems without a fixed corpus and therefore no guarantee that requested terms are avail- able, it cannot be assumed that all terms can be found. Here, recall measurements are not representative for a method and are difficult to compare between methods.
Concepts, synonyms, abbreviations and lexical variants Concept discovery is not well defined in theory and no tool support is available. Nevertheless it is possible to form concepts by grouping syntactic variations of a term (Section 2.3.6) which then can be enriched with synonyms (Section 2.3.2), abbreviations (Section 2.3.3), and definitions (Section 2.3.4).
In literature synonym detection has been reported to achieve high accuracy with Overview on
synonym discovery
see Table 2.8
0.98 for finding TOEFL synonyms (Turney, 2003) and 0.90 precision at a recall of 0.40 evaluated against the UMLS (Mccrae and Collier, 2008). Given such results, the methods can be regarded as already good enough to be used. On the other side,
the computational expenses of several hours reported by (Shimizu et al., 2008) to use previously learned distances for a word distance based experiment make such approaches not applicable for the interactive acquisition of synonyms. Nonetheless, different senses contained in dictionaries or ontologies can be selected as synonyms after disambiguation.
Unlike general synonyms, abbreviations (Wren et al., 2005), e.g. the technique
Overview on abbreviation detection methods
see Table 2.10
“RNAi” standing for “RNA interference” or more precisely for “Ribonucleic acid in- terference”, can be accurately identified: Gaudan et al. (2005); Yu et al. (2007); Zhou et al. (2006); Okazaki and Ananiadou (2006) report all precision and recall above 0.9. Detection and disambiguation of abbreviations from MEDLINE abstracts can be pre- formed with high quality, and most importantly, there are look-up services available which can be integrated in applications. We conclude that automatic methods can play an important role in finding abbreviations and will most probably be included soon in appropriate ontology engineering tools.
To achieve a meaningful ranking, lexical variations, synonyms, and abbrevia- tions are important. As discussed by Nenadi´c et al. (2004a), the introduction of in- flection variants, acronyms can improved precision significantly. Especially frequent terms typically abbreviated benefit from acronym variation detection. This group- ing should also be exploited for the ranking of single-word terms. As shown later, local document frequencies as well as global corpus frequencies significantly change when grouping different variants of terms, compare Section 3.1 (DOG4DAG Term Generation Method). With respect to synonymy, abbreviations and lexical variants the result of the literature study is, that the robust discovery of synonymy is very subjective as synonymy can be defined in various ways depending on the intended use. Promising methods exist, require big amounts of data, and are usually com- putationally very expensive. Therefore synonym extraction is not applicable for the intended use scenarios. The task of finding abbreviations and lexical variants can be regarded as scientifically solved, but few available implementations exist. For sim- ple abbreviation detection the method of Adar (2004) is sufficient. In cases where disambiguation is required the method of Gaudan et al. (2005) could be used.
For the extraction of term candidates it becomes clear from the literature and ex- plicitly formulated by Wermter and Hahn (2006), that linguistic information matters. Part-Of-Speech tagging or parsing should be used for extracting candidate phrase, e.g. noun phases. Wermter and Hahn also highlighted in his analysis, that all meth- ods, no matter if they are based on, statistical test, ATR, or collocation extraction, promote true negative terms instead of keeping them in the lower segments of the ranked list. This suggest that improved methods should build on better comparative measures using stable corpus statistics to normalize local observations.
As done for evaluation by Lee et al. (2006), existing ontology terms should play an important role in the validation of term candidates extracted from text. Extracting terms with definitional character from text as they exist in the GO (Ogren et al., 2004) (see also illustration in Figure 2.4) has not been addressed in literature.
Applicability of term generation tools The literature overview describes methods and available tools to extract terms from text. For example TerMine (Frantzi et al., 2000), or TermExtractor (Sclano and Velardi, 2007) are the two available tool which
are well defined and lead to the extraction of meaningful multi-word term candi- dates. A drawback of the methods is, that single word terms are being ignored and therefore gene and method names, topic descriptors and substances identifiers con- sisting of only one word are not included in the candidate lists. All methods aiming to extract domain-relevant vocabulary extract terms from text composed of at least two words. This is reasonable, because it is difficult to distinguish between domain- relevant and non-domain-relevant single word terms because corpora usually con- tain less domain-relevant terms than not domain-relevant terms. Also, multi-word terms in English are a priori most likely to be a technical term and hence can be easily extracted.
Based on result from in literature it can be expected that more than 75% of the terms presented to the user by a automatic system are domain relevant terms. It remains open work how to extract and rank single and multi-word terms by relevance to the domain of interest. In this thesis a method for term generation has been specified, implemented and evaluated (Chapter 3). This method recognises local abbreviations and lexical variants found in text. A ranking method has been specified and evaluated which allows to extracts domain relevant single and multi-word phrases in on result set.
Definition extraction
Non of the reviewed systems is capable to generate complete well structured defini- Overview on
definition extraction methods
see Table 2.11
tions fully automatically. Definition extraction as required for ontology development should retrieve a selection of proper definitions for terms. Results by Liu et al. (2003) and Westerhout and Monachesi (2008) show that textual definitions can be found on
web pages with F1=0.71. But, different to Liu et al. (2003), the task now is not rank-
ing web pages, but ranking definitions. The extraction of existing definitions from text can be performed with a reasonable precision (above 0.7) but low recall (below 0.4). The user defined benchmarks used to measure performance are not comparable between systems as they vary strongly in size and the expected quality.
Extracting definitional statements has been done as part of definitional question Overview on
definitional question answering
see Table 2.12
answering. Definitional question answering can be an indicator of the state-of-the- art as the task is to retrieve sentences from a corpus which contain the definitional statement. This does not mean, that necessarily a proper definition of a term need to be obtained. Definitional question answering has been evaluated for domains other than the life science. The evaluation of definitional question answering systems in TREC 2003 lead to comparable results. The results for definitional question answer-
ing in the literature reach F-measure values between F1 = 0.16 and F1 = 0.53. The
best systems in TREC2003 achieved F-measures around 0.30, later systems have been
reported to reach F=0.53(Cui et al., 2004). The extraction of shorter statement lead
to lower performance like Han et al. (2006) with F=0.16. The results suggest that it
is advantageous to use (1) context information for disambiguation, e.g. co-occurring words, (2) external information, e.g. WordNet, web pages, for re-ranking and vali- dation of candidate statements, and (3) handcrafted rules to perform in TREC def- initional question answering. Not unimportant for good results is the size of the answer in the question answering task. Good systems should be able to estimate correctly, which definitions are likely to be correct and of relevance.
Recently, Velardi et al. (2008) evaluated for few terms the extraction of definitions from web pages and presented a F-measure of 0.86. This score was calculated for the whole set of generated definitions (also multiple per term), regardless whether for each term a definition could be found. Hence, multiple, easy to find, good definitions for widely used domain specific terms will influence the performance measurement positively, while terms where it is difficult to find definitions have a lower influence on the result. Finding definitions is especially difficult for new domain specific terms – which are of special interest when creating ontologies – and for general terms which are likely ambiguous and often used. Recall should be calculated weighing all terms equally.
From literature it remains open how well domain specific textual definitions can be ex- tracted in a life science domain. The goal is to find proper definitions of the form A is B with property C. To establish a method, the quality of the method has been evaluated. In this thesis a method for definition extraction from web search results has been developed (Chapter 4). The method has been manually evaluated on the common benchmark from question answer- ing from TREC2003. More importantly the method has been evaluated large scale in the life sciences by manually validating generated definitions for 1,000 randomly chosen terms from MeSH and GO.
Taxonomic generation
There are lexico-syntactic and statistical methods to extract taxonomic relationships
Overview on taxonomy generation methods
see Table 2.14
such as is-a and part-of from text. In general, the most challenging problem for such methods is the fact that many relationships are not made explicit in text. An example for a lexico-syntactic method are Hearst-patterns (Hearst, 1992), like X such as Y. With these patterns one can infer e.g. from the text fragments “organelles such as mitochondria”, that mitochondria are organelles. Pattern-based methods show typically high precision around 0.90, but a low recall of 0.10. An example for the statistical methods is reported in Heymann and Garcia-Molina (2006). Here, the decision on the existence of a relation between two concepts depends on the measured co-occurrence of the concepts. The concept that shows more independence of the other will be suggested as parent in that relation. Another approach, which can generate ontologies from the concept usage in documents, is formal concept
analysis (Ganter et al., 2005; Cimiano et al., 2005) reaching an F-measure of 0.39−
0.45. Generally statistical approaches reach a F-measure of below 0.50 (Ryu and Choi, 2006; Snow et al., 2006). Recent work by (Brewster et al., 2009) describes an experiment of creating an initial hierarchy of terms in the animal behaviour domain. Like Hearst they use lexico-syntactic patterns to obtain taxonomic relations. The authors provide an small evaluation for 198 selected out of 13.755 terms between which predict relations with an precision of 0.28. This suggest, that simple string inclusion as used is not suitable to obtain good taxonomic relations.
To judge on the applicability of automatic methods for the determination of hierarchical information it has to be clarified how often hypernym relationships can be found in text. Snow et al. (2004) gives evidence, that hypernym and non-hyponym word pairs were found by a ratio of approximately 1:50. Generalised methods can rely on machine learning as done by Snow et al. (2004) who could reproduce that all patterns originally suggested by Hearst
(1992) where also identified by the proposed machine learning method to find patterns for hypernymy. Nonetheless, with recall far below 0.50, taxonomy generation remains an open problem.
Ontology learning integration in ontology editors
Apart from the TERMINE term recognition plug-in for Protégé, non of the currently used ontology editors comprises automated support for finding term, definitions, or novel taxonomic relationships.
To address the needs of the ontology engineers who develop the steadily growing number of ontologies in Biology and Medicine, the ontology generation methods developed in this thesis have been integrated in the two most used editors OBO-Edit and Protégé (Chapter 7)
Terminology Generation
References
Wächter, T. and Schroeder, M. (2010). Semi-automated ontology generation within OBO-Edit. In ISMB (Supplement to Bioinformatics), Impact factor 2009: 4.3 (accepted for publication)
Alexopoulou, D.∗, Wächter, T.∗, Pickersgill, L., Eyre, C., and Schroeder, M. (2008). Terminologies for text-mining; an experiment in the lipoprotein metabolism domain. BMC Bioinformatics, 9(Suppl 9):S2, Impact factor 2009: 3.7 ∗shared first author
Wächter T., Tan, H., Wobst, A., Lambrix, P., and Schroeder, M. (2006). A corpus-driven approach for the design, evolution, and alignment of ontologies. In Proceedings of Winter Simulation Confer- ence (Computational Systems Biology), Monterey, CA, USA (3rd–6th December 2006), pages 1595–1602, Invited contribution.
A term generation method has been designed, implemented, and evaluated. It generates terms by identifying statistically relevant noun-phrases in text and leads to high quality terms also found in manually created ontologies. It specifically deals with the relevance ranking of single word terms by the use of term normalisation and reference corpora which are big enough to capture the relative frequency of terms in comparison to each other. The method was evaluated manually by domain experts and against an ontology which has been created in collaboration with Unilever Research UK in the domain of lipoprotein metabolism. The method outperforms other systems and is capable to retrieve 75% relevant terms in the top 50 and 55% in the top 200 generated terms. Further, it has been analysed how the ranking of terms depends on the selected Part-Of-Speech tagger for noun phrase extraction, the source for the global corpus statistics as well as on the used scoring method. Especially the use of a reference corpus has a high influence on the ranking of terms, but surprisingly the differences are not significant whether a general corpus based on web sites or a domain specific corpus like PubMed is used to obtain global term frequencies. The method has been encapsulated in a web service which enabled the integration in various applications.
From our experience the manual collection of domain relevant terms is time- consuming and not objective. In need of faster acquisition of domain relevant ter- minology, automatic term recognition methods have been already developed in the 1990s (Section 2.3.1) and are now again of interest to efficiently create vocabularies and ontologies for semantic applications. With respect to Research Question 1 – To what extent can ontology construction be automated? (Section 9.1) – a term recognition method has been developed within this work to automatically generate terminology from natural language texts and to rank these terms by relevance for easier affir- mation by a domain expert. In contrast to other existing methods, the new method explicitly allows the extraction of single word terms, which significantly complicates the relevance ranking. The majority of technical terms i.e. almost 90% of the biomed- ical terms in the GENIA text corpus are compounds. The majority of compounds in a corpus will automatically be domain relevant. In English, there are much more sin- gle word nouns than compounds and only a small fraction will be domain relevant technical terms in the domain.
In biology and medicine, single word terms are required to be extracted because e.g. many identifiers, names, acronyms, cellular components, species, or anatomi- cal terms are single words which need to be contained. To clarify to what extend the ranking of terms changes with different configurations of the term generation pipeline and to check the hypotheses associated with research question 1, a number of experiments have been performed. In particular it has been experimentally inves- tigated how the impact of technical parameters such as the specific method to assign part of speech tags (Section 3.5.1), the source of corpus statistics (Section 3.5.2), and the chosen statistical measures (Section 3.5.3) influence the obtained ranking of can- didate terms.
Requirements
With respect to the intended use in biomedical applications a number of require- ments have been collected:
(1)The method should represent the state of the art, and as such perform comparable
well or better compared to other methods.
(2)The method should favour terminology from the biomedical domain.
(3)The method needs to be fast enough to be used in interactive applications, e.g.
ontology editors or on the fly document classification.
(4)The method should be organized modular, to allow to experiment with different
algorithms and components.
Requirement (1) Despite the availability of other term generation (or term recog- nition) methods (Section 2.3.1) a different method was needed to support the process of developing novel biomedical ontologies e.g. to be used in semantic text-mining based applications like GoPubMed. From the few existing systems, TerMine (Frantzi et al., 2000) for instance is available as ready-to-use web service, but ignores all phrases consisting of one word only. The same applies to TermExtractor (Sclano and Velardi, 2007). Both systems build on the assumption that longer terms have on aver- age a higher probability to be technical terms. This assumption certainly holds and while neglecting single words both systems achieve good performance. But single
word terms are very important to be extracted as they often have a distinct meaning in the domain. Many gene, protein, species names, general anatomical terms, and method names are single words and must not be ignored.
To allow appropriate ranking even after including single words the normalisation of term variation is important. As shown in Nenadi´c et al. (2004a), the introduction of inflection variants improved precision by approximately 25%. Acronym variations significantly improved precision by 70% when considering the most frequent terms while recall improved up to 25%. Acronym variation detection lead to improve- ment for frequent terms, which are typically abbreviated. The new term recognition method in-cooperates such information on abbreviations and acronyms, as well on lexical variants found in the analysed text. Additionally local and global corpus