In biomedical text mining, researchers use lexical, syntactic, and semantic techniques to extract desired information from text (Jensen et al., 2006). Related research fields are Natural Language Processing and computational linguistics, as well informa- tion retrieval including machine learning and word sense disambiguation. Natural Language Processing (NLP) is an area of computer science which deals with the in-
NLP
natural languages by using techniques provided by computational linguistics such as statistical or rule-based modeling of natural languages.
Before text is analysed or interpreted a number of standard process steps are usually performed. An input stream of characters of a text is tokenized, meaning
that tokens are obtained. The tokens are being categorised in pre-defined classes TOKEN
standing for types of symbols, like number, punctuation, comma, opening or closing bracket, words, etc. These classes help to specify algorithms to process the text.
Sentence splitting assembles the tokenized text into sentences. A difficulty in sentence SENTENCE
SPLITTING
splitting is to decide whether a possible punctuation mark is a true delimiter or is part of some textual unit within a sentence like a organism (“C. elegance”) or a person name (“Mr. Smith”). For ontology learning obtaining the structural units of the text is of greater interest, meaning sentence splitting, the identification of tokens and noun phrases, as well as the awareness for term variations and normalisation using stemming and dictionaries.
• morphological: inflection e.g. singular vs. plural;
• orthographic: hyphens, slashes, upper case, lower case, etc • lexical: lexical synonyms e.g.“cancer” vs. “carcinoma” ;
• structural: use of prepositions e.g.“clones of human” vs. “human clones”; • acronyms and abbreviations:
Stemming is capable of resolving morphological variations by obtaining a nor- STEMMING
malised base form of each term (Porter, 1997). Stemming is fast and simple, but introduces additional ambiguity. Very often, words will appear in different forms, such as “binding” and “binds”. These refer to the same concept, which can be solved by resolving words to their stem (“bind”). However, the analogous reduction of “dimerisation” to “dimer” is more questionable. The former talks about the process, the latter about the result. A similar example is “organisation”, where a transforma- tion into “organ” is invalid as well as “sensitive” and “sensitisation”, both stemmed to “sensiti” suggesting equality but are in fact different, as one is a property while the other one describes a whole process.
Part-of-speech tagging
With Part-Of-Speech (POS) the grammatical classification, or the syntactic category POS
a word is denoted, to which a word can be assigned to in the context of a phrase, sentence or paragraph. This categories can be many fold and can be mapped to classes like noun, adjective, adverb, verbal. POS tagging is the next step of making use of linguistic knowledge to interpret the tokens obtained from text. The concrete categories depend on the annotated categories in the annotated corpus. As example the tags used in the Penn Treebank corpus (Marcus et al., 1993) are listed in Table 2.3. An example sentence has been tagged for Example 2.1.
Noun phrase chunking
Phrase chunking divides sentences into non-overlapping sequences of tokens. Noun PHRASE
CHUNKING
phrase chunking recognises chunks that consist of noun phrases (NP). Other tasks are recognising verbal phrases, pronoun phrases, or participle phrases. For term
CC Coordinating conjunction TO to
CD Cardinal number UH Interjection DT Determiner VB Verb, base form EX Existential there VBD Verb, past tense
FW Foreign word VBG Verb, gerund/present participle IN Preposition/subord. conjunction VBN Verb, past participle
JJ Adjective VBP Verb, non-3rd ps. sing. present JJR Adjective, comparative VBZ Verb, 3rd ps. sing. present JJS Adjective, superlative WDT wh-determiner
LS List item marker WP wh-pronoun
MD Modal WP$ Possessive wh-pronoun NN Noun, singular or mass WRB wh-adverb
NNS Noun, plural # Pound sign NNP Proper noun, singular $ Dollar sign
NNPS Proper noun, plural . Sentence- nal punctuation
PDT Predeterminer , Comma
POS Possessive ending : Colon, semi-colon PRP Personal pronoun ( Left bracket character PP$ Possessive pronoun ) Right bracket character RB Adverb " Straight double quote RBR Adverb, comparative ’ Left open single quote RBS Adverb, superlative “ Left open double quote RP Particle ’ Right close single quote SYM Symbol (mathematical or scienti c) ” Right close double quote
Table 2.3.The tag set of the Penn Treebank Part-of-Speech tagged corpus.
recognition (Section 2.3.1), the notion of a noun phrase as term candidate is of in-
terest. A noun phrase is a sequence of words, that are a unit and can act as subject, NOUN PHRASE
complement, or object in a sentence. A recent overview by Wermter et al. (2005) evaluated the performance of state-of-the-art machine learning based noun phrase chunkers for biomedical text. The chunkers have been trained on the PENN TREE- BANK newspaper corpus and tested on the biomedical text corpus (GENIA). The results on GENIA have been 3-6% lower depending on the system. In Example 2.1 the noun phrases are shown in square brackets extracted after POS tagging.
Example 2.1 (Part-of-Speech tagging with the Stanford POS-tagger). A sentence
from a PubMed abstract (PMID 19442486) was Part-of-Speech tagged using the Stan- ford POS-tagger (Toutanova and Manning, 2000) and the noun phrases have been extracted with the noun phrase chunker by Ramshaw and Marcus (1995).
Sentence:
The mouse embryonic stem cell test (EST) was designed to predict embryotoxicity based on the inhibition of the differentiation of embryonic stem cells (ESC) into beating cardiomyocytes in combination with cytotoxicity data in monolayer ESC cultures and 3T3 cells.
POS-tagged sentence:
The/DT mouse/NN embryonic/JJ stem/NN cell/NN test/NN -LRB-/-LRB- EST/NNP -RRB-/-RRB- was/VBD designed/VBN to/TO predict/VB embryotoxicity/RB based/VBN on/IN the/DT inhibition/NN of/IN the/DT differentiation/NN of/IN embryonic/JJ stem/NN cells/NNS -LRB-/-LRB- ESC/NNP -RRB-/-RRB- into/IN beating/VBG
cardiomyocytes/NNS in/IN combination/NN with/IN cytotoxicity/JJ data/NNS in/IN monolayer/NN ESC/NN cultures/NNS and/CC 3T3/CD cells/NNS ./.
NP chunking (NPs in square brackets):
[ The/DT mouse/NN embryonic/JJ stem/NN cell/NN test/NN -LRB-/-LRB- EST/NNP -RRB-/-RRB- ] was/VBD designed/VBN to/TO predict/VB embryotoxicity/RB based/VBN on/IN [ the/DT inhibition/NN ] of/IN [ the/DT differentiation/NN ] of/IN [ embryonic/JJ stem/NN cells/NNS -LRB-/-LRB- ESC/NNP -RRB-/-RRB- ] into/IN beating/VBG [ cardiomyocytes/NNS ] in/IN [ combination/NN ] with/IN [ cytotoxicity/JJ data/NNS ] in/IN [ monolayer/NN ESC/NN cultures/NNS ] and/CC [ 3T3/CD cells/NNS ] ./.
Word Sense Disambiguation
Word sense disambiguation (WSD) is a sub-task of semantic tagging and deals with WORD SENSE DIS-
AMBIGUATION
relating the occurrence of a word in a text to a specific meaning, which is distin- guishable from other meanings that can potentially be related to that same word (Schuemie et al., 2005). WSD is essentially a classification problem: given an input text and a set of sense tags for the ambiguous words in the text, assign the correct senses to these words. Sense assignment often involves two assumptions: a. within a discourse, e.g. a document, a word is only used in one sense (Gale et al., 1992) and b. words have a tendency to exhibit only one sense in a given collocation – neighbouring words (Yarowsky, 1993). (Alexopoulou et al., 2009) analysed and eval- uated 4 approached to word sense disambiguation. The ’Closest Sense’ method as- sumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The ’Term Cooc’ method defines a log-odds ratio for co-occurring terms including co-occurrences in- ferred from the ontology structure. The ’MetaData’ approach (Doms, 2009, chapter: Algorithms for Concept Recognition) trains a maximum entropy classifier on meta- data, such as journal, author, date of publication. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these ap- proaches we defined a manually curated training corpus of 2,600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The ’MetaData’ approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The ’Term Cooc’ approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict
is-a/part-of, but rather a loose is-related-to hierarchy. The ’Closest Sense’ approach achieves on average 80% success rate. Alexopoulou et al. concluded that metadata, such as journal, author name is valuable for disambiguation, but requires high qual- ity training data. The closest sense method requires no training, but a large, consis- tently modelled ontology, which are two opposing conditions. Term co-occurrence achieves greater 90% success given a consistently modelled ontology. Overall, the re- sults show that well structured ontologies can play a very important role to improve disambiguation.
Maximum entropy method The maximum entropy method was introduced by
MAXIMUM ENTROPY
Berger et al. (1996) and is a method for statistical modelling where minimal assump- tions are made about the data. The method allows the assignment of a-priori proba- bility to known classes based on incomplete information. As the name suggest, the method aims to maximize the entropy and the authors describe the methods goal as follows: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model consistent with all the facts, but otherwise as uniform as possible. In information theory, entropy (or self-information) measures the
ENTROPY
amount of information in associated with a random variable (Manning and Schütze, 1999).
HX= −
∑
p(x)log2p(x),with p(x)the probability mass function of a random variable X, over a discrete set
of symbols.