The detection of abbreviations in natural language text is an important task in in- formation retrieval. An abbreviation is the short form of a word or word phrase. Abbreviations are widely used in scientific literature. It can be distinguished be- tween local abbreviation and global abbreviations. Local abbreviations are such ab- breviations which occur in documents together with their long forms, while global abbreviations do occur without their longforms explicitly stated. As such global ab- breviations are often ambiguous, meaning that they correspond to different word senses in different documents. Some examples how local abbreviations occur are listed in Table 2.9. The task abbreviation detection is usually solved in two steps. (1) a dictionary of abbreviation to long form is created, where abbreviations can be assigned to multiple long forms and vice versa. To choose the correct abbreviation
Rule Example
The first letter of an abbreviation matches the first letter of the meaningful word of the full form.
The Unified Medical Language System (UMLS) The abbreviation matches the first letter of each
word in the full form.
tumour necrosis factor (TNF)
A word in the full form can be skipped if the ab- breviation letter matches the first letter of the fol- lowing word.
extracellular signal-regulated protein kinase 1 (ERK1)
The abbreviation letter matches consecutive let- ters of a word in the full form.
insulin receptor (InR)
The abbreviation letter matches the last letter of a word in the full form if the letter is an s and if the first letter of the word matches the abbreviation.
cysteine-rich domains (CRDs)
The abbreviation letter matches a middle letter of a word in the full form if the first letter of the word matches the abbreviation.
immunoglobulin G1 (IgG1)
Table 2.9. Pattern-matching rules for mapping an abbreviation to its full form (Yu et al., 2002)
(2) the occurrence of an abbreviation in text has to be disambiguated, meaning that the correct senses (abbreviation / long form pair) need to be selected.
Table 2.10 contains a summary of these methods including performance esti- Overview on
abbreviation detection
see Table 2.10
mations (if available) and the major characteristics. In literature various methods have been reported to find abbreviations using machine learning (Pakhomov, 2001; Nadeau and Turney, 2005; Gaudan et al., 2005; Yu et al., 2007; Okazaki et al., 2008),
heuristic rules and algorithms(Taghva and Gilbreth, 1999; Wren and Garner, 2002; Yu et al., 2002; Schwartz and Hearst, 2003; Liu et al., 2003; Adar, 2004; Ao and Tak- agi, 2005; Zhou et al., 2006; Yu et al., 2007; Okazaki et al., 2008) or rely on statistics (Hisamitsu and Niwa, 2001; Liu et al., 2003; Zhou et al., 2006).
Recent literature on abbreviation detection methods
Generally, the machine learning approaches which disambiguate “long form”/ “short form” pairs achieve very high accuracy (precision and recall >0.9), however, they require training data. This training data is usually obtained using rule-based methods or available annotated corpora which typically show low recall and high precision. These high quality “long form”/ “short form” pairs are used to train some machine learning classifier using either maximum entropy classifiers (Okazaki et al., 2008), support vector machines (Gaudan et al., 2005; Yu et al., 2007) or Bayesian clas- sifiers (Yu et al., 2007) to find true abbreviations. Considering all approaches in (Ta- ble 2.10) abbreviation detection can be regarded as scientifically solved for most do- main. Disambiguation using terms from controlled vocabularies (Adar, 2004), con- text words (Gaudan et al., 2005), or high quality abbreviation data sets improves the results significantly.
The simpler approaches like Adar (2004) extract only acronyms from biomedical literature abstracts. The system achieved a high precision of 0.95 and 0.75 recall on the detection of long form/abbreviation pairs. The long forms where detected by
Method Characteristics Precision Recall Comment
rules and algorithms
machine
learning statistics
Taghva and Gilbreth
(1999) 4 0.98 0.86-0.93
method based on an inexact pat- tern matching algorithm applied to text surrounding the possible acronym
Pustejovsky et al.
(2001) 4 0.98 0.72
evaluated on Medstract gold standard6
Yu et al.
(2002) 4 0.95 0.70
Rule-based extraction of abbrevi- ations in parenthesis
Chang et al.
(2002) 4 0.80 0.83
evaluated on Medstract gold standard6
Pakhomov
(2001) 4 0.98
acronym detection on 10,000 rheumatology notes
Schwartz and Hearst
(2003) 4 0.96 0.76 0.81 0.82 0.64 0.82
evaluated on Medstract corpus GOLD STANDARD
EVALUATION corpus DEVELOPMENT corpus Liu et al.
(2003) 4 4 0.9 0.89
extraction of collocations before parenthesis
Adar
(2004) 4 0.95 0.85 evaluated on Medstract corpus Ao and Takagi
(2005) 4 0.75
0.87
0.63 0.85
evaluated on Medstract corpus EVALUATION corpus DEVELOPMENT corpus Nadeau and Turney
(2005) 4 0.89 0.88
replication of the algorithm by Schwartz and Hearst (2003) using supervised learning
Gaudan et al.
(2005) 4 0.99 0.98
uses C-Value method by Frantzi et al. (1998) for disambiguation Chang and Schütze
(2006) 4 0.80 0.83 evaluated on Medstract corpus Okazaki and Anani-
adou (2006)
4 0.99 0.82−0.95
exploits overlapping definitions of acronyms from several authors; evaluated against own corpus Zhou et al.
(2006) 4 4 0.97
one third novel and 19% novel non/acronym abbreviations not contained in other databases Yu et al.
(2007) 4 4 up to 0.92 up to 0.91
rule-based dictionary construc- tion followed by disambiguation with machine learning
Okazaki et al.
(2008) 4 4 0.89−0.98 0.87−0.98
high F-measure of 0.91−0.97 de- pending on the corpus tested; dis- ambiguation is not addressed.
Table 2.10. Overview on abbreviation detection approaches regarding their characteristics and qual- ity.Typically abbreviations can be reliably found using statistics, machine learning or rules (patterns). precision and recall above 0.80 and often above 0.90 have been achieved for various benchmarks.
searching for the longest common sub sequence in conjunction with a set of scoring rules (Taghva and Gilbreth, 1999, see) that favours the first letter of each word of the long form. The algorithms recognises the cases, where the long form precede
the abbreviation in brackets. Morphological similar long forms get merged if the n-grams they contain are similar. Instead of training a machine classifier, common MeSH annotations of the associated abstracts are used to merge long forms sharing the same context.
Extending the approach of Adar (2004), Gaudan et al. (2005) developed a better disambiguation methodology. In contrast to Adar the similarity of long forms is not anymore defined based common MeSH annotations of the abstracts which contain the long forms. MeSH annotations are only available for MEDLINE abstracts and the approach cannot be applied to arbitrary text. The similarity is now defined based on
common words contained in the long forms. Acronyms where no long form could Disambiguation
based on common words
be found are disambiguated based on a context model derived from Frantzi et al. (2000). An support vector machine is trained for each sense of an acronyms by in- cooperating all abstracts containing long forms. Before training the long forms are removed. The authors report to disambiguate acronyms with a precision of 0.99, recall of 0.98, and an accuracy of 0.99.
With a similar approach, the method by Okazaki and Ananiadou (2006) achieved
0.99 precision and 0.82−0.95 recall on a self defined evaluation corpus and supports
this way the results by Gaudan et al..
The system ADAM, by Zhou et al. (2006) also finds non-acronym abbreviations, Non-acronym
abbreviations
a problem which previous systems did not address. Abbreviations are four in a five step procedure with step (1) extracting candidate abbreviations (only single word abbreviations) and surrounding text, (2) identify long forms using statistical infor-
mation, (3) filter short-form/long-form pairs according to a length ration (≥2.5), (4)
verifying that short forms are used in text separately from their long forms, and (5) grouping together morphologically similar long forms. ADAM reaches a precision 0.97 and one third of the abbreviations are novel and are not found by other meth- ods, of which 19% of the abbreviations in ADAM are non/acronym abbreviations.
Yu et al. (2007) disambiguate like others, but treats syntactic variations before Resolving
syntactic variations before training
training . This is said to be especially important when classifying abbreviations in full-text articles. Tested for two machine learning approaches the authors obtained precision/recall for Naïve Bayesian for the Journal of Biological Chemistry (JBC) 0.86/0.79 and for the Journal of Clinical Investigation (JCI) 0.9/0.84. Whereas the support vector machine (SVM) approach reached for the JBC 0.89/0.91 and for the JCI 0.92/0.88.
Motivated by the limitations of manually created heuristic rules to extracted the correct long forms from text Okazaki et al. (2008) proposes an learning approach for the alignment of abbreviations and their long forms.