2. INTRODUCCIÓN
2.3. CONCEPTOS ESENCIALES EN RCP
2.3.15. Modelo para comunicación de datos en el paro cardíaco
The concept of text normalisation was first proposed by Sproat et al. (2001) in the context of a preprocessing step for text-to-speech conversion. For instance, bdrm and apt are not readable without appropriate treatment in Example (2.33), because word-level phonetic transcriptions for these non-standard words are not available in text to speech systems, which are often trained on formal, edited datasets. However, if the two non-standard words were normalised to “bedroom” and “apartment”, re- spectively, it would be much easier to read them appropriately in speech synthesis (Schwarm and Ostendorf 2002).
(2.33) In 1988, a four bdrm apt only costs $1M.
In general, text normalisation transforms non-standard words into their contex- tually appropriate canonical forms, making the data more amenable for downstream processing (Sproat et al. 2001).17 This is a vague definition, but it is also very flexi- ble, allowing for application dependent normalisation. For instance, a keyword-based event detection system might require normalisation of misspellings (e.g., shakin “shak- ing”), informal abbreviations (e.g., dis mgt ctrs “disaster management centres”) and phonetic approximations (e.g., earthquick “earthquake”) for accurate keyword count- ing. Similarly, 2014 can be pronounced differently, depending on whether it represents a year or is a cardinal number. While NER is sensitive to capitalisations, syntactic parsing is straightforwardly affected by incorrectly split characters (e.g., l o v e “love”)
17Originally, text normalisation was defined to include sentence tokenisation, and the detection,
categorisation, and restoration of non-standard words. This thesis focuses on detection and restora- tion of non-standard words.
and concatenated words (e.g., cu “see you”). In addition, restoring missing punctu- ation and sentence constituents (e.g., subjects) may also help to improve tweet read- ability for humans. Nonetheless, most work on text normalisation primarily focuses on non-standard words consisting of alphanumeric characters (Cook and Stevenson 2009; Beaufort et al. 2010). These words can be categorised into four types, as shown in Figure 2.1. IV IV OOV OOV Standard Standard Non-standard Non-standard Nor mal isat ion seriously . . . srsly . . . wit for with
. . .
Obama . . .
Figure 2.1: Word categorisations in text normalisation.
Non-standard words include both Out-Of-Vocabulary (OOV) non-standard words and In-Vocabulary (IV) non-standard words. Both types of non-standard words dif- fer from their standard forms and can take some effort and context for humans to comprehend. On the one hand, many non-standard words are OOV words, although they often correspond to standard words in tweets, e.g., srsly represents “seriously” in Example (2.34).
(2.34) $50 for shopping is srsly not enough, and i’m not even kidding
On the other hand, some non-standard forms of IV words happen to coincide with other IV words, however, they usually do not fit the context, e.g., wit in Exam- ple (2.35).
(2.35) I will come wit you
The detection of non-standard words is challenging for both types. IV non- standard words are computational expensive to identify, because every token in the text must be examined. Due to this pragmatic issue, OOV non-standard words receive more attention in text normalisation and IV non-standard words are often ignored. OOV words are relatively easier to identify, by checking whether they are present in a given lexicon. However, not all OOV words are non-standard words. There are many named entities which are not included in lexicons, but are never- theless standard forms, e.g., Obama. The classification of OOV words as standard or non-standard is not trivial. As such, many normalisation systems either assume non-standard words have already been identified (Liu et al. 2011a), or largely treat all OOVs as non-standard words (Sproat et al. 2001).18
When discussing mapping non-standard words to their standard forms in normal- isation, another unresolved issue is what defines a standard form? While talkin is non-standard and its standard form is “talking”, whether IBM is a standard word or it should be normalised into “International Business Machines” is arguable. To make the normalisation task more tractable, a standard form is often solicited from an IV lexicon. This lexicon can be based on a commonly-used off-the-shelf dictio- nary. Alternatively, a corpus-derived lexicon from the target domain can also serve the purpose, e.g., all types with token frequency ≥ some threshold in a particular corpus. In both approaches, whether IBM is an IV word or should be normalised is then naturally settled.
Text normalisation has been tailored to various granularities relative to the nor- malisation scope. One straightforward normalisation approach is to manipulate char- acters within non-standard words to revert them to their canonical forms (i.e., context- insensitive lexical normalisation), e.g., dropping repetitive characters in hooooot “hot” and adding a missing g in takin “taking”. In some cases, this word-centric character manipulation is insufficient to capture ambiguous non-standard words in the data. For instance, hw represents “how” in Example (2.36) and “homework” in Example (2.37).
18Sproat et al. (2001) also included common abbreviations and rule patterns to improve the
(2.36) Hi, hw are you?
(2.37) Let me finish my hw though
To deal with this uncertainty, context information beyond the target non-standard word is considered in normalisation (i.e., context-sensitive lexical normalisation). “homework” makes more sense than “how” as the standard form of hw in Exam- ple (2.37), because “finish my homework” is more likely than “finish my how” in terms of trigram frequency. The scope of these normalisations focuses on non-standard words, and each non-standard word is independently normalised as in conventional spell checkers. As a result, these methods are referred to as spell checking-based approaches, as discussed in Section 2.2.3.
Nevertheless, normalisations for multiple non-standard words may be mutually influenced when context words are non-standard words as well. For instance, yr can be interpreted as “you’re” or “year” in Example (2.38). If the second non-standard word srs were normalised to “serious”, then the chance of yr being “you are” should become higher.
(2.38) Oh wow yr srs
To capture the mutual influence of normalisations for adjacent non-standard words, joint normalisation for the whole sentence is preferable. In this setting, the canonical forms are not independently selected for each non-standard word, but the decisions are made by optimising the likelihood of normalisations over the whole sen- tence. This more flexible and powerful approach is often interpreted as a sequential labelling task.
In most cases, a non-standard word is normalised to a single canonical form. How- ever, non-standard words can be the result of splitting and concatenating standard words. As a result, non-standard words may be grouped together for many-to-many normalisations, e.g., I l o v e i t “I love it” and cu “see you”. Additionally, text normalisation may also involve the insertion of missing words and the deletion of redundant words, such as deleting I in Example (2.39). These powerful normalisa- tions are often modelled as a monolingual machine translation task that translates
noisy text with non-standard words to standard English text in the target domain, as addressed in Section 2.2.3.
(2.39) I I swear all my friends have boyfriends and I’m like ... Oh