Lista de Acrónimos
ANEXO 3: NARRATIVA DE CAMBIO DE PROINSAN
ir and nlp for informal text, such as social media, must handle production-error and processing-error noise. ir and nlp for digitised text, such as historical newspapers, has to consider all three sources of noise. We discuss the digitisation process and why noise can be so prevalent in ocr text later in this chapter (Section 2.3).
Below we review some of the recent work in applying common nlp tasks, such as named entity recognition and machine translation, to noisy sources of text.
nlp in noisy informal text
Minkov et al. (2005) present two methods for improving the performance of named entity recognition of people in informal text — various email corpora extracted from the cspace email corpus (Kraut et al., 2004) and the Enron corpus (Klimt and Yang, 2004). Their analysis of the highly-weighted features in their conditional random field model (Lafferty et al., 2001) showed that names in informal text have different, less informative, types of
contextual evidence. For example formal name titles such asMrandMs, as well as job titles
and other pronominal modifiers such asPresidentandJudgewere highly weighted in a
typical news dataset, while most of the important features in the email dataset are related to email-specific structure and metadata, much of which was corpus specific. For example,
quoted excerpts from other email messages would often be prefixed withfrom author name.
Locke and Martin (2009) were the first to evaluate ner performance on Twitter posts. They compared the performance of a named entity classifier on conll 2003 shared task data (Tjong Kim Sang and De Meulder, 2003) and Twitter posts. They achieved 85.72%
𝐹1measure on the conll data, while only 31.05% on the Twitter dataset, highlighting the
great disparity between performance on clean news text and noisy text when using a model trained only on clean text.
Ritter et al. (2011) analyse various stages of the nlp pipeline on Twitter messages. By manually annotating 800 tweets and retraining a part-of-speech (pos) tagger, they are able
to reduce error by 41% compared to the Stanford pos tagger (Toutanova et al., 2003). By training the individual components of the nlp pipeline with domain specific data, Ritter et al. were reducing the processing error type of noise.
nlp in noisy ocr text
Miller et al. (2000) were the first to analyse the effects of noisy ocr text on named entity recognition performance. In their paper, they evaluate a hidden Markov model (hmm, Baum and Petrie, 1966) for named entity extraction applied to news text with artificially introduced noise. They printed digital documents, scanned the physical copies, and applied ocr to produce four versions of the same text with varying error rates. Miller et al. demonstrated
that the 𝐹1measure of the hmm ner model decreases linearly as the word error rate (wer)
increases. They also showed that increasing the number of training examples increases the performance of their ner model, regardless of the wer.
Packer et al. (2010) extract person names from digitised historical documents with relevance to genealogy and family history research. Their analysis noted that word order errors and lost page formatting played a bigger role in poor extraction performance than character recognition errors.
Chen et al. (2015a) explore both ner and machine translation in digitised handwritten Arabic documents. They note that the poor sentence boundaries in the ocr output from handwritten Arabic posed significant problems for both ner and machine translation. Chen et al. (2015a) developed a conditional random field based sentence boundary detector to address the issue.
Jean-Caurant et al. (2017) introduce a technique for post-ocr lexicographical-based named-entity correction. They produce a graph of named entities and compute edit-distance similarities and contextual similarities between nodes. They then cluster the graph to identify entities that are the same. They generated a simulated dataset by converting text into images, adding visual noise, then processing the images with the ABBYY FineReader
2.3. Digitisation 19 ocr system. By analysing the reduction in word error rate on their post-processed text, they indirectly calculate that their method corrected 37% of the named entities.
In summary, nlp models trained on clean news text are sensitive to noise, and errors in nlp pipelines compound. To better handle noisy sources of text, nlp models should be trained with domain-specific data. Furthermore, nlp on noisy ocr text struggles more because of the lack of structural information than because of word error rates. Therefore more emphasis needs to be placed on extracting structural information during digitisation.
2.3 Digitisation
Digitisation is the process of taking a physical document or document representation and transferring it into a faithful digital (i.e. computer-readable) representation. The ideal goal is to preserve all of the information contained within the physical document. This achieves the goal of preservation, as physical documents deteriorate over time. It also achieves the goal of improved access, as digital documents can be quickly retrieved from archives and returned to a user, without them having to be in the same physical location as the document. Digitisation is necessary to allow sophisticated computer processing of documents.
In the digitisation pipeline, we highlight four key steps: acquisition, image processing, layout analysis, and character recognition. Figure 2.3 shows the output of each of these steps. Each of these steps present unique challenges. Since the output of each step forms the input to the next step, errors early in the process compound and can significantly impact the faithfulness of the final output.