Representaciones Sociales
ENTREVISTA GRABADA A LAS MADRES QUE CONCURREN A LOS CONTROLES DE LOS CAPS Sala:
Below are common problems present in the Google ocr dataset. Many of the problems are typical for a digitisation dataset, but some present unique challenges for creating an article segmentation dataset.
Missing characters Optical noise, de-noising and inconsistent printing can lead to char-
acters being missed by the ocr engine. While this is very common with smaller punctuation, such as full stops and commas, characters within words can also go missing. For example, in Figure 5.1, the red characters indicate text missing from the ocr output. In this example, the
em dash is missing on the second line, and the first three characters of the wordCanberra
are missing on the third line.
Missing words Often the characters of whole words will be absent from the ocr data.
It is not clear why this occurs, but is either an artefact of the ocr engine or the post- processing/storage of the data. Figure 5.1 shows an example of this phenomenon, with red text indicating text that was missing from the ocr. Here entire words and short sequences of words are completely missing from the ocr output.
Incorrect characters Characters are often misclassified by the ocr engine. This is more
common with characters not forming words and punctuation marks, as the ocr engine’s language model typically corrects misclassified characters within words, and punctuation marks are often small, affected by de-noising and can easily be confused with similar punctuation marks (e.g. commas and full stops).
(a) image from newspaper page
GARRY Jack has publicly declared war on his Kangaroo understudy of 1986 — Canberra’s Gary Belcher.
Jack has heard of nothing else since his return from England that Belcher is going to press him for the Test fullback spot against Great Britain.
“He’s not getting the spot, nor is anyone
else.” Jack told the public when asked the question at a Leichardt shopping centre on Thursday night.
JOIN us today from noon on 2GB for Sydney’s number one football cover featuring the Balmain v Parramatta match at Parramatta Stadium. It’ll be fun, and accurate, with constant updated scores from
other grounds.
(b) ocr text
Figure 5.1: The red coloured text indicates characters that were not present in the recognised text. Excerpt from the smh 1988-03-13, p. 87.
Font size Characters in the ocr output are typically classified into one of a few sizes only
— most commonly three. These sizes categories do not translate into physical sizes and can only be used to determine relative difference between characters of different sizes.
Hyphenation It is common for newspaper text to be typeset in a justified alignment.
Words are often hyphenated when text is justified to prevent loose lines — lines that have been stretched resulting in spaces beyond a visually palatable amount. The ocr output contains many words that have been split due to hyphenation, and often the connecting hyphens are absent from the ocr text due to reasons described above. Hyphenation also results in increased character errors, as the language model in the ocr engine is not as effective in correcting character errors in parts of words as it is in complete words.
Figure 5.2 shows a two paragraph excerpt where six words have been hyphenated. In this example, the ocr engine has failed to recognise all six hyphens.
Unordered text Words within the same paragraph tend to be grouped together in the
html file. However, paragraphs are often disordered throughout the html file.
An example of this is in Figure 5.3. Colour indicates the position of characters within the html file according to the spectrum at the bottom of the figure. For example, characters coloured in red appear towards the start of the html file, while characters coloured in
5.1. Google ocr (gocr) 99
(a) image from newspaper page
Two tries in three min- utes, the first by Tony Paton and the second by Andrew Ettingshausen made it anyone’s match and knocked the compla- cency out of the Canter- bury players.
It was a see-sawing second half with Canter- bury grabbing an 18-12 lead then Cronulla level- ling at 18-all after a spectacular passing move- ment.
(b) ocr text
Figure 5.2: Hyphens are often missing from the recognised text — indicated here in red. Excerpt from the smh 1988-03-13, p. 87.
magenta appear towards the end of the html file. The article on the lower left portion
of the page,Business attitudes a ‘decade out of date’, shows a clear example of adjacent
paragraphs being dislocated in the html file. The first paragraph of the second column is coloured green–yellow, indicating it is located in the first third of the html file, while the next paragraph is coloured blue, indicating it is located in the last third of the html file. Notice also that the colour of text within paragraphs tends to be quite consistent, indicating the order of words within the same paragraphs tends to be preserved in the html.
Noisy characters The ocr engine frequently incorrectly identifies characters on border
lines towards the edges of articles and images. This is often due to incorrect segmentation triggering recognition in areas that do not contain any text, such as vertical rules or optical noise. Figure 5.4 shows an example of this. The left side of this excerpt contains much optical noise, and the ocr engine has attempted to recognise characters in this region. The recognition output shows the characters incorrectly recognised in red.
Duplicate ocr lines The ocr engine occasionally duplicates lines of text. An example of
this is in Figure 5.5, which shows the last line of the paragraph being present twice in the ocr output. We do not know why this occurs.
Figure 5.3: Colour represents position in the ocr output — showing the scattering of paragraphs. Page 3 of the smh, 1988-03-12.
5.1. Google ocr (gocr) 101
(a) image from newspaper page
Many other important shopkeepers arrived in this period, among them William Moffat and David Jones. Moffat founded a business which was to become W. C. Penfolds. David Jones opened his first shop in 1838 opposite the GPO, a convenient halt for the bullock wagons which traded with the test of the rapidly growing colony. In 1838, Sydney was still a small town of 30,000 people, but within 20 years it had become a flourishing city of 100,000. Melbourne, under the infiu- ence of the gold rushes, was growing even more rapidly and had a popula ion of 1 40,000 by 1 86 1 .
These cities required much more
" ' ' " ' " * " - ;^ £" t o (b) ocr text
Figure 5.4: The ocr engine has incorrectly recognised characters on the left-hand side of the article (marked in red). Excerpt from the smh 1988-03-08, p. 27.
(a) image from newspaper page
"Launching closer to the equator means that your satellite has a longer life."
•China has signed several pre liminary agreements for the launch of satellites for American and European companies, but has not finalised dates.
Senator Button, who is in Beijing for a joint ministerial economic commission meeting, said two feasibility studies for the spaceport were to be deliv ered in coming months.
cred in coinin
(b) ocr text
Figure 5.5: A portion of the last line of the paragraph is recognised by the ocr engine twice
(a) image from newspaper page
ENISE and John had already got to picking names. If the child was a boy, it was to be Luke. Even now, years after they have given up all hope of having children, the name Luke has some special meaning for them, reminding them of what they might have had if things had been different. But they just smile and shrug it off as part of life. They have learnt to accept their infertility. To get to this point, though, they had to survive 1 1 years of hoping and trying in vain; of medical tests and people telling them that if they could D
(b) ocr text
Figure 5.6: The ocr engine often places drop caps in erroneous places and does not group
the letter with the word to which it belongs. Note the letterDin the top left. Excerpt from
the smh 1987-03-05, p. 15.
Drop caps These are usually not grouped with the other characters of the first word of the
paragraph. The ocr engine often places drop caps in strange places. This is not surprising since the drop cap region would most likely belong to a different geometric layout analysis region to the remainder of the paragraph. Thus the ocr engine would likely recognise the
text in each region separately. In Figure 5.6 the ocr engine placed the letterDexceedingly
to the left of the remaining text and could easily be overlooked.
Missing pages The digitisation dataset we obtained from Fairfax Media was not complete.
Many pages that were digitised by Google and available on the Google News Archive website are missing from the gocr collection. There was confusion at Fairfax about this missing data, with our contact under the impression that Fairfax had copies of the data, but did not know the location of the hard drives.