CAPÍTULO II - DIVIDENDOS Y UTILIDADES
II. DIVIDENDOS FICTOS
The text-line detection accuracy measures how many text-lines were correctly detected. A text-line is said to be correctly detected if it does not fall into any of the three types of error defined below.
Oversegmented text-lines: the number of text-lines that are either split into more than one text-line, or partially detected.
Undersegmented text-lines: the number of text-lines merged with some other text- line.
98 6.3. EXPERIMENTS, RESULTS, AND DISCUSSION
Book Poetry Digest Magazine Newspaper 0 10 20 30 40 50 60 70 80 90 100 Proposed Method x-y cut T e x t- Lin e E x tra ct io n Accu ra cy ( % )
Figure 6.6: A comparison of text-line segmentation accuracies of the proposed technique and the x-y cut algorithm.
The documents in the dataset were segmented using the x-y cut algorithm and the text-line extraction algorithm described in Section 6.2. The x-y cut algorithm was chosen as a representative state-of-the-art algorithm for segmenting Nastaliq script documents. A comparison of the percentage of correctly segmented text-lines by both methods is shown in Figure 6.6.
Horizontal projection of sample paragraphs in Nastaliq script is shown in Figure 6.7. In both paragraphs, there are no zero-valleys in the projection profile. To segment Urdu text with x-y cut, noise thresholds are set to a high value so that the algorithm can find the main body of a text-line. The parts of a text-line cut out at this stage can be assigned to the text-line by simple post-processing steps. Hence, text-lines as shown in Figure 6.8 are considered correctly segmented. The results of segmentation using the x-y cut algorithm are shown in Table 6.1. The values of the parameters for the x-y cut algorithm used were: tx = 30, ty = 2, tnx = 20, tny = 250 (see Section 4.2.1 for an
explanation of these parameters).
The results show that the algorithm is able to segment book, and poetry documents with high accuracy owing to relatively large inter-line spacing. However, it fails to segment digest, magazine, and newspaper layouts due to smaller inter-line gaps and presence of multiple columns.
Figure 6.7: Horizontal projection of Urdu text samples. The top figure shows the case of larger inter-line spacing, the bottom figure shows the case of small inter-line spacing. Note that in both cases, there are no between-line zero-valleys in the projection profile.
Figure 6.8: Results of segmenting an Urdu piece of text using the x-y cut algorithm. The resulting text-lines are considered correct as the body of each text-line is correctly identified. The ascenders and descenders can be assigned to the text-line body using simple post-processing steps.
100 6.3. EXPERIMENTS, RESULTS, AND DISCUSSION
Table 6.1: The performance of the x-y cut algorithm on the Urdu documents dataset.
Document Correctly Over- Under- Missed
Type (n =) Detected Segmented Segmented Text-Lines
Text-Lines (%) Text-Lines (%) Text-Lines (%) (%)
Book (231) 82.68 5.19 6.93 5.19
Poetry (284) 94.72 1.76 0.00 3.52
Digest (702) 64.67 0.57 29.91 4.84
Magazine (1156) 64.45 0.17 33.82 1.56
Newspaper (819) 24.54 0.85 60.2 2.20
Table 6.2: The performance of the presented text-line extraction algorithm on each cat- egory of the dataset.
Document Correctly Over- Under- Missed
Type (n =) Detected Segmented Segmented Text-Lines
Text-Lines (%) Text-Lines (%) Text-Lines (%) (%)
Book (231) 92.21 5.63 0.00 2.16
Poetry (284) 92.25 4.58 0.00 3.17
Digest (702) 80.63 11.54 0.00 7.83
Magazine (1156) 90.48 4.15 0.87 4.33
Newspaper (819) 72.16 7.81 4.15 15.87
Table 6.2 summarizes the results obtained by applying the presented layout analysis algorithm to the test data, and manually inspecting the resulting text-lines detected by the system. Example images showing the result of the algorithm are shown in Figure 6.9 and 6.10. Note that the presented system achieved better results than the x-y cut ap- proach on all categories of documents except poetry documents. The presented algorithm made more split errors than the x-y cut algorithms. These errors appeared as a result of partially detected text-lines in which the first or the last word of the line was not vertically aligned with the rest of the text-line and hence was ignored by the text-line extraction algorithm.
Based on the results presented in Table 6.2, the following observations can be made:
• The algorithm works quite well on documents in the book, magazine, and poetry categories. A few number of missed error are those in which the text-line consisted of only one connected component as the text-line detection algorithm needs at least two components to make a line. The over-segmentation errors occur due to page curl in some documents.
• In the digest layout, many over-segmentation errors occur due to enumerated lists, in which the enumerator symbol is segmented separately.
(a) Facing pages of a prose book
(b) Facing pages of a poetry book
Figure 6.9: Example images illustrating results of the proposed layout analysis system on a book and a poetry page. The yellow rectangles show the detected ver- tical gutters. Thin horizontal blue lines indicate detected text-line segments, and the thick magenta lines running down and diagonally across the image indicate reading order.
102 6.3. EXPERIMENTS, RESULTS, AND DISCUSSION
Figure 6.10: Example image illustrating results of the proposed layout analysis system on a magazine page. Note that images and graphics were not removed, so they result in some spurious text-lines.
Figure 6.11: An example image cropped from the front page of a newspaper containing both normal and inverted text in the same image. The binarization step fails to identify the inverted text and makes it a part of the page background (white) as shown in the binarized image. (cp. [SuHKB06])
• In the newspaper layout, many text-lines are missed. This is due to very small inter-line spacing between two consecutive text-lines. Due to this small spacing, connected components from two neighboring text-lines sometimes merge together. In such a case the upper text-line is missed by the algorithm. Another source of a missed text-line error is the presence of inverted text resulting in poor binarization of the image as shown in Figure 6.11. A number of false alarms also appear in these layouts due to non-text elements on the page.