The output of our detector is several overlapping detection rectangles around the text region, as shown in Fig. A.3(b). We obtain text saliency maps by counting at each pixel the number of times it is detected by the detection rectangles. Af- ter normalizing this count to lie between [0, 255] we obtain an image as shown in Fig. A.3(c). The combined saliency map is obtained by a pixel-wise multiplication of the text saliency map and the generic saliency map. This is shown in Fig. A.3(d). The combined saliency maps are more amenable to binarizing and isolating text in order to be fed to an OCR engine.
A.6. Combining text detection and generic saliency detection 145
(a) (b) (c) (d) (e)
Figure A.2: Figure showing text detection and binarization examples. (a) The initial detections on the input image using the AdaBoost cascade. (b) The combined detection (black bounding box) using the algorithm presented in Section A.4. (c) The k-means segmented result. (d) The binarized output.
(a)
(b)
(c)
(d)
(e)
Figure A.3: Combining text saliency map and generic saliency detection. (a) Original natural image with a text region. (b) Raw AdaBoost text detection rectangles (in black). (c) Text saliency map by counting the number of detections at each pixel. (d) Generic saliency detection using techniques from Chapter 3(e) Multiplying text saliency and generic saliency detection. The combined saliency map can be more easily used for binarizing text regions and feeding to an OCR engine.
A.7. Summary of the chapter 147
A.7
Summary of the chapter
In this chapter we presented a task-specific detection technique for text occurring in natural images. We first presented the state of the art in text detection. We then elaborated upon the text detection technique that is based on a well-known face detection approach. Our method takes few examples to train, uses simple features, is robust, and computationally efficient. We showed how the multiple detections that result can be combined to obtain the final detection. We then presented a method of binarizing the detected text so that it can be passed on to an OCR engine. Finally, we showed how text saliency maps can be created and combined with generic saliency maps to obtain task-specific saliency maps.
Appendix B
Object scale and Gaussian filtering
What appears to be a yellow patch of land from an aeroplane turns out to be a field of sunflowers closer on land. An even closer look can show the constituent molecules, atoms, and subatomic particles. We say the yellow patch is observed at a coarse scale while the atoms are visible at the much finer scale. Real-world objects appear differently depending on the scale of observation. Just as objects in the world, details in an image exist only over a limited range of resolution. For a computer vision system analyzing an unknown scene, there is no way to know a priori what scales are appropriate for describing the structures of interest in the image. Hence, a reasonable approach is to consider descriptions of the image at multiple scales. The formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, is the scale- space theory [143, 78, 46, 90]. The notion of scale-space applies to signals of arbitrary numbers of variables. Here we restrict ourselves to two-dimensional images. For a given image I(x, y), its linear (Gaussian) scale-space representation is a family of derived signals L(x, y; σ) defined by the convolution of I(x, y) with the Gaussian kernel
g(x, y, σ) = 1 2πσ2e
−(x2+y2)
2σ2 (B.1)
with σ being the standard deviation of the Gaussian kernel such that
L(x, y; σ) = (g(., .; σ) ∗ I)(x, y) (B.2)
Typically only a finite discrete set of levels of L for σ ≥ 0 are considered in the scale-space representation.
For σ = 0, g becomes an impulse function such that L(x, y; 0) = I(x, y), so that the scale-space representation at the finest scale level σ = 0 is the image I itself. As σ increases, L is the result of smoothing I with a larger Gaussian filter, thereby removing more fine structures, i.e. high frequency detail. Specifically, high frequency details, which are significantly smaller than σ in extent are removed from the image as we move towards coarser scales, leaving us with low-pass versions of
the image. This illustrates the relation between scale and spatial frequency content of images.
It would seem that any low-pass filter g could be used to generate a scale-space. This is, however, not the case. It is of crucial importance that no new structures (i.e, that do not correspond to simplifications of corresponding structures at a finer scale) are introduced at the coarse scales. The Gaussian filter is unique for generating a linear scale-space based on this essential requirement [78, 46, 90].
B.1
Gaussian filtering in practice
Gaussian filtering in practice is done using separable filters to reduce the computa- tional overhead. For good results, in the discrete signal case binomial filters [15] are used as they approximate the Gaussian filters well for small values of σ. In addition, they using shift and additions instead of computationally expensive division opera- tion (used for normalization). However, for large values of σ, the use of the binomial kernel or a discrete approximation of the Gaussian kernel (Eq. B.1) becomes com- putationally expensive. In such cases, recursive filtering approaches are far more advantageous as they use a small constant number of operations for filtering for any σ value. The most popular approaches for this are by Deriche [35] and Young et al. [148]. In this thesis we use the latter method along with the correct boundary conditions proposed by Triggs and Sdika [129].