Classifier training commences with target training error rates being set to which the learning algorithm is required to converge. Ordinarily, this is set to a zero error rate. The more weak classifiers that the learning algorithm requires to reach this target, the longer the runtime will be. More powerful and efficient learning frameworks require fewer weak classifiers to converge to target training error rates.
Viola and Jones (2001b) discuss convergence issues of boosting algorithms and mention that the training process becomes more difficult particularly in latter layers of cascade training. They point out that in those layers, weak classifiers tend not to be powerful enough in their ability to discriminate positive samples from negative training samples. An observation is made that in early layers, it is possible to see error rates of 10% to 30%, whereas in latter layers this rises to 40% to 50%. This weakness in the discriminatory ability, found in latter layers, prolongs the convergence of the final layers towards their cascade target rates and ultimately delays the overall training process itself.
3.3. Slow Learning Convergence 35
In order to realize faster convergence speeds, some authors have targeted modifications to the types of features used at learning. Xiao et al. (2003) and Withopf et al. (2007) attempt to reduce the number of weak classifiers by using more discriminatory features. The more powerful feature type they select is the output of previous layers, which becomes incorporated into a current layer as an additional stronger learner. By using historical information from a previous layer, Withopf et al. (2007) achieved a reduction of up to 58% in the weak classifier count, which translated to a 15% reduction in their training time. Lienhart and Maydt (2002) on the other hand, introduce a novel set of rotated Haar-like features on top of the set already used by Viola and Jones (2004), thus boosting the filter set to nearly twice the original size. Though an increase in additional rotated feature types for Lienhart and Maydt (2002) lead to a decrease in false alarm rates for a given hit rate, it didn’t decrease training times, but instead prolonged them. The lengthening of their training phase can be squarely attributed to an upsurge in the size of the feature space. This demonstrated that a richer feature space with more powerful discriminants does not necessarily equate to a shortening of a training phase.
Similarly, Mita et al. (2005) employ more powerful features in the form of joint Haar- like features for their cascaded face detector. Their usage of co-occurrence of multiple Haar-like features accelerated the convergence and generated fewer weak classifiers by associating each weak classifier with multiple features. However, their system is susceptible to overfitting depending on the number of co-occurrent features they specify per weak classifier. This parameter is different depending on the problem at hand and cannot be known a priori. This means that training needs to be re-run in order to determine the optimal configurations. Training phase durations for a single run were not discussed in their research. Consequently it is fair to assume, that with an additional calculation of the co-occurring features, a substantial computational penalty is incurred which is not likely to contribute to a shortening of the training phase.
Also in the area of image detection, researchers Wu et al. (2004) and Whitehill and Omlin (2006) experimented with using the more powerful Gabor Wavelet features. Wu et al. (2004) report a faster convergence to a target training error rate using these feature types but fail to observe a marked improvement in classifiers’ generalization over the weaker Haar-like features. However, other studies (Wu et al., 2004; Whitehill and Omlin,
2006), have shown that using more powerful Gabor features results in a prolonged feature extraction process which contributes to training runtimes as well as a detection runtimes increasing by orders of magnitude.
Other researchers have put forward modified versions of the original AdaBoost with the aim of not only achieving better accuracy, but also a faster training convergence. Lienhart, Kuranov and Pisarevsky (2003) compared three main AdaBoost algorithms, namely Discrete AdaBoost, Gentle AdaBoost and Real AdaBoost (Friedman et al., 2000) described in Chapter 2.2. They found that by using slightly more complex weak learners, Gentle AdaBoost outperformed the other two on their datasets. It not only needed fewer weak classifiers to converge to its target training error rate, but its accuracy on test sets was also better. Nevertheless, the authors do not state whether using Gentle AdaBoost also corresponded to a reduction in training runtimes.
Another variant of AdaBoost, termed FloatBoost was proposed by Li et al. (2002). It utilizes a backtracking algorithm which has the capacity of removing weak classifiers that are deemed harmful or not useful in minimizing the overall error rate. Consequently, strong classifiers with fewer weak classifiers were created in their research, but in the end their training runtime proved to be considerably longer than that of AdaBoost and its variants.
Lastly, Viola and Jones (2001a) introduced Asymmetric AdaBoost which altered the weighting distribution of the boosting process by placing greater weight on misclassified positive training samples. It demonstrated a greater capacity to reject false positives for a given hit rate, but there were no reports of a speed up in convergence runtimes. If anything, the proposed boosting algorithm appeared to extend the training phase since it required a larger number of cascades than normal AdaBoost to achieve its target training error rates.