2.10. Costos Eléctricos y Electrónicos
3.4.2. FDCH (Federación Deportiva de Chimborazo)
After the initial success of CNNs for classication it was natural to use them for the task of object detection as well. In classication challenges the goal is to predict which is the main objects of an image. Usually, the object covers a large area of the image and is in the center. On the other hand, detection is the task where we predict which objects lie within the image and furthermore specify their locations. This makes the CNNs trained for the Imagenet classication challenge not directly useful for detection.
The initial approach to bridge the gap, was to extract areas of the image with a sliding window of various scales and evaluate the result of the classier. This approach requires an enormous amount of computation and does not oer real- time performance due to the multiple evaluations of the CNN. Another approach is to pre-process the extracted areas and list them based on the probability that they include an object. After the selection of the most prominent areas the CNN is evaluated only on those and not all the initially extracted areas. This method is minimizing the computational burden but not enough in order to make it suitable for real-time systems.
Convolutions have the property to preserve information about the location of the features that are being extracted so the next generation of object de- tectors make use of this property by training the network to directly predict bounding boxes. The following CNN based object detectors represent the two aforementioned methods and are worth our consideration for the object detector that has to be developed.
• Region-based CNN (R-CNN)[5] is a very well known object detector that
was originally based on AlexNet for feature extraction. Instead of the sliding window method, it uses selective search in order to extract about 2000 regions that are then evaluated with the classier. On top of the classier, an SVM determines the existence of an object and its location. The rst version was not fast enough so as to be a candidate for real-time systems. It also suers from not being able to be trained from end-to- end in order to improve performance. Fast R-CNN and Faster R-CNN are newer versions that introduce improvements on those weak points and also use less regions for speed-up. The improvements include the use of a region proposal network (RPN) for region proposal and also the ability to train the detection pipeline (RP N →CN N →SV M) all together.
Figure 14: R-CNN object detector pipeline[5]. The rst step is the extraction of regions that have the higher probability to contain an object, after the selec- tion those regions are resized to t the classication CNN and then evaluated. Finally, the results of the evaluations are combined in order to create bounding boxes around the detected objects.
• YOLO[27] is a object detector which treats detection as a regression prob-
lem in one single CNN that outputs bounding boxes and class probabil- ities. This is done by segmenting the image into multiple areas, where each area can contain a xed amount of bounding boxes. For each box of each region, the output layer will predict a pair of coordinates (x, y) for
the center of the box, its width and height(w, h)and a condence value.
Furthermore, for each area the neural network makes a class prediction. The last layer's output vector has a length of A×(B×5 +C), whereA
is the number of areas, B is the number of boxes per area and C is the
number of classes that the CNN can classify. Note that B is multiplied
by5as each box has5 variables(x, y, h, w, conf idence). The nal step is
to combine the predicted boxes and classes of each area in order to create the nal result . By evaluating the CNN only once there is a big benet regarding the computation which is required, so YOLO is the most promi- nent candidate for real-time systems. This approach has been proved to deliver robust results, comparable with the region based detectors but its much faster. Lately, more and more object detectors are based on similar methods which extract the location of the objects directly through the convolutional layers.
Figure 15: YOLO object detector pipeline[27]. As YOLO is a 'one shot' object detector, we only need to resize the input image and then evaluate the CNN. The last step is to use non-maxima suppression in order to lter the boxes that are not valid detections.
Figure 16: TPU block diagram [14]. Note that most of the components serve as data buers due to the very high demand for local memory in neural network accelerators. The core of the system is the matrix multiply unit and everything is built around it in order to fetch the parameters and the data but also to write back the result.