• No se han encontrado resultados

Texture-less Object Detection in RGB-D: A comparison between traditional approaches & deep neural networks

N/A
N/A
Protected

Academic year: 2023

Share "Texture-less Object Detection in RGB-D: A comparison between traditional approaches & deep neural networks"

Copied!
29
0
0

Texto completo

(1)
(2)
(3)

Bachelor’s thesis

Texture-less

Object Detection in RGB-D:

A comparison between traditional approaches

&

deep neural networks.

Tomas ´ Golvano Garc´ıa

Tutor: Marcos Escudero Vi˜nolo Ponente: Jes´us Besc´os Cano1 Madrid, June 2021

1 Escuela Polit´ecnica superior, Universidad Aut´onoma de Madrid

(4)

Abstract

Uniformly colored, smooth, texture-less objects occur frequently in domestic and industrial environments. Learn to classify and detect these objects accurately are a frequent need in both industrial and personal robotics and its applications. Localizing and detecting the pose of the objects in a precise way is required for any further interaction upon the object.[1]

This thesis started as a genuine curiosity about this topic because ru- mours said this task was very much different from regular detection because of the non-linearity of the objects to detect, so regular feature extractors were not as reliable as they usually were for other tasks where the objects had stable gradients in its shapes.

This writing started as a learning process of the statistical tools used for both detection and classification in the Computer Vision field. Mainly starting from the mathematical perspective transmuted then to the more empirical approach that this field requires.

In order to understand the differences between the most traditional ap- proach a decade ago with a state of the art method on the detection of objects. It was used a publicly available dataset by Hinterstoisser[2].

Keywords

textureless textur-less HOG PCA SVM Faster-RCNN detection

(5)

Resumen

Con color uniforme, sin bordes bien marcados, los objectos sin textura se encuentran con frecuencia tanto en entornos dom´esticos como industriales.

Aprender a clasificar y detectar este tipo de objectos con precisi´on es una necesidad com´un en rob´otica tanto industrial como de estar por casa y sus posibles aplicaciones. Ubicar y detectar la posici´on de los objetos the manera precisa es necesario para cualquier interacci´on que se pretenda hacer sobre un objeto de estas caracter´ısticas.

Este trabajo comenz´o como una curiosidad genuina sobre los objetos sin textura, ya que se rumoreaba que esta tarea era bastante diferente a la detecci´on de otros objetos, por la falta de linealidad en el objeto a detectar y sus caracter´ısticas, con lo que los descriptores tradicionales no eran tan fiables como podr´ıan serlo para objetos con gradientes y formas estables.

Este texto comenz´on como un proceso de aprendizaje de las herramientas estad´ısticas usadas tanto para detecci´on como clasificaci´on en el campo de Computer Vision. En su mayor´ıa, partiendo desde la perspectiva matem´atica entonces transformada a la parte m´as emp´ırica y de adecuaci´on que este campo demanda.

Con el objectivo de entender las diferencias entre la que era una es- trategia tradicional hace una d´ecada con alg´un m´etodo de los que est´en alcanzando el actual estado del arte en la detecci´on de objetos. Se ha usado un dataset disponible p´ublicamente por Hinterstoisser.

Palabras Clave

textureless textur-less HOG PCA SVM Faster-RCNN detection

(6)

Contents

Page 5

. . . 5

. . . 5

2 State of the art and related work 6 2.1 Object Detection . . . 6

. . . 6

. . . 6

. . . 7

. . . 9

. . . 10

10 11 2.2 Traditional approaches 2.2.1 Histogram of oriented gradients 2.2.2 Principal Component Analysis Algorithm 2.2.3 Classification algorithms: SVM and KNN 2.3 Convolutional Neural Networks 2.3.1 Region-based: R-CNN, Fast R-CNN and Faster R-CNN 2.3.2 One-stage: EfficientDet and YOLO . . . . 3 Materials and Methods 14 . . . 14

3.1 RGBD datasets for object detection 3.2 Tools . . . 15

. . . 15

. . . 15

. . . 15

. . . 15

. . . 16

. . . 16

. . . 16

. . . 16

. . . 16

. . . 16

3.2.1 Software 3.2.2 Hardware 3.3 Traditional approach 3.3.1 Featuring the data 3.3.2 Training 3.3.3 Test 3.4 Faster Region-based CNN 3.4.1 Initialization 3.4.2 Training 3.4.3 Test . . . 16 16 . . . 24 24 25 1 Introduction

1.1 Motivation 1.2 Aims

4 Results and comparisons 4.1 Quantitative results 4.2 Qualitative results

5 Conclusions and future work References

(7)

1 Introduction

1.1 Motivation

Regardless their ubiquitous presence in many environments, recognition and localization of texture-less objects is challenging in several respects. Appear- ance of a texture-less object is dominated by its shape, its material properties and by the configuration of light sources. Such objects present significant challenges to contemporary visual object detection methods. This is prin- cipally because common local appearance descriptors are not discriminative enough to provide reliable correspondences.

In 2D images, the most sensible solution to describe the texture-less objects is to use a representation amenable to the object edges, i.e. the points where image characteristics change sharply. These edges correspond mainly to objects’ outline in the case of texture-less objects.

Depth images used as additional input can simplify the detection task.

This kind of images, known as RGB-D images, have aligned color and depth, simultaneously describing the appearance and geometry of the scene. The RGB-D images can be obtained using Kinect-like sensors. Extra information obtained with the 3D shape allows a more detailed description, expecting to decrease the number of hallucinations.

Over the last decade, neural networks and GPU’s generated a revolution in the Computer Vision panorama, in this work the focus is made on the recognition specifically. To check how big are those new solutions compared with a more traditional approach are the aim of this work. Particularly, this will be done with a set of three classes of texture-less objects extracted from a set of images by Hinterstoisser in RGBD.

Currently in computer vision the convolutional neural networks (CNN) are the most popular approach to detect and recognize objects within an image. Originally they were simply used as simple classifiers however soon they started to be used beyond that, segmentation, region proposals, pose recognition, and so on.

1.2 Aims

The main goal of this work is to understand how and why this irruption happened in computer vision, checking two very different methods, one that could be tried a decade ago, and then a current state of the art method.

Convolutional neural networks (CNN) yield remarkable results in many computer vision fields. Up to now, in the field of texture- less object detection, Wohlhart et al. [32] used CNN to obtain descriptors of object views that efficiently capture both the object identity and 3D pose. The CNN was trained by enforcing simple

(8)

[26],

similarity and dissimilarity constraints between the descriptors, untangling the images from different objects and different views into clusters that are not only well-separated but also structured as the corresponding sets of poses. The method can work with either RGB or RGB-D images and outperforms the state-of-the- art methods on the dataset of Hinterstoisser et al. [18]. Another method using CNN is the one presented by David Held et al.

where they also outperformed the state-of-the-art using a CNN, they introduced a new approach for recognition with limited training data, in which they used multi-view dataset to train their network to be robust to viewpoint changes, being able to improve the recognition of objects with a single image training for each object. Another objective is to check how several sources of information affect to the detection of these texture-less objects, such as grayscale, RGB and RGB-D. That one may help to know whether more information is always more accuracy for this problem or not.

2 State of the art and related work

2.1 Object Detection 2.2 Traditional approaches

Most traditional approaches, start with standardization, followed by a fea- ture extractor, then the training is made for some traditional classification method, such as KNN, SVM or Random Forest, finally non-maximum sup- pression is used to select the most likely bounding box of the output.

2.2.1 Histogram of oriented gradients

Histogram of oriented gradients is a feature descriptor technique used for the first time at the begining of 1986 [29], without using the words, but using the same principles, this is to define the shapes of an object by its gradients.

It was used again a few years later, on 1994 same principles were used again by Mitsubishi for hand gesture recognition [30], using for the first time the

”orientation histogram” term. Despite of that, it was not till much later, on 2005, when Dalal and Triggs made it popular as a reliable descriptor for pedestrian detection.

HOG is an image descriptor presented by Dalal and Triggs [17], which counts occurrences of gradient orientation in localized portions of an image.

The HOG descriptor effectively encodes the object’s shape. The shape is one of the dominating properties of texture-less objects and thus the HOG descriptor seems to be a suitable choice to describe this type of objects.

Additionally, it is possible to extend the HOG descriptor to depth images and compare its discriminative power to its original version

(9)

It was a very successful descriptor, it was applied by default on grayscale images, but it could be used also on each one of the channels and then concatenating them to try to improve the in- formation obtained from that.

2.2.2 Principal Component Analysis Algorithm

Given the high dimensionality problem, and also that some dimensions have the tendency to be more meaningful than others, if every dimension of an observation is taken as similar relevance by the further classifiers, it can happen the results and computing times are not as satisfactory as expected.

Principal component analysis algorithm (PCA) is used with the goal of dimensionality dimension reduction. It is useful when the raw data has high dimensionality and it is also highly correlated. Oftenly that makes the data to be grouped into ellipsoids. Those ellipsoids are projected into a less dimension hyperplane. PCA finds the orthogonal k directions of an linear space where the concentration of the data is the highest. This concentration is measured in the data by the variance(spread) of the projected data.

In MATLAB or Python it is computed just using a function, but in order to fully understand the process, it will be described here. First of all, the raw data must to be centered by substracting the mean, so the mean of each input dimension is zero. Right after getting the zero mean on each dimension data, the covariance matrix is computed:

1 Pn (i))T

Σ = n−1 i=1 (x(i))(x

Where n is the number of dimensions, and X the observations matrix.

After this point, there are some different options, with a few differences.

There are several ways to compute PCA to get the projections: one uses the alternating least squares algorithm (ALS), another one is eigenvectors, but the most common nowadays is singular value decomposition (SVD). The results are the same of the last two options are the same, however SVD is numerically more stable so it is frequently more popular.

SVD factorizes the covariance matrix obtaining three matrix as a result from it:

svd(Σ) = USV

Just one of the components of the factorization is interesting here for the current PCA purpose, the U matrix. U is a nxn squared unitary matrix over IR, so it is an orthogonal matrix.

(10)

⎡ ⎤

| U = u⎣ 1

| u2

| u3

| u4 ...

| uk ...

|

un ⎦  IRnxn

| | | | | |

When trying to reduce the dimensions from n to a k number of dimen- sions, a number o k vertical vectors of the U matrix are taken, sorted from the most variance explained to the minimum variance explained by the fur- ther projection.

⎡ ⎤

| | | | |

⎣ ⎦

Ureduce = u1 u2 u3 u4 ... uk

| | | | |

| {z }

nxk

So, for any original value xi from the zero mean data, the projection goes from:

xi  IRn −→ zi  IRk

⎡ ⎤T

| | | | |

⎣ ⎦

zi = u1 u2 u3 u4 ... uk xi

| | | | | |{z}nx1

| {z }

kxn

Resulting in a dimension reduction in order to minimize the square pro- jection error of the original data. In a nutshell, PCA algorithm seeks to find a matrix W who minimizes:

XWWT − X

Where X is the observations matrix without the mean of every dimension.

In order to deal with high dimensionality and after the suspicion that textureless objects could be described in a simpler way. It could be happen- ing the combination of convex and concave shapes, that can be internally described when a textureless object is described by histogram of oriented

(11)

gradients(HOG), could show some way to describe in a manageable number of dimensions and more meaningful ones these kind of objects.

2.2.3 Classification algorithms: SVM and KNN Support Vector Machines

Commonly known as SVM, it is a set of supervised algorithm models used to separate 2 categories of data. It was widely used and it is a common classifier by its simplicity. When there are several classes it can be used for a model for each class, creating a one versus all model. It can be proved that a logistic regression trained enough becomes a SVM classifier.

Its general working method consists of spliting the data from the 2 different classes using the kernel trick to produce a non-linear boundary. The kernel trick in a nutshell lies in projecting the data into a higher dimension space where the a maximum distance hyperplane can be drawn even if in lower dimensions it was not possible to do that with a hyperplane because the boundary is non-linear.

K-nearest neighbor

K-nearest neighbors classifier also known as KNN classifier is a non- parametric classifier model, so the number of parameters grow with the amount of training data. These models are more flexible than parametric ones, however frequently they are also computationally intractable for large datasets. [5]

This algorithm checks the K closest points from the training set to the each test observation. Taken those K points, it counts the number of members that belong to each class. From that counting result, a probability is esti- mated for each class, the more points of a certain class found close to the test point, the more likely to be estimated as it belongs to that one.

The distance is often calculated as Euclidean distance, but there are other variants, such as Minkowski or Chebychev. Also, from the measured dis- tance, the probability score that is given to each class is by default computed uniformly, so the distance of the training points is not taken in considera- tion, just giving relevance to be among the K nearest ones. However, there are ways to modify that, weighting the distance in several ways in order to get different scores depending on the distance the training set points are from our testing point. These distance-weighting techniques include but are not limited to inverse distance and inverse square distance.

(12)

designing better feature extractors [17] [3].

Although this algortihm is very simple, it performs pretty well choosing a moderate K number. However, since it is based in distance measurement, it is affected by the curse of dimensionality, so the higher the dimensions, the poorer the performance of KNN method.

The chosen K is going to make it inaccurate if K is very low, like K=1 or K=2, but the higher the K the higher the noise reduction among classes(cite).

There are techniques such as hyperparameter optimization, which can use cross-validation in order to achieve that quasi-optimal parameter goal. It has been observed that increasing the K decrease the noise in the classifi- cation among classes.(Miscellaneous Clustering Methods, Brian S. Everitt Sabine Landau Morven Leese Daniel Stahl, 2011)

2.3 Convolutional Neural Networks

A decade ago, researchers were mostly focused on improving classifiers and Despite the Neural Networks were around for decades [6] [12], it was not until 2012 when the real break- through happened regarding CNN precision and performance. Three main causes to blame for this to occur: larger datasets publicly available, more powerful computers and the advent of AlexNet [22].

Arguably, AlexNet signified the coming out party for CNNs. Its new tech- niques highlights can be summarized as the use of GPU to train a model, and nowadays standard methods such as dropout layers and rectified linear activation units (ReLU).

2.3.1 Region-based: R-CNN, Fast R-CNN and Faster R-CNN Region-based Convolutional Neural Network

For the first step of the first version of these algorithms, it makes a proposal of around 2000 bounding boxes in the image, those are called region proposals. For each one of the proposals a feature extraction is performed by a deep CNN. finally an SVM is used to classify the region. This method has a great accuracy, although it cannot be used in real-time applications because of its latency.

Fast R-CNN

In order to reduce the latency R-CNN was causing, the researchers noticed that if the convolution operation was done just once per image, every overlaping bounding box would not have its features computed redundantly. So this process solved in a considerable way the bottleneck of the region proposals.

(13)

Faster R-CNN

Over the previous versions, some bottleneck were solved, but the region proposals kept being a troublesome part, the selective search part of the network. So the researchers created a small neural network specialized on the creation of regions of interest.

This small neural network was called Region Proposal Network.

2.3.2 One-stage: EfficientDet and YOLO

For every following method described, the information is extracted from the image in one single step, this is simultaneously, then that information is computed in parallel.

EfficientDet [8]

In the pursue for a balance between accuracy and size, EfficientNet[13]

came up with certain novelties extending the work from MobileNet [9]. The two main characteristics of this improvement were compound scaling and neural architecture search.

Before compound scaling, the different layers in a neural network were trying to come up with variations in three main dimen- sions mainly: depth, this is the number of layers; width, num- ber of channels of each layer; and resolution of the image. These three dimensions were played with almost randomly by most of researchers.

In order to create a balance between these three dimensions in the ConvNet scaling, they noticed that the floating point operations per second (FLOPS) in a convolution were directly proportional to its depth and to the square of its width and resolution. In order to get a significant improvement they decided to keep the product of the three parameters containing these variables around a fixed value of 2. From there, using a grid of parameters around the starting point, they could take an optimal one.

This way they managed to find a balance between the scaling in the different layers of the ConvNet, which is something that makes sense, given that more resolution should be required to appreciate fine-grained patterns.

(14)

Figure 1: Compound scaling method

The other innovative method was the neural architecture search(NAS).

They realized that an appropriate mobile-size baseline architec- ture was critical for the performance of the net. They used as a goal a model used previously[19] to maximize:

ACC(m) × [F LOP S(m)/T ]w

This way the model promotes accuracy while penalizing compu- tationally heavy solutions and also penalizing for slow inference times. Quoting the authors: “EfficientNet-B7 achieves state- of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our Efficient- Nets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.” [13]

At that point the authors decided to take one step further an implement this solutions into object detection field. They spotted a problem over the last state-of-the art-methods in order to get deployed in real world applications, given that some models such as [24] relative to the ammount of parameters needed and the latency required to get the result.

The main contributions they made in the EfficientDet were a fea- ture pyramid network called BiFPN and the compound scaling, the key idea behind the paper of EffNet that was previously men- tioned.

(15)

You Only Look Once

YOLO is the short for You Only Look Once and, unsurprisingly, it belongs to the cathegory of one-stage methods. It works by splitting an image into a fixed SxS grid, within each of the grid it takes m bounding boxes. For each of the bounding box. If the center of an object falls into a grid cell, that grid cell is the one responsible for detecting that particular object. Every cell is detected simultaneously, that is one of the reasons why this model is so fast.

The first version of YOLO was facing some troubles, it was divid- ing the image into a grid of 7x7. Despite generating 2 different bounding box for each grid, each one of the grids could just detect one single object, therefore the number of detections was limited.

This meant if the objects were close together and they were shar- ing the same grid, the algorithm was not able to detect both of them, only one.

The network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image. The confidence of an object if there is no object inside a certain grid is zero, but if there is a part of an object, the intersection over union is different than zero, but that value helps prevent detecting background and also helps to create the bounding box that will be created later.

The CNN by YOLO tries to optimize a loss function that incor- porates location, confidence and classification. Its network has 24 convolutional layers.

For the YOLOv2 the researchers were mainly focused on improv- ing the recall and the mAP. To achieve this result they commit several improvements: batch normalization, a higher resolution classiffier and convolutional anchor boxes which allows to detect several objects in the same grid.

In order to obtain greater rates of accuracy, the researchers made the YOLOv3 bigger than the previous version. YOLOv3 applies a logistic regression to the overlapping bounding boxes for getting its confidence. Another novelty was the allowance of multilabel, ie.

a cat is also an animal. This version has a 53 layer network, which

(16)

is an hybrid between its previous version with residual network architecture.

As if it were a fireworks competition, YOLOv4 introduced the compbination of a bunch tecniques, among them: weighted resid- ual conections(WRC), cross-stage partial connections (CSP), Cross mini-Batch Normalization (CmBN), self-adversarial training (SAT), Mish activation, mosaic data augmentation, DropBlock regulariza- tion, and CIoU loss.[14]

YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms. The limitation of YOLO al- gorithm is that it struggles with small objects within the image, for example it might have difficulties in detecting a flock of birds.

This is due to the spatial constraints of the algorithm.

3 Materials and Methods

3.1 RGBD datasets for object detection

For the dataset that is considered for this work, the bounding boxes provided with the data were not complete. This is, for a given image it may happen that the 3 objects to detect are present, but in the bounding boxes information, for that precise image, it is just one of them registered.

Given this constraint, the decision of making a model for each ob- ject was taken. This is one versus all for each object. This differs from the current approach taken in the current CNN-large-dataset world, but it was interesting to appreciate even more the achieve- ments that occur over the last decade in the field of computer vision.

Thus, the method described in the following section is made for each one of the three classes to be detected in the images. One model for each image, in both SVM and CNN experiments.

Additionally, it was decided to not use resizes of the images be- cause every image in the dataset was containing the object at a very similar size, so it was considered counterproductive to do so.

(17)

3.2 Tools 3.2.1 Software

• MATLAB

• VLFeat: Cross-platformy open source collection of vision algorithms with a special focus on visual features and clustering. It bundles a MATLAB toolbox, a clean and portable C library and a number of command line utilities.

• Python 3, especially pytorch libraries.

3.2.2 Hardware

For the experimental evaluation, datasets captured with the following sensors were used:

• Primesense Carmine 1.09 (Short Range)

• Microsoft Kinect v2

• Canon Digital IXUS 950 IS 3.3 Traditional approach

In this part it will be described how was handled the dataset with an approach already a decade old. Traditionally the first thing was image preprocessing by standardizing the input image. Then the previously explained HOG descriptor is used, after that the image is classified and the process is repeated before the non maximum suppression and the testing results are done.

Several experiments were done, the main one is described at the beginning of its corresponding section. The less successful, which also provide information about what is not worth it trying, are described right after that.

3.3.1 Featuring the data

Since there are publications discussing some detectors do not work appropriately for textureless objects, it was decided to use HOG, because in this version used it provides information about direct and indirect gradients in the object.

(18)

3.3.2 Training

With the purpose of training the data, for every image in the training set two crops of 100x100 pixels. One of these crops was positive, it contains the object to detect; to this one it is added a noise of 25% over its bounding box position. The other crop contained a negative, not the image of the class it is desired to detect.

3.3.3 Test

In order to avoid as much as possible the sliding window technique, a bunch of a thousand crops randomly placed of each image were taken and classified into one of the classes, this is ape, cat, duck or background.

3.4 Faster Region-based CNN 3.4.1 Initialization

The initial weights used were coming from the pretrained network on COCO dataset

3.4.2 Training

This process was perform during just 10 epochs, taking advantage of the pretrained condition of the current network, it did not take any longer to converge.

3.4.3 Test

To conclude this process, to verify it was working appropriately, the bounding boxes were drawn in the resultant test images and the result checked.

4 Results and comparisons

4.1 Quantitative results

In the tables is shown for each class and method the results achieved on average precision(AP) and average recall, taking as relevant data de intersection over union with the bounding boxes.

(19)

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.935 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.935 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.958 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.958 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.958 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.958 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 1: Faster-RCNN Ape RGB

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.930 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.990 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.930 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.955 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.955 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.955 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.955 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 2: Faster-RCNN Ape RGBD

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 3: SVM Ape RGB

(20)

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 4: SVM Ape RGBD

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.947 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.945 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.978 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.968 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.968 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.968 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.966 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.984

Table 5: Faster-RCNN Cat RGB

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.946 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.945 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.963 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.967 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.967 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.967 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.966 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.979

Table 6: Faster-RCNN Cat RGBD

(21)

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.007 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.033 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.002 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.066 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.006 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.038 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.081 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.082 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.064 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.289

Table 7: SVM Cat RGB

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.006 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.035 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.066 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.033 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.079 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.080 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.063 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.274

Table 8: SVM Cat RGBD

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.930 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.930 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.953 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.953 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.953 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.953 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 9: Faster-RCNN Duck RGB

(22)

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

Average Precision Average Precision Average Precision Average Precision Average Precision Average Precision Average Recall Average Recall Average Recall Average Recall Average Recall Average Recall

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.932 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.932 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.954 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.954 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.954 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.954 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 10: Faster-RCNN Duck RGBD

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.001 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.015 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.003 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.013 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.013 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.013 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 11: SVM Duck RGB

(AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.002 (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.017 (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.004 (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.015 (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.015 (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.015 (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

Table 12: SVM Duck RGBD

(23)

Figure 2: left column, FasterRCNN; right column: SVM; top row: RGB;

bottom row: RGBD

(24)

Figure 3: left column, FasterRCNN; right column: SVM; top row: RGB;

bottom row: RGBD

(25)

Figure 4: left column, FasterRCNN; right column: SVM; top row: RGB;

bottom row: RGBD

(26)

4.2 Qualitative results

5 Conclusions and future work

Unsurprisingly, the Faster-RCNN beated the SVM by a tremen- dous difference, being able to make these experiments provided me a more global understanding of the reasons for the CNN shift happened over the last decade.

Although it was very interesting to do so, it was also an arduous job to do, because it was not so easy to compare both methods for a dataset that was probably done for testing one versus all models.

That was one of the main reasons why YOLO was not used also to test it on this dataset

In conclusion, while doing this work I realized how the computer vision world has changed over the last years and probably not many people can foresee how it will look like in another decade.

(27)

References

[1] Hodaˇn, Tom´aˇs. Pose Estimation of Specific Rigid Objects PhD Thesis, Czech Technical University in Prague, 2021.

[2] Hinterstoisser, Stefan, et al. Gradient response maps for real-time de- tection of textureless objects. IEEE transactions on pattern analysis and machine intelligence 34.5 (2011): 876-888.

[3] F. Huang, S. Huang, J. Ker and Y. Chen, ”High-Performance SIFT Hardware Accelerator for Real-Time Image Feature Extraction,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no.

3, pp. 340-351, March 2012, doi: 10.1109/TCSVT.2011.2162760.

[4] T. Hodaˇn, J. Matas, Texture-less object detection– PhD thesis proposal.

Center for Machine Perception, K13133 FEE Czech Technical University, Prague, Czech Republic. September 2015.

[5] Murphy, Kevin P. ”Machine learning: a probabilistic perspective.” MIT press, 2012.

[6] Hinton, Geoffrey E. “How Neural Networks Learn from Experience.”

Scientific American, vol. 267, no. 3, 1992, pp. 144–151. JSTOR, www.jstor.org/stable/24939221. Accessed 19 June 2021.

[7] T. Hodaˇn, X. Zabulis, M. Lourakis, S.ˇ Obdrˇz´alek, J. Matas. Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images International Conference on Intelligent Robots and Systems (IROS) 2015, Hamburg, Germany

[8] Mingxing Tan, Ruoming Pang, Quoc V. Le. EfficientDet: Scalable and Efficient Object Detection arXiv:1911.09070v7 , July 2020.

[9] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam Mo- bileNets: Efficient Convolutional Neural Networks for Mobile Vision Ap- plications arXiv:1704.04861v1 , April 2017

[10] A. Collet, M. Martinez, S. Srinivasa, The MOPED framework: object recognition and pose estimation for manipulation. I. J. Robotic Res., vol.

30, no. 10, pp. 1284–1306, 2011

[11] T. Tuytelaars, K. Mikolajczyk, Local invariant feature detectors: A survey. Found. Trends. Comput. Graph. Vis., vol. 3, no. 3, pp. 177–280, July 2008.

[12] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner.

Gradient-Based Learning Applied to Document Recognition Proceedings of the IEEE, November 1998.

(28)

[13] Tan, M. & Le, Q.. EfficientNet: Rethinking Model Scaling for Convo- lutional Neural Networks. Proceedings of the 36th International Confer- ence on Machine Learning, in Proceedings of Machine Learning Research 97:6105-6114 m http://proceedings.mlr.press/v97/tan19a.html . 2019.

[14] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao.

YOLOv4: Optimal Speed and Accuracy of Object Detection.

arXiv:2004.10934 , 2020.

[15] H. Cai, T. Werner, and J. Matas. Fast detection of multiple textureless 3-D objects. ICVS, volume 7963 of LNCS, pages 103–112. 2013.

[16] F. Tombari, A. Franchi, L. Di, BOLD features to detect textureless objects. ICCV, 2013, pp. 1265–1272.

[17] N. Dalal, B. Triggs. Histograms of oriented gradients for human detec- tion. International Conference on Computer Vision & Pattern Recogni- tion (CVPR ’05), Jun 2005, San Diego, United States. IEEE Computer Society, 1, pp.886–893, 2005, . ¡10.1109/CVPR.2005.177¿.

[18] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. ACCV, 2012.

[19] Tan, Mingxing, and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning. PMLR, 2019.

[20] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit. Gradient response maps for real-time detection of textureless objects. IEEE PAMI, 2012.

[21] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally, match locally: Efficient and robust 3D object recognition. CVPR, pages 998–1005, 2010.

[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. Ima- geNet classification with deep convolutional neural networks. Commun.

ACM 60, 6 (June 2017), 84–90. DOI:https://doi.org/10.1145/3065386 [23] E. Brachmann, A. Krull, F. Michel, S. Gumhold, and J. Shotton. Learn-

ing 6D object pose estimation using 3D object coordinates. ECCV, 2014.

[24] Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V. Le. Learning data augmentation strategies for object detection. arXiv preprint arXiv:1804.02767, 2019.

[25] E. Brachmann, A. Krull, F. Michel, S. Gumhold, and J. Shotton. Learn- ing 6D object pose estimation using 3d object coordinates. ECCV, 2014.

(29)

[26] D. Held, S. Thrun, S. Savarese. Deep learning for single-view instance recognition. Stanford University arXiv:1507.08286v1 [cs.CV] 29 Jul 2015 [27] P. F. Felzenszwalb, R. B. Grishick, D. McAllester, and D. Ramanan.

Object detection with discriminatively trained part based models.PAMI, 2009.

[28] Jan Fischer , R. Bormann1 , G. Arbeiter and A. Ver. A feature de- scriptor for texture-less object representation using 2D and 3D cues from RGB-D data. 2013 IEEE International Conference on Robotics and Au- tomation (ICRA) Karlsruhe, Germany, May 6-10, 2013

[29] Robert K. McConnell. Method of and apparatus for pattern recognition United States Patent, number 4,567,610. Jan. 28, 1986.

[30] William T. Freeman, Michal Roth, Orientation Histograms for Hand Gesture Recognition, Tech. Rep. TR94-03, Mitsubishi Electric Research Laboratories, Cambridge, MA. Dec, 1994.

[31] D.G. Lowe. Object recognition from local scale-invariant features. In ICCV, volume 2, pages 1150–1157, 1999.

[32] P. Wohlhart, V. Lepetit. textitLearning descriptors for object recogni- tion and 3D pose estimation. arXiv preprint arXiv:1502.05908, 2015.

Referencias

Documento similar