• No se han encontrado resultados

regation)

TheNORBbenchmark introduced in Section4.2.1focuses on pose and lighting continual learning and, unlike the originalNORB protocol, it does not split the objects in two disjoint groups: for each class, 5 objects in the training set and 5 objects in the test set.

Figure B.2: HTM and CNN incremental tuning accuracy, when splitting class objects as in the originalNORB protocol (for each class: 5 objects in the training set and 5 in the test set). Nomindist is here necessary between test and training batches because of the object segregation.

Our choice was aimed at isolating the capability of learning pose and invariance from the ca- pability of recognizing different objects of the same class (which is critical inNORBbecause of the small number of objects per class). However, to further validate the efficacy of the proposed

Chapter 8. Appendices 110

continual tuning, here we came back to the native object segregation and report results corre- sponding to Section5.3.1results under this scenario. FigureB.2shows HTM and CNN accuracy for different tuning strategies. We observe that:

• The trend is very similar to the Section 5.3.1 experiments: even in this case, supervised strategies work well for both strategies while semi-supervised tuning is effective for HTM but not for our CNN implementation.

• The accuracy achieved is markedly lower with respect to Section 5.3.1, but is in line with results reported above if we consider the number of training samples and the forgetting effect due to continual learning.

Appendix C

Adapting Pre-trained CNN to

Different Input Size

In the recent years, the pervasiveness of deep neural networks and the complexity of training such architectures on datasets of remarkable size has led to the proliferation of pre-trained models which represent a very good starting point for many customized solutions. However, this approach requires adapting problem-specific data to a fixed size architecture which was designed and optimized to solve another task. In the context of computer vision and object recognition, for example, it is very common to stretch images of arbitrary sizes to 227×227 pixels which is the typical input of well-known CNN models pre-trained onImageNet: this often leads to highly distort the original patterns and significantly increases inference time. A more elegant (and efficient) approach is adapting a pre-trained model to work with input patterns of different size. This is straightforward for convolution and pooling layers thanks to local (shared) connections, but is much more problematic for fully connected layers, whose number of weights depends on the input image size. In this case, two main strategies can be used:

1. Applying fixed size pooling (global or pyramidal) over the last convolution/pooling layer as proposed in [He et al., 2015; Lin et al., 2014; Ren et al., 2017]. However, finetuning of upper levels might be necessary if the input scale changes dramatically or the original model was not designed with a fixed-size pooling layer at all.

2. Reusing the pre-trained network up to the last convolution layer and retraining the fully connected layers from scratch on the new task and input size. A typical approach is also to train an external classifier (e.g., SVM) from pooled features just after the last convolutional layer.

Independently of the network adaption to a different input size, when the problem classes change, the final softmax layer needs to be replaced and re-trained from scratch.

Since in our experiments we used the classic CaffeNet and VGG models (which have not been trained in a multi-scale fashion) and we aimed at fast processing, we opted for the second strategy.

Appendix C.Adapting Pre-trained CNN to Different Input Size 112

Table C.1: Accuracy differences between reduced size CNNs (Mid) and the corresponding full-size models on the 50 classes task. All the models have been pre-trained onILSVRC-2012. SVM training and CNN fine-tuning were performed onCORe50.

Accuracy (object level: 50 classes) CNN + SVM (on top of ...) fc6 pool5

1 CaffeNet 63,46% 63,14%

2 Mid-CaffeNet 52,84%

3 VGG 69,03% 70,91%

4 Mid-VGG 59,25%

CNN + Finetuning Accuracy (object level: 50 classes)

5 CaffeNet 75,97%

6 Mid-CaffeNet 65,98%

7 VGG 77,39%

8 Mid-VGG 69,08%

Hence, we reshaped the input volume to 3×128×128, halved1 the number of units in the fully

connected layersfc6 andfc7 (from 4096 to 2048) and re-trained them from scratch. This results in a relevant speedup at inference time (3.4×for CaffeNet and 4.67×for VGG). The resulting mid-size models are now suitable to be tuned onCORe50 native 128×128 frames.

TableC.1summaries our findings. For the full-size models, extracting features fromfc6 orpool5 is nearly equivalent in terms of accuracy (compare columns fc6 and pool5 for raw 1 and 3 in the table). So the lack of a fully pre-trainedfc6 in the mid-size models is not critical. However, in the experiments with SVM (rows 1:4), the mid-size networks loose about 10% accuracy with respect their original version. A similar gap (just slightly smaller for VGG) can be observed when the networks are finetuned (rows 5:8). The reason of such accuracy drop is not totally clear to us. On the one hand, if we consider finetuning experiments (rows 5:8), fc6 and fc7 have been pre-trained on a higher number of patterns in the full-size networks, and therefore it is reasonable to expect higher accuracy; on the other hand, if we consider pool5 + SVM experiments, both the network exploits the same pre-training and stretching our input patterns from 128×128 to 227×227 (in principle) does not add new information.

We did similar experiments on other datasets (e.g.,NORB,COIL-100,BigBrother,iCubWorld32) and obtained close results: it seems that the zoomed image, even if a blurred, allow a more de- tailed feature extraction to be performed by the network. This can be due to the spatial scale of the filters learned onILSVRC-2012 or by a richer hierarchical representation (more neurons and link between neurons cover the object region). We believe that more investigations are necessary to fully understand the reasons and to make available pre-trained mid-size networks (for patterns whose native size is close to 128×128) which are competitive with full-size ones.

Appendix D

Single-Incremental-Tasks

Experiments Details

D.1

Implementation Details (Caffe framework)

Since implementing dynamic output layer expansion was tricky in Caffe framework, we ini- tially implemented the different strategies by using a singlemaximal head (i.e., including all the problem classes since from the beginning) instead of an expanding head. In principle, the two approaches are quite similar, since if a particular batch does not contain patterns of a given class, no relevant error signals are sent back along the corresponding connections during SGD. Hoverer, looking at the details of the training process, the two approaches are not exactly the same.

For example, for CWR+ we verified, with some surprise, that the maximal head simplifying approach constantly leads to better accuracy (up to 6-7% on CORe50) w.r.t. to the expanding head approach. We empirically found that the reason is related to the gradient dynamics during the initial learning iterations: working with a higher number of classes makes initial predictions smaller (because of softmax normalization) and the gradient correction for the true class stronger; in a second stage, predictions start to converge and the gradient magnitude is equivalent in the two approaches. It seems that for SGD learning (with fixed learning rate) boosting the gradient in the first iterations favors accuracy and reduces forgetting. We checked this by experimentally verifying that the expanding head approach combined with a variable learning rate performs similarly to maximal head with fixed learning rate. Therefore, to maximize accuracy and reduce complexity, CWR and its evolutions (CWR+ and AR1) have been implemented with the maximal output layer approach. Referring to the pseudocode in Algorithms1, 2and3, it is sufficient to keep to constant maximum size (e.g., 50 for CORe50) and remove the line“expand output layer with...”.

For the other approaches we verified that: LWF performs slightly better with expanding head approach while EWC and SI work better (and are easy to tune) with maximal head. To produce

Appendix C.SIT Experiments Details 114

the results presented in Section5.1.4we used for each strategy the approach that proved to be the most effective. Strategy specific notes are reported in the following for Caffe implementation.

LWF It is worth noting that in Caffe a cross-entropy loss layer accepting soft target vectors is not available in the standard layer catalogue and a custom loss layer need to be created.

EWC To implement EWC in Caffe we:

• compute, average and clip Fi values in pyCaffe (for maximum flexibility). To calculate Fki the variance of the gradient should be computed by taking the gradient of each of the

ni patterns in isolation. To speed-up implementation and improve efficiency we computed

the variance at mini-batch level, that is using the average gradients over mini-batches. In our experiment we did not note any performance drop even when using mini-batches of 256 patterns.

• passF and Θ∗ to the solver via a further input layer.

• modified SGD solver, by adding a custom regularization that perform EWC regularization in weight decay style.

SI Starting from EWC implementation, SI can be easily setup in Caffe, in fact the regularization stage is the same and we only need to computeFivalues during SGD. To this purpose, in current

Appendix C.SIT Experiments Details 115

Documento similar