AUXILIAR CONTADORA
Objetivo 4: Rediseñar el sistema contable acorde a la actividad
Convolutional Neural Networks (CNNs) constitute the majority of current deep learning architectures, especially for computer vision purposes. In contrast to units composing fully connected layers, CNNs employ convolutional filters to create sparse connections among layers. As an example, let us consider the model introduced in [4] and commonly named “AlexNet”, whose architecture is illustrated in Figure 2.9: designed for image processing, its first convolutional input layer employs kernels (illustrated in green) that assess only a 5 × 5 region of the input image at a time. Each position in the kernel corresponds to a learnable weight that multiplies input values and, as in any other neural unit, is then processed using an activation function (typically a ReLU). Thus, the convolution operation will output a single activation value for each evaluated input region. Similar to the signal processing counterpart operation, the convolutional operation in a CNN evaluates all regions in the input by sliding the kernel over all input data points, with step sizes corresponding to a predefined stride value.
Importantly, the kernel weights are shared at every input location, i.e., the same weight values are used for evaluation over the whole input. This strategy pro- vides CNNs with the ability of learning representations that are invariant to trans-
Figure 2.9: Diagram illustrating the architecture of the model introduced in [4], now widely known as AlexNet.
Figure 2.10: Diagram illustrating the typical composition of a convolutional layer, combining a convolutional kernel, followed by an activation function (ReLU) and a max-pooling layer.
lation, a powerful tool to describe sets of complex descriptors as described in detail in the next paragraphs. As Figure 2.10 illustrates, the collection of kernel outputs at the different input locations generates new representations commonly referred to as feature maps.
Pooling
After each convolutional layer, most CNNs employ a pooling layer. Pooling is another type of kernel-based operation which, instead of using learnable weights, transform multiple input values into a single output by means of fixed operations like averaging or max, with the latter being the most commonly used in CNNs. Figure 2.10 illustrates the typical composition of a convolutional layer, where a learnable kernel is followed by a ReLU layer and also a max-pooling operator.
These operations are employed to increase the model’s receptive field, i.e., to evaluate increasingly larger regions of the inputs. Intuitively, relying solely on kernels of small sizes (e.g., 5 × 5) pixels would lead to models only capable of learning local feature descriptors, insufficient to characterize complex, larger structures. However, increasing kernel sizes would imply more weight parameters to be learned, which in turn tends to make training unfeasible as training data requirements increase. Thus, pooling operations serve as a mechanism to downsample input representations and allow the evaluation of increasingly larger receptive fields.
Hierarchical features
The combination of multiple convolutional layers and downsampling tech- niques yields deep CNNs with their extraordinary ability to learn hierarchical fea- tures. They are a key factor for the success of these models in comparison to previous hand-engineered methods described in Section 2.4.3 [41]. As described in [42], the convolutional layers C1-C2 in Figure 2.9 learn to identify low-level features such as corners and other edge/color combinations. The following layers C3-C5 combine this low-level information into more complex structures, such as motifs, object parts and finally entire objects.
Data augmentation, fine-tuning, transfer learning
Traditional deep CNNs are composed of millions of parameters: as an example, the early AlexNet [4] contained 60 million parameters. Thus, although many large publicly available datasets have been introduced, gathering domain specific training data to train such deep models is a daunting task. One alternative to reduce the required amount of labeled data is data augmentation, a technique used to benefit the training of multiple machine learning models. Data augmentation is typically performed by applying transformations such as translation, rotation and color space shifts to pre-labeled data, as illustrated in Figure 2.11 for an image composing the PASCAL dataset [43].
Figure 2.11: Examples of common data augmentation strategies in computer vision, using an image composing the PASCAL dataset [43] for illustration.
In addition, various transfer learning approaches such as fine-tuning have been investigated [44, 45]. Earlier layers of a deep neural network tend to contain more generic information (low-level features), which is then combined by the later layers into task specific objects of interest. Thus, a network that can recognize different objects present in a large dataset must contain a set of low-level descriptors robust enough to characterize a wide range of patterns. Under this premise, fine-tuning procedures typically aim at adjusting the higher-level part of a network pre-trained on a large generic dataset, rather than training the full network from scratch. This greatly reduces the need for task-specific data, since only a smaller set of parameters has to be refined for the particular application [45].
Skip-connections
Increasing the number of layers is the most natural way of increasing the capacity of deep neural networks. However, over time it has been observed that a limit can be encountered where the performances of models start to decrease as their number of layers becomes too large. In [46], He et al. introduces the concept of skip- connections and residual learning to address this problem. As Figure 2.12 illustrates, skip connections provide a direct pathway between the input and the output of the corresponding layer. This pathway is equivalent to an identity layer, such that weights composing the layer under consideration have only to learn a residual mapping with respect to the identity function. Hence, if a shallower option is the optimal solution for a certain part of the network, it is easier for the learning process to converge to such a solution where weights would be set to approximately 0, while a pathway for gradient backpropagation still exists. By exploiting skip-connections, the authors introduced the ResNet model, one of the most popular network architectures used as a backbone in many state-the-of-art models for various image understanding tasks.
Figure 2.12: Representation of the concept of skip-connections for the design of resid- ual networks.