3. SEGURIDAD INDUSTRIAL Y SALUD OCUPACIONAL EN EL
3.2. REGLAS DE ORO PARA EJECUTAR TRABAJOS EN TENSIÓN
Typical Convolutional Neural Networks (CNNs) consist of a sequence of layers, each transforming one volume of activations to another through linear as well as non-linear operators. There are three main distinct types of layers to build CNNs: convolutional layer, pooling layer, and fully-connected layer. In the following part of this section, these three types of layers will be introduced respectively.
Figure 2.1 illustrates a diagram of classic CNN architecture – AlexNet [26]. Our DS-CNN introduced in Chapter 5 is also built based on this architecture. In Figure 2.1, the convolutional layer and the following pooling layer is labeled Ci and the
fully-connected layer are labeled as Fi where i is the index of layer. The size of a
convolutional layer is described as depth@width×height, where depth is the number of convolutional filters, width and height denote the spatial dimension of resulting 2D feature maps.
Convolutional Layers
The convolutional layers are the key component of a CNN, which differentiates CNNs from conventional neural networks. Figure 2.2 illustrates how convolution operation is applied on a small region of the input data to build multiple feature maps. While Figure 2.2 only shows the first convolutional layer (that is why the input data is an image), the other convolutional layers works in the same manner. The learnable parameters of convolutional layers are a set of convolutional filters with sizen×n×i,
Figure 2.2: The diagram of the first convolutional layer in a CNN architecture.
wherendenotes the spatial size of a filter (also called as receptive field) andidenotes the depth, assuming these filters are square in spatial dimension (some works also use filters in other shapes according to a specific task). For an input image, the depth is the number of color channels of the given image, e.g. three for a RGB image. For a set of 2D feature maps produced by a convolutional layer, the depth is the number of convolutional filters.
During the forward pass, each filter goes through the width and height of the input data in raster order, producing a 2D feature map. Each pixel in a feature map is the result of dot product between the filter and a region of the input data. Note that each filter must extends through the full depth of the input data, therefore the depth of each filter is set equal to the depth of input data. Suppose a convolutional layer has j filters, then j feature maps are produced by it. Stacking these j feature maps together along the depth dimension forms the total output data (feature volume) with the depth of j. The feature volume will be the input data of the next convolution layer or fully-connected layer. The spatial dimension of a feature map (k in Figure 2.2) can be computed as
k = (m−n+ 2p)
Figure 2.3: 96 filters (11×11×3 for each) learned by the first convolutional layer in AlexNet.
where m is the input data size, p is the amount of zero padding on the border to handle the alignment problem, ands is the scan stride of the convolutional filter.
One main drawback of conventional neural networks is that they cannot handle high-dimensional data such as image without any scaling, because, in this case, the fully-connected layers would have a huge number of learnable weights, which leads to overfitting. Convolutional layers address this problem by taking advantage of the spatial relations between small image regions. In a convolutional layer, the whole input data share a set of 3D filters and the total number of parameters isn×n×j, which is not related to the image size.
Taking the AlexNet architecture as an example (shown in Figure 2.1), the first convolutional layer has 96 filters with a size of 11, a stride of 4, no zero-padding and the image size is 227×227. Then the number of parameters is 11×11×96 = 11,616, and the size of the output feature volume is 55×55×96.
Figure 2.3 visualizes 96 filters learned by the first convolutional layer in AlexNet. We can see that the different filters have learned to detect edges and patterns at differ- ent locations and orientations in the image. This is useful to capture the translation- invariant and rotation-invariant property of images.
Pooling Layers
In typical CNNs, some convolutional layers may be followed by a pooling layer. The pooling layer has two main functions. At first, the pooling layer can effectively reduce the amount of parameters and computations. Although we can increase the stride of convolution operations to do the same thing, details in feature maps may be lost due to the lower dimension. The pooling layer applies some down-sampling operation (max or averaging) to each feature map independently. The intuition behind the pooling layer is that, because of the high correlation between small regions in an image, we can say features that are valuable in one region are also likely to be valuable for other neighboring regions. Thus, it is reasonable to aggregate extracted features at various locations. Secondly, the pooling layer can also provide translation invariance, i.e. the same pooled neuron will be activated even when there is a translation in the image.
In AlexNet architecture [26], the pooling layer scans each feature map using a 3×3 max filter with a stride of 2. As a result, the feature map is down-sampled by 2, while the depth of the feature volume remains unchanged. Considering the first convolutional layer of AlexNet produces a feature volume with a size of 55×55×96, then the following pooling layer would down-sample it to 27×27×96. Figure 2.4 illustrates the pooling process performed on a 5×5 matrix. As a result, a 2×2 pooled matrix is produced.
Fully-connected Layer
Neurons in a fully-connected layer have full connections to all neurons in the previous layer. The activations of a fully-connected layer can hence be computed using a matrix multiplication plus a bias offset. In CNNs, fully-connected layers encode the feature volume produced by convolutional layers to a feature vector, specific to a learning task. Usually, fully connected layers result in the largest number of the total parameters in CNNs, which is likely to lead to overfitting. Therefore, some
Figure 2.4: A max pooling layer scans a 5×5 matrix using a 3×3 max filter with a stride of 2.
recent works [36, 37] remove the full connections between the final convolutional layer and followed fully-connected layer. As a result, the amount of parameter can be substantially reduced.
In AlexNet architecture shown in Figure 2.1, the final pooling layer produces a feature volume with a size of 6×6×256, while the first fully-connected layer has 4,096 neurons. Thus, there are 37,748,736 connections (weights) between these two layers.