• No se han encontrado resultados

UNIDAD DE TITULACIÓN

MARCO TEÓRICO

II.5. Otros aspectos de la especie

II.6.1. Metabolitos primarios

Convolution is the core building block in CNN which could well capture the spatial relations across pixels in images. Then the discrete 2D convolution is:

(xk) (i, j) = ∞ X −∞ ∞ X −∞ xi−u,j−vku,v, (2.1)

At a convolution layer, the previous layer’s feature maps are convolved with learnable kernels and put through the activation function to form the output feature maps, as shown in Fig. 2.1 (figures are from this website1). Each output map may combine convolutions

with multiple input maps, which can be formally expressed as Eq. 2.2.

xlj =f   X i∈Mj xli−1kijl +blj  , (2.2)

where Mj represents a selection of input maps. Each output feature map is given an ad- ditive biasb, however for a particular output map, the input maps will be convolved with

distinct kernels. In other words, if output feature map j and h both sum over input map i,

then the kernels applied to feature map i are different for output feature mapsj and h.

Many works have been done to improve the convolution operation. Locally connected convolution, which is different from plain convolution in the fact that every location in the

feature map learns a different set of filters, is proposed for considering the location infor- mation and used to process images where the key points are relatively fixed across images, such as face recognition and prostate segmentation [196, 152]. Deformable convolution [42] is a generic operation which can model geometric transformations without additional su- pervision and can thus contribute to forming networks of more capacity. Depth-wise convo- lution, a kind of separable convolution, takes full advantage of group convolution [105] to save computation consumption. Combined with point-wise convolution which is actually convolution operation with 1×1 kernel, efficient lightweight networks [79] can be built to

work on mobile devices.

2.1.1.2 Pooling

Pooling operation is designed to downsample the feature maps, that is, if there are

N input feature maps, then there will be exactly N corresponding output feature maps

with downsampled size. Pooling layers are periodically inserted in-between successive convolution layers in CNNs, with the purpose of progressively reducing the spatial size of the representation so that we can reduce the amount of parameters and computation cost in the network, and hence to also control overfitting to some extent. Formally,

xjl =f βjldown xil−1+blj, (2.3)

where down(·)represents a sub-sampling function. The widely used ones are max pooling

and average pooling.

2.1.1.3 Activation function

Activation function is one of the most important building blocks in neural networks, which is put at the end of or in between neural networks to help decide if the neuron would fire or not as shown in Fig. 2.2(a). The activation function is non-linear transforma-

Activation function

(a) (b)

activation

f(x)

net input

x

Activation

function

o

Sigmoid

Rectified Linear

Figure 2.2: (a) shows activation function in neural networks and (b) displays typical activation functions.

tion that we do over the input signal. There are several widely used activation functions, such as sigmoid (Eq. 2.4), Rectified Linear Unit (denoted as ReLU) [144] (Eq. 2.5), tanh (Eq. 2.6) and so on.

sigmoid(x) = 1/ 1 +e−x, (2.4)

ReLU(x) = max (0, x), (2.5)

tanh (x) = 2

1 +e−2x −1. (2.6)

Compared to sigmoid which is the previously frequently used activation function, ReLU is recently more frequently adopted as activation function in CNNs (shown in Fig. 2.2(b)), because ReLU can alleviate the gradient vanishing or gradient explosion problems which often occur using sigmoid [202] (Vanishing (exploding) gradients is a well known problem in deep neural networks. As the gradient information is back-propagated, repeated multiplication or convolution with small (big) weights lead to ineffectively small (big) gradients in shallow layers). Some activation functions are also proposed to further

improve the ReLU, such as LeakyReLU [130] (Eq. 2.7), Parametric Rectified Linear Unit (denoted as PReLU) [70] (Eq. 2.8, note θ is a learned vector which has the same size with

x), exponential linear unit(denoted as elu) [39] (Eq. 2.9) and so on. LeakyReLU (x) =      αx x , x < 0 , x0 , (2.7) PReLU (x) =      θx x , x < 0 , x0 , (2.8) elu (x) =      α(ex1) x , x <0 , x0 . (2.9) 2.1.1.4 Normalization

To further address the gradient vanishing or exploding issues, many normalization techniques are proposed to help better optimize the training of networks. Ioffe et al. [87] proposed batch normalization to reduce internal covariate shift which refers to change in the input distribution to internal layers of a deep network and thus accelerate the training of neural networks. Followed batch normalization, instance normalization [183] and layer normalization [19] are proposed for better training networks in certain conditions. Later, group normalization [199] is proposed for cases when batch size is small. Currently, it is a standard way to include one normalization technique to make the networks easier to train.

2.1.1.5 Architectures

With years of development, numerous network architectures have been proposed for different applications, in which, AlexNet [105], VGGNet [173], Inception [177], ResNet [71] and DenseNet [82] are the milestone architectures which inspire the computer vision field to move forward. I briefly cover two of them: Alexnet and ResNet.

weight layer weight layer 𝐹(𝑥) + 𝑥 𝑥 𝑥identity 𝐹(𝑥) relu

Figure 2.3: Illustration of a typical residual block.

AlexNet is the most representative CNN architecture in the early stage, which consists of 5 convolution layers, 3 fully connected layers, 3pooling layers. Dropout [175] is adopted

to enhance model’s generalization ability. To save GPU memory, the convolutions are partitioned into two groups. AlexNet achieve the state-of-the-art performance on image recognition challenges at that time.

ResNet [71] is another milestone for deep learning. With residual learning, the per- formances of many computer vision and medical image analysis tasks have been largely improved [71, 68, 152]. In ResNet, the authors design a residual block and use series of residual blocks to form the residual networks. Fig. 2.3 shows a typical residual block with two intermedia layers. The essence of residual block is identity mapping, which can be mathematically formulated as follows:

Z=F (A,{θi}) +A, (2.10)

where {θi} is the set of convolutional filters in the bottleneck residual unit, F is the con- volutional layers in a residual block, andA and Z are the input and output feature maps,

2.1.1.6 Regularization

There are several techniques developed to regularize the networks to prevent overfitting.

L2 regularization, also called weight decay is widely adopted regularization term for neural networks. Besides, Dropout [175], a very sophisticated design by randomly dropping part of neurons at each training iteration, is a strong regularization for training neural networks which can efficiently solve the overfitting problem if well used. Early stopping is another choice for improving generalization ability of network models.

2.1.1.7 Loss function

Cross entropy loss is widely adopted to optimize the proposed CNN model for classifi- cation problems. The mathematical formulation is given by Eq. 2.11.

H(y, p) =

k

X

i=1

yilog(pi), (2.11)

where yi and pi are the ground truth and the CNN score for category i in {1,2, ..., k}, with

k denoting the number of distinct classes.

Documento similar