• No se han encontrado resultados

CAPÍTULO V PRESENTACIÓN DE RESULTADOS

5.1 Análisis e interpretación de resultados

Convolutional Neural Networks (CNN) are designed to work with grids such as raw representations of images and recognise visual objects. They were inspired by the way the visual cortex works: the neurons respond to the stimuli of a restricted region called the receptive field, and the receptive fields of different neurons overlap. Similarly, in the CNN architecture the neurons are only connected to certain regions of the input or the previous layer, and these regions usually overlap. They are similar to the MLPs discussed in the previous section in their feedforward way of processing the information. However, there are a number of differences between these architectures. The MLPs treat every input independently, and all the neurons are fully connected to the neurons of the previous layer. A CNN could be seen as applying an MLP to a certain patch or kernel of the data, which is then shifted many times in order to cover the whole input. The weights of these MLPs are shared.

This operation is called convolution. This is illustrated in Figure 2.3. The input is divided into patches called kernelsorfilters, in fact, the window containing the patch is shifted each time, the shift of the kernel is called a stride. Each kernel is passed to a hidden layer that produces k outputs called feature maps. Using this convolution operation the input of size heightinp×widthinp×p where p is the number of channels(e.g. 3 for red, green and blue) is mapped to an output of size

heightout×widthout×k.

This convolution is usually combined with a pooling operation. The pooling combines vectors in a certain neighbourhood into a single vector by, for instance, summing them or getting their maximum or average. A common type of pooling is max-pooling, which is illustrated in Figure 2.4.

CNNs are designed to work on grid-shaped data, such as images. They are known to work very well on the task of handwritten digit recognition and achieve state-of-the-art performance on the MNIST2 database of handwritten digits.3 Most architectures include combinations of several convolutional and pooling layers. One of the first successful convolutional architectures was LeNet proposed by LeCun et al. (1998), which was a combination of a few convolutional and max-pooling layers with fully-connected layers. Another very famous convolutional architecture is AlexNet (Krizhevsky et al., 2012) that was initially developed for the task of image classification (Deng et al., 2009), but this model and its variations were also successfully applied to the tasks of object detection (Girshick et al., 2014), video classification (Karpathy et al., 2014), visual tracking (Wang and Yeung, 2013) and other computer vision tasks.

Even though the CNNs were designed to work with visual data, they have also been applied to textual data (Kim, 2014; Kalchbrenner et al., 2014; dos Santos and Gatti, 2014). The motivation for the use of CNNs for text is in their ability to convert the variable-sized input into a fixed-sized output, which is often needed in

2stands for Mixed National Institute of Standards and Technology, but usually known as just

MNIST.

Figure 2.3: Illustration of a two-dimensional convolution. The input is divided into patches called kernels or filters. The window containing the patch is shifted each time, the shift of the kernel is called a stride. Each kernel is passed to a hidden layer that produces outputs called features maps.

NLP tasks. A typical CNN applied to texts is usually slightly different from the one applied to images, as images are usually represented as 3-dimensional grids, the first two dimensions are the spatial position and the third dimension corresponds to the colour channel. When dealing with textual data, the text is represented as word indices and then these indices are replaced with corresponding word embeddings. The convolution is then applied to word embeddings. Figure 2.5 illustrates a CNN that encodes a variable length text as a fixed-size vector. First, the words are represented as word embeddings: ew(1),ew(2),ew(3), ...,ew(n). These could be either initialised randomly or pretrained. For each word all the word vectors in the window of size k around it are concatenated,4 and weights, biases and an activation are applied to the resulting vector:

z(i) =σ(W[ew(i−k), ...,ew(i), ...,ew(i+k)] +b)

Figure 2.4: Illustration of a max-pooling operation with a kernel of size 2x2 and a stride of 1.

In Figure 2.5 k is equal to 1, i.e. only one word to the left and one word to the right are used to calculate the local features. The same weights Wand biasesbare used for all words, i.e. the weights and biases are shared for all the words, unlike the MLP architecture.

Finally, a max-pooling is applied to the output of the convolutional layer: each dimension i of the resulting representation r is maximum among the values of the vectors z along this dimension:

ri =max(z (1) i , z (2) i , ..., z (n) i )

We use a similar architecture in Chapter 3 for detecting semantically equivalent questions. A similar convolutional architecture but over character-level representa- tion was used by dos Santos and Gatti (2014) for sentiment analysis.

Documento similar