3.3.2.1 La producción Kuentz (1992: 34-65) propone tres niveles de análisis en ocasión del manual escolar que
POLISISTEMA LITERARIO EN EL MANUAL ESCOLAR
3.3.2.2 Discurso secundario en El Laberinto de Creta En el discurso secundario nos encontramos con:
The proposed network is composed of two branches: a local branch, based on stacked LSTMs, and a global branch, based on a MLP, operating on the features Vα and vβ, respectively. The local branch consists of N -stacked LSTMs to obtain a higher-level of abstraction on the input data [58]. The choice to use the LSTM derives from the remarkable results obtained in several issues regarding the analysis of body movements, such as gesture recognition [6] or action recognition [167].
According to the stacked LSTM architecture (Sec. 3.2.3), each unit of an LST Ml, at time t, takes as input a vector xt and the previous hidden state hl,t−1. Where
3.5 Conclusions 45
the given input xt indicates a temporal local feature vector vt∈ Vα if l = 0 (i.e., the first level of the stack), otherwise it represents the hidden vector of the underlying layer xt = hl−1,t (i.e., for layers higher than the first one). The output vector zα of the local branch is represented by hN −1,T −1, namely the hidden state of the last layer N − 1 at the last time instant T − 1 for the analysed time window.
Regarding the global branch, it is composed by a MLP with 3 hidden layers, where each hidden node transforms its input, obtained from the weighted sum of the output values of the previous layer, with a rectified linear unit (ReLU) activation function. The first hidden layer has weighted connections with the temporal global feature vector vβ, which represents the input layer. The number of hidden nodes in each layer (as well as the number of output nodes) is smaller than the number of input entries, in this way the MLP is used to extrapolate highly significant patterns from the input, mapping vβ into a low-dimensional description represented by the output layer vector zβ.
In the last part of the network, the zα and zβ vectors are combined in a new vector called z, using the concatenation operator. Afterwards, a dense layer using a ReLU activation function is applied, thus connecting each entry value in z to an entry of the output vector y via a weight. This layer is used to map the vector z to a number of output nodes equal to the size of the set of affects to be recognized. The size of this set is indicated with the value K. The final classification ˆy is obtained
by applying to y a softmax regularization:
ˆ y(k) = e y(k) PK−1 q=0 ey(q) . (3.21)
Finally, the proposed network is trained using the cross-entropy loss and the RMSprop optimization algorithm [38].
3.5
Conclusions
In this chapter, the proposed framework was presented, explaining the implementa- tion details. For each module, the features given as network input were indicated and motivated. Since the inputs are mainly data sequences, the use of stacked LSTMs was fundamental for the realization of each method, although in the action recognition and body affect recognition modules, it is possible to note how it was necessary the support of other DNN architectures. In fact, the MLP was used to analyze the global features obtained from the movements of the body in Section 3.3, while, in Sec. 3.4, the CNNs (specifically the 3DCNNs) allowed to analyse the surrounding environment and the object details with which people interacted within the scene.
47
Chapter 4
Test and evaluation
This chapter presents the experimental phases of all the proposed framework module and it is structured as follows. In Section 4.1, the results obtained in the classification of hand gestures are introduced. In Section 4.2, tests and analysis performed of the action recognition module are reported. In Section 4.3, the experiments carried out in the non-acted affect recognition module are described. Finally, in Section 4.4, a summary discussion on the experiments of the framework modules is summarized.
4.1
Hand gesture recognition experiments
This section describes the experimental tests performed to evaluate the performance of the proposed Hand gesture recognition module. All the experiments were executed by using a LMC on an Intel i5 3.2GHz, 16GB RAM, with a GeForce GTX 1050ti graphics card. The stacked LSTMs and the BPTT algorithm, used to compute the minimization based on the stochastic gradient descent, were implemented by using the Keras1 framework.
The main aims of the experimental session were both the validation of the proposed method, including the assessment of the joint angles as salient features for the hand gesture recognition, and the outperforming of competing works of the current state-of-the-art. The achievement of the first goal was obtained by creating a challenging dataset based on the sign language (Section 4.1.1) on which the optimal number of stacked LSTMs (Section 4.1.2) and the effectiveness of the selected joint features (Section 4.1.3) were analysed. In addition, on the same dataset, a set of well-known metrics was computed to evaluate the overall performance of the approach (Section 4.1.4). Instead, the second goal was obtain by comparing the proposed method with other considerable works on the basis of the SHREC dataset (Section 4.1.5).