• No se han encontrado resultados

Descripci´ on geom´ etrica y materiales empleados

As we described in Section 4.3, a two-layer IndRNN has the same order of number of param- eters as the traditional RNN. Moreover, the computational complexities of the traditional RNN and the proposed KINN are also similar. From the point of view of matrix multiplication, they both involve two matrix product operations (M×NandN×N). Additionally, IndRNN requires another two elementwise vector product operations. On the other hand, LSTM has many more

Figure 4.6:Complexity evaluation in terms of training and testing time (sec).

(about 4 times) parameters than RNN and IndRNN, and it takes much more computation than RNN and IndRNN models.

To evaluate the computational complexity of different models, the adding problem for se- quence of length 100 is used. The settings are the same as described in 4.4.1. The program is implemented based on Theano [61] and Lasagne, and runs on a TITAN X GPU. The training and testing time (seconds) for RNN, one-layer IndRNN, two-layers IndRNN and LSTM are shown in Fig. 4.6. For the RNN and IndRNN models, ReLU is used as the activation function. It can be seen that the results are consistent with our argument that two-layers IndRNN takes a similar time as RNN while LSTM takes much more time.

4.5

Summary

In this Chapter, we presented an independently recurrent neural network (IndRNN), where neu- rons in one layer are independent of each other. The gradient backpropagation through time process for the IndRNN has been explained and a regulation technique has been developed to effectively address the gradient vanishing and exploding problems. Compared with the existing RNN models including LSTM and GRU, IndRNN can process much longer sequences. The basic IndRNN can be stacked to construct a deep network especially combined with residual connections over layers, and the deep network can be trained robustly. In addition, indepen- dence among neurons in each layer allows better interpretation of the neurons. Experiments on

multiple fundamental tasks have verified the advantages of the proposed IndRNN over existing RNN models.

Chapter 5

Application to Skeleton based Activity

Recognition

5.1

Introduction

Human action recognition has received increasing interest in the past due to its wide range of applications in video analytics, robotics, health monitoring and autonomous driving. The success of deep learning in computer vision has driven the development of many deep models [88]–[96] for action recognition. Among these models, recurrent neural network (RNN) [68], [97]–[99] is one of the popular ones because of its capability of modeling sequential data. Recently, RNNs are further augmented with attention models [100], [101] to explicitly model the observation that discriminative information presents in different body parts at different time steps. Noticeable improvement in performance has been attained [102], [103].

This chapter is concerned with two fundamental and challenging issues in an attention-based RNN for action recognition from skeleton data, where the attention weights are associated with joints of the skeletons. First, the state-of-the-art attention models, such as those presented in [102], [103], lacks proper regularization on attention weights to enforce that same class of ac- tions would share similar attention weights and the attention weights of different actions would be sufficiently different, but at the same time the attention weights for the same class should be also allowed to vary to accommodate different performing styles. For example, the joints of legs in action “kicking” would have higher attention weights than other joints, so are the joints

of arms in action “boxing”. Therefore, the attention on joints for different actions are different, that is, the attention weights between two skeleton samples of “kicking” should be more similar than the attention weights between a skeleton sample of “kicking” and a skeleton sample of “boxing”. This makes the regulation of attention weights for different action categories possi- ble. In addition, multiple sets of joints may be discriminative for different samples of a same class of actions. For example, one subject may perform the “hand waving” with their left hand while another one may perform it with their right hand. Therefore, while similarities exist in the attention weights for one action class, there may also be differences, which should also be considered in the attention regularization.

Second, the general principle that the deeper the network the better in extracting discrimi- native features is hardly implementable using a conventional RNN, such as the Vanilla RNN and long short-term memory (LSTM), due to the notorious gradient vanishing and exploding problems. Attention based RNNs for action recognition usually only include one or two fully connected layers to obtain attention and one or two LSTM layers for the classification as in [102], [103]. Such shallow networks are hardly able to explore the long range dependency both temporally and spatially and a deep (e.g. multiple layers) RNN is expected to improve the per- formance as observed in the last chapter. In addition, one fully connected layer in estimating the attention weight tends to trap the end-to-end training to a local optimum as shown in the experiments. Such a local optimum issue cannot be resolved by the double stochastic attention regularization [101] which aims to encourage the model to pay equal attention to every joints over a sequence of skeletons.

To address these two issues, this chapter proposes

• a new deep attention architecture in which the IndRNN model is adopted to build up a deep RNN for classification and multiple fully connected layers are employed to estimate the attention weights for each joint at each time step. An ablation study has shown that the proposed deep attention architecture provides much more stable and better performance than the shallow counterparts.

• a new triplet loss function to regulate the attention among different action categories. This triplet loss function is further extended with a sample to class distance to enforce the intra-class attention distances to be no larger than the inter-class distances and at the same time to allow different sets of attention weights within the same class.

Experimental results have shown that the proposed deep attention architecture and the new loss function improves significantly the performance of classification and that the attention learned is much more stable compared with the traditional attention models [102], [103]. In addition, the double stochastic attention regularization [101] is no longer required to train the network.

Documento similar