3.1. Abstraction of the Human Body
3.1.2. Human Action Recognition (HAR)
Figura 3.4: Architecture of CPN, taken from [8].
Using a ResNet-152 as the backbone, the model achieves the highest AP (73.7) in COCO test-dev set.
The network structure named Cascading Pyramid Network (CPN) [8] includes two stages: GlobalNet and RefineNet (see Figure 3.4). The GlobalNet is a pyramid network that can successfully locate simple keypoints, such as eyes and hands, but may not accurately recognise occluded or invisible keypoints. The RefineNet tries to handle the hard keypoints integrating the levels of Global-Net feature representations. The CPN is a top-down approach. This method achieves competitive results at COCO challenge, with a 73.0 AP on the test-dev set
Currently, the bottom-up methods are the fastest, they can be executed in real time.
However top-down approaches continue to be more accurate in the most commonly used datasets.
Figura 3.6: Overall pipeline of the ST-GCN method [69].
form the natural skeleton, andtemporal edges, which connect the same joint over time;
This representation is one of the most relevant contributions of the work. The method implements several convolutional layers that gradually generate a top-level feature map on the graph, this is presented in the central part of Figure 3.6. The results are obtained by the standard SoftMax classifier. This model is evaluated in the NTU-RGB+D dataset, reaching 81.5 and 88.3 precision in XS and XV respectively.
In [67], authors propose a novel deep architecture for skeletal human action recog- nition by better modelling the spatial and temporal features of human actions. The basic structure is a spatial-temporal module (STM) which contains motif-based GCNNs with variable temporal dense block (VTDB). Figure 3.7 shows that STM contains a motif-based graph convolution sub-module for modelling spatial information, where a weighted adja- cency matrix is used for modelling action-specific spatial structure. The VTDB is used to encode temporal features from different ranges (T1, T2, and T3). TransLayer repre- sents the transition layer in VTDB. A residual connection is applied on each STM. The non-local block is used only in the last stage of the network to reduce computation. Motif- GCNs effectively fuse information from different semantic roles of physically connected and disconnected joints to learn high-order features. This model achieves improvements over the state-of-the-art methods on two challenging large-scale datasets, Kinetics and NTU-RGB+D.
Figura 3.7: Architecture of the Spatial-Temporal Module proposed in [67].
A little more recent study presented by Zhao et al. [74], they propose an end-to-end framework that combines neural networks with probabilistic models. This method is a Ba- yesian neural network (BNN) model. The model is a combination of graph convolutions and a short-term memory network (LSTM). The graphical convolutions allow capturing the spatial dependence of the body’s joints, while the LSTM captures the temporal depen- dence of the postures. This model is probabilistic because it considers the parameters of the model as random variables, which allow a better handling of the movement data due
Figura 3.8: Overview of the framework proposed in [74].
to its randomness. Inspired by adversary learning, a discriminator is added to regularise the model parameters and be able to deal with new data. The classification is defined as a Bayesian inference problem, which helps reduce over-fitting. The general framework of this model is presented in Figure 3.8, where the blue arrows represent the data flow at test time and the red arrows the flow at training time. This model achieved competitive results in the evaluated datasets, showing its effectiveness.
Shi et al. [56] present a two-stream approach that merges spatial and motion data.
For this model, joints and bones are defined as spatial data. The joints are the vertices of the skeletal graph and the bones are the edges. This model uses skeleton graphs as directed graphs, which provide information about the direction and dynamics of the limbs of the body. Directed graphs are processed by their Directed Graph Neural Network (DGNN) to extract features and perform HAR. The movement information is represented in the same graph structure used. The model can be extended from processing images to videos
Tabla 3.1: Comparative frame of the methods develop in ‘Abstraction of the Human Body’
section.
Method Cites Topology Classifier Dataset Metrics Results Human Pose Estimation
RMPE-PAF, 2017 [5]
2134 Two-branch multi-stage CNN
– MPII &
COCO
AP coco test-dev
61.8
CPN, 2018 [8] 271 GlobalNet + RefineNet
– COCO AP coco
test-dev
73.0 Simple Baseline,
2018 [68]
275 ResNet – COCO AP coco
test-dev
73.7 HR-net,
2019 [60]
260 HR-net – COCO
& MPII
AP coco test-dev
75.5 Human Action Recognition
ST-GCN, 2018 [69]
465 ST-GCN SoftMax NTU
RGB+D XS XV
81.5 89.3 Bayesian GCN-
LSTM, 2019 [74]
5 BNN & GCN- LSTM
Bayesian NTU RGB+D
XS XV
81.8 89.0 Directed GNN,
2019 [56]
36 DGNN SoftMax NTU-
RGB+D XS XV
89.9 96.1 GCN Motif &
VTDB, 2019 [67]
7 Motif-based GCN & VTDB
– NTU-
RGB+D XS XV
84.2 90.2
by changing the 2D convolutions into 3D convolutions and altering the obtaining of direct graphs. The final model outperforms current state-of-the-art performance on two large- scale data sets, Kinetic and NTU-RGB+D.
Table 3.1 reviews the important criteria for comparing the previously studied met- hods for the HPE and HAR topics. One of the most relevant criteria is the Results of the evaluation metrics of a common dataset among the works. The most accurate jobs are Directed Graph Neural Network [56] for HAR (the metrics XS and XV are explain in Section 2.2.2.2, this metrics are the official way to present results with the NTU RGB+D dataset), and HR-net [60] for HPE. Another interesting criterion is ‘Cites’, with this we can know the relevance of the works.