3.1. Abstraction of the Human Body
3.1.1. Human Pose Estimation (HPE)
Due to the increasing development of deep learning, the HPE has progressed even in its use for real-world applications [10, 7]
One of the most important jobs in recent years is RMPE-PAF, presented by Cao et al. in [5], a bottom-up method of deep learning. This job is applied in real time to multiple people and is the basis for many subsequent research. RMPE-PAF is an approach based on the non-parametric representation or ‘part affinity fields’, which allows learning to associate parts of the body with the corresponding person, in the image. The general
Figura 3.1: Overall process of RMPE-PAF [5].
pipeline or process of the method is illustrated in Figure 3.1. The inputs to the system are colour images (Figure 3.1(a)), outputs consist of 2D keypoint locations for each person in the image (Figure 3.1(e)). To obtain this result, a CNN is used that simultaneously predicts a set of 2D confidence maps with locations of the body parts (Figure 3.1(b)) and a set of 2D vectors of the fields of affinity of the parties; these vectors encode the degree of association between parts (Figure 3.1(c)). Confidence maps and affinity fields are analysed using a greedy algorithm that relates them and generates the porstures (Figure 3.1(d)).
Figure 3.2 shows a part of the RMPE-PAF architecture, specifically its CNN. The network has two branches, in the Figure 3.2, each branch is differentiated with a colour, branch 1 is in beige and branch 2 in blue. In branch 1 the confidence maps are obtained, while the affinity fields are obtained in the other branch. Both branches are iterative architectures and improve their predictions in the following stages t. After each stage, intermediate supervision takes place.
This method is evaluated in the MPII data set and in the COCO 2016 key point
Figura 3.2: Architecture of the two-branch multi-stage CNN [5].
challenge data set. In the comparison in the MPII data set, the mAP was measured according to the PCKh threshold, achieving the best results in precision and execution time for bottom-up methods. In COCO, the similarity of the key point of the object was used to calculate the mAP, again achieving the best results for the bottom-up methods.
One of most valuable contributions of Cao et al. [5], is their short execution time, achieving the speed of 8.8 fps for a video with 19 people, but it only achieves 61.8 of accuracy. The most accurate methods are usually those based on top down approach;
according to [7] the more accurate methods are the Deep High-Resolution network [60], followed by simple baseline for HPE [68] and Cascaded Pyramid Network [8].
Deep High-Resolution Representation Learning for HPE called as HR-net, is propo- sed by Sun et al. [60]. HR-net solves HPE by learning reliable high-resolution renderings.
The HR-net architecture (Figure 3.3) starts with a high resolution sub-network; gradually, high to low resolution sub-networks are added one by one and more stages are formed.
The multi-resolution sub-networks are connected in parallel. Multiple multi-scale fusions are performed so that each high to low resolution representation receives information from the others.
Figura 3.3: Architecture of HR-net [60].
Predicted heat maps are potentially more accurate. The Hr-net results are given in the COCO keypoint detection challenge (75.5 AP in the test-dev set) and in MPII dataset (92.3 of [email protected] in the test set). In addition, this model show its superiority in pose tracking task over the PoseTrack dataset.
The simple baseline model, as the name implies, is a simple way to solve HPE [68].
The idea behind this model is assesses how good could a simple method for HPE and person tracking be? and the answer results in a simple but very effective method and architecture, which seeks to become the basis or starting point for future research work on HPE and people tracking. The proposed HPE model is based on a few aggregate deconvolutionary layers in a backbone, which could be the ResNet [20]. The backbone net can change to improve the results. The ResNet was chosen because it is on of the most common backbone network to perform the image feature extraction. The added layers are located at the last convolution stage, usually called C5. Three deconvolutional layers with batch normalisation and ReLU is used. Each layer has 256 filters with 4 ×4 kernel. A 1×1 convolutional layer is added over the last stage in the ResNet to get the heatmaps for each keypoint. For training, the backbone is initialised by pre-training in ImageNet [12].
Figura 3.4: Architecture of CPN, taken from [8].
Using a ResNet-152 as the backbone, the model achieves the highest AP (73.7) in COCO test-dev set.
The network structure named Cascading Pyramid Network (CPN) [8] includes two stages: GlobalNet and RefineNet (see Figure 3.4). The GlobalNet is a pyramid network that can successfully locate simple keypoints, such as eyes and hands, but may not accurately recognise occluded or invisible keypoints. The RefineNet tries to handle the hard keypoints integrating the levels of Global-Net feature representations. The CPN is a top-down approach. This method achieves competitive results at COCO challenge, with a 73.0 AP on the test-dev set
Currently, the bottom-up methods are the fastest, they can be executed in real time.
However top-down approaches continue to be more accurate in the most commonly used datasets.