Error concealment method for HEVC
based on motion vector redundancy
by
Domingo Guzm´an Estrada
Submitted as partial requirement to obtain the degree of Master of Science, in the area of Computational Sciences
at the
Instituto Nacional de Astrof´ısica, ´Optica y Electr´onica February, 2018
Tonantzintla, Puebla
Supervised by:
Claudia Feregrino Uribe, PhD Alicia Morales Reyes, PhD
c
INAOE 2018
All rights reserved
The author grants INAOE permission to reproduce and distribute this document
Contents
Nomenclature xiii
Abstract xvii
1 Introduction 1
1.1 Motivation . . . 1
1.2 Objectives . . . 3
1.3 Methodology . . . 4
1.3.1 Motion information acquisition . . . 5
1.3.2 Motion information embedding and extraction . . . 6
1.3.3 Error concealment . . . 7
1.4 Document structure . . . 8
2 Background 11 2.1 Video background . . . 11
2.2 Human visual system . . . 12
2.3 YCbCr color space . . . 13
2.4 Subsampling . . . 15
2.4.1 Frame and field coding . . . 16
2.5.1 H.26X . . . 17
2.5.2 MPEG family . . . 18
2.6 High efficiency video coding . . . 20
2.6.1 HEVC coding process . . . 22
2.6.2 Picture partitioning . . . 23
2.6.3 Intra prediction . . . 29
2.6.4 Inter prediction . . . 29
2.6.5 Transform coding . . . 32
2.6.6 Reconstruction and buffering . . . 33
2.6.7 Entropy coding . . . 34
2.7 Error resilience . . . 35
2.8 Video evaluation metrics . . . 36
2.8.1 PSNR . . . 37
2.8.2 SSIM . . . 37
2.9 Chapter summary . . . 38
3 Related work 41 3.1 Temporal approaches . . . 42
3.1.1 MAD . . . 43
3.1.2 Bidirectional motion vector tracking . . . 45
3.1.3 Sub-block based methods . . . 49
3.2 Data hiding based methods . . . 51
3.2.1 Information to embed . . . 51
3.2.2 Where to embed . . . 55
3.3 Chapter discussion . . . 62
4.1 Information to embed . . . 67
4.1.1 Coding unit selection . . . 68
4.1.2 Prediction unit selection . . . 68
4.2 Coding scheme . . . 70
4.2.1 Bitstring coding . . . 72
4.3 Information embedding and retrieval . . . 73
4.3.1 Information per level . . . 73
4.3.2 Embedding . . . 75
4.3.3 Information retrieval . . . 77
4.4 Block reconstruction . . . 78
4.4.1 Structure assignment . . . 79
4.4.2 Candidate list construction . . . 80
4.4.3 Candidate evaluation . . . 83
4.5 Chapter discussion . . . 84
5 Experimental assessment and results analysis 85 5.1 Experimental framework . . . 86
5.1.1 Dataset . . . 86
5.1.2 Quality evaluation metrics . . . 89
5.2 Experimental assessment . . . 90
5.2.1 Result comparison . . . 93
5.3 Results analysis . . . 95
6 Conclusions and future work 101 6.1 Future work . . . 103
List of Figures
1.1 Example of a frame partitioning into slices and CU/CTU. Each CU/CTU is sub-partitioned into several SCU. For each SCU a coding mode be-tween intra and inter is selected. . . 6
2.1 Visual difference between 4:2:0, 4:2:2 and 4:4:4 subsampling schemes in increased compression order. At higher sampling, 4:4:4, there is no compression, meanwhile a lower subsampling, 4:2:0 achieves compres-sion in the chroma components. . . 15 2.2 Field coding example, frame (c) shows the result of interlacing field 1
(a) and field 2 (b). . . 16 2.3 HEVC basic coding diagram summarized into 5 main steps, (a), (b),
(c), (d) and (e). . . 24 2.4 HEVC group of pictures. An example of picture types organization
and dependencies between them are shown. . . 25 2.5 HEVC frame partitioning at CU level. (a) Example of frame
par-titioning into LCUs. (b) Quad-tree structure example of a LCU, [Aguirre-Ramos et al., 2014]. . . 27 2.6 Coding tree unit composed of luma, Cr, and Cb coding blocks. . . . 27 2.7 Depending on the CU prediction mode (intra or inter) different PU
2.8 35 angular modes available in HEVC intra coding, including planar and DC mode. . . 29 2.9 Example of temporal and spatial motion vector candidates. Only
motion vectors from spatial neighboring blocks to the left and above the current block are considered as spatial MVP candidates; this is because the blocks to the right and below the current block are not yet decoded. . . 31
3.1 OBMA is calculated through MAD between the outer boundaries of the candidate and lost blocks respectively. . . 44 3.2 BMVT example. Blocks in framen1 are translated to framen using
MV extrapolation. . . 46 3.3 Sub-block partitioning with their extrapolated MVs. . . 47 3.4 Spatial and temporal motion vector dependencies; Dotted arrows
sig-nals disabled MV dependencies. . . 57 3.5 CU partitioning example in HEVC. (a) Example of a LCU
partition-ing structure. (b) Associated TCU of the structure shown in (a). . . . 61
4.1 HEVC coding diagram with the stages of the motion information ex-traction (a) and backup embedding (b) of the presented method. . . . 66 4.2 HEVC decoding diagram with the stages of the motion information
extraction (a) and backup embedding (b) of the presented method. . 67 4.3 Frame partitioning example of an HD sequence into CU. (a) CU
parti-tioned, unsuitable for the proposed selection scheme. (b) CU without partitions, suitable for the proposed method. Image obtained from a sequence generated by Digiturk cable channel. . . 69 4.4 Different PU partitions available on inter prediction. Gray zones
4.5 Quad-tree structure partitioning example. (a) LCU recursive parti-tioned into small CU, including an example of PU partitioning in gray zones. (b) The corresponding tree of the LCU, including its associated PUs and TUs. . . 71 4.6 PU partitioning into 4X4 coefficient blocks. The 1st and 2nd level of
embeddable blocks are illustrated. . . 74 4.7 Flow chart that represents the proposed embedding algorithm, the
process is repeated for each TU within the host CU until the complete bitstring has been embedded. . . 76 4.8 TU inverse diagonal order used for data embedding and retrieval. (a)
Example of CU partitioning into TUs. (b) Partitioning of the gray 32X32 TU into 4x4 blocks and inverse diagonal order used to perform embedding. (c) Embedding example of a 4x4 sub block . . . 77 4.9 Block reconstruction flow chart for missing CUs, including restoration
even when the host CU is missing. Notice that making reconstruction with backup information is faster than in an uninformed way. . . 78 4.10 Generic structure examples, The chosen structure can be modified
depending on the CU size. . . 80 4.11 Prioritized PUs depending of the partition and position of the
con-cealed PU. Arrows signal the direction over which the 4 candidates are prioritized. . . 81 4.12 Reach Forward Technique, during the concealment of the X PU part;
correctly received neighbors (C) are used as candidates, arrow signal the direction of the most important neighbor. (a) Normal behavior of the candidate building process. (b) Using the RFT, the next block along the same direction of the missing most important neighbor is used as substitution. . . 82
5.1 PSNR variation during frame concealment of the sequenceYachtRide at 2% of random losses. . . 94 5.2 Visual quality comparison of a concealed frame in RaceHorse sequence
at 1% damage level. (a) Original frame without losses. (b) Zero MV, PSNR = 31.5. (c) [Aguirre-Ramos et al., 2014], PSNR = 34.6. (d) Proposed method, PSNR = 36.3. Highlighted zones in (c) and (d) compare shows the reconstruction quality differences in frame. . . . 96
List of Tables
3.1 PUs structures and their corresponding VLCs. . . 60
4.1 Variable-length codes included in the TMuC [I.-K. Kim, 2012]. Bold codes represent the Selected PUs structures for the proposed method. 72
5.1 RAW YUV video sequences used for experimentation and their cor-responding main characteristics and configurations parameters. . . . 87 5.2 Motion characteristics of included sequences for experimentation.
No-tice that UHD version of some sequences describes the same kind of objects and movement, but in different resolutions. . . 88 5.3 Characteristic comparison of more related works. (1) Aguirre-Ramos
et al., 2014, (2) Carreira et al., 2014a, (3) Zero MV. . . 92 5.4 Bitrate increment comparison between the presented method and the
one of [Aguirre-Ramos et al., 2014] . . . 97 5.5 Concealment results comparison simulating 1%, 2%, 5%, 10% and
20% random packet losses. Proposed method, (1) Aguirre-Ramos et al., 2014,(2) Zero MV. . . 100
Nomenclature
AECOD Adaptive Error Concealment Order Determination ALF Adaptive Loop Filtering
MVP Motion Vector Prediction
AMVP Advanced Motion Vector Prediction AVC Advanced Video Coding
B-frame Bi-Predictive Frame BD Boundary Distortion
BMA Boundary Matching Algorithm BMT Backward Motion Vector Tracking BMVT Bidirectional Motion Vector Tracking
BWPS Block-based Weighted Pixel-value Superposition CABAC Context-based Adaptive Binary Arithmetic Coding CAVLC Context-Adaptive Variable-Length Coding
Cb Blue Chrominance
Cr Red Chrominance
Cg Green Chrominance CMV Candidate Motion Vector CMVs Candidate Motion Vectors
CU Coding Unit
DCT Discrete Cosine Transform
DMVE Decoder Motion-Vector Estimation DST Discrete Sine Transform
DWT Discrete Wavelet Transform EC Error Concealment
FMO Flexible Macroblock Ordering FMT Forward Motion Vector Tracking FPS Frames per Second
GOP Group of Pictures HD High Definition UHD Ultra High Definition
HEVC High Efficiency Video Coding
HM HEVC Test Model
HVS Human Visual System
ICMVs Interpolated Candidate Motion Vectors IDCT Integer Discrete Cosine Transform IP Intra-Period
ITU International Telecommunications Union ITU-R ITU Radio communication Sector
JCT-VC Joint Collaborative Team on Video Coding JVT Joint Video Team
LCU Large Coding Unit
MB Macroblock
MBMA Modified Boundary Matching Algorithm MPEG Moving Picture Experts Group
MSE Mean Squared Error MV Motion Vector MVs Motion Vectors
OBMA Outer Boundary Matching Algorithm PSNR Peak Signal-to-Noise Ratio
PU Prediction Unit QoS Quality of Service
RBMA Refined Boundary Matching Algorithm RFT Reach-Forward Technique
RS Redundant Slice
SAD Sum of Absolute Differences SAO Sample Adaptive Offset SSIM Structural Similarity
TEC Temporal Error Concealment TMuC Test Model under Consideration TU Transform Unit
VCEG Video Coding Experts Group VLC Variable-Length Coding VQEG Video Quality Experts Group WPP Wavefront Parallel Processing
Gratitude to Consejo Nacional de Ciencia y Tecnolog´ıa (CONACYT) for its sponsorship in this work under the CVU grant number 702715, and Instituto Na-cional de Astrof´ısica, ´Optica y Electr´onica (INAOE) for providing the necessary support and elements to my personal development.
Abstract
Several video compression standards have been produced with the aim of reducing transmission bit rates without decreasing the video quality. The High Efficiency Video Coding (HEVC) standard, developed by the Joint Collaborative Team on Video Coding (JCT-VC), is characterized by delivering improved coding efficiency relative to previous standards, such as H.264/AVC. HEVC provides improved com-pression ratios mainly for High Definition (HD) and Ultra High Definition (UHD) video, however, such gains are associated with high increases in computational com-plexity and therefore long processing times.
Due to the characteristics of the HEVC design, it is shown that the temporal and spatial dependency of motion information is higher than in the H.264/AVC standard, producing a lack of robustness against the loss of information caused by transmission errors or data losses in storage. Based on the above, current error concealment methods might not be prepared to work with these new features.
This work addresses the development of an Error Concealment (EC) method for improving the resilience of the HEVC standard, backing up a set of intra coded Cod-ing Units (CUs), includCod-ing its correspondCod-ing motion information and partitionCod-ing. CUs are selected by choosing quad-tree structures and PU shapes with lower par-titioning depth, because losing larger structures represent a significant degradation of the visual quality, and then transmitted as redundancy. Redundant information is embedded in a reversible way introducing no visible distortion within Discrete Cosine Transform (DCT) coefficients.
A lost CU could be restored using the backup structure and motion information at the decoding stage or could be restored through a generic structure choosing between neighboring candidates through the use of the Outer Boundary Matching Algorithm (OBMA), in situations where the coding unit and its backup information gets lost.
Simulations show that the proposed method can efficiently restore lost CUs using redundant motion information, improving error resilience up to 3 dB in Peak Signal-to-Noise Ratio (PSNR) and quality when compared to the related proposal in state of the art at the frame level, and up to 4 dB compared to the original HEVC standard without redundant embedded information.
Chapter 1
Introduction
1.1
Motivation
The increasing demand for digital video in several areas of knowledge, including entertainment, education, medical and sciences, has led to an emergent necessity of bandwidth and storage capacity. In order to accomplish the current video com-pression requirements, the International Telecommunications Union (ITU) and the ISO/IEC Moving Picture Experts Group (MPEG), better known as the Joint Col-laborative Team on Video Coding (JCT-VC), have been working in the develop-ment of the High Efficiency Video Coding (HEVC) standard, also known as H.265 [Sullivan et al., 2012]. Firstly published on March of 2013, the HEVC is character-ized by achieving about double data compression ratios at the same level of quality in comparison with its predecessor the H.264 Advanced Video Coding (AVC) standard [Wiegand et al., 2003].
The development of the HEVC is focused on working with higher resolutions, superior to 8K and by better usage of the parallel architectures that predominate nowadays. To achieve such requirements, HEVC introduces new tools and char-acteristics that were not present in previous standards, among which stand out: the replacement and increase of macroblock size structure by the quad-tree
struc-ture, which is optimized for video compression at high resolutions, the removal of the Flexible Macroblock Ordering (FMO), used as error resilience tool in previous standards, and the substantial improvement of the inter coding, including the re-placement of the Motion Vector Prediction (MVP) by the new Advanced Motion Vector Prediction (AMVP), which is based on motion vector competition.
Despite the new characteristics and tools included in the HEVC standard, that bring further compression efficiency and relative performance increase, some disadvantages come along, such as the increase of complexity, which is traduced to an increment in computing time [Correa et al., 2012] and the decrease of error resilience, due to low effort that has been done to ensure successful transmission in real scenarios. The error resilience and performance over lossy networks have been analyzed by [Oztas et al., 2012], showing that HEVC has lower error resilience, regarding the number of affected frames when missing or erroneous information is received, when compared to the H.264/AVC standard. The Temporal Motion Vec-tor Prediction (TMVP), included in the AMVP, produces an increasing number of dependencies between motion vectors (MVs). The error propagation caused by the TMVP is analyzed by [Li et al., 2011], showing that an error may lead to an imme-diate failure in the decoding of the current frame and all the following dependent, due to AMVP vulnerabilities.
Over error-prone environments such as network transmissions or storage de-vices, it is common that some information of a video sequence gets lost due to multiple factors such as network congestion, noise in transmission lines, or failures in storage media, among others. The Error Concealment (EC) methods are sets of algorithms that try to conceal the unwanted effects produced by missing data, re-constructing the lost areas of each frame of the sequence, to establish reliable video presentation, mainly in sensitive areas, such as real-time applications.
EC methods can be classified depending on the information source: spatial, temporal or hybrid. They often use spatial information coming from neighbor regions
and temporal information from previous correctly decoded frames, to conceal lost areas, while hybrid ones, combine techniques of spatial and temporal approaches. The decision over which method to choose is usually taken by an evaluation based on a distortion function. Moreover, there are techniques based on data hiding, Data Hiding Error Concealment (DHEC), which create a communication channel between encoder and decoder, embedding redundant information and using it as a backup mechanism to perform concealment.
In this work, a DHEC method to improve error resilience of HEVC is proposed. To reduce error propagation due to incorrect MV predictions produced by missing data, redundant MVs are used as a backup mechanism with the aim of increasing error robustness. By coding a set of redundant motion vectors (MVs), including its corresponding quad-tree structure and Prediction Unit (PU) shape on a host Coding Unit (CU). The proposed mechanism can conceal missing CUs using the backup motion information, or with a generic structure in situations when both the current block and the block with the backup also get lost.
1.2
Objectives
The general objective is to develop an error concealment method for HEVC using motion information as redundancy. A set of motion vectors including its correspond-ing quad-tree and prediction unit structure is transmitted as redundancy and used to conceal missing areas on video; In this way, it is possible to reduce the propa-gation of mismatched MV predictions. To accomplish the method mentioned above the following specific objectives are necessary:
• Identify and select redundant motion information.
• Explore and determine the best stage of the encoding process to embed infor-mation.
• Design a coding method for the motion information which minimizes the amount of introduced distortion.
• Design an update detection method to be applied during decoding process to detect and to conceal errors produced by the loss of information.
1.3
Methodology
Some of the enhanced features of HEVC in comparison with its predecessor, the H.264/AVC standard, are the replacement of the MVP by the new AMVP, which is based on motion vector competition (MVC) and where the best predictor for each motion block is signaled to the decoder. The replacement of block merging by the direct and skip mode, another of the improvements of HEVC, is used to derive information from large areas without changes or regular motion. Motion vector competition determines which Motion Vector Predictor (MVP) from a list of Motion Vector Predictors (MVPs) is used for motion vector derivation.
The proposed DHEC method that involves the process of acquiring useful motion information, redundant information embedding and the error concealment is described below:
1. Motion information acquisition. Involves the selection of motion vectors based on the lower structure partitioning of its corresponding PUs, selecting those with the smallest amount of partitions.
2. Motion information embedding and extraction. This process evalu-ates the amount of available storage, the embedding of information through the modification certain Discrete Cosine Transform (DCT) coefficients and extracting before the frame decoding.
3. Error concealment. Comprises the CU reconstruction, using the backup motion information, or with generic information when necessary.
1.3.1
Motion information acquisition
In the frame partitioning of HEVC, each frame is divided into slices which can be processed separately from any other region of the same frame (see section 2.6.2). At the same time, a frame is partitioned into square blocks of the same size, commonly 64x64, named Coding Units (CUs). Increasing the size of the coding units up to 64x64 when compared to previous standard H.264/AVC which use blocks of a max-imum of 16x16, is advantageous for high-resolution video allowing to reach better compression levels. Later, each CU, or Large Coding Unit (CU), is sub-partitioned in a recursive way within a structure similar to a quad-tree, adapting to the structure of the objects in the image.
At the smaller level of the quad-tree structure subdivisions, the encoder de-termines the coding mode between intra and inter prediction for each Small Coding Unit (SCU). If intra picture prediction is chosen, one of the 35 supported spatial intra prediction modes had to be selected and signaled, moreover if inter prediction is selected, motion estimation process is performed, and the CU is sub-partitioned into a so-called Prediction Blocks (PB). Figure 1.1 shows an example of frame parti-tioning into slices and several Coding Tree Units (CTUs). For each inter-coded PB, there is at least one MV associated within.
The AMVP [Sze et al., 2014] is useful to achieve high compression levels on the inter coding process of HEVC, it derives the best Motion Vector Predictor MVP for each intern coded SCU from a previously constructed list of spatial and temporal MVs. The candidate list construction includes the following candidates:
1. Up to two spatial candidate MVPs, derived from five spatial neighboring blocks.
2. One temporal candidate MVP, derived from two temporal, co-located blocks when both spatial candidate MVPs are not available or they are iden-tical.
Figure 1.1: Example of a frame partitioning into slices and CU/CTU. Each CU/CTU is
sub-partitioned into several SCU. For each SCU a coding mode between intra and inter is
selected.
3. Zero motion vectors, when either the spatial or temporal candidates are not available.
Due to CUs cover huge areas on video, mainly at such HD and UHD resolutions, the loss of one of them might represent a critical degradation of the visual quality. Moreover, due to an CU has a lower amount of motion information associated, including the MVs, PU, and CU structure, the corresponding backup information requires few amount of space for embedding. The proposed motion vectors selection scheme, performed at the coding process, is based on the analysis of all CUs choosing those with fewer number partitions.
1.3.2
Motion information embedding and extraction
This method is based on data hiding, which means that redundant information is embedded during the coding through the modification of specific DCT coefficients and transmitted to the decoder, creating a communication channel between the
encoder and decoder.
In the resulting Transform Units (TUs) of the motion compensation process, performed in the inter coding, information is highly correlated, complicating the compression. Most of the video coding standards, including the H.264 and the HEVC, use the transform coding process with the aim of decorrelating the spatial information and perform the quantization process, which consists of compressing a range of values to a single value, taking the lower coefficients as non-significant and then discarding them. The resulting amount of zero coefficients represents the perfect environment for a reversible embedding scheme.
In the presented method information is embedded into the Transform Units (TUs) of the resulting blocks of the motion compensation process. The amount of available embedding space is measured before the embedding process, counting the number of zero coefficients available for embedding, if the amount is enough to store the backup motion information, the CU is marked as available for embedding; otherwise, the CU is marked as not available.
1.3.3
Error concealment
EC methods can be classified according to the source of information used to perform concealment. Data Hiding (DH) based, also known as hybrid, use the information available only during the coding and hide it using different ways. Later during the decoding process, this information is used to reconstruct the missing blocks of each frame. However, due to many factors, such as the available storage for embedding, there are situations when the backup information is not available on the decoding side; In this way, one of the most useful approaches is the Temporal Error Concealment (TEC), which uses spatial and temporal information, coming from the neighbor blocks to perform concealment. The Mean of Absolute Difference (MAD) between the boundaries of the lost blocks and the neighbor blocks is the main technique used to select a neighbor to replace a missing one. The EC method
performed on the decoding side is done by taking into account the following steps:
1. Lost coding unit detection. Before frame decoding, a missing CU detection scheme is applied, identifying all missing CUs and restoring them before frame decoding.
2. Information retrieval. After CU dequantization, that is the decoding step that decompresses all the CUs, the embedded information is retrieved from DCT coefficients, and their values are restored to their original value, leaving the image without modifications.
3. CU restoration. All the missing inter-coded CU are reconstructed using the backup motion information if available, otherwise, the missing CU is recon-structed using a generic structure and the replacement block is obtained by applying OBMA over a defined set of candidate blocks.
The restoration of all missing CUs in a video sequence, across the steps men-tioned above, can attenuate the effects produced by the error propagation, and in the best of the cases stop it, slowing down the deterioration of the video quality generated by missing information. Error propagation will remain on the image until an inter-frame that refreshes the image appears, which implies that the quality of the concealment will be different for each inter-period of a video sequence.
1.4
Document structure
This work is organized as follows: Chapter 2 reviews of the required knowledge nec-essary for a successful understanding of the proposed method, this chapter includes: the color and Human Visual System (HVS) basis, makes a brief introduction of video coding standards, including the H.26X family, and deepen on the main steps of the HEVC standard. Video evaluation metrics and video error resilience are also
analyzed in this chapter. Chapter 3 reviews the existing approaches for error con-cealment, focusing on the temporal and those based on data hiding, which is more related to the presented method. The proposed EC method described in Chapter 4 includes the selection of motion information to be embedded, the coding scheme, the information retrieval and the concealment, at the same time the concealment con-sists of the CU structure reconstruction using the backup information or a generic structure when necessary. Chapter 5 addresses the experimental framework, the ex-periments performed and results obtained and the results are analyzed compared to other methods. Finally, Chapter 6 presents the conclusions of this work and shows possible future improvements.
Chapter 2
Background
This chapter introduces the knowledge, methods, and definitions necessary for a successful understanding of the elements that compose the proposal. An introduction to the video coding standards, including the Moving Picture Experts Group (MPEG) and H.26X family is given. Also, some concepts and techniques common in video compression are included. The HEVC coding elements are presented, focusing on those involved in the process of the EC method. Furthermore, the chapter gives some essential concepts that will help to understand the resilience, error concealment and data hiding on video. Finally, are given the video evaluation metrics more widely known in the state of the art of EC.
2.1
Video background
An image is a spatial distribution of intensities that remain constant with time. Digital video is the representation of a real-world scene composed by a group of static images, also called frames, that when presented in sequence creates the illusion of movement; such illusion of continuous video is obtained by changing them at certain frame rate or Frames Per Second (FPS). Sampling rates of 1/60, 1/30 or 1/24 second are commonly used to produce a moving video signal [Richardson, 2010]. A higher
FPS gives a smoother motion on video but requires more information to be stored. A frame is composed of multiple objects, where each one has its characteristics, such as shape, depth, texture, and illumination; objects can be further divided into pixels, and each pixel represents the brightness of a specific point in the picture in a gray-scale space or can represent color information on several other color spaces such as RGB or YCbCr. All frames have the same vertical and horizontal dimensions within the same video, and all of them are represented by a matrix of pixels.
The representation of a video can be classified into analog or digital. Digital video can be obtained from a digital camera or converted from an analog video signal, whereby, video sequences may vary in resolution, FPS, aspect ratio, color, depth, and quality.
A video sequence can be optimized for several final purposes, better known as profiles, such as broadcast, storage or network streaming, and a conventional encoder can optimize a sequence for fast decoding or better compression level, varying the quality of coding in general terms.
2.2
Human visual system
Most of the audio, video, and image lossy compression methods are based on models of the human sensory system, focusing on the removal of redundant information without degrading the perceived quality. Video and image compression methods take advantage of the Human Visual System (HVS) replacing the photographic image with spatial, temporal and chromatic characteristics such as contours, color and motion [Nadenau et al., 2000], reaching high compression rates without significantly compromising the perceived quality.
The retina, located at the back part of the eyes, is a dense layer of intercon-nected neurons that sample the visual information and codes it before transmitting along the optical nerve. The sub-sampling of the retina is done by two kinds of
pho-toreceptors: rods and cones. Rods are sensitive to low levels of luminosity and are distributed in the borders of the visual field and cones can perceive several ranges of wavelengths that can be classified as long (L), medium (M) and short (S) according to their sensitivity. The human eye has about 110 million rods and approximately seven million cones. In general terms, rods are sensitive to intensity variations while cones detect color.
In the design of image and video compression methods several characteristics of the HVS [Marziliano et al., 2004] were taken into account, to accomplish high com-pression rates without significantly compromising the visual quality, among which stand out the following:
• The HVS is very sensitive to image edges or missing edge information.
• The HVS is more sensitive to image features that persist for long periods of time.
• The HVS is more sensitive to low spatial frequencies, such as changes that occur over large areas, rather than high spatial frequencies, such as rapid changes in small areas.
There are several color spaces used to represent real-world information such as texture, light, and color into a digital context. The RGB color space, formed through different combinations of the R (red), G (green) and B (blue) primary components, is one of them. However, because the HVS is more sensitive to luminance than color information, the YCbCr space was designed to take advantage of those characteristic [Richardson, 2002] of the HVS previously mentioned.
2.3
YCbCr color space
Most digital video applications need a mechanism to represent color information; a monochrome image only requires one number to indicate the brightness or luminance
with defined color depth, while color images require at least three numbers per pixel to represent color accurately. In the RGB color space, a color image sample is described with three numbers that indicate the relative proportions of red, green and blue, the three additive primary colors of light; any color can be created by combining red, green and blue in different proportions. However, the YCbCr color space is composed by two blue chrominance (Cb) and two red chrominance (Cr) samples for each four luminance (Y) allowing the sub-sampling of the chrominance (color) information, hence, despite of RGB is used to represent color scenes in most applications, the YCbCr space is more efficient for image compression in digital video.
The Y component of the YCbCr color space represents luminance or bright-ness of a pixel. Cb and Cr are the chrominance components of the pixels and are proportional to the color differences ofB−Y andRY respectively. Luma (Y) can be obtained as a weighted average of theR,G, and B components through the formula (2.1) and the chroma components are obtained from the difference between R, Gor
B and the luminance Y as in formula (2.2).
Y =krR+kgG+kbB (2.1)
where kr, kg and kb are the weighting factors.
Cr =R−Y, Cb=B−Y, Cg =G−Y (2.2) On the YCbCr color space, only the luma (Y) and the two chroma red and blue (Cr, Cb), are transmitted becauseCr+Cb+Cg is a constant value, and only two of the three chrominance components are needed, that is one reason why YCbCr has an essential advantage over RGB, moreover the HVS is less sensitive to color than luminance, making it possible to reduce or subsample the amount of data required to represent the chrominance components without severe consequences on the perceived visual quality.
Figure 2.1: Visual difference between 4:2:0, 4:2:2 and 4:4:4 subsampling schemes in
increased compression order. At higher sampling, 4:4:4, there is no compression,
mean-while a lower subsampling, 4:2:0 achieves compression in the chroma components.
2.4
Subsampling
The YCbCr color space is the essential way of encoding in digital video because it allows the sampling of the chrominance components at a lower frequency than the luminance, taking advantage of the characteristics of the HVS. YCbCr supports several sampling proportions for theY,CbandCrcomponents. Commonly subsam-pling is indicated by a three-digit notation separated by two points, such as 4:4:4, 4:2:2, and 4:2:0, as are shown in Figure 2.1. In 4:4:4 sampling, for every 4 luminance samples there are 4 Cr and 4 Cb samples, in 4:2:2 sampling, the Cb and Cr com-ponents have the same vertical resolution as Y, but half the horizontal resolution, meanwhile in 4:2:0 sampling, for every 4 Y samples in the horizontal direction there are 2 Cr and 2 Cb samples.
Due to high compression requirements, 4:2:0 subsampling has become the most popular sampling pattern in digital image and video processing applications, reaching an adequate trade-off between image compression and degradation of perceptible quality, for several kinds of uses mainly in real-time video transmission.
Figure 2.2: Field coding example, frame (c) shows the result of interlacing field 1 (a) and
field 2 (b).
2.4.1
Frame and field coding
A video signal can be sampled as a set of consecutive frames, as previously men-tioned, or can the half part of two frames but displaying only one in progressive or interlaced way through the use of horizontal interlaced fields. As it can be seen in Figure 2.2, in the first interlaced field, only even lines of the first frame are displayed, while in the second field, the odd lines of the second frame are shown.
Field coding allows more detailed images to be created than would otherwise be possible within a given amount of bandwidth. However, an interlaced video comes with unwanted effects produced mainly during slow-motion sequences. As a result, images appear smoother and fast-motion sequences are sharper.
2.5
Video coding standards
There are many video coding techniques being developed and published contin-uously, along with technological advances, innovating and improving compression techniques. However, commercial video coding applications tend to use a limited number of methods for video compression [Rao et al., 2013]; this is due to the stan-dardization of video coding formats has the following objectives:
manufacturers.
• Standards make possible to build systems that incorporate video, in which many different applications such as video codecs, audio codecs, network trans-port protocols, security, and rights management, interact in clear and consis-tent ways.
• Many video coding techniques are patented, and therefore there is a risk that a particular video codec implementation may infringe patents. The techniques and algorithms required to implement a standard are well defined and the cost of licensing patents that cover these techniques.
Since the development of H.120, the first digital video coding standard pub-lished by the CCITT, now the ITU-T [ITU-T, 1984], the development of interna-tional standards has been carried out by the Internainterna-tional Telecommunication Union (ITU-T), also known as Video Coding Experts Group (VCEG) and by the Interna-tional Organization for Standardization (ISO/IEC), also known as Moving Picture Experts Group (MPEG) [Rao et al., 2013].
Throughout history, compression standards have experimented an incredible evolution, improving the quality and interoperability together with technology. In the next section, a recount of the most relevant video coding standards in a chrono-logical way is made.
2.5.1
H.26X
Even though the H.120 standard started a line of research in the development of video compression methods, its performance was deficient and in a later revision, was added the motion compensation, even used nowadays; however, its compression capability was not high enough to be widely adopted.
The H.261 video coding standard, the initial member of the H.26X family, rat-ified in 1988 by the ITU-T, was the first video coding standard widely adopted in
several applications [Sullivan and Wiegand, 1998]. H.261 is characterized by the in-troduction of the macroblock structure partitioning, motion compensation features, scalar quantization and the use of Variable-Length Coding (VLC).
2.5.2
MPEG family
The MPEG family of standards, established in 1988, includes MPEG-1, MPEG-2, and MPEG-4, formally known as 11172, 13818, and ISO/IEC-14496, the MPEG working group is part of the JTC1, the Joint ISO/IEC Technical Committee on Information Technology.
MPEG-1, developed by ISO/IEC in 1993, has superior video quality when com-pared to H.261 at higher bit rates. Among its main improvements the bi-directional half pixel motion prediction stands out, also the use of quantization weighting ma-trices and the fact that it supports resolution of 352x240 for NTSC or 352x288 for PAL systems at 1.5 Mbps.
H.262/MPEG-2 jointly developed by the VCEG and MPEG groups, whose first version was released in 1995, is similar to H.261 but with additional support for interlaced video coding; MPEG-2 is also retro-compatible with MPEG-1, which means an MPEG-2 player can play back MPEG-1 video without any modification [Rao et al., 2013]. This standard was widely used by analog broadcast TV systems. H.263 V.1, developed by the ITU-T in 1995, one member of the H.26X video coding family [ITU-T, 1998] and the best replacement for previous standards at all bit rates, is characterized by the use of 3D variable-length coding of DCT coefficients, also by the enhancement of motion vector prediction, coding structure improvements and the addition of arithmetic entropy coding. H.263 V.2, also known MPEG-4 Part 2 and H.263v3, also known as H.263++ are posterior ratifications published by the ITU-T VCEG.
The MPEG-4 standard, developed by MPEG in 1998, is based on the features of MPEG-1 and MPEG-2 among other related standards. Most of the features included
in MPEG-4 are left to individual developers to decide whether or not to implement it. Therefore, the inclusion of the concept of profiles allows a set of characteristics to be defined appropriately for a specific application. The standard also includes some enhancements such as the increase of coding efficiency, intra DCT coefficient prediction improvements and the implementation of error resilience techniques.
H.264/MPEG-4 Part 10 /AVC
With the development of the H.26L standard in 1998, which offers significantly better video compression efficiency than previous ITU-T standards, the JVT is a group of coding experts, created in 2001 by a joint effort of the VCEG and the MPEG groups; to finalize the video coding standard H.264/AVC, H.26L was included into MPEG-4 (part 10) and jointly adopted the name of H.264 [Sunna, 2005].
The improved coding efficiency of H.264 [Rao et al., 2013] can be attributed to the additional coding tools and new features listed below:
• Adaptive intra-picture prediction.
• Small block size transform with integer precision.
• Multiple reference pictures and generalized B-frames.
• Variable block sizes.
• Quarter pixel precision for motion compensation.
• Content adaptive in-loop deblocking filter.
• Improved entropy coding by introduction of Context Adaptive Binary Arith-metic Coding (CABAC) and context adaptive variable length coding (CAVLC).
Throughout different ratifications of the H.264/AVC standard, several char-acteristics have been improved. The motion-compensation process performed using
an adaptive size approach was included; sub-blocks of different sizes and shapes are used in this process, resulting in an efficiency improvement of the compression rates. Motion estimation precision was increased from 1/2 to 1/4 of a pixel for represen-tation and the number of reference frames has risen to five, relative to previous standards. The use of Integer Discrete Cosine Transform (IDCT) was also intro-duced, permitting to achieve better transform coding precision [Richardson, 2002]. Finally, the introduction of adaptive deblocking filter allows to increase the visual quality, concealing the block effect produced by the DCT block quantization.
2.6
High efficiency video coding
The High Efficiency Video Coding (HEVC) standard meets the best works and re-search of previously developed standards, resulting in the now formally standardized H.265. The increasing demand for HD and Ultra High Definition (UHD) video has lead to an emergent necessity of bandwidth and storage capacity for several applica-tion areas like entertainment, educaapplica-tion, medicine, and science. To achieve current video compression requirements, the ITU and the ISO/IEC MPEG, best known as Joint Collaborative Team on Video Coding (JCT-VC) have been working on the development of the HEVC standard, which is characterized by operating with reso-lutions upper than 8K and by better use of parallel architectures.
HEVC introduces new tools and characteristics that were not present in previ-ous standards, providing up to double of increased coding efficiency when compared to the H.264/AVC standard. The most relevant changes made to HEVC, useful to reach such compression rates, in comparison to previous standard H264/AVC are summarized as follows:
• Quad-tree partitioning. The macroblock structure partitioning was re-placed by a more complex structure named coding unit, which uses a recursive quad-tree partitioning approach; also, the block size of the LCUs was increased
from 16x16 to a maximum of 64x64.
• Parallel encoding and decoding. Due to the increase of computational complexity concerning previous standards, several tools and mechanism were included, such as wavefront parallel, to facilitate the parallel encoding and decoding processes.
• Intra-prediction mode improvement. H.264 uses nine different intra pre-diction modes, while in HEVC the number of prepre-diction modes was increased to 35; achieving better approximations to the original block and minimizing residual coding.
• Integer transforms addition. Several integer transform techniques were added, with sizes from 32x32 down to 4x4; also, a 4x4 discrete sine transform is available.
• Merge mode motion coding. A new merge mode motion coding is available in HEVC, replacing skip mode of previous H.264 standard, and reducing the coded information in areas where motion, texture, and color is uniform.
• Extensive in-loop processing. Deblocking filter and two new in-loop pro-cessing tools, named sample Adaptive Offset (SAO) and Adaptive Loop Fil-tering (ALF) were added, to reduce the blocking effect produced by the DCT block quantization.
Some of the features included in the HEVC standard such as the replacement and increase of macroblock structure by the new quad-tree structure which best approximates to characteristics of the image, and the removal of Flexible Macroblock Ordering (FMO) which allows to code blocks in a nonconsecutive way, increasing the probability that the neighbors received from a lost one were higher, imply an increase in its computational complexity as well as a decrease in the error resilience has been demonstrated [Correa et al., 2012].
2.6.1
HEVC coding process
The HEVC coding and decoding process are similar to many other previous stan-dards, taking advantage of spatially redundancies by inferring samples from its neigh-borhood and temporal redundancies by coding differences between frames instead of complete pictures. Figure 2.3 shows a simplified version of the encoding process which for explanatory purposes will be summarized in the following steps:
• Picture partitioning.
• Intra/Inter prediction.
• Transform coding.
• Entropy coding.
• Reconstruction and buffering.
In the HEVC picture partitioning a video sequence is composed by a series of continuous frames, also known as Group of Pictures (GOP), where each one of them is divided into slices which can be processed separately from any other region of the same frame, useful for parallel encoding/decoding. Later each tile is sub-partitioned into CUs of the same size. The CU structure is obtained through recursively partitioned into like a quad-tree structure; the structure partitioning and block sizes can vary depending on the video characteristics an complexity, such as texture, objects, color, and movement, Figure 2.3 (a).
With the aim to accurately approximate to the original image, the encoder selects a coding mode between intra and inter based on distortion measures, Fig-ure 2.3 (b). Intra coded units are coded and decoded independently, they do not need any reference to being decoded and are commonly used as refresh point or at the beginning of a new sequence. Otherwise, inter coding process takes advantage of the similarity of successive frames by calculating motion vectors which represent
the direction of motion of a particular block from the current frame to the reference frame.
Video sequences are highly redundant temporally, among successive frames and spatially, within the same frame. Compression algorithms usually take advantage of these redundancies to increase efficiency. Residual information obtained from the inter and intra prediction are transform coded using different transforms such as DCT, and subsequently is quantized according to a predefined quantization param-eter, Figure 2.3 (c).
In the entropy coding, motion information obtained from inter-prediction is combined with the quantized transform information and are entropy coded using the Context-Adaptive Binary Arithmetic Coding (CABAC), Figure 2.3 (d). The resulting bitstring is ready for later processing such as storage or transmission. Fi-nally, the transform information is decoded, and together with an in-loop filter, the current frame is reconstructed and stored in the buffer, useful for future inter-coded units, Figure 2.3 (e).
In the following sections, a detailed review of the steps mentioned above is presented, focusing on those closely related to the process and techniques involved in the proposed method.
2.6.2
Picture partitioning
Sequence
A video sequence consists of sets of continuous frames which, when presented consec-utively at a determined frame rate, provides the illusion of movement. A sequence offers an entry point into the coded video and contains a set of mandatory and optional parameters; mandatory system parameters are necessary to initialize the decoder system, while optional system parameters are used for system settings such as resolution, orientation, at the discretion of the network provider. Also, optional
Figure 2.3: HEVC basic coding diagram summarized into 5 main steps, (a), (b), (c), (d)
and (e).
user data can be sent in the sequence header [Rao et al., 2013].
Group of pictures
A Group Of Pictures (GOP) structure is a collection of successive pictures or frames within a coded video; it specifies the order in which intra and inter frames are arranged. In the HEVC standard, a GOP can contain the following picture types:
• I picture. An intra coded picture is coded independently of all other pictures. Each GOP begins with this type of picture, the period within an I frame appears determined by the intra period.
• P picture. The predictive coded picture contains motion-compensated differ-ence information relative to previously decoded pictures of type P or I.
Figure 2.4: HEVC group of pictures. An example of picture types organization and
de-pendencies between them are shown.
difference information relative to two previously decoded pictures of type P or I.
An I picture indicates the beginning of a GOP, afterward, several P and B
frames follow. I frames contain the full image and do not require any additional information to reconstruct it. Each I picture can be a clean random access point, such that any errors within the GOP structure are corrected by the next I picture. In Figure 2.4 an example of GOP can be observed, arrows represent dependence between frames for decoding, last P frame depends on previously decoded P frame which at the same time depends on the first I frame.
Tiles and slices
There are several tools included in HEVC that help to improve error resiliency while achieving parallel encoding/decoding; slices provide parallelism and resiliency while tiles increase the capability of parallel processing [Sullivan et al., 2012]. A slice is a picture partitioning that contains certain number of contiguous CUs; de-pending on the given configuration, slices can provide a constant or variable num-ber of CUs. In general terms there are two types of slices, regular and dependent
[Aguirre-Ramos et al., 2014] described below:
• Regular slices. Are used for parallelism and resilient coding; each slice can be independently decoded using its information. If any slice gets lost or is damaged, the decoding process can continue without major problems. Un-fortunately, the use of regular slices can incur in substantial coding overhead because prediction is not performed across boundaries.
• Dependent slices. In this kind, dependency is maintained across slice bound-aries; the coding efficiency is increased, and reduced end-to-end delay is pro-vided by allowing the transmission of sub-slices before the slice is completely coded. This mode does not provide any resiliency but allows better compres-sion levels by reducing the amount of necessary NAL units during transmiscompres-sion.
The use of tiles is a feature introduced in HEVC to provide parallelism. Tiles are similar to regular slices in the sense that they group a uniform number of CUs and dependencies are broken across the boundaries. This structure is designed to be processed independently when they are treated using parallelism, and each tile can be coded/decoded by one threat/core.
Coding unit
The Coding Unit (CU), the primary block unit of the picture partitioning, as seen in the Figure 2.5 (a), is the equivalent of a Macro Block (MB) structure of previous H.264 standard. A CU can be as small as 8x8 Small Coding Unit (SCU) or as large as 64x64 Large Coding Unit (LCU); the size of each CU and its partitions depends on the desired encoding configuration which should adjust to the motion and detail characteristics of the picture. Each LCU is recursively partitioned into smaller CUs; this process creates a result tree-like structures named quad-tree structure, as can be seen in Figure 2.5 (b). The quad-tree structure is one of the most relevant
Figure 2.5: HEVC frame partitioning at CU level. (a) Example of frame partitioning into
LCUs. (b) Quad-tree structure example of a LCU, [Aguirre-Ramos et al., 2014].
characteristics of HEVC; it helps to achieve better approximations to the object shapes in the image.
An LCU of luma samples together with its two corresponding CU of chroma samples and the syntax associated with these sample blocks is subsumed under a so-called Coding Tree Unit (CTU) [Sze et al., 2014], as can be seen in Figure2.6 where the picture components chroma Cr and chroma Cb are identical.
Figure 2.7: Depending on the CU prediction mode (intra or inter) different PU shapes are
available for prediction.
In the structure partitioning of HEVC, each CU is the root of two trees, one of Prediction Units (PUs) and other of Transform Units (TUs). For each CU, a coding mode which determines whether a Prediction Unit (PU) is predicted using intra-picture or motion-compensated prediction is chosen. PUs are different partition shapes used for both intra, and inter prediction, the selection of the best PU for each CU is based on a distortion function, in this way, different PU shapes allow reducing residual energy; thus, compression efficiency increases. Figure 2.7 shows the different PU shapes available for intra and inter prediction. Such scheme introduced in HEVC allows higher compression gains due to few residuals obtained by the new inter and intra prediction.
Figure 2.8: 35 angular modes available in HEVC intra coding, including planar and DC
mode.
2.6.3
Intra prediction
Intra prediction takes advantage of previously coded PUs to reduce the used infor-mation. All the intra prediction modes available in HEVC use reference samples from adjacent reconstructed blocks and extrapolate it to different directions. Such extrapolations can be classified into two categories: angular prediction methods, where is possible to accurately model structures with directional edges, and planar prediction / DC prediction [Sze et al., 2014], in which predictors estimate smooth image contents.
Figure 2.8 shows the 35 different angular intra prediction modes available in HEVC, 33 are directional interpolations, one planar, and one DC mode. The in-creased number of available modes, when compared to previous H.264/AVC stan-dard, helps to obtain a better approximation to the original image, reducing the number of coded coefficients.
2.6.4
Inter prediction
The inter picture prediction of HEVC standard is the product of improvements and generalizations from previous video coding standards, including H.264/AVC. The
MVP was replaced by the AMVP, in which motion data is obtained from already decoded blocks. Motion estimation process consists of finding the best match to the current block in a previously coded frame (reference frame), and for each block, a corresponding reference block serves as a predictor. Motion compensation calculates the residuals between both the estimate and the original block and together with the motion information are the resulting of the inter prediction process.
In the AMVP, motion vector representation is based on the translational mo-tion model, the posimo-tion of the block in a previously decoded picture is indicated by a motion vector (∆x, ∆y), where x and y specify the horizontal and vertical respective displacement, relative to the position of the current block.
Two kinds of inter-picture prediction are available on HEVC; uni-prediction and bi-prediction, with one or two sets of motion vectors respectively. The reference pictures that can be used in bi-prediction are stored in two separate lists, namely
list0 and list1 and the best predictor for each motion block is signaled to the de-coder. Also, a new technique called inter-prediction block merging [Sze et al., 2014, Richardson, 2002], included in the HEVC standard, derives all motion data of a block from the neighboring blocks replacing the direct and skip modes of previous H.264/AVC. In the following subsections, a review of the AMVP, block merging and skip mode is made.
Advanced motion vector prediction
In the AMVP, motion vector competition signals which Motion Vector Predictor (MVP) from a list of Motion Vector Predictor (MVPs) should be used for represent-ing the current block [Sze et al., 2014]. The variable quad-tree block structure in HEVC can result in one block having several neighboring blocks with motion vec-tors as potential MVP candidates. The AMVP candidate list construction includes the following MVP candidates:
Figure 2.9: Example of temporal and spatial motion vector candidates. Only motion
vec-tors from spatial neighboring blocks to the left and above the current block are considered
as spatial MVP candidates; this is because the blocks to the right and below the current
block are not yet decoded.
• One temporal candidate MVPs obtained from two temporal, co-located blocks when both spatial candidates MVPs are not available or they are iden-tical.
• Zero motion vector when the spatial, the temporal or both candidates are not available.
Figure 2.9 shows an example of the locations of the spatial, coming from an already coded frame, Figure 2.9 (C0) and temporal, coming from already coded neighbor candidate MVPs, Figure 2.9 (B0, B1, B2, A0, A1). The best candidate from the candidate list should be the one that minimizes the amount of residual informa-tion. Once a candidate MVP is selected, motion vectors need to be scaled according to the temporal distances between the candidate reference picture and the current reference picture. Motion scaling is performed according to a scale factor, which is based on the temporal distance between the current picture and the reference picture of the candidate block, and the temporal distance between the current picture and the reference picture of the current block.
Block merging and skip mode
HEVC block merging allows reducing the amount of code motion information, re-garding bit rate. The merging technique is based on coding motion information only once, for each connected area in a video, called motion regions, with a simi-lar motion. At least one seed-PU not coded in merge-mode is required per motion region [Rao et al., 2013]. Skip mode is a particular case of merge-mode where all motion information is copied from a signaled candidate, the main difference from merge-mode is that in skip-mode residuals are not coded.
Block merging included in HEVC introduces a syntax that allows a sub-block to explicitly reuse the same motion parameters contained in neighboring blocks; it compiles a list of candidate motion tuples from neighboring blocks, then an index is signaled which identifies the candidate to be used.
After the Intra/Inter coding has been completed, the difference between the original block and its approximation is obtained, then, together with the motion information or the intra information is transform coded.
2.6.5
Transform coding
The process of transform and quantization is a type of data compression used to convert spatial image pixel values to transform coefficient values. Transform cod-ing decorrelate information discardcod-ing parts of it without affectcod-ing its perceived quality; those transform functions concentrate image energy in a small number of transform coefficients [Richardson, 2002]. Later, the quantization process consists of compressing a range of values to a single value, and low energy coefficients are taken as non-significant, and so they are discarded [Sze et al., 2014]. The number of coefficients produced is equal to the number of pixels transformed, and finally, coefficients are coded by lossless entropy coding.
prediction, then the signal is divided into residual square blocks, and each of them is used as a two-dimensional transform. The resulting transform coefficients are then subjected to quantization. The final design of HEVC is based on two transform types: the Discrete Cosine Transform (DCT) and the alternate transform based on the Discrete Sine Transform (DST). In HEVC the amount of compression achieved by the quantization process could be controlled by the quantization parameter (QP) which determines a quantization step size, between 52 values from 0 to 51 for 8-bit depth.
2.6.6
Reconstruction and buffering
Before entropy coding, the picture reconstruction is performed at the encoding pro-cess. The reconstruction process involves the transform decoding and all in-loop filtering process; the resulting picture is an exact reproduction of the one recon-structed at the decoder side, then, the reproduced image is buffered and used later as a reference picture during the inter-coding process.
In-loop filters
The HEVC standard specifies two in-loop filters, a deblocking and a Sample Adap-tive Offset (SAO) filter. The in-loop filters are applied in the encoding and decoding loops, after the inverse quantization and before saving the picture in the decoded pic-ture buffer [Sze et al., 2014]. Deblocking filter attenuates discontinuities produced in the prediction and transform block boundaries, while SAO filter improves the decoded picture by attenuating ringing artifacts and intensity changes of some areas of a image.
2.6.7
Entropy coding
Entropy coding is performed at the last stage of video coding, all the residual in-formation of previous steps and reconstruction inin-formation obtained from the In-tra/Inter prediction is coded. Entropy coding is a lossless compression scheme that uses the statistical properties to compress data. Context-Based Adaptive Binary Arithmetic Coding (CABAC) [Qian et al., 2009] is a form of entropy coding used in HEVC. CABAC has two operating modes, regular and bypass; in regular mode, a context-modeling stage is carried out to select the probability context-model that best fits the data. The selected context model is updated during the coding process. While in bypass mode, an equiprobable model is assumed, avoiding the computa-tional complexity of the regular way, but achieving lower compression results.
The basic design of CABAC involves the process of binarization, context mod-eling, and binary arithmetic coding summarized in the following items:
• Binarization maps the syntax information to binary symbols (bins).
• Context modeling estimates the probability of each bin based on some specific context.
• Binary arithmetic coding compresses the bins to bits according to the estimated probability.
The obtained binary information from entropy coding is treated in the form of bitstring and then, is subject to specific transport layers on transmission-based video applications, in this way HEVC specifies two conceptual layers, video coding layer (VCL) and network adaptation layer (NAL). VCL contains all the specifications related to the generation of a standard-compliant bit-string while NAL converts this VCL bit-string into a format suitable for specific transport layers or storage media.
2.7
Error resilience
Although new video coding tools bring further compression levels through the re-moval of a considerable amount of redundant information, some disadvantages come along, such as complexity increases and error robustness decreases due to vulner-ability to errors such as information losses during transmission or damages during storage in the compressed bit-string, producing the degradation of the resulting se-quence or even making the sese-quence undecodable.
Losses can be produced over error-prone environments, where it is not possible to assure the quality of service (QOS) between the source and destination mainly in wireless transmissions; under this assumption, any transmission over the Internet is susceptible to packet loss or corruption due to impairment of the physical channels [Wang et al., 2000]. Is in this way where error resilience mechanisms should be employed in video transmission with the aim to mitigate the effects produced by the loss of information.
The techniques designed to deal with video transmission errors can be grouped into two categories: error resilience and error concealment.
• Error resilience. Consists of sending redundant information useful for re-covery when transmission errors are present. This redundancy can be added through several coding techniques. Another way of providing error resiliency is by error resilient encoding; in these techniques, the video coder generates a resilient bit-string. The predictive coding loop and the variable length coder are carefully designed to avoid error propagation, providing error resiliency at the cost of decreasing coding efficiency.
• Error concealment. Also classified as error-control techniques are designed to deal with damaged areas in video caused by the loss of information. The main difference between Error Concealment (EC) and error resilience tech-niques is that in EC, the lost areas in the video are visually concealed with
neighborhood information, and the damage propagation is controlled.
Error concealment and error resiliency techniques are complementary. Al-though these methods significantly improve the overall video quality in case of errors, they may compromise the coding performance. Therefore, it is essential to achieve an efficient trade-off between the coding efficiency and error robustness while applying such approaches.
2.8
Video evaluation metrics
To compare the quality of a defined set of sequences is necessary to determine the reliability of the images. However, measuring visual quality is complex and subjec-tive in many situations because there are several factors, relasubjec-tive to the observer, that can affect the results. The visual quality is inherently subjective and therefore is influenced by subjective factors that make difficult to obtain an entirely accurate measure of quality [Richardson, 2010].
A significant effort has been done by the ITU radiocommunication sector (ITU-R) to create a recommendation [ITU-R, 2012] for the use of subjective quality eval-uation metrics. The methods presented in this recommendation are accepted as realistic measures of subjective visual quality [Richardson, 2010]; however, these methods also present complications associated with the expertise of the observer identifying video distortions. Such difficulties can be mostly solved by using a sig-nificant amount of observers, making the subjective video evaluation an expensive process. Also, objective quality metrics are designed to measure the quality of video predicting the observer experience, but the obtained results could not correlate well with subjective video quality measures. Anyway, objective metrics are commonly used in state of the art due to their simplicity and usability as a first evaluation approach. In the following subsections a review of the most common techniques used as an image quality evaluation metric, the PSNR, and SSIM, are presented.
2.8.1
PSNR
The Peak Signal-to-Noise Ratio (PSNR) is a widely used objective metric based on the Mean Squared Error (MSE) between an original and an impaired image or video frame [Richardson, 2002]. PSNR does not take into account essential elements such as the HVS and focus on the pixel values ignoring the picture content. The typical formula for PSNR is (2.3).
P SN RdB = 10log10
(2n−1)2
M SE (2.3)
where, n is the number of bits used to represent the maximum signal value, and the MSE is obtained through formula (2.4). The PSNR represents the inverse of the difference in a logarithmic scale; a higher PSNR means that both images are similar; the measured PSNR will be infinite if both pictures are the same. On the other hand, if pictures are different, the PSNR will be low.
M SE=
PN−1
i=0
PM−1
j=0 (U(i, j)−V(i, j)) 2
M ·N (2.4)
where,U(·,·) andV(·,·) refer to the pixel values from the original and impaired images respectively, and M, N represent their size.
2.8.2
SSIM
The Structural Similarity (SSIM) measures the similarity between two images com-paring their luminance, contrast, and structure. All the operations are computed within a local 8x8 square window, which moves pixel-by-pixel over the entire frame. The SSIM can be viewed as a quality measure when comparing at least two images, using one as a test image and the other as a perfect image. The formula for SSIM
is given by (2.5).
SSIM(x, y) = (2µxµy+c1)(2σx,y+c2) (µ2
x+µ2y+c1)(σx2+σ2y+c2)
(2.5)
where, µx is the average ofx;µy is the average ofy;σx2 is the variance ofx;σy2
is the variance of y; σx,y is the covariance of x and y; c1 = (k1L)2 and c2 = (k2L)2 stabilize the division with weak denominator;Lis the dynamic range of pixel values, usually 2bitsperpixel−1;k
1 = 0.01 andk2 = 0.03.
Although SSIM and PSNR are not formally accepted measures, they are still the most used video and image quality metrics. In the following sections, both SSIM and PSNR are used to measure and compare the performance of the proposed method.
2.9
Chapter summary
The final product of a video coding standard is capable of encoding a video sequence into a compressed way creating an approximation to the original one, where the decoded sequence can be identical (lossless coding) or an estimate (lossy coding) to the original. In this chapter, an introduction to the HVS and the way in which the video coding standards take advantage of them were briefly presented, including the relationship between the color spaces and the sub-sampling with the perceptible media quality. Also, a quick review of the video coding standards in a chronological way, from H.120 to H.265 was presented, highlighting the strengths and weaknesses of each one of them.
The main stages of the HEVC coding process, which do not differ from other video coding standards, are summarized into: picture partitioning, intra/inter pre-diction, transform coding, entropy coding and reconstruction and buffering. How-ever, in a real scenario, the encoding process involves more complicated process than the described above and a detailed explanation is only given in the processes that