All of the above works address traditional single-view videos, and thus, they cannot handle next- generation videos in which users switch among views or viewports. Rate adaptation algorithms for single-view videos will either waste network bandwidth if they prefetch all views or lead to playback interruptions if they prefetch a single (active) view or viewport only. We categorize these contents to Multiview and Virtual Reality as follows.
Multiview Videos. We categorize multiview videos to three types: 3D Free-viewpoint video [58, 59], 3D Multiview [125], and 2D Multiview [34]. Hamza and Hefeeda [58, 59] proposed an in- teractive free-viewpoint video streaming system using DASH. The client fetches original views and generates virtual views when required. They proposed a rate-based adaptation algorithm that chooses the quality levels of texture and depth views based on the estimated bandwidth and the rate-distortion model for the target virtual view. This work does not consider the temporal quality variance when choosing different quality levels of original views over time.
Su et al. [125] proposed a rate-based adaptation algorithm for 3D Multiview video streaming using the High Efficiency Video Coding (HEVC) standard. They only address non-interactive 3D videos where users cannot choose the viewing angles. In addition, the adaptation algorithm divides the estimated bandwidth over all views, which may not consider the relative importance for every view.
360-degree Videos. There are two types of VR streaming: monolithic and tile-based streaming. In monolithic streaming, the content provider pre-creates video frames each consisting of multiple viewports, and allocates more bits for popular viewports. Facebook employs such a technique [86] to stream its VR content. In the tiling approach, the video is temporally chopped into video segments (similar to 2D videos). Moreover, every video frame is spatially cut into tiles to save bandwidth. In this type, efficient spatio-temporal adaptation algorithms that choose combinations of tile quality levels over time are required. We note that multiple works [140, 31, 55] have considered calculating the optimal set of tiles by balancing between the coding efficiency and bandwidth waste. In this report, we focus on adaptation algorithms and delivery systems for tile-based streaming.
Xie et al. [143] proposed a buffer-based adaptation algorithm based on head movement pre- diction. They predict the tile probability, and estimate the total bitrate budget that minimizes the difference between current and target buffer occupancy and avoids rebuffering. They minimize the total expected distortion and the spatial quality variance of tiles given the tile probability distribu- tion and the bitrate budget. They reduce the target buffer occupancy to absorb the prediction errors. They predict the per-tile probability as follows. First, they use linear regression to predict the rota- tion angles, and assume that the prediction errors follow Gaussian distribution to predict the viewing angle. Second, they predicted the probability of a given spherical point to be viewed by averaging
the viewing angle probabilities. Finally, the probability of a given tile is the average probabilities of the spherical points belong to that tile. The paper assumes linear relationship between the current and predicted angles, and then draws conclusions about the prediction error probability distribution using small dataset of head movements. In addition, the algorithm does not consider the temporal quality variance.
Nasrabadi et al. [102] consider tiled streaming of SVC videos. The server pre-creates base and enhancement layers for every tile in the cubemap projection. The goal is to improve quality while reducing rebuffering and quality switches. To reduce rebuffering, the algorithm aggressively down- loads base layers for all tiles up to a threshold buffer occupancy K seconds. If the buffer length exceeds K, the algorithm prefetches enhancement layers based on the network throughput (to re- duce quality switches), and the prediction of the next viewport using weighted linear regression.
Petrangeli et al. [109] proposed a tiled-based adaptation algorithm based on H.265 encod- ing, HTTP/2 server push, and estimated bandwidth and future viewport. They use a non-uniform equirectangular projection, where every frame is tiled into six tiles: two polar and four equatorial. The tiles are encoded using H.265 in order to allow the client to decode multiple tiles using one decoder. In addition, to reduce the number of GET requests resulted in requesting multiple tiles, they use HTTP/2 to enable the server to push the requested tiles with a single GET request. The adaptation algorithm estimates the network bandwidth and allocates the lowest quality level for all tiles. It further predicts the future viewport by estimating the fixation speed. Once the current and future viewports are known, the algorithm uses the bandwidth slack to increase the quality level of tiles based on their spatial location to the predicted viewport. That is, the closest tiles to the future viewport are allocated higher quality levels. Similar to [143], this work does not consider quality switches over time.
Alface et al. [115] presented a real-time 16K VR streaming system to mobile untethered clients, where users can zoom in and out in addition to the aforementioned rotation angles. The incentive of this work is that 4K VR streaming often results in pixleated experience, hence, 16K streaming is preferred in such interactive applications. However, 16K streaming poses two challenges. First, the authors conjecture that naive 16K streaming requires bandwidth of 300 Mbps which is a huge overhead for wireless devices and their battery usage. Second, there is no display technology that can render such resolution. They proposed a server-side component that produces two outputs: one 4K stream per user for the active viewport, and one 1K stream shared across all users as a fallback viewport when the corresponding 4K stream is not downloaded due to network latency. To generate the 4K stream, every tile in the scene is independently encoded to 8x8 tiles to allow random access of the files. Then, the tiles of the viewport are chosen and transcoded to one 4K stream to be sent to the user. This work does not consider the spatial relationship between different tiles, which may result in better network utilization. In addition, due to online transcoders, the cost of the system increases as the number of concurrent users increases.
Provider 3
Provider 1
Provider 2
IXP
Clients
Servers
Servers
Figure 2.3: Illustration of different interconnection models between ISPs. Customers connect to the Internet through ISPs. Dotted lines correspond to the interdomain traffic between ISPs.