This section examines the QoS requirements of real-time audio streaming applications. Since most applications require either voice or high quality sound encoding, these two classes are examined in particular.
1.3. APPLICATION QOS REQUIREMENTS 17
1.3.1.1 Throughput
The throughput requirements of audio streaming applications depend entirely on the en- coding scheme used for the audio data transmission. The encoding format is usually deter- mined by the required sound quality of the application. Tools which simply transfer voice information usually deploy other encoding techniques – especially designed for the purpose of voice data transmission (for example, Voice Coder (VOCODER)) – than applications which transmit high quality music information.
A. Voice Encoding
The traditional digital voice encoding technique, known as 64 kbps Pulse Code Modulation (PCM), corresponds to the sound quality everybody knows from the public telephone system. It is thus referred to as Telephone Quality audio. The encoding scheme is defined within the ITU G.711 standard. The mono analog signal is sampled 8000 times per second and each sample encoded in 8 bits. No compression is used. The resulting bit rate of
telephone quality sound is therefore 8bits× 8000Hz = 64kbps.
In the 1980s a number of encoding and compression techniques were developed enabling more efficient digital voice encoding than G.711. Telephone quality can also be achieved with only 32 kbps simply by applying a more sophisticated encoding technique, known
as Differential Pulse Code Modulation (DPCM) – a loss-free encoding. Slightly lower
voice quality can be provided with Adaptive Differential Pulse Code Modulation (ADPCM) encoded digital voice at 40, 32, 24, and 16 kbps. More recent encoding algorithms (for example the Linear Predictive Coding (LPC) or Code Excited Linear Prediction (CELP) voice coder) can reduce bit rates as low as 2.4 or 4.8 kbps for digital voice.
Voice Quality Encoding Technique (Standard) Bit Rate
Telephone Quality PCM (G.711) 64 kbps
Telephone Quality DPCM 32 kbps
(Lower) Telephone Quality ADPCM (G.721, G.726, G.727) 40, 32, 24, 16 kbps
Lower Telephone Quality LD-CELP (G.728) 16 kbps
GSM Phone Quality GSM 13 kbps
Low-bandwidth Voice CELP (Federal-Standard-1016) 4.8 kbps
Low-bandwidth Voice LPC-10 (Federal-Standard-1015) 2.4 kbps
Table 1.1: Voice Quality Encoding Schemes and Throughputs
B. High Quality Sound Encoding
CD Quality is commonly recognized as a high quality sound encoding. The CD audio standard is based on sampling the analog signal at 44.1 kHz, each sample being coded
with 16 bits. The result is 705.6 kbps for one monophonic channel. As compact discs are stereophonic, the throughput required to transmit a full stereophony sound in CD quality is 1411.2 kbps.
Within the last few years several encoding or compression techniques for CD quality sound have been developed (see also [Fro97, Gadml]). MPEG Layer-1 enables stereo CD quality encoding with a bit rate of 384 kbps. It should be noted that both stereo channels are multiplexed in the same stream. The MUSICAM scheme, adopted for MPEG Layer-2, allows encoding stereophonic CD quality sound with “medium” bit rates of 248 or 192 kbps. More advanced encodings (for example, MPEG Layer-3 using perceptional coding) achieve near CD quality at 64 kbps per audio channel.
Table 1.2 summarizes the throughput requirements of various audio types of audio streams.
Sound Quality Encoding Technique (Standard) Bit Rate
CD quality CD-DA (stereo) 1.4 Mbps
CD quality MPEG Layer-1 (stereo) 384 kbps
Near CD quality MPEG Layer-2 (stereo) 192-248 kbps
Near CD quality MPEG Layer-3 (stereo) 128 kbps
Improved CD quality MPEG (sound studio, stereo) 768 kbps
Table 1.2: Sound Quality Encoding Schemes and Throughputs
Based on these findings one can conclude that the throughput requirements for real-time, high quality sound transmission are relatively high (although sophisticated compression mechanisms are used) compared to the throughput users experience in the public Internet. This is, in part, why current research in the area of real-time audio streaming focuses on low-bandwidth voice data.
1.3.1.2 Delay
The transit delay requirements for the transmission of continuous audio streams are highly dependent on the multimedia application. In the case of pure live audio data distribution (uni-directional transmission), long delays are usually tolerable. Large receiver buffers can be deployed to compensate for high delay variations and irregularities in the network and end systems. This of course is not the case for interactive applications such as Internet Telephony or live audio conferencing systems. Interactivity, especially human conversation, demands high responsiveness. The two-way or round-trip delay of the streaming application is crucial.
The impression of “real-time” which users experience from responsive applications is sub- jective. User studies for the ITU indicate that most telephony users perceive communi- cation with round-trip delays greater than approximately 300 ms as simplex connections
1.3. APPLICATION QOS REQUIREMENTS 19 rather than duplex communication. However, depending on the application and user per- ception, more tolerant users are often satisfied with delays of 300-800 ms [G.196]. Conver- sations with a round-trip delay close to a second cannot easily use “normal” social protocols for talker selection.
For duplex audio transmission, a technical difficulty lies in the echo that may be audible if the end-to-end round-trip delay exceeds a certain threshold, and no particular measure (such as the use of directional microphones and speakers, or echo canceling systems) is seen to limit the echo. The ITU has defined 24 ms as the upper limit of one-way transit delay for which echo canceling is not required.
1.3.1.3 Delay Jitter
Streaming of live audio is probably the most sensitive media type to delay variations. If packets carrying the audio information arrive with a wide distribution of transit delays, the receiving system needs to wait a sufficient time, called buffering or playout delay, before playing back the data in order to ensure that most of the delayed blocks arrive in time. Otherwise, a significant number of packets would arrive late. The gaps in the signal, caused by late and lost packets, result in audible artifacts. This results in sound quality that is intolerable.
Receiver buffering mechanisms temporarily store incoming packets in a so called buffer until their playout point. The packets can then be played out smoothly without gaps in the signal. Buffering mechanisms are also often referred to as delay compensation. Although delay compensation clearly has advantages, there are two possible drawbacks of this technique. First, an additional delay is introduced at the receiver. Second, sufficient buffer memory must be available at the receiving system.
The process of determining the best buffering or playout delay is commonly called Playout Delay Estimation (see section 3.1.4). It is dictated mainly by the following two parameters:
• The maximum overall delay that the application or the end user can tolerate. In the case of interactive audio streaming, the maximum total delay is very restrictive. Since a large portion of the delay budget is consumed by network transmission and the processing in the end systems, additional delay introduced by network jitter and scheduling irregularities in the end systems should be minimized.
• The buffering capabilities of the receiving system.
Even though the total delay might not be the limiting factor in all cases, the available memory in the end system, especially in small or mobile end devices, restricts the buffering delay. A delay of even a few seconds of high quality audio, for example, would require a considerable buffer size.
1.3.1.4 Reliability
It is important to note that bit errors usually lead to dropped packets within Internet communication. Therefore, only packet loss needs to be considered when examining the reliability requirements of Internet media streaming applications. Bit errors are dealt with on the transport layer; user applications need not consider them.
It is commonly recognized that humans are far more sensitive to erroneous audio trans- mission than to defective video transfer. This is due to the different processing of audio and visual information. Thus, QoS requirements for audio with respect to error liability are very strict. The maximum error rate tolerable within audio communications is highly
dependent on the application4, the encoding scheme5, and the sensitivity of the individual
human user.
One study [Jay80] concludes that no more than 5% of erroneous audio data can be tolerated in human conversations. Another study [Sch97] discovered that a packet loss rate of 1% is clearly noticeable as a crackle. Up to 13% of packet loss of voice information still allows words to be understood, but there are many crackles in the signal. Loss rates of 20% still allow sentences to be understood. This is due to the redundancy in human language. Non-redundant information like numbers get lost. Also, speakers with a (strong) accent are very hard to understand. At 25% packet loss only parts of phrases are understandable. Higher packet loss rates make audio voice transmissions for most people totally useless. Packet losses within real-time audio streaming cannot simply be resolved by means of re- transmission, since the end-to-end delay constraints would be greatly exceeded. If only few consecutive packets are lost, techniques that replay the last frame(s) rather than playing no sound mask the problem. It should be noted that gaps in the signal are immediately recognized by the listener (except during silent periods). Other techniques suggest ex- trapolating the missing information by determining an approximate value from previously received frames. A similar technique interpolates missing block based on the predecessor
and the successor blocks [T+96].
Both extrapolation and interpolation are called predictive techniques, as their approach is to provide estimates for missing information. Deploying the principle of these predictive techniques for transmission error recovery is often referred to as error concealment.
Summarizing, one can conclude that interactive real-time audio streaming has very strict end-to-end QoS requirements, especially with respect to the end-to-end delay, jitter and reliability. The throughput requirements are less demanding.
4For example, audio artefacts in high quality music are usually less tolerable than erroneous voice
information.
5It should be noted that some encoding techniques generate packets of different priority and thus it
depends which packets are lost; others add redundancy to the packets which enables recovery from most packet losses.
1.3. APPLICATION QOS REQUIREMENTS 21