A great many techniques have been developed over the past century for the production of spatial audio. However, in general, the approach taken consists of either:
• The manipulation of level and/or time differences in pairs or multiple pairs of loudspeakers.
• The reconstruction of a sound field over a listening area using a loudspeaker array.
• The reconstruction of the ear signals using headphones or highly localized loudspeaker signals.
The first approach of manipulating either phase/time or more usually level differences between pairs of loudspeakers is often referred to as stereophony. The production of ITD and ILD cues through the manipulation of these factors can be achieved both acoustically through the use of different microphone arrays, and through electronic processing techniques such as amplitude panning. Stereophony originally referred to any method of reproducing a sound field using a number of loudspeakers, but is now generally used to refer specifically to techniques based on the manipulation of level and/or time differences in pairs or multiple pairs of loudspeakers, such as in two- channel stereo and 5.1 surround sound.
Ambisonics and Wavefield Synthesis are two techniques which attempt to reconstruct a sound field within a listening area using loudspeaker arrays.
Ambisonics is a complete set of techniques for recording, manipulating and synthesizing artificial sound fields [Malham et al, 1995] which has been regularly used in spatial music and theatre for the past three decades. While never a
commercial success, Ambisonics has proved enduringly popular for spatial music presentations for various reasons, such as its independence from a specific
loudspeaker configuration and its elegant theoretical construction which is based on spherical harmonics.
Wavefield Synthesis (WFS) is a more recently developed technique which can be considered as an acoustical equivalent to holography, or holophony [Berkhout, 1998]. The technique uses large numbers of loudspeakers arranged in linear arrays and can theoretically recreate a sound field over a much larger listening area than is possible with other sound field reconstruction techniques such as Ambisonics.
The third approach uses HRTF data to either record or synthesize spatial auditory cues. This binaural approach is highly applicable for a single listener as it requires a strict separation of the two ear signals, such as when listening with headphones. However, it is much more difficult to extend this approach to large groups of listeners and so will not be covered in this thesis.
Apart from potentially WFS, these techniques can really only recreate the directional perceptual cues. The simulation of distance is often achieved using additional processes which will be discussed later.
3 Stereophony
Fig. 3.1 Bell Labs stereophony, proposed (left) and implemented (right)
The earliest work on stereophony was carried out independently by both Bell Laboratories in the United States and Alan Blumlein at EMI in the UK in the early nineteen thirties. The approach adopted by Bell labs was based on the concept of an acoustic curtain [Steinberg et al, 1934], namely that a sound source recorded by a large number of equally spaced microphones could then be reproduced using a matching curtain of loudspeakers (Figure 3.1 left). In theory, the source wavefront is sampled by the microphone array and then reconstructed using the loudspeaker array. In practice, this approach had to use a reduced number of channels, so a system was developed using three matching spaced omni-directional microphones and three loudspeakers placed in a front-left, centre and front-right arrangement (Figure 3.1b). This approach was problematic however, as the reduction in channels distorted the wavefront and audible echoes sometimes occurred due to the phenomenon of spatial aliasing (see Section 3.2.2). Spaced microphone techniques such as this capture the different onset arrival times of high frequency transients, and so capture the ITD localization cues present in the original signal. However, this also makes it difficult to process the audio afterward as unpredictable time differences are fixed in the recording.
Fig. 3.2 Blumlein’s coincident microphone arrangement
At the same time, Alan Blumlein was developing various alternative arrangements, such as the two coincident microphones with figure-of-8 directivity patterns shown in Figure 3.2 [Wiggens, 2004]. This coincident microphone arrangement records level differences which vary with the angular position of the source, but as the microphones are coincident, time differences are not captured. However, Blumlein realized that the resulting level differences would in fact result in an IPD at low frequencies due to the unavoidable cross-talk between the
loudspeakers. To illustrate this, consider two sources radiating a low frequency signal with no time difference, but with a greater amplitude signal radiating from the left loudspeaker (Figure 3.3). The listener will receive at his left ear the louder signal from the left loudspeaker, combined with the quieter signal from the right
loudspeaker, which is now delayed due to the greater distance travelled. The sum of these two wavefronts will be a phase-shifted and amplified version of the louder wavefront. A similar and inverse summing process occurs at the right ear and Blumlein realized that the resulting difference in phase between the two ear signals will produce an interaural time cue at low frequencies that is proportional to the amplitude difference between the loudspeaker signals. In turn, at higher frequency ranges, head-shadowing acts as a greater obstacle to the two wavefronts, so the
extent resembles natural hearing, as it produces IPD and ILD cues in the frequency ranges at which these localization cues are most effective. It therefore uses the unavoidable cross-talk between the loudspeakers as an advantage, as this cross-talk produces an IPD which is related to the original source direction. Critics of this approach of summing localization argue that level differences alone cannot produce the ITD cues necessary for correct localization of onset transients [Thiele, 1980]. However, subjective listening tests have shown that this is not the case and that transients can be clearly localized in Blumlein stereo recordings [Rumsey, 2001]. In addition, this approach allows for the post-processing of the stereo image by adjusting the combination of the two microphone signals. More recently alternative
microphone arrangements such as ORTF or the Decca tree have been developed which represent a trade-off between the two approaches and reduce the conflicting ITD cues that arise for transient and steady-state signals with purely coincident techniques.
Fig. 3.3 Standard stereophonic arrangement
These microphone techniques can of course also be adapted to artificially position a monophonic recording in a stereo field. The introduction of time
differences to a monophonic signal routed to two loudspeakers can be used to position or pan the signal between the loudspeakers. This approach, however, can introduce contradictory phantom image positions for the transient and steady-state parts of the signal [Martin et al, 1999a], as additional phase differences are introduced by the summing effect of the loudspeaker cross-talk. In addition, comb filtering can occur
position is highly dependent on the position and orientation of the listener [Rumsey, 2001]. Amplitude panning introduces level differences by simply weighting the signal routed to each loudspeaker and this technique is quite effective when used with a symmetrical pair of loudspeakers in front of a single, centrally positioned listener, with an optimal separation angle of ±30o. Amplitude panning can be considered as a simplification of Blumlein’s coincident microphone technique shown in Fig. 3.2. With this arrangement, a signal in the front left quadrant will arrive at the maximum of the blue microphone response characteristic and at the null point of the red
microphone. Amplitude panning simplifies this idea so that a signal panned hard left will only be produced by the left-most loudspeaker, and vice versa, while a signal panned to the centre will be created as a phantom image by both loudspeakers. This has the result that a slight yet perceptible change in timbre occurs when a signal is panned from a loudspeaker position to a point in between.
The specific implementation of stereophony for two loudspeakers, i.e. two- channel stereo, is by far the most commonly used audio format in the world today. However, as this format only utilises a pair of front loudspeakers, it must necessarily reproduce both the direct source signal and reverberation from the front. One of the earliest formal extensions of this method to more than two channels is the
Quadraphonic system, which is summarized in the next section.