• No se han encontrado resultados

Tema 3. Operaciones de separación por contacto

9. Métodos simplificados para el cálculo del

As mentioned in the introduction, research into the acoustic correlates of the human voice with different emotional states has received a great deal of attention for a number of years, and has been spurred on by technological advances such as the telephones, audio recording technology (Scherer, 2003), and more recently,

advances in computer speech/speech synthesis (Schr¨oder et al., 2010). Through these efforts, a vast body of work has accumulated, and as a result a number of review articles have been published in order to chart and summarise the findings (e.g. Scherer (1986), Banse and Scherer (1996), and Scherer (2003)). In light of this, this section will not provide a detailed coverage of the whole field of human emotional speech research (this is covered in adequate detail in the review articles), rather it outlines the main important aspects of the field that apply to the study of NLUs and gibberish speech.

Firstly, it must be pointed out that in comparison to NLUs consisting of beeps and squeaks, human vocal expressions are very complex acoustic signals, particu- larly as they exploit the affordances of sound in order to dual encode both affect and natural language - where what is said and how it is said are transmitted via the same channel at the same time, in the same signal (Picard, 1997; Scherer, 1986, 2003). This makes the study of the human voice and specifically emotional speech particularly challenging. The first problem is to try and isolate these two components of the signal such that the their underlying acoustic characteristics can be studied. In order to achieve this a number of novel techniques have been developed in order to address this, varying from methods of masking linguistic context, to artificially creating signals that have no linguistic context at all. Here, it is worth pointing out that this sounds familiar: both NLUs and gibberish speech have a similar underlying goal, and as we shall see later in this chapter, previous literature has used very similar techniques to create both NLUs and gibberish speech, while perhaps not being completely aware of this. Using these techniques, it has been possible to explore the acoustic correlates of emotional expressive via the human voice and via music. This is presented in section 2.3.1.2.

2.3.1.1 Removing linguistic context from the human voice.

There have generally been two approaches to this problem of removing the dis- tortion of natural language from expressive speech: cue masking, and cue manip- ulation by re-synthesis (Banse and Scherer, 1996; Scherer, 2003).

In cue masking approaches, the verbal cues are masked, distorted/corrupted or removed from expressive vocalisations that have been captured from humans (either via eliciting natural emotional expressions in people, or by recording actor portrayals) to study the influence of the ensuing acoustic features on peoples’ inferred emotional meaning and content (Scherer, 1986). This particular approach has been used early on in the field, using techniques such as low pass filtering in order to remove the higher frequency components of a voice sample in order to suppress the intelligibility of phonemes (e.g. Knoll et al. (2009)), and randomised splicing, where voice recordings are split up into small segments and reordered in such a manner that the prosodic features of the utterance are generally retained, while the verbal cues are distorted and the verbal content corrupted (e,g, Scherer (1971) and Scherer et al. (1972)). Remez et al. (1981) used Sine Wave Synthesis to investigate the nature of speech perception. This technique involves analysing the voice recordings and generating time-varying sinusoidal wave patters that match the time-varying patters of the vocal formants of the voice.

While the benefits of the cue masking approach are that they are using ac- tual expressive human speech, which ensures as high degree of voice quality and accuracy, there are problems surrounding the methods through which the voice recordings have been captured. In particular when voice actors are employed, there is a risk that when they are asked to portray an emotion, they exagger- ate this and thus the voice recording does not necessarily comes and accurate reflection of genuine emotional speech (Scherer, 2003). Also, it has been reported that people do still exhibit an ability to recognise and understand to a degree the verbal context of the speech, which demonstrates the degree to which affective and verbal content are intertwined in the voice (Remez et al., 1981; Scherer et al., 1972) .

Cue manipulation via re-synthesis is a more modern approach has been proven to be a remarkably useful tool (Cowie and Cornelius, 2003), particularly given the developments in general speech synthesis technology. Through this technology, the human voice can be explicitly parameterised which allows for systematic manip-

ulation of the vocal patters and parameters and how peoples’ affective inferences change as a result (Scherer, 2003). An early example of this, before the large scale developments of speech synthesisers, comes from Scherer and Oshinsky (1977), who used a MOOG synthesiser to create concatenated tones of sounds that were designed to resemble both sentence-like utterances as well as musical melodies, by specifically manipulating the pitch, rhythm, contour, timbre and tempo of tones. More recently, the use of speech synthesisers has become popular as reflected by the large number of publications on the subject (e.g. Cahn (1990), Murray and Arnott (1993), Murray and Arnott (1996), Burkhardt and Sendlmeier (2000), Laukka (2005), Schr¨oder (2001), Schr¨oder (2003a) and Schr¨oder et al. (2010)), partly due to the direct application that findings have for speech technology ap- plications, of which there are many.

The purpose of highlighting these two methods of creating stimulus with af- fective content for psychological studies is that there are a very many number of parallels between both NLUs and gibberish speech, with respect to the underlying goals, but also the techniques that are used to actually produce utterances and stimuli. This is something that the related literature in both NLUs and gibberish speech has failed to observe6. Furthermore, this emphasises the strong relation- ship and relevance between the human voice, speech synthesis and NLUs/gibberish speech, and highlights that the use of NLUs and gibberish speech does not only need to be geared toward the application in social HRI, but the methods used to create utterances can also have utility as scientific tools that can be used to help further address research questions regarding emotional expression in the human voice.

2.3.1.2 Acoustic Correlates of Emotional Speech

Work investigating the acoustic correlates of emotional speech have tended to focus on a relatively small number of vocal cues given the complexity of the

6It can be argued that the true roots of NLUs and gibberish speech lay in psychology, and the only the area of application, HRI, is now different and new. It may be due to the difference in age of the respective fields of psychology (old) and HRI (young) that authors have not observed the strong links between the two fields with respect to the methods used to create stimuli/utterances.

human voice as an acoustic signal, and who they change across different basic emotional categories (Juslin and Scherer, 2005). Moreover, as there is a very large body of research that addresses this, much of which reports different findings that sometimes conflict, it is difficult to consolidate the results of these studies into a coherent overview of how these different parameters vary across the different emotional states. This is where invaluable review efforts come into their own (e.g. Scherer (1986), Banse and Scherer (1996), Scherer (2003), Juslin and Laukka (2003)). Drawing upon these review articles, this section serves to provide an overview of the different vocal cues that have been studied, and how they vary. These parameters are taken into consideration in the next chapter which outlines a custom method for characterising sentence-like NLUs.

Table 2.1 lists the main vocal cues that have been studied in the human voice, broadly speaking, providing a brief description of each. It can be seen that these different parameters are all commonly associated with different general properties of the voice, namely the pitch, intensity, temporal aspects and voice quality. It is the changes in both the the properties of pitch and the temporal aspects that translate to changes in prosody which is an general umbrella term for referring to the dynamics of the acoustic signal over time. With respect to NLUs and gibberish speech, all of these parameters hold relevance as they provide high level ways of characterising utterances, and as we shall see later in this chapter, many of these vocal cues are been used when creating affectively charged utterances.

Table 2.1: Description of the Acoustic Cues in Vocal Expression. Table adapted from Juslin and Laukka (2003). Acoustic Cues Perceived Corre-

late

Description Pitch Fundamental Fre-

quency (F0)

Pitch F0 represents the rate at which the vocal chords oscillate. Acoustically, the F0 is the lowest periodic cycle component of the waveform

F0 Contour Intonation contour The F0 contour is the sequence of F0 values across an utterance over time. Besides changes in pitch, the F0 contour also contains temporal information, and as such is difficult to operationalise.

Jitter Pitch Perturbations Jitter is the small scale perturbations in the F0 related to random vibrations of the vocal chords.

Intensity Intensity Loudness of speech Intensity id the measure of acoustic energy in the acoustic signal, and reflects the amount of effort required to produce an utterance. It is usually measured as the amplitude of the acoustic signal.

Attack Rapidity of voice on- sets

The arrack of a signal refers to the rate of the rise in the amplitude of the voiced segments of an utterance.

Temporal As- pects

Speech Rate Velocity of speech The rate can be measured as the overall duration of an utterances, or as units per duration. It can either include only the voiced segments of speech, or the the entire utterance as a whole.

Pauses Amount of silence in speech

Pauses as usually measured as the number or duration of silences in the acoustic waveform.

Voice Quality High Frequency Energy

Voice quality High frequency energy refers to the relative proportion of total acoustic energy above a certain threshold. As the energy in the spectrum increases, the voice sounds more shape and less soft.

Formant Frequen- cies

Voice quality These are the frequency regions in which the amplitude of acoustic energy is high, reflecting the natural resonances in the vocal tract. The first tow or three formants large determine the quality of vowel pronunciation, while higher formats are usually speaker dependant.

With respect to how these voice cues change across the expression of different emotions, Scherer (2003) has attempted to provide a rough characterisation for the main vocal cues across the basic emotions as based upon the general findings reported in the literature. These are shown in table 2.2. It can be seen from the table that not all the voice cues have a characterisation for the different emotions. This is because not all studies focus on the same emotions, and many studies report contradictory and conflicting results (Scherer, 2003). Generally, it can be seen that high arousal states such as angry, fear, and joy are commonly associated with an increase in the F0 frequency, as well as the variability and range of this. In these states it is also commonly found that the speech and articulation rate are higher than lower arousal states such as sadness and boredom, as is the high frequency energy.

While this table is generally rather vague, it does serve as a good basic guideline for how different acoustic signals might be designed to convey different affective states, and particularly how the features of the signals covary. What the main drawback is that these are very general, while when it comes to implementing such insights into a system for creating synthetic utterances, many of the parameters characterising and utterance are system specific and so transfer of these broad characteristics of vocal cues to system specific parameters can be limited, partic- ularly in the case of NLUs, which are designed to be abstract sounds rather than resembling human speech. Also, these characteristics identified by Scherer (2003) do not provide exact specifications for each of the voice cues with respect to their measured values. The reason for this is that each human voice is different, and so the exact parameter values differ greatly from person to person, and so also from experiment to experiment in the literature. However there appears to be more consistency in the dynamics of the human voice across the different emotional states than there is in raw parameter values, so this serves as a good initial start point by which to design and compare the dynamics of NLUs with, but it makes gauging the initial cue values (such as speech rate, pauses, F0 range, etc) difficult.

Table 2.2: Summary of the acoustic patterning of the human voice for the basic emotions. Table adapted from Scherer (2003).

Voice Property Basic Emotion

Stress Anger/rage Fear/panic Sadness Joy/elation Boredom

Intensity % % % & %

F0 floor/mean % % % & %

F0 variability % & % & F0 range % % (&) & % & Sentence contours & &

High frequency energy % % & %

Speech and articulation rate % % & % &

Documento similar