• No se han encontrado resultados

Capítulo 3. Las Redes Sociales, Transnacionalismo y Migración Tunkaseña

3.2. Enfoques Posestructuralistas que explican los procesos de Migración Internacional

Another line of research uses data-driven methods to simulate (individual) speakers’ gesturing behavior.

2. The term ‘gesture’ does not only include hand and arm movements, but also head nods, eyebrow movements etc.

Creating Characters from Human Motion Capture Data Stone et al. (2004) proposed a method for using a database of recorded speech and captured motion to create an animated conversational character (RS+CM approach). The framework tied together offline activities of content authoring and data preparation with online processes that use the prepared content and data for generation and animation.

In content authoring, a scriptwriter designs what the character will say. Automatic tools then compute the utterance units implicit in the specification, formulate a concise script for a performer, compute a database specification that organizes the anticipated sound and motion recordings, and compile an application-specific generator that will index the resulting database.

For data preparation, automatic tools for speech and motion data analysis are combined with manual annotations. In particular, the data is coded for points of perceived prominence in speech and gesture. In addition, gestures are classified into two categories: descriptive or expressive. Descriptive gestures elaborate the referential content of the utterance, which is typical of iconic gestures that represent objects or events in space. Expressive gestures, in contrast, highlight the attitude of the speaker towards which she is saying and comment on the relationship of speaker and addressee, which is typically the case for metaphorical and beat gestures.

To plan a new utterance, the generator automatically determines the content and communicative function of each of the phrases the character needs to realize. To animate these phrases, suitable sound recordings must be combined with adequate gesture performances. As this is the unit selection problem, a cost function is defined to determine the best combination of a sound s and a motion m that minimizes the degree to which a unit of performance must be modified in the final realization. The function takes two measures into account. First, the difference in two successive motions, temporally averaged across the short overlay window where corresponding samples are interpolated, and second, the differences in pitch (a ratio) between the peak of two successive sounds.

The unifed approach makes it possible to capture the data needed for a character with a limited number of performances, to catalogue performance data with limited human effort, and to synthesize novel utterances.

Generating Gestures from a Speaker-Specific Model Another data-driven ges- ture generation approach was developed by Neff et al. (2008): a system for generating believable gesture animations for novel text which reflect the gesturing style of partic- ular individuals by building speaker models (SM approach). The approach is mainly data-driven, using a video corpus of the human performer, but also incorporates gen- eral, character-independent mechanisms. The approach is divided into two phases: an offline processing phase to build gesture profiles from speakers, and an online processing phase to generate speaker-specific gestures for arbitrary input texts. For an overview see Figure 3.9.

Figure 3.9: System to generate believable gesture animations for novel text that reflect the gesturing

style of particular individuals (Neff et al., 2008). The approach is divided into two phases: an offline processing phase to build a gesture lexicon and gesture profiles from speakers, and an online processing phase to generate speaker-specific gestures for arbitrary input texts.

The offline processing phase, done individually for each speaker, and begins with the annotation of video data from the particular speaker with respect to speech and gestures. Spoken words are grouped into clauses and annotated for their information structure. The gestural part of the annotation follows the hierarchical organization of gestures in phases, phrases, and units (cf. Section 2.1.2). In addition, four attributes are coded for each gesture: (1) handedness, (2) lexical affiliate, (3) co-occurrence, and (4) lexeme. The lexeme denotes the lexicon entry to which the gesture corresponds in a gesture lexicon built beforehand to capture the semantics of gestures (Kipp, 2004). A total of 39 gesture types, i.e., recurring gesture patterns, are identified and described with respect to gesture form constraints in terms of handshape, hand location, hand orientation, hand/arm movement, handedness, shoulder movement, and facial expression.

From the annotated corpus, a profile of a speaker’s gesturing behavior is built. This profile consists of a sample database, a statistical model, and average values. For the database, the annotations for each gesture in the corpus are stored as a reproducible sample of the specifc speaker. To build the statistical model, the speech transcription is processed in order to assign semantic tags such as ‘agreement’ (“yes”) or ‘quest_part’ (“why”). The model is then automatically computed from the annotations and used in

generation to trigger gestures, to predict where they are placed relative to speech, and to determine parameters such as handedness and frequency.

Once a speaker’s gesture profile is created, the system can process any text that has been segmented into utterances and (manually) coded for its information structure. Words are stemmed and mapped to a semantic tag, just as in the offline modeling step. The generation then proceeds in two steps: gesture creation/selection, and gesture formation.

In the first step, a large number of underspecified gesture candidates are created and then reduced by a selection criterion. For each semantic tag in the input text, the generation system computes the conditional probability that a gesture g occurs with given semantic tag s. Additionally, a bi-gram model of gesture sequence is considered, i.e., the conditional probabilities that gesturegifollows a previous gesture gi1. Similarly, the handedness is determined on the basis of a bi-gram model that captures handedness sequences. Handshape is determined by consulting a lexicon in which all suitable handshapes for a particular lexeme are specified. Following the rule of economy, a handshape is chosen if it is equal to the handshape employed in the previous gesture. Otherwise the handshape is changed to a suitable one.

In the second online step, timing details for the realization of gestures are planned. To this end, gesture and speech are arranged by positioning the end of the stroke at the end of the corresponding word, using a random onset based on the speaker’s mean value. Neighboring gestures are merged resulting in multiple strokes synchronized to enforce a minimum time span.