• No se han encontrado resultados

PRESENTACIÓN Y ANÁLISIS DE LOS RESULTADOS

ESCALA DE VALORACIÓN GRADIENTE

4.3 Correlaciones y contrastación de hipótesis

In this section, we propose an ecient solution to structure the intermediate representations built by layered LSTM. We have shown that gaze can be used eectively as a driving signal for head motion generation. This intervention is eective both in terms of accuracy and

Figure 3.32 Average H1 RMSE of the Baseline model without and with SP shifted frames corresponding to number of training epoch.

coordination patterning.

For now, the cascaded model with HHI data, predicting FX before generating head motion improve just pitch (H1) angle. This could be explained by in the task, there is a larger distance between the tablet (used to info the cube position) and the manipulator space so that the head pitch has a large contribute in gazing. Since the target and manipulator spaces are close to each other and the movement of eyes between these two positions lays in the eld of view, the head movement will less contribute to the gaze. Therefore, the H2 and H3 have a smaller correlation with the gaze target comparing to H1. In future, within the immersive teleoperation system, the pilot will perhaps move his/her eyes slower than normal in order to overcome the sensorimotor latency of the oculomotor control. The head movement is thus expected to have a larger contribution to gaze shifts. So, the proposed model could improve all of degrees of head movements by intermediate gaze prediction.

The quality of prediction may be enhanced in several ways. Other contextual information can be used as additional input precise regions of interest for the gaze, gaze contacts, communicative functions of speech, etc. as well as intermediate objectives e.g. eyebrow movements or respiratory patterns. In addition, we did not use the segmentation of the task into IUs because most of these IUs were triggered by gaze or speech events. More complex tasks involving switching between multiple interaction styles with multiple agents may motivate the structuring of the interaction by IUs, notably when alternative cues are used to trigger similar pragmatic frames.

Furthermore, the head motion generation model will be used to drive the head of our iCub-humanoid robot when autonomously instructing human manipulators. We rst plan to perform the subjective assessment of our multimodal behavioral model (see [NBE16] for our crowd-sourcing methodology). Another challenge is to adapt this model to multiple ma-

3.6. Summary 79

Figure 3.33 CCA of Hs vs. SP with various number of shifted frame.

nipulators, notably those with motor disabilities. In this case, the behavioral model should both incrementally estimate the best action and the optimal interaction style according to the goodness of t between the actual and expected behavior of the interlocutor predicted by the joint behavioral model.

3.6 Summary

In this chapter, we present multimodal interactive behavioral models based on recurrent neural networks, namely Long-Short Term Memory (LSTM) RNN for predicting discrete (arm, gaze, backchannel) and continuous (head motion) variables.

The predictions of arm, gaze and interaction units are compared between LSTM for on- line prediction and Bidirectional LSTM (BiLSTM) for o-line prediction with other statistical methods: HMM and DBN. The LSTM behavioral models benet from extracting contextual information from data, instead of being limited to the boundaries of the hidden states of HMM or the immediate previous frames of the DBN dependency graph. The LSTM methods achieve a better performance than statistical methods with regards to both prediction performance and intermodal coordination.

Figure 3.34 Average H1 RMSE without and with SP shifted frames corresponding to the dierent models.

For backchannel generation, we compare two methods (CRF vs. LSTM) with and without contextual windows using data from the RL/RI scenario. The LSTM models outperform CRF models. LSTM seems to better capture the relevance and timing of BC in the dialog scenario, where the interviewer usually uses BC to encourage the elderly people to be condent in answering questions.

We also investigated how to use LSTM models to generate continuous head motions. Because of the ability of capturing long-term dependence between latent variables, a LSTM with single layer was used as a base-line model. We used CCA to analyze the PTT interactive data and found that head motions the highest correlation with gaze (FX) and interaction unit (IU) comparing with other features (speech, manipulator's arm, F0). A control model with an additional FX input has a signicantly improving head motion generation quality. In order to improve the quality of prediction but kept the same inputs as the baseline model, we built a cascaded LSTM model which uses an other LSTM to predict FX as an input of another that generate head motion. We found that the cascaded LSTM model (pre-trained and ne-tuned parameter) not only improves the head motion accuracy comparing with the baseline, but also has the best coordination.

Chapter 4

Gesture Controllers: Design and

evaluation

In chapter 2, we described how to collect human-human interactive (HHI) data and extract useful features for training multimodal interactive behavioral models. Then, we built interac- tive models (as described in chapter 3) that can generate actions from perception streams in two interactive tasks (Put That There and Selective Reminding Test).

Note that the interactive models can generate multimodal robot behaviors at both lev- els: abstract-level vs.skill-level. The abstract level represents elementary behavioral skills of the target task (e.g.look at (ROI), say (text), hand-point to (ROI)), which are described by discrete events. In contrast, the skill-level behavior is related to specic motions such as head trajectories. The feature-level behaviors are generated so that the score they com- pute can directly command the robot's motor micro-controllers while the skill-level behaviors should trigger specic gesture controllers that further convert events into skill-level trajecto- ries. Gesture controllers are thus here the analog of the gesticon, the central gesture repository introduced by Krenn and Pirker [KP04], that stored gesture snippets and facial expressions relevant for the generation of dialogue accompanying non verbal behavior of virtual agents.

In this chapter, we focus on designing gesture controllers that can be used to execute the discrete events for our humanoid robot. Building gesture controllers is a fundamental step of developing robot behaviors, which enable us to realize how our robot interacts physically with humans (see Figure 4.1).

For future works, the gesture controllers will be used to build up semi-autonomous as well as autonomous robots to perform the interactive tasks (see chapter 5). Hence, we need to ensure that the events and their synchronization are still perceived correctly by human observers for which they are created. In this chapter, we propose an evaluation framework to spot the robot's faulty behaviors so that they can be redesigned or better adapted. Observing HHI is a good way to design and evaluate the gesture controllers and their relative synchronization. For evaluation, reusing the scores of the HHI data allow to evaluate how the robotic gestures are perceived by human targets, without evaluating the interactive behavioral model at the same time. This leads to focus on corrections of how gesture controllers encode elementary skills and when events are triggered, and thus partly disentangle execution from planning problems. In fact, robots have diculties in performing many actions that tutoring humans can easily perform (e.g. using one's hand to open/close a notebook or use a pen to write). Therefore,

Figure 4.1 Gesture controllers: design and evaluation. HHI data are used not only to design gesture controllers, but also to evaluate the capability of the robot in reproducing coordinated verbal and co-verbal behaviors.

in order for our iCub humanoid robot to perform acceptably the interactive scenarios, some actions of the robot will be changed to better match robot's abilities. In particular, in the RL/RI task, instead of opening/closing a notebook to show/hide items, the robot simulates item display and scoring events just by clicking on a faked tablet. This chapter covers how we adapt the HHI events to the HRI situation so that the events could be executed easier by the robot while maintaining equivalent semantics of the demonstrated HHI events.

We focus here on the RL/RI scenario, which requires the robot to perform much more complex multimodal behaviors and to exhibit more varied social skills than the Put That There scenario. We will detail how to adapt the HHI protocol to HRI, design and evaluate gesture controllers for this scenario.

4.1. Adapt the RL/RI scenario from HHI to HRI 83

Figure 4.2 Adapted RL/RI scenario for human-robot interaction: the robot uses a tablet to convince the subject that it drives the display of items and that it eectively takes notes. Another tablet facing the subject displays/hides items according to the robot's needs.

4.1 Adapt the RL/RI scenario from HHI to HRI

In chapter 2, we presented the HHI multimodal data, which consists in time-stamped speech, arm/hand gestures and gaze events labeled with their discrete values (e.g. looking at subject's face, tablet . . . ; uttering a text/backchannel with dierent attitudes; . . . ), organized in HHI multimodal scores. Now, we concentrate on developing modality-specic gesture controllers to map these events to robotic actions that a human observer could perceive and understand.

Outline

Documento similar