Variable 2: Desempeño laboral
2.5. Métodos de análisis de datos
Johnston et al. (2009) define multimodal integration as the process of combining input from different modes to create an interpretation of composite input. A synonym is the term multimodal fusion which is adopted from the terminology in physics. Multimodal fusion is a process that combines manifold types of input data, each associated with a particular modality. It is a fundamental task in the integration of various modalities.
Classification of Multimodal Input
Nigay and Coutaz (1993) call the absence of fusion Independent Modalities and the presence Combined Modalities. Serrano and Nigay (2009) organise the combination space of interaction modalities into two dimensions, the type of the relationship between modalities and the temporal relationship. The type of relationship is explained with the CARE properties Coutaz et al. (1995):
The CARE properties (Complementary, Assignment, Redundancy, and Equivalence) characterise multimodal interaction from the usability perspective on HCI. They are a set of properties that describe the relationship between modalities for reaching a goal or the next state in a multimodal system.
Equivalence - Expresses the concept of free choice of modality. Multiple modalities can reach the same goal and it is sufficient to use only one of them without any temporal constraint on them.
Assignment - Expresses the absence of choice. One, and only one, modality can be used in order to reach a goal. An example is the steering wheel of a car.
Redundancy - Two modalities have the same expressive power but are both re- quired to be used within a temporal window in order to reach a goal. Redundancy can be important for safety relevant functionalities.
Complementary - Two modalities are used within a temporal window for reaching a goal. Both modalities are needed to describe the desired meaning. A speak-and- point system is a classic example of this.
Equivalence and Assignment are independent modalities and can be interpreted individ- ually from one another. Redundancy and Complementary are combined modalities and require a multimodal fusion of the input. While redundant input must be compared and verified for an identical meaning, the complementary input must be combined in order to express the meaning.
Temporal Relationship and Synchronisation
As mentioned above, the time frame during which multimodal input occurs is relevant for the multimodal fusion. This implies the importance of the temporal synchronisation of all input devices. Vernier and Nigay (2001) specify five distinct combination schemes for the temporal relation between multimodal inputs (Figure 2.6). Three of the relations describe multimodal inputs that overlap and occur simultaneously (Concomitance, Co- incidence, Parallelism). Anachronism and Sequence are sequential and are distinguished by the size of the temporal window between the usage of the two modalities.
The temporal relations can provide relevant indications of whether and how multimodal input should be combined. Early multimodal systems like the “Put-That-There” system by Bolt (1980) relied on the fact that multimodal constructions temporally co-occur.
2.2 Multimodal Human-Computer Interaction 23
The meaning of the deictic term “that” in the spoken utterance “put that there”, e.g., was resolved with the object at which the user was pointing when it was spoken. This multimodal integration approach seems to be suitable for multimodal speak-and- point systems but has a restricted practical use in the design of future multimodal systems that involve other modes like gestures or body movements without deictic- point relations (Oviatt, 2012). It turned out that the temporal overlap of signals not urgently determines which signals should be combined. A series of studies showed that there exist two distinct types of users with respect to integration patterns and that their integration patterns occur across the lifespan from children through the elderly (Xiao et al., 2002, 2003). An integration pattern here specifies the strategy of how users combine multimodal input with respect to the temporal relation. Simultaneous integrators overlap their input temporally, whereas sequential integrators begin with one mode after the other one has been finished (Oviatt, 1999b; Oviatt et al., 2005). Since a user’s habitual integration pattern remains highly consistent during a session, this may allow systems to automatically detect and adapt to a user’s dominant multimodal integration pattern. This may also include the temporal thresholds during the sequential use of modalities.
Fusion level
An important aspect for multimodal fusion is the appropriate fusion technique that is applied to combine incoming unimodal events into a single representation of the user’s intention. Literature often distinguishes between two stages where fusion occurs: early fusion and late fusion (Turk and K¨olsch, 2003; Jaimes and Sebe, 2007; Nigay and Coutaz, 1993). The decisive factor here is the level of abstraction at which the fusion takes place.
Early fusion occurs at a feature level. The input signals are concatenated and provided to a joint classifier that generates an interpretation (see Figure 2.7a). The interpretation (or classification) is mostly based on machine-learning technologies like neural networks, or hidden Markov models. A classic example for early fusion is the audio-visual combination of speech and lip movements. Here the motion data from the lips are concatenated with features from the recorded voice in order to recognise a spoken utterance (Tamura et al., 2004).
During the late fusion or decision fusion, the signals are first classified independently on a feature level. After that the results are combined to a joint interpretation (see Figure 2.7b). The late fusion is realised on a semantic level and techniques like unification on graphs or Bayesian networks are employed in order to combine information.
Atrey et al. (2010) mention several advantages of the late over the early fusion. One is that the interpretations at a semantic level have the same form making their fusion easier. A second one is that for each single modality, the most suitable methods for analyzing the input data can be applied, making the process more flexible than the early
(a) Early fusion at a feature level (b) Late fusion at a semantic level
Figure 2.7 – Multimodal fusion on distinct levels (Oviatt and Cohen, 2015b)
fusion. Wahlster (2003) explains that at a semantic level, the back-tracking and rein- terpretation of a result is easier. Furthermore, the development process is less complex since multimodality with new modalities can be handled without specifying all varieties of cross-modal references in advance. Oviatt and Cohen (2015b) point out that the development process is simplified because commercial “black box” recognisers can also be applied, which provide no access to their internal state or data. Late fusion is able to fuse modalities that are not time-synchronous. With early fusion, the feature vectors usually have a close temporally bound. However, early fusion can gain potentially useful information that would already have been thrown away when the late fusion is applied (Wasinger, 2006).
In SiAM-dp the focus is laid upon the late fusion process, since the system works with already semantically represented content that is provided by modality-specific inter- preters/recognisers. Nevertheless, it is possible that recognisers already applied early fusion for the interpretation of input from multiple modalities before they provide the result to the dialogue system.