LA ADMINISTRACION DE RIESGOS CORPORATIVOS

CAPITULO II.............................................................................................................................. 2

3.3 RIESGOS

3.3.11 LA ADMINISTRACION DE RIESGOS CORPORATIVOS

The “Put That There” Bolt (1980) system is one of the earliest multimodal concept demonstrations. Bolt built a media room with a wall-sized screen display and a user chair in front of it (see Figure 3.1a). The system enables the user to interact by voice and pointing gesture inputs in a spatial data management context. Commands like “create a blue square there”, “Put that there” or “Make that smaller” allow the user to create, move or manipulate geometric objects on the screen. All of these phrases are incomplete since either the information about the object to manipulate or the target position of a movement action is missing. The missing content is complemented with the integration of pointing gestures that provide spatial information and allow one to resolve pronoun references and to eliminate ambiguity. Semantic processing is, e.g., realised by replacing the deictic term ’there’ with the x,y coordinate indicated by the cursor at the time of the utterance.

The CUBRICON system from Neal et al. (1989) supports multimodal input and output in the context of map-based tactical mission-planning. It enables the user to interact using spoken or typed natural language in combination with pointing gestures generated by mouse input on a graphical display. In the other direction, the system multimodally presents information distributed to three output devices: Two displays and speech syn- thesis. Thus, it combines generated spoken natural language output with graphical pointing gestures. For example, if speech output provides information about an object, the icon that represents this object is simultaneously highlighted by blinking. The system uses a semantic based reference resolution process that supports the handling of ambiguous pointing gestures by considering the type or particular properties of the object that is represented by a selected icon. For example, if more than one icon is close to the coordinate of the pointing gesture, the utterance “What is the status of this <point> airbase?” is only combined with icons from objects of the type airbase. The XTRA (Wahlster, 1991) system allows the user to combine natural language input together with pointing gestures in the context of an expert system that assists the user in filling out a tax form. The goal is to simulate a face-to-face conversation between humans where they frequently use deictic gestures parallel to verbal descriptions for referent identification. A focus lies on the interpretation of distinct pointing gesture granularities that range from exact pointing with a pencil, via standard pointing with the index finger, to vague pointing with the entire hand. Since each granularity level results in a different number of referential candidates, is it necessary to involve more knowledge in the reference resolution process. The content of the knowledge base embraces the tax form and the form hierarchy, the pointing gestures, a conceptual domain-specific model, the functional-semantic structure of natural-language input, and the dialogue memory. In a multi-step approach, the correct reference is resolved by analyzing the pointing gesture, the semantics of the verbal object descriptors, and the appearance of an object in the dialogue memory (Kobsa et al., 1986). Additionally, simultaneous pointing gestures with both hands are supported. Figure 3.1b shows how this can help

3.1 Overview of research in multimodal interaction 37

(a) The “Put That There” system (Bolt, 1980)

(b) Simultaneous pointing gestures in the XTRA system (Wahlster, 1992)

Figure 3.1 – Multimodal interfaces with combined spoken and gestural interaction

to prevent ambiguities. Here, the pencil in one hand specifies the focus by pointing to a region of the form, and the index finger of the other hand points to a specific object in the marked region. Although the finger of the second hand points at the same location, the selected numbers differ depending on the location of the pencil, which is used for focusing (i: {3,4},ii:{4,5}). Three reasons for the advantage of using pointing gestures are mentioned: The natural language dialogue is simplified by saving the speaker the generation, and the hearer the analysis of complex referential descriptions; they make reference possible in situations in which linguistic reference is not sufficient; and they allow the speaker to be imprecise or ambiguous, especially if the precise technological term is unknown to him.

The AlFresco system Stock et al. (1996) is a multimodal system that integrates natural language and hypermedia. It is an interactive system for users interested in frescoes and paintings and provides information, images, and videos of Fourteenth Century Ital- ian frescoes and monuments. Besides an understanding of natural language, the system integrates the typing of sentences and navigating in underlying hypertexts using a touch- screen. For a better hypertextual exploration, the output of images and text with buttons offers new entry points for further communication. It has been one of the first systems that managed the coherence between dialogue and displayed output. The dialogue man- ager provides a graphical representation of the discourse which helps to limit the problem of opacity in the system’s behaviour and thus allows the user to easily resolve misinter- pretations. With the support of the resolution of anaphoras and deictic references on displayed images and hypertext buttons, the system allows much more effective access to information than a system with natural language only communication.

Figure 3.2 – The collaborative and multimodal pen and voice system QUICKSET (Cohen et al., 1997)

that allows the collaborative interaction via a number of distributed devices. It provides a multimodal interface to various applications by integrating components responsible for speech recognition, natural language generation, graphical user interfaces and multimodal integration. The architecture allows the connection to distributed devices and the outsourcing of expensive input processing to resource-rich devices. Communication is achieved via WLAN and through a distributed multi-agent architecture. The same interaction capabilities can be enabled for distinct types of supported devices, e.g., hand- helds, desktops, and wall-sized terminals. The system allows one to realise applications in diverse scenarios with a special focus on map-based interaction. A core functionality is the unimodal and multimodal integration of spoken language and pen input. Whereas speech input is the main modality for initiating interactions with the system, the pen input can provide valuable additional input information. The pen input is interpreted by a gesture recognition agent that uses neural network and hidden Markov models and is able to recognise 68 pen-gestures, including various military map symbols (platoon, mortar, fortified line, etc.), editing gestures (deletion, grouping), route indications, area indications, taps, etc. Furthermore, hand-written text can be interpreted.

One presented example application (Figure 3.2) is a military strategy simulator. The user can use pointing gestures combined with utterances to create new objects like tanks at a specific point on the map. He can also add barbed-wire fences or fortified lines by drawing lines at the desired locations. The typification is done either unimodally by drawing the appropriate military symbol or multimodally by saying the label.

The author specifies some multimodal architecture requirements for future human-computer interfaces that mean a further evolution step and additional value over the multimodal systems of the first phase. First is a flexible asynchronous architecture that allows mul- tiprocessing and parallel running recognisers and interpreters. Their result should be a set of time-stamped meaning fragments for each input that are described in a com- mon representation. With the support of a time-sensitive grouping process and a fusion

3.1 Overview of research in multimodal interaction 39

concept, it should be possible to semantically combine meaning fragments from each modality stream to a joint interpretation.

The QUICKSET system fulfills these requirements with a flexible asynchronous frame- work and employs continuous speech and continuous gesture recognisers running in parallel. The representation of meaning is solved with typed feature structures that are very well suited to multimodal integration because they allow the sharing of structures and the representation of partial meaning. By applying typed feature structure unifica- tion on the input, it is possible to combine complementary and redundant information whereas contradictory information can be recognised and refused.

Further early works on the combination of spoken and gestural interaction are presented in (Siroux et al., 1995; Cohen et al., 1997; Oviatt, 1996).

In document UNIVERSIDAD MAYOR DE SAN ANDRÉS (página 27-49)