• No se han encontrado resultados

2. REVISIÓN BIBLIOGRÁFICA

2.3. CRIANZA Y MANEJO DE LOS GRILLOS DOMÉSTICOS EN CAUTIVERIO

2.3.5 Control y monitoreo

As previously outlined in Section 4.2, speech contains temporal redundancy which makes it slow for navigating large amounts of content. Although these arguments have led to the development of non-speech methods, there are scenarios in which the occurrence of unfamiliar information, the number of possible options, or the familiarity of the user with the interface make a purely symbolic representation unsuitable. An interesting interface design involves displaying several concurrent speech signals from which the user is able to selectively attend to a desired piece of content. This model allows the desired content, whatever it may be, to be displayed to the user with minimal delay and without requiring the learning of non-speech cues. In a serial presentation paradigm, the target speech is presented in isolation with minimal competing noise. By contrast, in the parallel display, the target phrase has to compete with other speech signals.

Much of the work on concurrent speech has been concerned with the design of communications systems for time-critical, multi-talker scenarios. This work has contributed greatly to understanding the important factors in the intelligibility of concurrent speech streams (see Section 3.3.3). Television use cases introduce a different set of requirements and challenges, however, regarding acceptable workload levels, interaction, and use context. Despite not being available in commercial products, some researchers have investigated and developed auditory displays for computer access relying on concurrent speech streams.

AudioStreamer and Audio Hallway

Schmandt & Mullins (1995) proposed AudioStreamer, which was one of the first auditory displays that attempted to utilise peoples’ ability to selectively attend to a desired stream of speech in the presence of other talkers. The system presented three concurrent spatialised speech streams, which were binaurally spatialised to 0◦ and ±60in order to make use of

the cocktail party effect. It is notable that the authors chose not to separate the current items maximally (i.e., 0◦ and ±90). These positions were chosen “to be large enough to

allow easy perceptual segregation of the sources, but still limit the time it takes to switch from one to the other, which is proportional to angle” (Schmandt & Mullins, 1995, p. 218). Mullins (1996) indicates that this decision was based on Rhodes’s (1987) findings of increased reaction times for increased angular separation in non-speech localisation tasks. More recent experiments, however, have found no significant differences from increased switching angles, though as the angular separation from an attended location increases, response times increase (Mondor & Zatorre, 1995). In addition to spatial separation, the display also used different talkers for each of the streams so as to exploit acoustic variations and reduce informational and energetic masking between concurrent talkers.

Perhaps the most interesting feature of the AudioStreamer display was the attempt to adapt to a user’s interest in one of the presented streams by analysing head movements. If the user turned their head towards a particular source, its level was temporarily increased, then exponentially decreased over time to return to the original level (Schmandt & Mullins, 1995). If the user wished to isolate one stream they could repeatedly look toward the virtual source, in which case the other streams would be silenced. The system was also designed to momentarily draw attention to key points in other streams so as to avoid important sections being missed.

Mullins’ masters dissertation went into further detail about the development of the AudioStreamer display (Mullins, 1996). In it, Mullins states that participants were overwhelmed by three channels of simultaneous speech and therefore introduced five-second onset asynchronies between each stream. Such a large onset asynchrony vastly exceeds the length of a word and therefore is not comparable to the studies reviewed in the discussion of onset asynchrony in Section 3.3.3. Unfortunately, despite the development, no formalised experimentation was presented in either work, making it difficult to assess how effective the display was, either in terms of communicating the information or of the user experience it provided.

Schmandt (1998) proposed a second auditory display exploiting concurrent speech presentation called Audio Hallway. Similar to the AudioStreamer, it was intended to be used to allow the browsing of large collections of audio files. The Audio Hallway display provided the user with two levels of navigation; one ‘high-level’ allowing the navigation of groups of clustered content and a ‘low-level’ navigation of the individual audio files within a selected cluster. The high-level navigation was facilitated using the metaphor of a hallway in which doors were situated on either side leading to rooms filled with clustered content. The

Figure 4.2: Visual representation of the ‘Audio Hallway’ display. Adapted from Schmandt (1998, p. 167).

users travelled down the hallway using a head movement either forward or back and entered a door by tilting their head to the corresponding side. The doors were denoted by presenting all of the grouped items concurrently and automating the gain of each item so that individual items took it in turns to be the most prominent, a method which was termed as braided audio. The hallway was rendered binaurally, with three clusters audible simultaneously, such that the closest door was heard on one side with the next and previous doors on the other side perceived as in front or behind the listener respectively. Azimuthal distance between concurrent clusters was increased by creating a model where the hallway increased in width the further away it was from the listener’s position (see Figure 4.2). The cluster closest to the listener was presented more loudly than the other two sources to make it more prominent and therefore more easily attended. Despite these modifications users reportedly struggled with this display, which was interpreted by the author as an indication that combining multiple spatially separated sounds with listener position movement was not appropriate for auditory displays.

The low-level navigation in Audio Hallway was provided once the user had entered one of the rooms. Up to twenty items were presented on the azimuthal plane in the frontal hemisphere with up to four active at any one time. The horizontal location of the sources was distorted according to the orientation of the listeners head so that the spatial separation between the items was exaggerated. To emphasise this effect, the gain of items in front of the listener

Figure 4.3: Visual representation of the ‘Dynamic Soundscape’ display. Adapted from Kobayashi & Schmandt (1997, p. 167).

was higher than those towards the sides. Although no formal user testing was described, Schmandt (1998) reported that users struggled less with the navigation and attributed this to the locations of the auditory items being more easily associated with the orientation of the listener’s head.

Kobayashi and Schmandt’s Dynamic Soundscape

Kobayashi & Schmandt (1997) continued to research into using multiple concurrent streams for browsing speech audio with the Dynamic Soundscape system. The system positioned virtual sources, termed ‘speakers’, around the head on the azimuthal plane using binaural techniques. This distributed the content around the user’s head, which each of the speakers would play as they orbited the user (see Figure 4.3). The user controlled the interface with a touch pad, knob or through pointing gestures, which they used to activate a maximum of four speakers at any time. A head tracker system was also used and, like AudioStreamer (Mullins, 1996), head movements were analysed to assess user attention to specific speakers and alter their relative prominence through level manipulations. A continuously playing audio cursor served to indicate the position of the user control (hand, or point on touchpad), which was found to be especially useful for users for whom the binaural rendering was less effective.

Figure 4.4: Visual representation of the virtual dial display. Adapted from Frauenberger & Stockman (2006, p. 143).

The display highlighted issues with the loss of spatial resolution in memory, which meant users struggled to remember the precise locations of specific content (Kobayashi & Schmandt, 1997). This issue may have been exacerbated by the movement of multiple speakers and the concept of spatially distributed continuous information, which would be expected to impair spatial acuity compared with discrete points in a static interface. Unfortunately, though some user testing of the interface was performed, no comparisons were made between this and a traditional serial display with ‘fast-forward’ or ‘rewind’ functionality, making it hard to assess how beneficial this design would be.

Frauenberger and Stockman’s virtual dial

Frauenberger & Stockman (2006) proposed a design using concurrent speech to navigate auditory menus using the idea of a virtual horizontal dial with items located around its perimeter. The display used a virtual room with the centre of the dial positioned outside so that a maximum of three items from the menu would be inside the room, and therefore audible, at any one time (Frauenberger, 2013). Two additional ‘preview sources’ were also audible if the selected item was a sub-menu (Frauenberger & Stockman, 2006) (see Figure 4.4). The user navigated the menu by rotating the virtual ring using a game pad dial until the desired item was directly in front. The display made use of different voice identities and talking styles (i.e., voiced or whispered) to reduce between-stream confusions.

The system was experimentally evaluated against a traditional screen-reader interface. Results indicated that performance was initially faster with the prototype interface, but performance significantly improved with the traditional screen-reader in the second trial. This, combined with a slight increase in task completion time with the prototype, led to the traditional interface becoming faster. Frauenberger & Stockman (2006) suggested that this phenomenon may have been due to the fatiguing effect of the constantly repeating audio, as participants commented on this being exhausting.

From a navigational speed perspective, the system’s design seems unlikely to have been optimal. The prototype interface included some redundancy in the display as, following the initial display of three items, each subsequent display contained only one new item. This effectively reduced the display to a serial presentation with a reduced SNR. Furthermore, selection required the user to position the target item at the central location. This would have meant that a target detected at a lateral location would have had to be repositioned before it could be selected. While this reduced the amount of hardware required to make selections, it would have taken participants longer than if they had been able to select any of the audible options.

Clique

Parente (2008) proposed an auditory display system using concurrent speech for computer-based GUI tasks, which was named Clique. The system was designed to use a collection of views, separating required information into different levels of relevance and similarity. Information that was most likely to be of interest to the current task was presented as part of the primary view. The preview provided information on items or tasks that were part of, or could become part of, the target task (e.g., a summary of the length of an email). These views were reminders of context (i.e., what application was in use), referred to as the overview. They could be accessed by the user at any point and therefore took the form of a background ambiance. A peripheral view was included to provide notifications regarding tasks completed by other tasks, such as the arrival of an email. A final view enabled the user to repeat the output of the other streams to compensate for the issues with memory; this was named the review.

One of Clique’s key features was that it was designed to separate the user experience from the underlying GUI, allowing consistent patterns of interaction to be deployed over different applications. The system received commands through keyboard shortcuts and allowed

functions such as searching to be conducted over all applications. The interface presented concurrent streams of information provided by ‘virtual assistants’, each tasked with providing specific information. The content and the narrator formed the primary view, with the content providing information such as the text in an email and the narrator producing sounds to echo user inputs. The summary acted as part of the preview, providing information on the number of emails in the inbox or the amount of time it would take to read a presentation. The related assistant provided some information for the preview as well as some information for the peripheral, on any state changes within the current task. The unrelated assistant acted as part of the peripheral view, providing information on other tasks or subtask state changes, such as an email arrival. The environment provided the context view in the form of atmospheric sounds and were presented to the listener without spatialisation.

Assistants were positioned at distinct points in space on the azimuthal plane using 3D audio. Different voices were assigned to the assistants to improve the user’s ability to distinguish concurrently presented content and a 200 ms onset asynchrony was included to assist with stream segregation. A mathematical proof was provided which demonstrated that concurrent presentation with an onset asynchrony would allow faster access to information than a serial interface does. In addition to the spoken content, the system used a combination of earcons, auditory icons and speech, depending on the nature of the content being expressed. States and actions were generally expressed using earcons, whilst auditory icons were reserved for identifying the type of subtask (i.e., list, table etc.).

The interface was assessed experimentally with both visually-impaired and normal-sighted participants in two separate trials. The assessment with visually-impaired participants comprised several distinct tasks. It explored participants’ ability to recall information from a target stream and unattended streams with different numbers of maskers. Performance was compared with a commercial screen reader (JAWS) in terms of finding specified target items, learnability, and multi-tasking performance.

In the comparisons of performance with the prototype display with different numbers of competing streams, the results showed that the participants had significantly higher success rates for the target speech over the secondary streams. For the target information, the only significant difference was between the two and three concurrent streams, while the non-target results showed a significant difference between all the numbers of streams. When the interface was compared with JAWS, performance at finding specified target items appeared to depend on the capabilities of the system being represented, with Clique allowing for significantly

more successful selections only where there was no search capability provided in the JAWS version of the interface. The learnability assessment results indicated that Clique led to more correct descriptions of how to complete tasks after a short training phase than JAWS did, which was attributed to the fewer commands required by the user to perform tasks. A multitasking assessment found that participants completed significantly more tasks with JAWS than Clique. Parente attributed this result to the marking system used, whereby the partial completions were not included. Assessment by normal-sighted participants found that if users started tasks using Clique they needed less time to complete them with the GUI later. The total time spent interacting with the interface was, however, substantially greater. It was also found that users preferred a simplified version of Clique which imposed a reduced workload.

Despite the apparent success of the Clique system in outscoring the JAWS interface in the majority experiments presented, it is unclear whether the final system is truly optimal. Though the use of asynchronous onsets in conjunction with gender and pitch differences is undoubtedly advantageous, justification is not provided for the final combination of parameters, which makes it hard to apply the findings to the design of future systems. Furthermore, interpretation of task durations was complicated due to the different interaction capabilities of the different displays (i.e., availability of search functions) and therefore it is difficult to determine how much advantage was provided through the use of concurrent speech.

VoiceScapes

Werner et al. (2015) proposed a menu display using concurrent streams of speech in which between three and seven talkers were presented concurrently. Each source was presented from a unique spatial location and each was associated with different voices, which were arranged so that adjacent talkers were of different sexes. Interestingly, spoken items within the menus were looped. As words were of different durations, this meant the relative phase of the items changed during the presentation. A pilot study compared serial presentation of normal and of compressed versions with two different concurrent presentations. Although it is not entirely clear in the original paper, it is believed that one of the displays increased the number of talkers present (presenting the first three items before adding two more and then a further two), allowing the users to hear the display with fewer active talkers before more were added. The other concurrent presentation did the opposite, starting with seven concurrent items

and finishing with three. Results indicated that normal speed serial speech presentation was faster and easier to use than the concurrent speech approaches.

vCocktail

The vCocktail system by Ikei et al. (2006) made use of onset asynchrony between overlapping successive spoken menu items, which they referred to as ‘multiplexed speech’. The authors conducted a series of experiments to determine the optimal configuration of the display. The first experiment investigated the localisation accuracy achieved using the system. This entailed randomly selecting one of the 40 words and 36 directions, presenting it to the participant over headphones and waiting for a response indicating its location on the user-controlled GUI. The results showed front-back confusion to be a considerable issue. After excluding these errors, however, practically all of the localisation results exhibited a mean error of≈ 20◦. A second experiment presented a display of between two and four concurrent

words and then tasked the user with identifying a word that had been present from a list of on-screen words. The onset asynchrony was systematically varied between 0 and 500 ms in 100 ms steps. The presentation was either diotic or spatialised in the frontal hemisphere with equally spaced angular intervals (180◦, 90◦ or 60◦) for two, three and four item presentations. The results showed that performance was improved by the introduction of onset asynchrony and by the spatialisation. No further significant gain was observed by increasing the onset asynchrony past 300 ms, which was approximately half the word duration. It was found that the spatialisation gave most advantage over diotic presentation when the onset asynchrony was low.

In the final experiment, the authors spatialised speech sources using onset intervals which ranged between zero and 500 ms. These were assessed using between two and four sources, different source orderings (i.e., from left to right or alternating between sources from either hemisphere) and either with or without a linear increase in attenuation applied over the course of each word. It was found that with three or more voices, high accuracy (≥ 99.7%) could be achieved with onset delays of 200 ms. This increased to 300 or 400 ms if adjacent sources were used and no attenuation applied in the three- or four-source conditions, respectively. As noted by Ikei et al. (2006), the optimal asynchrony without attenuation (300 ms) is about half of the duration of the stimuli (530 - 600 ms). This is important when considering the playback of more than two voices, because when the onset asynchrony is less than half the duration of the stimuli, all three items overlap, but when the onset asynchrony is half the

Documento similar