5. MEMORIA DE INVESTIGACIÓN
5.1. Programas de Investigación
My design decision for SpEx to display exactly fifteen segments (discussed in Section 4.1.2) not only impacted how much information was displayed to users but also the segmentation performance of TAFE. If I displayed too many segments in SpEx the interface would become cluttered with Word Clouds, while if I displayed too few segments the interface would present too few Word Clouds to give a descriptive structure of an audio recording to support my task taxonomy.
As a result of enforcing a strict number of segments, TAFE must seg- ment audio into fifteen pieces regardless of how many (or how few) seg- ments an audio recording may be expected to have. Therefore, TAFE must generate segments boundaries where they may not be appropriate. I eval- uated the segments TAFE produced against a baseline segmentation algo- rithm which segmented audio into fifteen uniformly sized segments. Seg- ments derived from TAFE and the baseline segmentation algorithm were compared against the official topic locations provided with the online uni- versity lectures. The official topic locations were manually generated. My dataset consisted of the same twenty lectures described in Section 5.4.1. The number of official topics ranged between three and nine (inclusive) topics with a mean of 5.35 topics.
Following the evaluation procedures of lecture segmentation in the lit- erature [58, 75, 74], I used Precision, Recall, and F-Measure to evaluate the accuracy of the segments produced by comparing against baseline seg- ments. I compared TAFE to the base-line algorithm and against the official topic locations provided by the respective universities. I considered a seg- ment boundary to be correct if it was +/- thirty seconds from the official topic boundary. The results are displayed in Table 5.4.
It is clear that segments produced by TAFE and the baseline produced poor precision (0.074 and 0.050 respectfully). Most segments were not close to official topic locations because both TAFE and the baseline pro- duced exactly fifteen segments, much greater than the five official topics
TAFE Baseline
Task Precision Recall F-Measure Precision Recall F-Measure ASTR160 Lec. 1 0.133 0.4 0.2 0.133 0.4 0.2 ASTR160 Lec. 2 0.067 0.333 0.111 0 0 - ASTR160 Lec. 3 0.133 0.5 0.211 0.067 0.25 0.105 ASTR160 Lec. 4 0.067 0.333 0.111 0.133 0.667 0.222 ASTR160 Lec. 5 0.2 0.75 0.316 0.067 0.25 0.105 CLCV205 Lec. 2 0.067 0.25 0.105 0.067 0.25 0.105 CLCV205 Lec. 3 0.133 0.667 0.222 0 0 - CLCV205 Lec. 4 0 0 - 0.133 0.667 0.222 CLCV205 Lec. 5 0.067 0.2 0.1 0.133 0.4 0.2 CLCV205 Lec. 6 0.067 0.25 0.105 0 0 - CS50 Lec. 1 0.067 0.5 0.118 0 0 - CS50 Lec. 2 0.067 0.25 0.105 0 0 - CS50 Lec. 3 0.067 0.333 0.111 0 0 - CS50 Lec. 4 0.067 0.5 0.118 0.067 0.5 0.118 CS50 Lec. 5 0.067 0.2 0.1 0.067 0.2 0.1 PSYC123 Lec. 1 0 0 - 0 0 - PSYC123 Lec. 2 0.067 0.143 0.091 0.067 0.143 0.091 PSYC123 Lec. 3 0.067 0.2 0.1 0.067 0.2 0.1 PSYC123 Lec. 4 0.067 0.25 0.105 0 0 - PSYC123 Lec. 5 0 0 - 0 0 - Mean 0.074 0.303 0.116 0.050 0.196 0.078 Std. Dev 0.048 0.206 0.076 0.052 0.227 0.083
Table 5.4: TAFE and baseline segmentation accuracy compared to official topic locations. Values rounded to three decimal places.
5.4. PERFORMANCE EVALUATION 87
present on average. However, the mean precision and recall produced by TAFE was greater than that produced by the baseline which indicates that the clustering of acoustic and text features can help with identifying dis- tinct regions in the audio. However, the results need not be very accurate because SpEx visually depicted important topics to be found by users in the form of Word Clouds and offered interaction mechanisms to identify exactly where key topics occurred in the audio. The segments produced by TAFE merely broke audio into easily consumable segments which, when visualised by SpEx served to provide a high-level structure of audio to assist navigation.
The effectiveness of SpEx and in-part the effectiveness of the segments TAFE produced were evaluated in a user study which is described in the following two chapters.
Chapter 6
User Study
I undertook a user study to understand how well users were able to use SpEx to navigate audio and the strategies users employed during the pro- cess. My user study was designed to analyse user performance against the context of the primary personas, Jack and Amy.
To replicate the conditions of university education, lecture and presen- tation audio were used and undergraduate university students made up the majority of participants. In total, twenty participants took part in my user study. I asked each participant to perform a predetermined series of tasks for each audio recording. My tasks were designed to characterise the tasks in my task taxonomy and the scenarios of Jack and Amy.
I recorded user actions to produce a set of quantitative data for statis- tical analysis of usage patterns while user opinions were obtained to gain insight into user thoughts and perceptions. The data from the user study will be analysed to verify or disprove experimental hypothesis and pro- vide answers to open questions about SpEx.
6.1
Type of User Study
There were two categories of user study I could create, a laboratory study where participants are given artificial tasks, and a field study where SpEx
is deployed for a real university course. I opted for a laboratory study. While analysing the genuine usage of SpEx in the field to fulfil real goals is useful, laboratory study could offer a more controlled environment. A controlled environment would allow me to carefully tailor the tasks users performed to directly correspond to my task taxonomy and I could ad- ditionally gather observational data for each participant. Consequently, I would not expose SpEx to untested environments where I could not guar- antee a quality of experience at the prototype stage that it was in. Field studies are known to discover usage scenarios and behaviours not found in laboratory studies [44], so I leave a field study for further work.
Additionally, I did not design a comparative study. I believed existing audio retrieval interfaces have few comparable features to SpEx. Further, no user study of an audio retrieval system has before used my task taxon- omy to structure its tasks, making comparison difficult. Analysing SpEx alone still allows me to gain insight into key usability issues that may hin- der its use.