Evaluation of visual presentations, visualisation systems and, more generally, of computer- based interfaces is a key component to ensuring their quality and success. For example, poor system usability may lead to low user effectiveness, increased errors in completing tasks, and consequently low adoption rates. The foci of evaluation may relate to various development stages, such as evaluation of a prototype with respect to state-of-the-art techniques, or deployment-level evaluation in order to assess system effectiveness and usage as part of the users’ real-world workflow. In addition, they may relate to the visu- alisation itself, or to assessment of a more holistic view of the user experience (Lam et al., 2011). There is a rich flora of evaluation methodologies available, varying in complexity and typically involving representative users, whose choice and settings largely depend on the evaluation goals and the underlying application context. Popular techniques include informal usability testing, formal studies and controlled experiments, longitudinal studies and large-scale log-based usability testing (Hearst, 2009, Ch. 2).
Informal usability testing is common during early stages of development and includes iterative stages during which a usually small number of target users are given successive prototypes with the goal to identify major problems or users’ preferences, or to test candidate system-features and designs. Evaluation and revision based on user feedback in a cyclical fashion is typical until required characteristics are attained, and low-fidelity designs are transformed into high-fidelity ones. Early stages of design may also include heuristic evaluations (Mack and Nielsen, 1995; Nielsen, 1992; Tory and M¨oller, 2005; Zuk et al., 2006), where a set of predefined guidelines or heuristics form the basis for evaluation, or field studies, focused on observing and documenting usage or completion of evaluator- defined tasks as part of the users’ everyday workflow, rather than being laboratory-based, and thus emphasising the element of realism. Additionally, observational studies may often be combined with interviews and (self-reporting) questionnaires.
Formal studies are typically artificially constrained in order to focus on key points of interest, are commonly conducted in a laboratory, and involve a large number of users. They are usually preceded by pilot testing to check the experimental design and/or by user training to increase system and experiment familiarity. Controlled experiments (Bland- ford et al., 2008; Keppel et al., 1992; Kohavi et al., 2007, 2009), a form of formal testing, are used to test hypotheses, while the focus is on quantitative analyses (Blandford et al.,
17Examples of SNA measures include those representing the betweenness centrality of a node, which
refers to how frequently it appears on the shortest path between other nodes, having thus a control over the network flow (Freeman, 1979).
2008). They are commonly used methodologies for rigorously comparing and bench- marking novel techniques with existing state-of-the-art counterparts, otherwise known as head-to-head comparisons, as participants can perform identical tasks across different systems (Lam et al., 2011), and the tasks tend to be simple and specific. Objective evalu- ation measures include overall task completion time, errors made, number of keystrokes, number of correct answers per specific time intervals, or may involve experts to evaluate user results. The experimental design may be conducted between or within subjects, while special care should be taken to ensure minimisation of confounding variables, that is, variables that unintentionally vary between experimental conditions and can affect the results; for example, comparing user speed using two different systems on different hardware.
A common problem in formal studies is the order in which users are assigned to ex- perimental conditions. Order effects can influence the users and bias the results. Popular techniques used to counterbalance these effects are the ‘blocked design’ and ‘latin-squares design’, which ensure a systematic approach to variation. For example, with two experi- mental conditions, C1 and C2, we can create 2! = 1 × 2 = 2 different orderings, ‘C1 C2’ and ‘C2 C1’, and randomly assign participants to each one of them. Last but not least, the majority of such studies is usually followed by questionnaire-based assessments to solicit user opinions and ratings, with the use of five- or seven-point Likert scales (Likert, 1932) being quite common.
Longitudinal studies are useful for revealing long-term usage and application patterns in everyday environments; observations of dozens of users over months or years con- tributes towards the reliability, validity, and generalisability of the results (Shneiderman and Plaisant, 2006). This study differs from the previous ones in that it goes beyond first- time user experience and examines participant behaviour as system familiarity increases. Shneiderman and Plaisant (2006) propose assessment of information visualisation tools through observation, interviews, surveys, automated logging of user activities and com- ponent frequency usage, difficulty in learning a tool and system-adoption rates, as well as success in achieving one’s goals.
Large-scale log-based testing is another form of evaluation that is typically adopted in Web-based systems, whose application context has the advantage of allowing for a large number of users. New features and designs can be tested by recording user behaviour and comparing it to other versions, and these experiments can be followed by laboratory studies. In contrast to formal studies, users are not required to undertake specific tasks, and they are neither explicitly asked to opt-in to the study, nor is feedback explicitly elicited (Hearst, 2009).
Several resources provide valuable details and give guidance on the use of appropriate evaluation methodologies and practices (Blandford et al., 2008; Dumas and Redish, 1999; Hearst, 2009; Horsky et al., 2010; K¨aki and Aula, 2008; Kohavi et al., 2007, 2009; Lam et al., 2011; Mack and Nielsen, 1995; Munzner, 2009; Plaisant, 2004; Shneiderman and Plaisant, 2006; Tory and M¨oller, 2005).
CHAPTER
3
Linguistic competence
In this chapter, we demonstrate how supervised discriminative machine learning tech- niques can be used to automate the assessment of ESOL examination scripts. In partic- ular, we report experiments on rank preference SVMs trained on FCE data, on detailed analysis of appropriate feature types derived automatically from generic text processing tools, and on comparison with different discriminative models. Experimental results on the publically available FCE dataset show that the system can achieve levels of perfor- mance close to the upper bound – as defined by the agreement between human examiners on the same corpora – for directly measuring linguistic competence. We report a consis- tent, comparable and replicable set of results based entirely on the FCE dataset and on public-domain tools and data, whilst also experimentally motivating some novel feature types for the automated assessment (AA) task, thus extending the work described in Briscoe et al. (2010). Finally, using a set of outlier texts, we test the validity of the model and identify cases where the model’s scores diverge from that of a human examiner.
Work presented in this chapter was submitted and accepted as a full paper in the 49th meeting of the Association for Computational Linguistics: Human-Language Technologies (Yannakoudakis et al., 2011).
3.1
Extending a baseline model
As described in Chapter 2, Section 2.3, Briscoe et al. (2010) were the first to apply discriminative machine learning methods to the AA task, which often outperform non- discriminative ones in the context of text classification (Joachims, 1998). They present a novel variant of the batch perceptron algorithm (B¨os and Opper, 1998), the Timed Aggregate Perceptron (TAP), that efficiently learns preference ranking models (see next section for details). They experimentally show that their model, employing a variety of (linguistic) features and trained on around 3,000 FCE ESOL texts, performs very close to the upper bound, as well as outperforms generative counterparts. Our contribution within this framework is fivefold:
1. We focus on reporting a replicable set of results based entirely on public domain tools and (training/test) data.
3. We study the contribution of different feature types to the AA task. 4. We present a comparison of different machine learning models. 5. We test the validity of our best model on outlier texts.