• No se han encontrado resultados

This chapter introduced the task of automatic labeling of student utterances in tutorial dialogues with moves from the DISCUSS taxonomy. More importantly this work provides a starting point for exploring the practical and technical limitations associated with recognizing dialogue acts, rhetorical forms and predicate types in unseen speech. The methods detailed above provide a straightforward, tractable mechanism for dealing with complex, multi-dimensional, multi-label dialogue move taxonomies. Recasting what is traditionally a strict, multi-class classification task as a series of binary decisions circumvents the hard decision of choosing a single best label, and allows for finer tuning of tagging behavior.

Using a set of standard features for dialogue move classification in conjunction with DISCUSS-

specific features yielded a set of promising results. The frequency-adjusted F1-scores of 0.935 for

dialogue acts, 0.704 for rhetorical forms, and 0.530 for predicate types are on par with corpus inter-annotator agreement statistics. Training and evaluation of these classifiers yielded a detailed error analysis, which shed light on the common problems and pitfalls associated with this task. These analyses also provide guidance to further refine the DISCUSS taxonomy and to improve the annotation process. Merging similar or ambiguous DISCUSS labels helped to reduce noise and sparsity in the training data, and ultimately gave a boost in classification performance. The refined

73 taxonomy and collection of DISCUSS classifiers are used in Chapter 8 to investigate the role of DISCUSS in characterizing the potential for learning in tutorial dialogues.

Question Ranking and Selection in Context1

An overarching goal of this thesis is to improve the dialogue capabilties in intelligent tutoring systems. This chapter focuses on the crucial subtask of selecting follow-up questions within the context of a tutorial dialogue. Although asking questions is only a subset of the overall tutoring process, it is still a complex process that requires understanding of the dialogue state, the student’s ability, and the learning goals. The challenge in this task is not simply to pick a context-relevant question, but to prioritize those that also encourage self-expression and stimulate learning and learner interest.

This work frames question selection as a task of scoring and ranking candidate questions for a specific point in the tutorial dialogue. Since dialogue is a dynamic process with multiple correct possibilities, the potential moves and questions used in this study are not restricted only to those found in the MyST-WOZ corpus described in Chapter 5. Instead this work explores the possibilites that stem from the question “What if a fully automatic question generation system existed?”. This is accomplished through the use of candidate questions hand-authored for each dialogue context. To investigate the mechanisms involved in ranking follow-up questions, these questions are paired with judgments of quality from experienced human tutors. Features extracted from the questions’ surface form, and underlying DISCUSS dialogue representation are embedded in machine learning classification algorithms to ultimately learn a function for ranking the appropriateness of questions

1 Parts of this chapter were adapted from Learning to Tutor Like a Tutor: Question Ranking in Context, In proceedings of the 11th International Conference on Intelligent Tutoring Systems (Becker et al., 2012a) and Question Ranking and Selection in Tutorial Dialogues, In proceedings of the 7th workshop on Building Educational Applications using NLP (Becker et al., 2012b).

75 for specific points in a dialogue.

These results show promise with the best question ranking models exhibiting performance on par with experienced human tutors. Furthermore, training models on judgments collected from individual judges (tutors) helps to shed light on the differences in pedagogical style and questioning tactics – even among tutors with similar training and backgrounds. The experiments and results detailed below provides three major contributions toward the larger goal of enabling computers to learn tutorial dialogue policies directly from human examples. First they demonstrate the utility and importance of rich dialogue representations, such as DISCUSS, for modeling decision making in task-oriented dialogues. Second, they provide a framework for learning behavior from Wizard-of-Oz data. Lastly, the question ranking task gives a scaffold for evaluating and learning behaviors using fully-automatic question generation.

7.1 Connections to Prior Work

Learning tutorial dialogue policies from corpora is a growing area of research in NLP and ITS. Existing work has made use of hidden Markov models (Boyer et al., 2009a) and reinforcement learning (Chi et al., 2010, 2008) to discover tutoring strategies optimized to maximize learning gains; however, much of this work assumes there is only one correct behavior, and the additional complexity required to model individual tutoring styles would require much more data. This work adopts an approach similar to Ai and Litman (2008) who utilize ranking to predict human judgments of simulated dialogue quality.

There is also an expanding body of work that applies ranking algorithms toward the task of question generation (QG) using approaches such as over-generation-and-ranking (Heilman and Smith, 2010a), language model ranking (Yao, 2010), and heuristics-based ranking (Agarwal and Mannem, 2011). While the focus of these efforts centers on issues of grammaticality, fluency, and content selection for automatic creation of standalone questions, the experiments described in this chapter shift focus towards the higher level task of choosing context appropriate questions. The work presented in this chapter merges aspects of these QG approaches with the sentence planning

tradition from natural language generation (Walker et al., 2001; Rambow et al., 2001). In sentence planning the goal is to select lexico-structural resources that encode communicative action. Rather than selecting representations, this system uses them directly as part of the feature space for learning functions to rank the questions’ actual surface form realization.

Previous work in categorizing dialogue acts and questions for tutoring (Graesser and Person, 1994; Core and Allen, 1997; Pilkington, 1999) has helped to shed light on the nature of interactions between tutors and students. Corpora tagged with dialogue and tutoring acts have been used to explore the correlation between tutoring moves and learning (Jackson et al., 2004; Litman and

Forbes-Riley, 2006) as well as specific behaviors such as when to ask “why” questions (Ros´e et al.,

2003), provide hints (Tsovaltzi and Matheson, 2001), or insert discourse markers (Kim et al., 2000). To the extent of this author’s knowledge, there has been no previous work in ranking questions for a tutorial dialogue context, nor has there been analysis into the role of dialogue act features for learning differences in tutoring style between experienced tutors.

Documento similar