PLIEGO DE CONDICIONES
2. Pliego de condiciones generales.
One of the most fundamental challenges when applying multi-armed bandit techniques to the problem of selecting questions in education software is how to dene the `reward' of a question. Intuitively, this reward should measure the amount of learning the question provided, or how much the student beneted from answering the question. A bandit algorithm will learn to suggest questions with high reward so it is important to make sure that this is appropriately dened in order to ensure that the algorithm is behaving in the desired way. Consider, for example, dening the reward as whether the student got a question correct. We may believe that it is desirable for students to get questions correct, however, using the correctness as the reward in a bandit problem will lead to the algorithm suggesting questions which are too easy for the student, as these will have the highest chance of being correctly answered. Instead,
there are several dierent approaches that can be taken. We discuss here some of these.
In most online education systems, the type of data that will be collected when a student answers a question will include whether they got it correct, and how long it took them to answer it. Hence, one option is to dene the reward in terms of this data. For example, if the student took a long time to answer a question and then eventually got it correct, this would suggest that they thought about it a lot and then managed to gure it out. This is potentially the sort of question we want to be giving to the student. Hence, one could dene the reward asrt=I{correct}stwherestis the
time it took them to answer thetth question. This would stop the system suggesting
really easy questions that can be answered very fast. One possible drawback of this approach is that it treats all incorrect answers the same. There are dierent degrees of incorrect answers which could be used to inform rewards (e.g. in a multiple choice scenario one wrong answer may be closer to the correct answer than another). There have also been several similar data-based denitions of reward in the literature. For example, Clement et al. (2015) dene the reward as the dierence in the proportion of the last dquestions answered correctly, and Raerty et al. (2011) use the negative
of the time taken to answer the question as the reward.
An alternative approach is to use an educational model and dene the reward in terms of this. For example if the model consists of various parameters representing the student's understanding in dierent topics, where large values indicate a high un- derstanding of the topic, one approach could be to dene the reward as the dierence in the parameters after and before the question has been answered and the model has been updated with the new data. One drawback of this approach is that you will only ever be as good as your model, so if the model is wrong, the questions chosen may not be optimal. Model based approaches have been considered in the literature, for example,Clement et al.(2015) measure the reward as the dierence in the knowl-
edge required to answer a question and the current knowledge of the student (both calculated by a model).
In some cases, there may be something observable that we directly want to maxi- mize. For example, if we know that student progress is monitored through a sequence of questions at the end of every homework (a mini-test or equivalent), then it is clear that we wish to give them questions which will maximize the score in these tests. Alternatively, if participation is optional, we could dene the reward as the number of future questions answered. This denition of reward is very much dependent on the specics of the online educational system, as not all of them will have the capacity (or desire) to test students regularly or measure engagement. Using alternative observ- able features to dene the reward has been considered by Liu et al.(2014); Erraqabi et al.(2017);Lindsey et al.(2013);Lan and Baraniuk(2016). In particular,Liu et al. (2014) use whether the next (randomly generated) question is answered correctly as a proxy for reward, whereasErraqabi et al.(2017) use the number of additional ques- tions the student answers. Lindsey et al. (2013) look at the score on a test after giving the student a sequence of questions, andLan and Baraniuk(2016) consider an environment where a test is given after every activity selected by the algorithm.
From the above discussion, it is clear that dening the reward for a bandit al- gorithm used in education software is not straightforward. There have been many approaches proposed, each of which has advantages and disadvantages. Furthermore, not all of these denitions will be appropriate in all online education systems. Inter- estingly, in the studies that involve using multi-armed bandits in a live educational environment with real students, there has been no consensus made about which deni- tion of reward to use. However, it is pleasing that in most cases the bandit algorithm still performed well in practice. Hence, the challenge of dening the reward when using a multi-armed bandit algorithm largely comes down to the setup of the system and which particular features the educator/designer wants to optimize. In what fol-
lows, and for the remainder of the thesis, we will always assume that the reward has been dened and that it is an appropriate measure of the learning process. We now discuss the specic problems in education that have motivated the work in this thesis.