3. JUSTIFICACIÓN
7.6 Requsitos específicos
7.6.2 Características del sistema
Shute and Regian (1993) describe how few intelligent tutoring systems in the literature have been subjected to rigorous, controlled evaluations. The results from those that have been evaluated in this manner demonstrate these tutors do accelerate learning with no degradation in outcome performance compared to appropriate control groups. However, they are not only concerned with the lack of interest in evaluation but the fact that only studies indicating successful results are being published. Shute and Regian put this down to the interdisciplinary makeup of the ITS community and a requirement for appropriate training. In order to provide some guidance, they maintain what is required is "to evaluate the efficacy of an ITS. Efficacy, in this paper, refers to assessing if the system teaches what it was intended to teach, to what degree, in comparison to what, and at what cost". Primarily the authors provide seven principles which are believed to underlie a good ITS evaluation study.
Before outlining the seven principles we need to review the general approach to research and development. The aim of experimental design is to the measure of effect produced by some independent variable on a dependent variable. Shute and Regian point out that "If the casual link between independent manipulations and dependent measures is equivocal, the experiment is said to lack internal validity. If the ability to generalize from the
experimental sample to the population of interest is equivocal, the experiment is said to lack external validity". The authors argue that internal validity is easier to achieve in an evaluation conducted within "a controlled, laboratory setting (addressing more basic research questions)", whereas external validity is easier to attain in an "evaluation experiment conducted in ‘the field’ or natural setting (more applied research questions)". As we have indicated earlier this represents the classic tradeoff between internal and external validity. So that while research in field settings (e.g., high school classrooms) is preferred since all aspects of the target situation exist, so do the potential confounds. Consequently research in laboratory settings is often selected because of the extreme control that is possible. However results in the laboratory are open to accusations of artificiality and lack of generalization. The authors believe that neither laboratory nor field
research alone will give a complete and accurate picture of the instructional effectiveness
of a particular intervention. Indeed they regard "research on pedagogy should be driven by theory and constrained by empirical observation". Here theory refers a coherent, plausible body of ideas about how people acquire, store, retrieve and apply knowledge and skill. This balanced view of the relationship between theory and data colours the approach taken by the authors to evaluation.
The authors argue that in their experience evaluation studies fail not because the efficacy of the ITS, but rather due to poor experimental design, inadequately operationalized constructs and measures, or deficient logistical planning and implementation. Therefore they present seven main principles that may be used to design, plan, and implement an effective ITS evaluation:-
(1) Delineate the goals of the tutor,
(2) Define the goals of the evaluation study,
(3) Select the appropriate design to meet the defined goals,
(4) Instantiate the design with appropriate measures, number and type o f subjects, and control conditions,
(5) Make careful logistical preparations for conducting the study, (6) Pilot test the tutor and the study, and
(7) Determine the primary data analyses as you plan the study.
The first principle becomes apparent when the instructional goals have shifted (subtly or markedly) over the developmental life-cycle of the ITS. Therefore it is a good idea to review these goals and ensure the evaluation designer is intimately knowledgeable about the following critical issues
What instructional approach underlies the tutor? - Is it a system with pedagogical
intelligence, a coached-practice environment, or a more free-form discovery micro world? Is the system supposed to guide learning, or provide a rich environment for the induction o f principles, or allow students to practice skills?
What learning theory does it assume? - Is there a clear knowledge or skill-acquisition
theory in the literature that motivates the instructional approach o f the tutor?
What exactly does it teach? - Specific and measurable knowledge or skills should be
expressed clearly as the desired learning outcomes.
What other impacts is it expected to have? - What other ways is the tutor expected to
affect the student? For example, influencing perceived self-efficacy.
In what context is it supposed to operate? - Is the system intended to supplement a lecture
or laboratory, or provide stand-alone instruction? Does the system teach to individuals or small groups? What prior knowledge is assumed of the students?
The second principle means thinking carefully about the goals of the study. Selection of an appropriate experimental design will depend on knowing what you want to find out from the research. Some of the following questions are worth considering:-
What would you like to know after the study is completed? - Do you want to know if the
ITS improves the students’ present knowledge and skills, or whether the tutor affects the students’ capabilities to perform some other task, or even if the tutor is better than conventional classroom instruction using the same material (i.e. a comparative study).
By what standards will you measure success? - You need to identify ways to measure whatever is being taught (e.g., indices to assess the veracity of knowledge, or successful application of problem-solving skills) and to whom your students will be compared on these measures.
What are potential confounds, and which o f these can you control? - Pinpointing potential
confounds before conducting the study makes it easier to control them (beforehand, by altering the design, or afterwards, statistically).
Will you use quantitative indices, protocols, or observational data? - These three types
of data represent the most prevalent means of capturing what a student is learning from the ITS.
The third principle involves choosing an appropriate design to test your research questions. Are you conducting a formative or a summative evaluation of your system? While formative evaluations have an internal control condition and suggest how can the system be made better, summative evaluations have an external control condition and can indicate how this system compares to some other system or approaches. Formative evaluations are conducted during the developmental phase of an ITS to find weaknesses early enough so that design changes can be implemented. For instance the authors conducted a pilot study with real subjects and discovered that many of them had significant difficulties learning the programming curriculum because they lacked (or forgot) some prerequisite knowledge presumed by the system (e.g., not knowing what an integer or a variable was). So they built a pre tutor, an approximately 2-hour computer-assisted instruction (CAI) module that instructed those 10 concepts.
Summative evaluations take place at the end o f ITS development, or at the end of major development stages, and assess various aspects of the finished product. The authors present five different designs that are suitable for summative evaluation studies (some of these will be covered in chapter seven):-
(1) Within-system Design - How do two or more alternative versions of a single tutor compare to one another?
(2) Between-system Design - How effective is your tutor in relation to another one teaching the same subject matter?
(3) Benchmark Design - How does your tutor fare in relation to some standard instructional approach?
(4) Hybrid Design - A combination of the above options, and
(5) Quasi-experimental Design - How well does your system operate in a real-world setting?
The fourth stage is to carefully plan the details of the design by considering and instantiating the dependent (outcome) and independent (manipulated) measures, the number and type of subjects needed in the experiment, and the appropriate control group(s). Shute et al. argue that it is very important to judge which criterion tasks (or other dependent measures assessing knowledge and skill acquisition) are needed in your evaluation. The dependent measures are required to reflect the goals of the ITS and related to the goals of the study. Moreover they recommend you should use multiple dependent measures, "Firstly, because ITS instruction is done on computers, you have the option to capture as much data, of whatever kind, you choose. You should err on the side of gathering too much data....Second, it is in the nature of learning and instructional research that the effectiveness of an intervention will depend, in part, on the aspects of performance you are trying to teach, and how you measure these indicators of performance". The authors provide examples of the kinds of learning or performance measures that can be collected, i.e. performance latency, performance accuracy, declarative knowledge, procedural knowledge, and procedural skill. They also caution that "Teaching one thing and measuring another is bound to result in a failed study".
With any learning task, differences between individuals are significant. Hence individual difference measures must be collected, such as profiles of knowledge, skills and traits. Several reasons are given by the authors for collecting individual differences measures in an experimental study "First, you can be sure that your treatment effects are real, and not simply an artifact of differential learner traits. Second, if you have collected aptitude data and find that they do impact the treatment condition, if necessary, you may then statistically control for those data that affect the treatment condition. Finally, if you don’t have any aptitude data, then you cannot investigate aptitude-treatment interactions". Ensuring that the evaluation study is rigorously controlled means overcoming one of the biggest obstacles, that of identifying suitable control conditions. Since various uncontrolled conditions and unanticipated interactions appear across settings (e.g. different instructional materials, classroom dynamics, and teachers’ personalities) the treatment effects can be masked. Therefore the authors offer a number of guidelines
(a) Use tutors that are based on a theoretically principled approach to learning and instruction;
(b) Give preference to data collected at a single site (rather than multiple sites) with standard procedures and measures;
(c) Obtain a range of demographic and aptitude measures from subjects; and
(d) Pre-specify a standard criterion task along with multiple dependent measures to be taken at various intervals during the course of learning.
Shute et al. emphasise that acquiring the right type and number of subjects (i.e. participants in research) is paramount for an ITS evaluation and with this in mind they make a number of recommendations : -
(a) identify the target population for which the tutor is intended (e.g. university students taking an introductory Astronomy course) and make sure that the sample you are testing matches your target population;
(b) calculate how many subjects are needed for the study (as a general rule-of-thumb, ITS evaluations should have at least 30 subjects per condition for simple treatment comparisons);
(c) studies using individual difference measures as independent variables should use about 100 subjects per treatment, however this rule-of-thumb can be relaxed somewhat for sufficiently powerful designs involving extreme groups, or matched cases;
(d) random assignment of subject to conditions is critically important and should be achieved whenever possible.
Principle five deals with making sure the study has been planned properly. Example of poor logistical preparations, include failure of subjects to provide a critical piece o f data or sufficient materials are not available at the data collection site. These kinds o f calamity can ruin good studies and render expensive data useless. They can be avoided with careful thought beforehand and also by considering (in advance) the worst possible scenarios, such as what you would do in case your hardware or software fails. Many possible obstacles can be avoided by piloting the study. As Shute et al. say, "It is important to find these things out before committing to the expense and trouble of a full evaluation". The authors cite a few things to look for during the pilot test of the tutor:
(a) Is the tutor running bug-free?
(b) Do subjects know what they should be doing at all times? (c) Are subjects learning anything?
(d) Do subjects indicate that they like the system?
(e) Did you estimate the learning time appropriately, or do subjects take longer (or less time) than you had anticipated? and
(f) Were all subjects able to complete the tutor?
Finally, principle seven requires that the primary data analysis should be planned before carrying out the study. A number of possible statistical techniques are presented and the authors clearly state that "Some kinds of statistical analyses are better suited to certain
classes of design types than others". Three categories of statistical techniques are worth noting
Confirmatory Data Analyses - when you have a specific hypothesis you want to test;
Exploratory Data Analyses - is an interactive and iterative type of data analysis, with no fixed procedure to analyze the data and its purpose is to leave the door open to alternative patterns that may exist in your data;
Cost-Benefit Analyses - cover the methods which estimate both cost and utility of systems.
Many o f the principles outlined are, surprisingly enough, no more than common sense guidelines. The authors have had experience of a large number of unreported studies, which remain unpublished largely because they failed. They explain their main reason for proposing these seven principles was in the hope of reducing the number of flawed future ITS studies.
4.4.2 Informal evaluations - the alternative to "controlled" evaluation
Twidale (1993) emphasises that problems may arise when using rigorous experimental methods and instead expounds the usefulness of informal techniques. He maintains that rigorous, formal, experimental, summative evaluation (i.e. controlled evaluation) is often assumed to be the proper method and any alternative is seen as a poor one. However the author contends there are a number of arguments against this:-
(1) Rigorous experiments are large, slow and costly;
(2) a controlled experiment only really measures one thing;
(3) a controlled experiment produces averaged out figures of overall performance; (4) unexpected interactions may lead to misleading results;
(5) the effect of the interface can be overwhelming;
(6) learning to learn with an ILE (intelligent learning environment) takes time.
Normally, experiments are associated with summative evaluation while informal techniques are most likely to be used for formative evaluation. Controlled evaluation is more frequently used because of its claim to be a higher status o f scientific objectivity and reproducibility. Furthermore research projects, that outside bodies are likely to fund, are ones with controlled evaluative research because it produces objective measures of effectiveness. Consequently, experiments are chosen because they are judged to be the most desirable method. There are, however, also other factors which play a role in this decision-making.
The author suggests that three of the most common disciplines in intelligent learning environments (i.e. computing, psychology and education) represent their own research paradigms. Research in psychology and education belong to the scientific paradigm and are more likely to emphasise the formal objective summative experiment, whereas parts of computing have more affinity with the engineering paradigm (which derives proof that a system works from its construction). A number of issues result from this culture clash. The engineering paradigm necessitates making decisions between design options and involves trade-offs on a number of dimensions (i.e. intuitive decision-making). Experiments are usually only useful when restricting their scope to a few significant decisions within any single design (i.e. formal decision-making).
Nonetheless, there are several reasons why the scientific paradigm of repeatable experiments is less likely to happen in the general field o f computing or specifically Intelligent Learning Environments. Firstly, experiments take a lot of time to set up and need to build up upon earlier ones to allow comparisons. Rapid improvements in hardware and software can reduce the value of the information gained. Experiments which involve computer systems can also be criticised because these systems are likely to develop increasing sophistication. Consequently, unexpected or unwanted results can be blamed on some technical flaw such as a primitive interface feature. Similarly, previous negative conclusions can be overruled on better features incorporated on a newly developed system. In addition, it is very expensive to reproduce systems developed on particular hardware and using particular combinations of software. Another reason may be that computing
researchers are more inclined to build and to improve systems, rather than running comparative experiments.
Informal techniques, on the other hand, are particularly useful in formative evaluation where the system is incomplete and so assessment of overall performance is inappropriate. In comparison to summative evaluations, the results from informal evaluations are more likely to be prescriptive, namely describing implications for future developments from the shortcomings of the present system. One example of an informal technique is the individual case-based study, which is informative during instances where the system needs to be studied in detail for small, incremental changes, especially when performance is less than ideal. A different technique is the Wizard o f Oz method which is popularly used when experiments need to be undertaken during formative evaluation, where conditions need not be as rigorous as those in controlled evaluations. It involves the user interacting with a computer interface but the information is passed to a human processor in another location who does the processing and passes back the reply via the interface. This method is suitable for testing the efficacy of the interface in advance of the development of the internal components. In addition, it can be used to test individual internal components of an incomplete system. Although fewer subjects are involved in Wizard of Oz techniques, they are also more labour intensive. Twidale argues the differences between informal and controlled evaluation may be summarised by saying that one should use controlled evaluation when one wants to show the advantages of a system, and informal evaluation when one wants to reveal difficulties.
Informal techniques can appear imprecise and may be perceived as far from scientific. However, Twidale (1993) asserts there are ways to ensure greater rigour, without losing the benefits of speed and simplicity from these informal methods, "The test for the effectiveness of the evaluation technique is whether it leads to the building of better systems....A formative study should reveal problems in the system that can be corrected in subsequent versions....It should also reveal those elements that appear to be successful even in stripped down prototypical form, providing evidence that in the improved version of the system, they would be all the more effective....The acquisition of anecdotal
evidence can also be used to answer the criticism of many studies both formally experimental and not, that they are unique to their circumstances".
A significant advantage of informal techniques over controlled evaluation are that they can