• No se han encontrado resultados

Questions

Multiple-choice tasks are often employed in large-scale assessments,

particularly to permit easy marking of responses. Open-response tasks, however, are commonly advocated as preferable to permit students to exhibit their own

conceptions of the relevant features of the task. Mevarech and Kramarsky (1997) noted that the open-ended nature of their task “may explain why some of the

alternative conceptions diagnosed... were not identified in previous studies” (p. 255). In the current study, the representation tasks were designed to be set in realistic contexts, and the desire to allow for alternative conceptions led to providing students

with a blank page on which to draw their representation as done by Mevarech and Kramarsky, rather than providing labeled axes that offer structural support but constrain the form of representations as done by Bell et al. (1987b). The choices of variables and of axes on which the variables were placed were considered structural elements of interest.

In employing open-response paper-based questions, it is particularly

important that tasks and rubrics are carefully designed. Multiple-choice tasks require clarity of task wording and marking is trivial: as Woolfolk (1993) commented, “All test items require skillful construction, but good multiple-choice items are a real challenge” (p. 547). In contrast, extended tasks or interviews often permit repeated paraphrasing or clarifying of the task with the student. Open-response paper-based questions must balance avoiding ambiguity about what is demanded by the task, with ensuring openness that permits students a degree of discernment. As Woolfolk (1993) commented, “The most difficult part of essay testing is judging the quality of the answers: but writing good, clear questions is not particularly easy, either”

(p. 548).

It is common practice in test development for scoring (or coding) schemes to be developed concurrently with tasks, in order to assist the task writer to refine wording of the task to correspond to the complete response that would demonstrate the understanding being assessed. If a task is poorly worded, students may respond in ways that minimally satisfy without demonstrating more complex skills or

understandings that the task writer aimed to assess. This distinction has been referred as to as functional versus optimal responses (Fischer & Knight, 1990): the task should optimally challenge students to demonstrate the depth of the understanding appropriate to the task, but also permit accessibility for students with partial or very

limited understanding to engage the task and demonstrate what they do know without being intimated by a task they consider to be too challenging to offer a response. In exploratory studies, it may not be possible to anticipate in detail the richness of student responses, thus rubrics may be broadly defined in terms of partial and complete responses for the demands of the task (Woolfolk, 1993). It is also good practice to trial tasks with the target audience, to troubleshoot alternative

interpretations. These practices were incorporated into the current study by

consulting experienced researchers with draft tasks, and then piloting the tasks with a student before administering to larger numbers of students. Some tasks were also refined based on evidence from previous investigations.

An example of a graphing task and coding scheme, similar to one used in the current study, is shown in Figure 3.02. The task emphasised “realistic scale” in the task wording, which was a key feature of criteria in the coding scheme to

differentiate correct responses from partial and incorrect responses. In contrast, graph form was not referred to in the task as it formed no part of the criteria apart from having axes that may be scaled. The coding scheme included different categories at each level of correctness; the code provided a diagnosis of why a level of correctness was assigned to the responses. The coding scheme included not only the wording of the criteria for each code, but also an example to assist those ratings responses. Explicit criteria evident in the response focused attention on what should be evident in the written response, rather than that which might presumed, often falsely, about the student understanding based on criteria not within the responses, such as awareness of a students’ grade level influencing the assigned level of the response.

Task:

Using the set of axes below, sketch a graph which shows the relationship between the height of a person and his/her age from birth to 30 years. Be sure to label your graph, and include a realistic scale on each axis.

Coding:

Figure 3.02. A sample task and coding scheme from TIMSS (1995) http://isc.bc.edu/timss1995i/Items.html (see C_Items.pdf#page23 - Question A10).

When categorising responses, some student responses prove difficult to categorise. Some students offer responses that suggest an alternative interpretation of the question such that the response does not exhibit features referred to in the coding

scheme – this issue is generally minimized by careful task design and trialling of tasks. Some student responses are difficult to rate because they exhibit features that might suggest assignment to more than one category. The magnitude of this issue is commonly measured by having two raters independently assign ratings, and then calculating an inter-rater reliability. In general, good coding schemes maximise the differences between categories and minimize the differences within categories, termed external heterogeneity and internal homogeneity (Patton, 2002, p. 465). A single category used for a large number of disparate responses provides less descriptive power, and suggests that the category should be divided into more restricted meaningful categories. Defining categories thus may involve iterative refinements to coding (Lesh & Lehrer, 2000; Maxwell, 1996; Miles & Huberman, 1994), a practice that reduces the relevance of inter-rater reliability in exploratory studies.

Preliminary investigations (Data Collections 1 and 2) in the current study employed written coding schemes and techniques of inter-rater reliability, providing evidence of high reliability, as well as the validity of the framework shown in examples. Responses to tasks in later investigations (Data Collection 3) were iteratively coded by the researcher and reviewed at times by the researcher

supervisor, with examples shown in results or appendices providing evidence of the validity of the assessment framework.