• No se han encontrado resultados

CAPÍTULO 7.- MECANISMOS DE COMERCIALIZACIÓN Y

7.5 Infonavit

inferences introduce errors of measurement (EoM). EoM are the differences between a measured value and a true value. Typically, EoM are not errors in the assessment instrument nor the design study. However, they influence the data and how it might be interpreted. For these reasons, EoM cannot be totally avoided, but they can be reduced. There are two types of EoM, systematic and random. “Systematic errors are constant across some set of scores (e.g., all scores for a particular person or occasion or test form) and are therefore potentially predictable” (Kane, 2011, p. 15). Random errors, conversely, are based on statistical fluctuations that exist when measured values are inconsistent and these cannot be predicted. One way EoM are minimized is through standardization.

Kane (2011) discusses the importance of standardization and its limitations in reducing random and systematic errors. Higher degrees of standardization have a way of decreasing random

error because standardization seeks sameness by reducing differences in the testing procedure. However, standardization can increase systematic error because performance assessments are conducted in unique contexts that are harder to generalize across a set of scores. Paradoxically, standardization in the testing procedure can promote fairness but, in some cases, that can also decrease fairness. When it comes to standardization more is not always better. “Standardization is less effective for absolute interpretations” and can “increase the overall error in some cases, especially for absolute interpretations (p. 18). However, “standardization tends to promote fairness and the appearance of fairness,” important factors when considering a high-stakes assessment (Kane, 2011, p. 17). Kane suggests that standardization works well in many contexts, but not all. In some cases, standardization practices can prove problematic, especially in large scale educational accountability assessment systems such as the TPA, where there is a rub between the nature of the type of assessment selected and the need for high levels of standardization in the assessment procedure (p. 26).

In addition, systematic errors associated with standardization tend to increase when the test consequences are high-stakes. For instance, accountability comparisons across programs and states are likely to be influenced by how well programs prepared candidates to take the test, as separate from preparing candidates to teach. When significant differences occur in test preparation and procedures, the “resulting systematic errors can be especially serious for the kinds of absolute interpretations (e.g., in terms of achievement levels) typically employed in accountability systems” (p. 26). Some standardization is necessary, but it is not the fix-all for errors of measurement (Kane, 2001). In fact, efforts to standardize a performance assessment may reduce its benefits as an authentic instrument because “even modest levels of standardization are difficult to implement in real practice situations, and these efforts may tend to make the performance assessment somewhat artificial and contrived” (Kane, 1992, p. 172). As observations of performance become more

standardized, they can be seen as less representative and accurate. Procedures to improve standardization are important for reliability, efficiency, scoring, and, especially, fairness, but may

contradict the benefits of authentic, context-specific and unique characteristic of performance assessment. The role of standardization, whether there is enough or too much, should be clarified in the evaluation, generalizability and extrapolation inferences and included as a part of the evidence collected in the VA for performance assessments.

Validating Measures of Performance.

All assessments have stronger and weaker links. Performance assessment, because of its strength as more easily extrapolated to the larger domain of teacher readiness, is preferred for decisions of teaching licensure and certification in the US. Cizek (2001) points out “there is simply no way to escape making decision about students” (p. 21) and decisions about which teachers should be licensed and which should not is one of the important decisions to be made if we are to preserve an educational system that promotes democratic activism and equity for all learners. The stakes are high and somewhat problematic. Goodman, Arbona and Dominquez de Ramirez (2008), state the challenges when they write, “if the test results are not valid with respect to evaluating candidates, they create the potential for individuals to be certified who really do not exhibit competence in authentic settings” (p. 26). However, “another concern is that failure to pass these high-stakes, minimum-competency tests could eliminate otherwise qualified candidates from the teaching profession” (p. 26). Democracy in the US is dependent upon active, participatory citizens who can read, write, analyze, and defend positions on issues. We must know that our teachers can develop learning environments in which student achievement leads to, at minimum, the maintenance of the status quo. In validation terms, “the end point of our chain of inferences consists of conclusions about readiness for practice in an area, and this end point is fixed, in the sense that we do not want to change our conception of readiness for practice or our standards of performance just to make the development of the assessment procedure easier” (Kane, 1994, p. 141).

It seems reasonable to determine readiness to practice in a specific professional field, “in terms of the extent to which a candidate is prepared to manage the situations that arise in practice” (Kane, 1994, p. 139). LaDuca (1994) describes this as “professional encounters” but most licensure or

certification exams do not ask that the samples of performance on these tasks be samples of actual practice (Kane, 1994). In this way the TPA is unique. For instance, we do not ask lawyers to

demonstrate their readiness to practice law by winning an actual court case. There are many reasons why high-stakes, authentic performance assessment have not been more widely adopted, including difficulties in designing studies that can provide data that verifies the judgments and decisions around test scores. Kane describes the issues in asking for specific, real-life samples of actual practice. The central concern is that there is “great variability in clients and their problems” (p. 139) and we see this in teaching as well. Candidate placements in different districts which practice different levels of scripted curriculum, schools with differing levels of socio-economic status, and diverse teachers and supervisors with different levels of experience both teaching and mentoring novices (Moir, 2013; Moir, 2012; Sandholtz & Shea, 2011; Darling-Hammond, 2006) present variations in the testing procedure. These concerns need to be a part of the validation design. For the purposes of this study, Kane’s definition of validity, which includes the uses and decisions that follow from the interpretation, was adopted. Robert Brennan (2001) notes that the work that Kane has published “offers the greatest hope for bridging the gap” between the seemingly simple theory of validity, and the complex practice of validation (Brennan, p. 12). Using Kane’s ABV model allows for questions about how context impacts performance to be a part of the study design, which makes it preferable for high-stakes, large-scale performance assessments of teacher readiness. As a model, ABV has been successfully applied to defend the plausibility of assessment inferences in several validation studies.

Several studies have applied ABV to language acquisition assessment (Chapelle, 2011; Chapelle, Enright, & Jamieson, 2010; Ryan, 2002), and several validation studies of high-stakes, large-scale tests in education have applied ABV. Shaw and Crisp (2012) used Kane’s (2006)

framework to examine International A level Physics exams in the UK (Shaw, Crisp, & Johnson, 2012). Gotch and Perie (2012) apply Kane’s model to local assessment systems using school districts in Pennsylvania state (Gotch & Perie, 2012). Bennett, Kane and Bridgeman (2011) use a modified ABV

framework to evaluate the use of formative evaluations as a part of a summative assessment system designed by the Partnership for Assessment of Readiness for College and Careers and the Smarter Balanced Assessment Consortium. In the case of the Bennett study, a modification to Kane’s original 2006 framework was made suggesting that the inclusion of a Theory of Action (Bennett, 2010) would augment validity arguments, specifically looking at formative assessment. Most of these tests have been designed for a P-12 audience, as a part of the testing systems adopted for public education. None of these tests were performance-based. At the time of writing this study, no ABV study of the TPA, or its predecessors the PACT, BEST, or Texas Portfolio, had been published.

Data to support the validity of assessments based on standards requires the collection of several types of evidence and requires multiple types of research. Because professional assessments of readiness take as a part of their construct the predictive nature of the test scores, ongoing

research must be conducted. These should include: collections of candidate work to understand the processes by which they understand and practice pedagogy, questionnaires to understand the breadth of candidate thinking, and interviews to understand the depth of candidate thinking. In addition, studies of different sub-groups, contexts, and time-periods must be undertaken to better understand the consequences of the assessment and any biases present. The evaluation of

standards, using performance assessment, introduces more variables in the testing experience and, unlike measurement of objective norm-based test comparisons, requires that validation practices used incorporate more of the candidate experience. Validators have, since Cronbach and Meehl, understood that there is a distinction between type of test and the validation method used. The validation model selected influences our ways of thinking about the learner, the tasks, the construct, and measuring validly and reliability (Kane, Crooks, & Cohen, 1999; Taylor, 1994).

Limitations of ABV.

Even among those proponents of a broad, unified, definition of validity, ABV is not without its detractors. Several researchers have questioned the practicality or even the prescriptive nature of ABV (Schilling, 2004; Briggs, 2004). Haertel (2004) does not object to

ABV, though he does take issue with the notion that ABV can verify direct observation assessments for certification (p. 194) without also demanding that content validity be present in the model. Similarly, Talbot and Briggs (2007) argue that, when applied, Kane’s validation approach can lack “substantive knowledge of the phenomenon being measured” (Talbot & Briggs, 2007, p. 207). Mainly the objections arise because Kane’s approach takes the IA as the theory to be validated, and does not demand that a separate content construct be identified unless it is already part of the IA. Talbot and Briggs suggest that two “amendments” to Kane’s approach be added. These include “(1) a clearer distinction between assumptions and inferences in the formation and evaluation of the interpretive argument; (2) breaking up the interpretive argument into what the authors describe as elemental, structural and ecological pieces” (Talbot & Briggs, 2007, p. 205). They go on to say that “each of these pieces is then associated with specific methods for gathering the evidence needed to support a validity argument” (p. 205). This builds on an earlier argument from Briggs (2004)

suggesting that Kane’s model include a third step, design validity, implemented before the development of an interpretive argument. Kane (2004) responded to these arguments reiterating that, while complicated, interpretations and uses of test scores deserve to be clearly stated in an effective validation framework that “can foster dialogue among stakeholders on the merits of specific inferences and assumptions” (Briggs, 2004, p. 199) but that the content construct need not necessarily be the overriding theory. Kane addressed most of these criticisms in his piece in

Educational Measurement (2006) which clarified the purposes and approach of ABV and addressed

errors in previous applications of ABV that did not incorporate the construct with enough clarity. Using ABV for the present study is not without its disadvantages. First, in order for ABV to successfully offer the plausibility of the IA, the interpretive argument must be stated very carefully and clearly. However, this process is time-consuming and often difficult because the assumptions underpinning the uses of a test can be general and implicit. Kane acknowledges this when he discusses the development of the interpretive argument as a “formative” stage that is “likely to be stated in very general terms” and that “[m]uch of the interpretive argument tends to be left implicit

on the assumption, presumably that the details are self-evident or unimportant” (2004, p. 141). It may take several iterations in the research process before an interpretive argument can be fully understood. It is not simple. “Getting agreement on the interpretation and use is not always an easy task and may require extended negotiations among stakeholders” (p. 142).

Secondly, a validator who is not also the test creator or a stakeholder in the “negotiations” above can find ABV a frustrating approach to validation studies because articulating the

interpretations, assumptions, and uses for the instrument requires “getting inside the head” of the test-maker. While the interpretive argument, the assumptions, and the inferences are often stronger when developed by a validator who can see outside the assessment tunnel, if the test-maker is not accessible (or the assessment procedures, the scoring, and the test-maker are represented by separate organizations) and the validator must rely on written documents (or press releases) to determine parts of the interpretive argument, ABV is made more difficult and less applicable. This is especially true if suggestions for improvements are not sought by the test-maker.

Finally, even when an interpretive argument is stated, it is not always clear that ABV offers a “methodology” or design for a validation study. ABV requires that the researcher take some leaps to get from the development of an argument to a study design. In addition, as a comprehensive approach to validation, responding to the assumptions and the questions they provoke for each inference from each occasion of the test is a limitless endeavor that, if not carefully managed, can consume the researcher.

Advantages of ABV.

There are several advantages of Kane’s ABV framework for the present study. The advantages to applying the ABV model to performance assessments for professional judgments include:

1. It is highly tolerant (Kane, 1992, p. 534). As a model, it can be used for any type of assessment or type of evidence. The TPA as an instrument is complex and requires a validation argument that can simultaneously look both broadly and narrowly. Because the TPA involves a variety of stakeholders and has a high-stakes impact on a large and diverse

population, the flexibility of ABV was necessary and key in its selection. Likewise, there is no preference made to include a specific type of data which allowed for a multiple-method design incorporating both correlational studies and personal narrative, both of which are central to understanding the proposed uses and value of the TPA in WA. Rather than a focus on one type of evidence, data is judged based on how well it responds to the inferences, assumptions, and uses as described in the interpretive argument (p. 534).

2. It does not “have to be associated with formal theories” (p. 534) which also allows it to be associated with any theory.

3. It “provides a basis for deciding on the kinds of evidence needed to validate a particular interpretation” (1990, p. 37).

4. It allows for a way to gauge progress preventing a never-ending, unwieldy investigation (Chapelle, 2011; Kane, 1992).

5. It increases the likelihood that the validation study results can lead to improvement in educational measurements, especially the assessment procedure for the assessment studied. It is productive with an aim to improve the procedure in order to make it more plausible (Kane, 2006).

6. “The term argument emphasizes the existence of an audience to be persuaded, the need to develop a positive case for the proposed interpretation, and the need to consider and evaluate competing interpretations” promoting a long-term view of validation practice that incorporates the needs of multiple, diverse stakeholders (1990, p. 37).

Finally, ABV is a preferred framework because the data and research can ultimately lead to the improvement of the instrument, rather than simply a critique of its uses. Because the TPA is a mandated instrument in WA, the overall goal of this study is not to eliminate or replace the TPA but to confirm that its claim to measure teacher readiness is strongly plausible and if/where it is not, to offer data to improve the TPA. Such an approach to validation studies can help to eliminate a bias either in support or against its use (Kane, 1992, p. 534). Because it can be used with high-stakes

assessments, assessments that are a part of a larger accountability system with multiple

stakeholders, and performance assessment, the ABV framework is ideally suited for an examination of the TPA.

Conclusion

Teacher educators have established a set of criteria to determine professional competence and readiness to teach. Over the past 30 years, an emphasis on evaluating those standards by using performance assessment has led to the widely acclaimed and adopted assessment for determining readiness, the Teacher Performance Assessment. Washington, with its emphasis on these shared criteria, as well as Student Voice and equity, has adopted the TPA for use in determining whether a teaching license should be granted. This practice has led to questions about the validity and reliability of the consequences and uses of TPA test scores, both for candidate licensure and for accountability of teacher training programs.

When it comes to validation studies that address consequences and uses, there are “two kinds of questions to ask of any procedure: can it be used as intended (a technical question) and

should it be used as intended (an ethical question)” (P. Newton, 2012, p. 14). This study suggests

that, in some cases, dissecting the consequences from the purpose of the assessment is a complex task. How do we validate an assessment whose evaluation question includes a construct of

accountability, itself a consequence of test score use? Validation as a contribution and practice depends upon a shared definition of what validity means and upon what such studies should focus. Any validation inquiry based upon a weak theory (IA) and too narrow a definition of validity will likely leave the end user without a true measure of whether their test-score-based-decisions are appropriate in the context in which they are being used. The benefit of ABV is that it allows the validator to select the evidence that is most needed (that most challenges the weakest links in the chain of inferences) based on the theory of validation required for those inferences based on the type of assessment studied and the use of the scores from that assessment. ABV applies the most

challenging consequences, best types of evidence, and most robust validation inquiry based on the inferences unique to that assessment procedure.

Kane reminds his readers that “The users of test scores have the primary responsibility for evaluating the decision that they adopt” (Kane, 2009, p. 62). As this is the case, users also have a responsibility to select the theory and model that will provide them with the most robust validation study. Because the TPA is a high-stakes performance assessment of teacher readiness, itself a fairly complex construct, ABV provides the best model for this undertaking. The TPA is a high-stakes measure for many diverse constituents and users. Therefore, it is even more important that the inferences and uses of TPA test scores be based on a validation study and not “beg the question” of validity and reliability (Kane, 2009).

This chapter outlined the standards currently used to determine teacher readiness in the US and the movement to assess those standards using performance assessment. The strengths and challenges of performance assessment, both as a measure of assessment and as an assessment in a validation study, were discussed. The TPA, as the performance assessment examined in this study, was described. Finally, a brief overview of the history of validity, its definition and practice, and the

Documento similar