4. ANÁLISIS SOBRE LAS MEJORES TÉCNICAS DISPONIBLES DEL SECTOR
4.3.3. Técnicas para la reducción de las emisiones durante el almacenamiento
As mentioned above, one way in which internal consistency can be determined is through inter-rater reliability and agreement. High levels of inter-rater reliability refer to the consistency “between evaluators in the ordering or relative standing of performance ratings, regardless of the absolute value of each evaluator’s rating” (Graham, Milanowski, & Miller, 2012, p. 5). Inter-rater
Teacher candidate pass rate n=58
Cut score Overall pass rate 35 93% 36 - 37 91% 38 90% 39 - 40 88% 41 83% 42 78% 43+ 71%
agreement indicates the degree to which multiple evaluators use the same rubric scale to give the same score to identical evidence (p.5). For this validation study, inter-rater agreement is of more importance than inter-rater reliability because score consequences are based on cut scores and agreement on absolute levels of performance.
Pearson is contracted to facilitate scoring. One set of raw scores per rubric, per candidate, were reported to Sterner. Pearson did not report scores to candidates. In addition to Pearson scores, supervisors were asked to score candidate TPA. Using these two sets of rubric scores, rubric score differences and similarities were analyzed. Two common indices for measuring inter-rater
agreement were calculated, including the percentage of absolute and adjacent agreement and Cohen’s Kappa.
The percentage of absolute of agreement refers to how often raters agree on the exact level or score given to each rubric. This percentage is simple to calculate in this study because the number of raters is small. However, results are difficult to interpret because there is no calculation of the level to which “chance” or “random” may explain agreement and, because the TPA rubric has five levels and scores may fall across different levels, it does not distinguish between these different levels of disagreement. For this reason, adjacent agreement percentages, or the frequency with which scores occurred within +1/-1, +2/-2, +3/-3, will be shared. Cohen’s Kappa reports how well scorers agreed. Kappa addresses the “chance” or “random” factors that may influence scores. Kappa is considered a better estimate of agreement when raters are reporting for different groups of candidates. Cohen’s Kappa is, however, more difficult to interpret when the rubrics have many levels (five or more) and can be “misleadingly low if a large majority of ratings are at the highest or lowest levels” (Graham, Milanowski, & Miller, p. 8).
Findings.
In total, 920 rubrics were scored by Pearson and the supervisor, for thirty candidates. Of these, 271 were different (.29). The majority (.79) of this difference occurred within one rubric level (+1/-1). Most (.52) of that difference happened when supervisors scored the candidate as +1, though some of the difference (.27) occurred when supervisors scored thecandidates -1 from Pearson scorers. The total difference in rubric scores within two rubric levels (+2/-2) was 18% and within three rubric levels (+3/-3) was 3%.
The three rubrics with the greatest number of scorer difference include Rubric 3 (Planning), Rubric 11 (AL), and Rubric 13 (SV). Other rubrics that scored >30% scorer difference include Rubric 1 (Planning), Rubric 2b (Planning), Rubric 8 (Assessment), and Rubric 10 (AL). These also correlate with those rubrics with the most variance in score range (+3/-3) (see Appendix O for agreement indexes).
One question sometimes asked about inter-rater agreement is whether that agreement should be assessed at the standard (rubric) level, construct level (task, category, domain), or by overall scores. Construct agreement varied (.68 to .74). Most agreement occurred in Task 4 scores (.74). Task 4 has only one rubric. Least agreement occurred for Task 1 (.68), which has the highest number of rubrics (see Appendix P for agreement percentages).
Support for validity
. Inter-rater agreement levels for absolute agreement between 75% and 90% are considered acceptable (Graham, Milanowski, & Miller, 2012). Two rubrics fall into this category (R12 and R14). Four rubrics miss the cut by one percentage point. The majority of the difference in rubric scores (.79) fall within one rubric scoring level (+1/-1). A review of the literature from teacher evaluation studies suggests most report a lower average in most studies of evaluations of teaching, around seventy percent (p.10). If using this average, the percentage of AbsoluteAgreement for TPA scores falls into an acceptable range (.71). In addition, half of the rubrics, and the rubric averages for Task 2, Task 3, Task 4, AL and SV, meet this benchmark level. Applying Cohen’s Kappa and taking into account the “chance” that agreement would occur between the two scorers, the inter-rater agreement levels is .65, which meets the minimum level for consequential use. This calculation is well above the average reported in the literature (.54), suggesting that inter- rater agreement is stronger than the average reported in the studies on assessment practice in teaching observation evaluations (Graham, Milanowski, & Miller, 2012).
Threats to validity.
The more consequential the examination results, the greater the burden for high reliability levels. Given that one would not expect to see scorer agreement outside of the one rubric level difference. Variance derived from scores that differ by +2/-2 or more levels is .21. Given that the scores will be used to determine candidate licensure, an absolute agreement in the mid-to-high .8 range would be more acceptable. Similarly, though well above the average from published studies, the Kappa should be closer to .8 than .6, which indicates that there may not be high levels of agreement in scorer rankings.It is important to note that the data suggest that we could estimate that .29 of those rated could have received a different score had their TPA been scored by a different rater. This study indicates that scores could have differed for seventeen of the fifty-eight students (n=17/58). This is a fairly substantial number and it is difficult to evaluate whether high-stakes decisions should be based on these findings.
Inconsistent scorer interpretations of rubrics and TPA language may have produced inconsistent scores. Therefore, this field test may not consistently represent how the candidate actually performed. Many factors can affect inter-rater agreement. In fact, 100% agreement is not really preferable because of the high cost and time commitment that such agreement would require. However, some professional agreement is necessary, especially for high-stakes decisions made from scores. The following are some factors that can affect agreement: rater training, rater selection, accountability for accuracy in ratings, adequacy of rater compensation, rubric designs, rubric scales, pilot programs and redesigns, and technology use (Graham, Milanowski, & Miller, pp. 15-22). Validity Evidence 4 through 6 will examine some of these factors.