15. Primera parte: Pruebas previas
15.2 Segunda prueba previa
Results of the MFRM (analytic) analysis are presented in Figure 9. The spread of test takers on the logit scale indicates a wide range of ability levels. Raters varied in their levels of severity between -1 and 1 logits. The map shows Task 2 was scored higher than Task 1 indicating that test takers may have found Task 2 less challenging. The length of planning time also impacted scores; the ten-minute condition resulted in scores that were higher than the one-minute condition. The items complexity, accuracy and fluency are mapped onto the logit scale in column six. Complexity was the most difficult category, followed by fluency and accuracy. The following section discusses the MFRM statistics at length.
Figure 9. Wright map analytic scale
4.2.4.1 Facets in the MFRM model
Test Takers. The MFRM statistics show that test taker ability measures varied. The range of fair average values was 1.28 to 3.45 on the analytic scale. The separation index was 3.91 and test takers were separated into 5.55 strata. The mean standard
error was .46 (standard deviation .12) indicating that scores may have been imprecise by approximately half a logit on average. This value is high and indicates that the scores may have been imprecise. The mean value of infit mean-square index statistics was .98 (standard deviation .44). This represents a range of .51 to 2.20. The literature clearly indicates that values that exceed 2.0 are cause for concern (Bachman, 2004, Linacre, 2013). Test taker 3 (2.00) and test taker 8 (2.20) record unacceptable fit statistics according to this standard. In order to investigate the cause of the misfit, the test takers’ raw scores are presented in Table 10.
Table 10 Scores awarded to test takers 3 and 8 by four raters
Test taker Rater 1 (.17) Rater 2 (.36) Rater 3 (-.09) Rater 4 (-.44) 3 (2.20) 4.3.3 / 3.3.3 2.2.2 / 2.1.1 3.3.3 / 3.3.2 4.2.3 / 4.3.4 8 (2.00) 3.3.4 / 3.2.2 4.3.4 / 2.2.2 4.4.3 / 3.3.2 4.4.4 / 3.3.3
Test taker 3 received high grades from Rater 1 who was more severe than Raters 3 and 4. In addition, Rater 4 awarded level 2 for accuracy to test taker 3, which appears to be inconsistent with the levels of lenience that are established for this rater in the model. Test taker 8 received high scores from the most severe rater (Rater 2) and relatively high scores from Rater 1. This level of inconsistency may explain the misfit of these scores to the MFRM model. However, in order to maintain equality between the number of test takers in the MFRM of the EBB data and the MFRM of the analytic data, these test takers were retained.
Raters. Table 10 presents the results of the MFRM of the rater data. The raters clearly varied in severity. The range of rater severity measures was from -.44 to .36 on the logit scale. The range of fair average values was from 2.57 to 2.80 logits. Standard errors ranged from .14 to .18 (mean = .16). The infit mean-square index
values were within an acceptable range of 0.81 to 1.23, which indicates that the raters fit the model well. The separation index was 1.60 and raters were separated into 2.47 strata indicating that there were broadly two levels of rater severity. The reliability statistic was .72, which indicates reliable differences between the raters’ distribution of scores. To compare the distribution of scores between the analytic scale raters (reliability .72) and the EBB scale raters (reliability .91), analytic scale raters were more likely to agree about suitable scores than the EBB scale raters: variation in rater severity was less likely to impact on test scores on the analytic scale.
Table 11 Report of rater severity: analytic scale Rater Fair Average Severity
Estimate Error Infit mean-square index
4 2.80 -.44 .18 .81 3 2.70 -.09 .16 .93 1 2.63 .17 .14 .84 2 2.57 .36 .15 1.23 Mean 2.67 .00 .16 .95 SD .09 .30 .02 .17
Reliability of difference in severity of raters .72
Tasks: Table 12 presents the results of the MFRM of the tasks. The results show that Task 2 was scored higher than Task 1 by .32 logits on the fair average scale. The result of the fixed chi-square test shows that this difference was significant at p < .001. The ordering of the tasks was similar in both the analytic and EBB analysis. The difference between the tasks indicates that different picture-based narrative tasks are required for the main study.
Table 12 Tasks 1 and 2 fair average, measure and infit statistics: analytic scale Task Fair Average Measure Infit mean-
square index same) chi-Fixed (all square
2 2.83 -.54 1.03 χ2= 48.8, p =
.00
1 2.51 .54 .89
Planning. Table 13 presents the results of MFRM of the planning conditions. The difference in planning time clearly impacted the scores. The ten-minute condition led to a fair average value that was .51 higher than the one-minute value. The fixed chi- square test demonstrates that this difference was significant at p < .001. The hypothesis that planning would impact the scores on the analytic scale is therefore confirmed.
Table 13 Planning fair average, measure and infit statistics: analytic scale Time Fair Average Measure Infit mean-
square index same) chi-Fixed (all square
10 min 2.90 -.87 .86 χ2 = 126.4, p =
.00
1 min 2.39 .87 1.05
Complexity, Accuracy, Fluency. The score frequencies per category are presented in Table 14. The table shows that raters clearly avoided awarding level 5 to the test takers on all categories. There also appears to be a central tendency effect in operation as the third level on the scale was clearly the most frequently used (Myford and Wolfe, 2004).
Table 14 Frequencies of scores on analytic scale categories
Score Accuracy Complexity Fluency
1 26 28 29 2 59 67 56 3 82 79 83 4 21 14 19 5 0 0 0
In addition to the overall MFRM of the analytic scale data, separate MFRM analyses were run to determine the planning impact on each category of the analytic scale. Table 15 presents the results. To begin with fluency, the ten-minute condition resulted in scores that were .56 logits higher on the fair average scale than the one- minute condition. The result of the fixed chi-square test demonstrates that this difference was significant at p < .001. On the accuracy category, the ten-minute planning condition recorded a fair average value that was .43 higher than the one- minute condition. The fixed chi-square test shows that this result was significant at p < .001. On the complexity category the ten-minute condition led to fair average values that were .55 higher than the one-minute condition on the fair average scale. The result of the fixed chi-square test was significant at p < .001. These results clearly demonstrate that extra planning time increased scores in each category of the analytic rating scale. The largest increases in scores occurred on the fluency and complexity categories.
Table 15 Complexity, accuracy, fluency fair average, measure and infit statistics Category Planning Fair
Average Measure Infit mean-square index Fixed (all same) chi- square Fluency 10 min 2.97 - 1.05 1.01 χ2 = 54.7, p = .00 1 min 2.41 1.05 .92 Accuracy 10 min 2.94 -.86 .89 χ2 = 37.7, p = .00 1 min 2.51 .86 1.02 Complexity 10 min 2.87 -1.09 .74 χ2 = 54.0, p = .00 1 min 2.32 1.09 1.14