• No se han encontrado resultados

3.5.5 Factores que pueden alterar el comportamiento

This section examines the use of the proposed Bayesian approaches to model-checking and model-comparison for a real mathematics performance assessment - the QUASAR Cognitive Assessment Instrument (QCAI). QUASAR (Quantitative Understanding: Amplifying Student Achievement and Reasoning) was a national project that sought to demonstrate that it is feasible to implement instructional programs in the middle-school grades that promote the acquisition of thinking and reasoning skills in mathematics (Silver, 1991). The QCAI was a performance assessment developed for the QUASAR project in order to evaluate the impact of innovative instructional programs on middle school students’ mathematical thinking and reasoning in four sub-domains: reasoning, problem solving, communication, and understanding of the features that characterize mathematical concepts and their interrelations (Lane, 1993). The QCAI includes four test forms (A, B, C, and D), each containing 9 different open-ended tasks scored at 5 levels (0-4). These four forms were randomly distributed within each sixth- and seventh-grade class in the schools participating in the QUASAR project (Lane, Stone, Ankenmann & Liu, 1995). This test was administrated in both the fall and the spring during 1990, 1991, and 1992.

Several researchers have examined the extent to which the QCAI response data met the assumptions and properties underlying the GR model. Lane et al. (1995) conducted a

comprehensive study to evaluate the dimensionality, speededness and item parameter invariance for each of four QCAI forms across three administration occasions (Spring 1991, Fall 1991, and Spring 1992). They examined the dimensionality through the use of the confirmatory factor analysis and eigenvalue plots. Factor analysis results indicated that each of the four forms of the QCAI were essentially unidimensional. However, it was found that tasks with lower factor loadings in a one-factor model solution reflected tasks requiring some type of explanations, and the tasks with relatively high loadings generally involved problems requiring students to only display their mathematics solution strategies. Lane et al. (1995) further explored the use of two- factor models. A two-factor model was estimated in which one factor included all tasks except those requiring a nonprocedural explanation and a second factor included only the tasks requiring a nonprocedural explanation. In addition, a two-factor model was estimated in which one factor included only the tasks requiring the display of solution strategies and an explanation and a second factor included tasks requiring only the solution strategies. From the results, there was no substantial statistical evidence to support the two-factor models, thus providing additional evidence supporting one dominant dimension underlying the item responses to the QCAI.

Speededness was investigated for tasks by statistically comparing hierarchical GR models using two groups of students with different administration time lengths. For two of the eight tasks examined, only the slope parameter estimates differed, and for another two tasks, both the slope and threshold parameter estimates differed. The stability of QCAI item parameter estimates over time was investigated using restricted IRT models within a multiple-group analysis in MULTILOG. The results indicated that the parameter estimates were stable for the first year, but not stable for the second year.

It is interesting to note that in their study, in order to select a more appropriate GR model for scaling the QCAI data, they compared two hierarchical models, a two-parameter (2P) GR and a one-parameter (1P) GR that restricted the slope parameters to be equal across items. These models were compared using the log-likelihood statistics for the two models. A significant difference between the statistics indicated that the 2P GR model fit the data better than the 1P model.

Goodness of fit with respect to the QCAI items was investigated by Stone, Ankenmann, Lane, & Liu (1993) and later reexamined by Stone (2000). Due to imprecise point ability estimates caused by the small number of tasks on each QCAI form, the researchers utilized Stone’s item-fit statistic G2* to assess the fit of each QCAI task to the GR model. The difference between these two studies involved different Monte Carlo resampling approaches for hypothesis testing of the fit statistic.

Stone et al. (1993) used a Monte Carlo resampling method which required estimation of the GR model for each simulated dataset, thus accounting for uncertainty in both item and ability parameters in generating the simulated null distribution of the G2* statistic. Fit was evaluated for

each of the items on four forms (A-D) across four administration occasions (Fall 1990, Spring 1991, Fall 1991, and Spring 1992) by comparing the G2* statistic with simulated null

distributions. A few flawed items were excluded from the analyses for earlier administration. The total number of tasks on the four forms was 30 for the first two administrations, and 33 for the last two administrations (three flawed tasks were revised and included). The results indicated that 12 tasks fit the data across all four administrations, only 1 task did not fit the data across the four administrations, 2 tasks did not fit the data across three of the four administrations, 7 tasks

did not fit the data across two of the four administrations, and 9 tasks did not fit the data for one of the four administrations.

The resampling method used by Stone et al. (1993) was computationally intensive due to the requirement that item parameters be estimated for each Monte Carlo sample. To reduce the computational complexity, Stone (2000) proposed an alternative resampling method that used the item parameter estimates based on the real data for all Monte Carlo samples. Thus, the step involving re-estimation of the GR model for each sample was eliminated. Stone (2000) also proposed a procedure for estimating a scaling factor that could be used to rescale the fit statistic to approximate the null distribution for hypothesis testing. For this method, only uncertainty in ability estimation was considered in generating the sampling null distribution of the G2* statistic.

Uncertainty in item parameter estimation was considered by adjusting the derived df by the number of estimated item parameters. In order to compare this alternative resampling method with the previous method, the fit of 62 QCAI items from two of the four administrations used in Stone et al. (1993) were reanalyzed using this alternative resampling and the results were compared with those from the previous study. Although general agreement in terms of the fit of these QCAI items from the two studies was high, there was some disagreement between two studies. The disagreement existed primarily for items found to be significantly “misfitting” in Stone et al. (1993) but not significantly “misfitting” using the alternative resampling method.

In the current study, the PPMC method was used to re-examine the fit of the QCAI to the two-parameter GR model in terms of unidimensionality, local independence, and item-fit. All 8 discrepancy measures used in Simulation Study 1 were used with PPMC for this real application, and the results were compared with those from the previous studies. In addition, the 1P GR and 2P GR models were re-compared using the proposed Bayesian model-comparison tools to see if

the 2P GR model fit the QCAI data better as found in Lane et al. (1995). Moreover, a 2- dimensional complex-structure GR model was estimated in order to see if a complex multidimensional model was preferred over the simple unidimensional GR model. In this multidimensional model, the first dimension included all items, and the second dimension included only the items requiring an explanation. It should be noted that only Yen’s Q3 statistic and the global OR measure were used with PPMC for the 2-dimensional complex-structure model since these two measures were found to be the most effective measures based on the simulation studies.

For this real data application, three QCAI forms with 8 items each were reanalyzed: Form A administrated in Spring 1991 (AS91), Form A given in Spring 1992 (AS92), and Form B given in Spring 1992 (BS92). The sample sizes were 399, 459, and 446 for the AS91, AS92, and BS92 forms, respectively.

Table 3.18 compares the decisions regarding item fit for the items on these three forms from Stone et al. (1993) and Stone (2000). All decisions regarding item fit were made at the α = 0.05 level of significance. The misfitting items were indicated by asterisks. As seen in this table, in Stone et al. (1993), there were 4 misfitting items for the AS91 test form, 2 misfitting items for the AS92 form, and 5 misfitting items for the BS92 form. However, two of these items were not identified as misfitting by Stone (2000). The fit of these items was re-examined using the PPMC method, and the results were compared with the results in this table.

Table 3.18 Misfitting Items Identified in Stone et al. (1993) and Stone (2000) AS91 AS92 BS92 Item Stone et al, 1993 Stone, 2000 Item Stone et al, 1993 Stone, 2000 Item Stone et al, 1993 Stone, 2000 1 * * 1 1 * * 2 2 * 2 * * 3 * 3 * * 3 * * 4 4 4 5 * * 5 5 6 6 6 * * 7 7 7 * * 8 * * 8 8

When a 2-dimensional complex-structure GR model was used to analyze the AS91 or AS92 datasets, four explanation items (Items 1, 5, 7, and 8) loaded on the two dimensions, and all other items only loaded on the first dimension. For the BS92 dataset, three explanation items (Items 1, 5, and 8) loaded on both dimensions, and all other items only loaded on the first dimension.

With regard to the implementation of MCMC and PPMC in WinBUGS, a chain of 15000 iterations was run to estimate, test and compare the fit of the two-parameter GR model, one- parameter GR model, and the 2-dimensional complex-structure GR model. The first 10000 iterations were discarded for the burn-in phase and the remaining 5000 iterations were thinned by selecting every 5th iteration to obtain posterior distributions based on 1000 iterations. The implementation of PPMC and the computation of model-comparison indices were based on this posterior sample.

Documento similar