Nuevas formas de escritura y lectura en la web social
6. Impacto de las tecnologías Web 2.0 en el fomento de la lectura
The first step to address research question 1, which is about the relationship between the EPT and two external measures, i.e., the self-assessment and the TOEFL iBT, is to examine whether the self-assessment as a new instrument developed in this study possesses good
psychometric properties. The self-assessment in this study is in the format of can-do statements on a 6-point Likert scale. A Rasch rating scale model-based item analysis was conducted to investigate the item reliability, person reliability, item difficulty, item discrimination, and scale functioning using Winsteps (Linacre, 2006).
The Rasch rating scale model, as a member of the one-parameter item response theory (IRT) models, is capable of analyzing polytomous data such as responses to Likert-scale items.
Other polytomous models include the Rasch partial credit model, the generalized partial credit model (a 2-parameter IRT model), and the graded response model (a 2-parameter IRT model) (de Ayala, 2009). The polytomous IRT models enjoy several advantages over the classical test theory (CTT), another popular approach used for analyzing Likert scale-based data and
developing survey instruments. CTT assumes that an observed score consists a true score and measurement errors associated with each item in a test. In the framework of CTT, item difficulty is calculated as the percentage of test-takers who answered the item correctly and item
discrimination is calculated as item-total correlation to reflect the degree to which an item can distinguish high-proficiency level test-takers and lower-level proficiency level test-takers. One major limitation in the CTT is that these two indices are sample dependent, in other words, test- takers’ ability may be labeled as drastically different if a much more difficult or easier test were administered (Carr, 2011). In a CTT-based analysis of Likert scale data, the raw score on each item is usually summed to form a scale score. From the perspective of polytomous IRT models for Likert data analysis, the item difficulty is deemed as the extent to which a respondent would endorse a certain category on the Likert scale based on his or her level on the target construct, for example, motivation or self-efficacy (Sick, 2009). When the unidimensionality assumption is met, the IRT models position item difficulty on the same logit scale along with test-taker’s level on a target construct and it conceptualizes an item score as a degree to which test-taker’s level of the target construct matches item difficulty or endorsability. In this sense, the difficulty
parameter in the IRT models is sample-independent. This is particularly important in scale development because item parameters that were calibrated using pilot data are expected to be invariant across sample groups.
In a CTT-based analysis of Likert scale data, the raw scores on each item are summed to form a scale score, even though the Likert scale data are ordinal in nature (Sick, 2009). Another of the advantages of using the polytomous IRT models is that these models treat Likert scale data as ordinal data, as opposed to interval data in non-IRT model analysis, but transform the “counts of the endorsement of these ordered Likert categories into interval scales based on actual
empirical evidence” (Bond & Fox, 2007, p.106). Therefore, a data-based threshold structure of Likert scale items can be detected empirically and a true measurement scale with equal intervals in the unit of logit can be established (Davidson & Henning, 1985).
Within the IRT family, both the Rasch rating scale model and the graded response model have been widely used in analyzing Likert scaled based instruments for their psychometric properties (de Ayala, 2009). Both model are capable of providing more psychometric property information at item-level and test-level, compared to the CTT approach. The major difference between these two polytomous models is that the Rasch rating scale model estimates a single parameter, the item difficulty or endorsability, while the graded response model estimates both item difficulty parameter and item discrimination parameter. The cost of estimating two parameters in the graded response model is a requirement of larger sample size – at least 500 respondents for a stable estimation of parameters (2009). For this practical reason, the Rasch rating scale model was used in this study.
Like other IRT models, the Rasch rating scale model assumes that the items measure the same unidimensional construct; this is known as unidimensionality assumption (Bond & Fox, 2007). This assumption was checked with both exploratory factor analysis (maximum likelihood extraction and promax rotation) and the Rasch principal component analysis of residuals.
fit statistics are used to assess whether an item functions as the Rasch model expects. Infit is a weighted fit statistic and is less sensitive to outliers, compared to outfit, an unweighted fit statistic. The expected mean square value of the infit and outfit statistics is 1.0. The value range from 0.5 to 1.5 is generally regarded as an acceptable fit to the Rasch model (Green, 2013). Another relevant item quality index is the point-biserial correlation coefficient of each item, which is the discrimination index in the classical test theory (CTT) framework.
Similar to the consideration of item fit in Rasch analysis, the categories of the Likert scale should also exhibit a good model fit. According to Bond and Fox (2007), the following four characteristics of a rating scale should be checked: category frequency, monotonicity of category average measures, threshold or step calibrations, and category fit. Category frequency is the total number of each of level or response category on the Likert scale chosen or endorsed by the respondents. Monotonicity of category average measures refers to the phenomenon in which the average ability measured in the unit of logit increases along with the increase of level or category on the Likert scale. Threshold or step calibrations are the estimated difficulty for choosing one category over its adjacent one. Similar to the item fit in Rasch modeling, category fit includes infit and outfit with both indicating the scale quality through the fit information about the extent to which the response categories function as the model expects. It is recommended that the count for each category should be no less than 10, the distance between thresholds should be at least 1.4 logits, but less than 5 logits, and the infit statistics for each category should be acceptable.
The same procedures of item analysis applied to the questionnaire items assessing motivation, academic self-efficacy, and self-regulated learning strategies.