• No se han encontrado resultados

El martirio de Juan el Bautista como compromiso con la verdad

Each participant’s responses for each task were scored in an appropriate manner. MC task responses were scored automatically. Cloze tests were each scored by a trained rater and the researcher with an answer key using an acceptable response scoring method. Each cloze blank had a known, intended response based on the source text, but near-synonyms of various degrees of specificity were allowed for full or partial credit, and words that fit the context semantically but were grammatically incorrect were also worth half a point. Although the cloze tests were scored discretely with each of the 15 blanks counting as 0, .5, or 1 point(s), human raters were chosen rather than automated rating due to the occurrence of acceptable synonyms and correct words with the wrong form which were worth partial credit. Raters had a chance to decide if a non-keyed response still created a coherent text segment. The researcher and each rater conferred about non-keyed response scores to reach agreement on scoring. Each correct response to a blank in the passage was to be given a point, for a maximum score of 15 points.

Summary rating was performed by trained raters. Raters were all graduate students in an applied linguistics department, and raters were compensated for their rating. Summaries were rated using an analytic rubric developed by the researcher (see Appendix F for the full summary rating guidelines) and informed by Taylor (2013). Although Taylor (2013) used a holistic rubric to rate gap-filling summary tasks, this study uses an analytic rubric based on constructs used in Taylor’s rubric. This rubric is used to measure summary quality on the constructs of content accuracy (whether or not a summary was accurate and complete with respect to the source text),

level of modeling (how well the summary distinguished between main and subordinate ideas, and generalizes across smaller details), task completion (to what degree the summary fit the word length parameters, was organized with respect to the source text, and conveyed useful and coherent information to a hypothetical peer), and language quality (including linguistic accuracy and use of source text). Only accuracy, modeling, and task completion are used as measures of comprehension (language is used to control for productive language ability). The language score component was only included on the rubric to mitigate the effect of raters’ judgments of

productive language quality on their assessment of the reading comprehension components and was not intended to reflect overall comprehension score.

Each summary was given a separate score on a scale from 0 to 4 for each construct, and each summary was rated by at least two raters. In the case that ratings from the first two raters differed in any category by more than one point, a third rater provided a third rating for the summary. The average of the closest two ratings for a given rubric construct were used as the final score, and an additional Total Comprehension score was calculated as the sum of the accuracy, modeling, and task completion ratings for each summary. Scores were analyzed for inter-rater reliability using Cohen’s Kappa, and additionally analyzed for rater fit and rubric reliability using Multi-faceted Rasch Analysis (Linacre, 2002).

3.2 Analyses

To address each of the research questions described in the previous section, a series of statistical analyses were performed.

3.2.1 Research question 1

3.2.1.1 Do examinees respond significantly faster to sentences inferable from a text than to unrelated sentences after reading the text and is this mediated by reading comprehension tasks?

To answer the first part of research question 1, of whether inferable sentences are primed by the text reading, reaction times to items with correct responses during the sentence

verification tasks were gathered and controlled for length of sentence. These reaction times are modeled as the dependent variable using Linear Mixed Effects (LME) modelling, with sentence type as the single fixed effect and subject as the random effect, as subjects gave multiple

responses for each independent variable category. Sentence truth value (true or false) and text- relatedness (related or unrelated) were categories for the fixed effect.

If correct responses to true/false related sentences have significantly faster reaction times than true/false unrelated sentences, this provides evidence that generating inferences was a component of L2 expository text comprehension and played a role in their interpretation during the sentence verification task. This relationship is shown in the results, which can be seen in the following chapter. Thus, a participant’s average response times to related sentences (true, false, or both), controlling for the participant’s overall response speed, can be used as measures of activation of inferencing. To examine if inference activation is different across tasks, a second linear mixed effects model was constructed to predict reaction times to related sentences with correct responses. The fixed effect was task type, and subject was included as a random effect.

3.2.1.2 To what extent does inference generation predict variance in comprehension task outcomes (scores) independent of language ability and individual differences?

To understand whether inference generation differs according to individual and testing factors (question 1b), three LME models were used to predict the dependent variable of reading comprehension score in each of the three task types using the independent variables of

inferencing (average response times to related sentences), language proficiency, and individual differences in reasoning, working memory, reading fluency, and motivation as fixed effects, and

random participant effects. The inclusion of inferencing as an independent variable was

contingent upon the results of question 1a. It is hypothesized that each task type has a different model of score prediction, with inferencing contributing more predictive power in modeling score of tasks with less response constraint (cloze and summary). Together, these analyses provide insight into the role of inferencing both as a mental product of reading and as a tool in understanding text comprehension.

3.2.2 Research Question 2

3.2.2.1 To what extent does real-time reading behavior, as measured by eye-tracking, differ between reading tasks?

Various statistical methods were also employed to answer the second set of research questions regarding the role of online reading behavior in reading comprehension. Eye-tracking metrics were compared using correlations to identify any measures which were overall pairwise multicollinear, and thus not measuring a distinct enough construct in this dataset. Next, to address this first part of question 2, regarding whether macrotextual reading behaviors differ between task types, eye-tracking measurements are compared for significant differences between the three tasks. Each eye-metric was predicted using linear mixed-effects (LME) regression model with a single fixed effect (Task) and two random effects (individual participant and the six text topics). This was performed using the lme4 package in R (Bates et al., 2015). R2 is presented as effect size for each prediction. Only measures with moderate effect sizes were included in the predictive model of tasks. Post-hoc pairwise tests were conducted to understand which of the tasks were significantly different from each other and illustrate the magnitude of each task’s effect on eye movement. Previous eye-tracking research suggests verification of statistical results with visual evidence (Kurzhals et al., 2017; Raschke et al., 2014). Thus, in interpreting these

results, visual evidence from scan-paths and heat maps are referenced to provide extra

explanation. Finally, a Generalized Logistic Mixed Effects Regression (glmer; Bates et al., 2015) was constructed using eye-tracking metrics as independent variables to predict the dependent variable, task type, controlling again for random individual effects. This type of statistical analysis allows for categorical dependent variables. The ten eye-tracking measurements are transition saccades between text and task, total fixations per word on text and on task, number of fixations per line and paragraph, average duration of fixation on text and on task, average length of saccade, and total rereading time by line and by paragraph. For the logistic regression, the data was split into a training and test set, with 85 participants’ three tasks included in a training set to build the model, and the remaining 11 participants datapoints used as a test set to verify the model. Due to the different level of response complexity and required attention to text information, it is hypothesized that higher levels of these measures of text level reading are associated with different tasks.

3.2.2.2 To what extent do online reading behaviors predict variance in reading comprehension scores beyond that predicted by individual differences?

Lastly, to address the second part of question 2, three linear models were constructed to predict the dependent variable of comprehension score in each task type, in these cases using eye-tracking metrics as fixed factors along with predictive individual differences identified as predictive of score in the above-mentioned linear models. Eye-tracking data was split in three sets, one for each reading task. Correlations were calculated between each metric and task score, and further correlations were calculated between each metric and the individual differences. Eye- tracking metrics which were significantly and at least weakly correlated with score, while not

being multicollinear with any other predictor measure, were included in a linear regression model to predict score.

3.3 Summary

In this chapter, I first reported the research questions for the present study. I then detailed the methodology of the study, including information concerning the participants,

operationalization of constructs, data collection instruments and procedures, and data

preparation. Finally, I provided an overview of the statistical analyses applied to answer each research questions. In the next chapter, I describe the preliminary analyses focused on the validation of the various measures for which data was collected in the above-described procedure.

4 INSTRUMENT RELIABILITY

This chapter presents the various procedures used to measure the reliability and validity of the various scores collected during the data collection procedure. For measures which included discretely scored items, internal reliability was calculated using Cronbach’s alpha. These measures include the language proficiency test, logical reasoning test, multiple-choice scores, and cloze scores. For working memory, due to the random nature of the stimulus presentation, and reporting of scores as accuracy percentages, split-half reliability for accuracy on the first and latter halves of the test is calculated instead of Cronbach’s alpha. For the motivation survey, a confirmatory factor analysis was conducted to verify that the questions asking about extrinsic and intrinsic motivation factored into two latent variables. For the more subjective summary rating, a full Multifaceted Rasch Analysis was conducted to investigate construct, scale, and intra-rater reliability, and Cohen’s Kappa was calculated to measure inter- rater reliability.

4.1 Morpho-syntactic Proficiency

The test used to establish basic L2 proficiency in terms of morpho-syntactic and

vocabulary knowledge was scored using a key, and each item was assigned a score of 1, 0.5, or 0. Each participant received a score out of 18. The mean score on the test was 12.573, sd = 3.399. Reliability was measured using Cronbach’s , a measure of the internal reliability of the test. It measures the degree to which the individual items on a test correlate with the overall ability of the test-takers. The closer  is to 1, the higher the reliability. The threshold for acceptable reliability is traditionally placed at .7, although shorter tests with fewer participants may have acceptable  below .7. For the proficiency test, Cronbach’s  was calculated to be .802.