As mentioned in 2.4.3, our definition of accuracy relates to “the extent to which an L2 learner’s performance (and the L2 system that underlies this performance) deviates from a norm (i.e. usually the native speaker). Thus, deviations from targetlike performance would be considered errors” (Housen et al., 2012, p. 4). In addition, accuracy also relates to “appropriateness and acceptability” (p. 4). For the purpose of this study, deviations from targetlike performance relate to grammatical errors and communicatively inadequate use of the targeted forms: OS and OPREP RC types, 3rd person singular and plural. In other words, if a learner produced an OS RC type that was grammatically correct but it did not reflect the context of the storyline it would be considered inaccurate. It was therefore necessary to use a measure that could gauge grammatical errors relating to OS and OPREP RC types and the accompanying use of 3rd person singular or plural as well as communicatively inadequate use of the forms in relation to targetlike performance. For the purpose of this study, Mochizuki & Ortega’s (2008) rating scale that was designed to measure oral accuracy of simple RC types was adapted and used. As described in 3.3.6, Mochizuki & Ortega (2008) designed a 6 point rating scale in which each point represented a category of grammatical accuracy in a spoken context. For example, five points were awarded for targetlike relativization in which the relative pronoun was used correctly. Points were then reduced for grammatical errors relating to the RC type with 0 points awarded as the lowest score which indicated avoidance of the form. Table 26 (section a) provides an illustration of the rating scale used in Mochizuki & Ortega (2008) including definitions and examples regarding the six points relating to RC production.
Table 26. Relative clause and relevant morphology rating scale (adapted from
Mochizuki & Ortega, 2008, p. 22)
Section B: Grammaticization: Morphological Adequacy Scale
Descriptor Definition Example Points
Target-like Two instances of targetlike use of ‘He thinks that he likes the 9
use of 3rd of 3rd person singular that accompany dog which has long ears’ person singular targetlike relativization involving 3rd ‘He thinks he likes the dogs & targetlike RC person singular or plural which have long ears’
Target-like One instance of targetlike use of ‘He wants the dog which 8 use of 3rd 3rd person singular has long ears’
person singular that accompany ‘He wants the dogs which & targetlike RC targetlike relativization involving have long ears’
3rd person singular or plural
Target-like Use of 3rd person singular ‘She thinks he like the dog 7 suppliance that contain errors that which has long ears’
of RC only compliment a targetlike ‘He thinks like the dog which RC that contains no errors the woman is looking at’
Target-like A relative clause that exhibits ‘He want the dog which 6 suppliance targetlike relativization; contains has long ear’
no errors relating to verb tense ‘He want the dogs which but may contain other errors have long ear’
such as articles
Section A: Syntacticization: Relative Clause scoring scheme
Descriptor Definition Example Points
Target-like A relative clause that exhibits ‘I want the dog which 5 suppliance targetlike relativization; have long ear’
it may contain one or more
errors that are irrelevant to the target structure, such as verb tense or
the use of articles
Developmental A relative clause that
suppliance contains any of four error ‘I want the dog which 4 types (i.e. pronoun rentention many people are
nonadjecency, incorrect watching dog.’ relative marker and 2. ‘The dog is friendly inappropriate relative which has long hair.’
pronoun omission) described 3. ‘Ken likes the dog who has in the previous studies long ears.’
on relative clauses (e.g. Izumi, 2003) 4. ‘I like the dog has long ears.’
Attempt with Relative clause attempted ‘She wants which has 3 processing but containing a breakdown long ears.’
overload such as omission of head ‘She wants the dog noun or verb in the relative which long ears.’ clause.
Least Relative clause where both ‘Kanako wants to buy 2 successful developmental and pro- which has long hair
attempt cessing load errors combine and long ear dog.’ to cloud the success of the
product and hinder intelligibility
Simplification An utterance in which the 1.‘long the dog that has long ear.’ 1 participant tried to convey 2. ‘the dog with long hair.’
meaning without attempting
relativization, alternative structures; these include either the structure derived from a direct translation form Japanese or alternative structures in English
Avoidance of Formulation of the content 0
Content involved in one of the seven contexts for obligatory suppliance was not attempted
According to Mochizuki & Ortega, the six point scoring system was designed because the participants in their study “not only were of low proficiency but also had little familiarity with speaking” (p. 21). Consequently, the rating scale allowed for a more sensitive measurement of RC accuracy compared to previous measures that involved binary accuracy scoring such as ‘error-free clauses’. The latter measure was considered unsuitable for the beginner level participants of the study who may not have had the ability to produce error-free instances of the form.
Despite the intermediate proficiency of the present study’s participants, the rating scale was still used as a measurement of accuracy for three reasons. First, as we saw in 4.5.1, the accuracy measures used in the pilot study were ‘error-free relative clauses per AS-unit’ and ‘error-free relative clauses per relative clause’. However, given the all-or-nothing binary feature of these measures, not many cases of error-free RCs were produced by the B2 learners. The advantage of the rating scale used in Mochizuki & Ortega (2008) was that it allowed for more sensitive improvements in learners’ grammatical development of RC use rather than accepting only a holistically accurate RC. Second, Mochizuki & Ortega’s (2008) narrative consisted of seven obligatory contexts of RCs, thus their rating scale “maintains the number of contexts to be assessed constant across learners (k = 7), thus making the scores into a true interval scale that can be directly submitted to referential analysis” (p. 21). In other words, the maximum score achievable would be 35 points (5 points multiplied by the 7 contexts) and this would remain constant for all participants. As a result, scores would not need transforming for conformity and could be directly inputted for inferential statistical analysis. Finally, the rating scale would be particularly useful when comparing pre- and post-test scores as it would allow us to see any potential gains made as a result of the treatment. For the purpose of this study, the rating scale used in Mochizuki & Ortega (2008) was adapted to incorporate targetlike relativization of the OS and OPREP RC types, 3rd person plural
and the accompanying use of 3rd person singular. This involved adding an additional
four categories at the top of the rating scale that were categorized under section B, see table 26. Points six and seven relate to targetlike relativization including correct use of RCs that were located next to singular or plural head nouns thus reflecting correct use of 3rd person singular or plural respectively. Points eight and nine however, concern target- like use of 3rd person singular within the independent clause as well as target-like use of the adjoining RC type. As a result, the rating scale designed for this study incorporates two sections of grammatical accuracy. Section A concerns syntacticization i.e. syntactic
adequacy of RCs, whilst section B relates to grammaticization i.e. morphological adequacy of RCs and the complimentary use of 3rd person singular. Ellis & Barkhuizen (2005) note that targetlike oral morphology is appropriate for syntactic accuracy when using focused tasks designed to elicit specific grammatical forms, as in the case of this study. Consequently, target-like morphology for the present study is a specific measure of accuracy. The maximum score achievable for the 7 obligatory RC contexts in each pre- and post-test using our adapted rating scale was 60 points (see figure 12 for an illustration of possible phrases to be used in the pre-test narrative and the points awarded).
Figure 12. Pre-test narrative: possible examples for maximum points for each RC
context
Context 1 (OS): The mother thinks that she likes the dog which has long ears 9 Context 2 (OS): She wants the dog which is next to the girl 8 Context 3 (OS): Kevin thinks he likes the dogs which have long hair 9 Content 4 (OPREP): He wants the dog which the family is looking at 8 Context 5 (OS): Kate thinks she likes the dog which has long hair 9 Context 6 (OS): She also thinks she likes the dog which has long ears 9 Context 7 (OPREP): She wants the dog which the girl is smiling at 8 Maximum total: 60
The uneven scoring of each RC context was due to the extent to which 3rd
person singular was required to accompany each RC type. For example, contexts 1, 3, 5 and 6 could be awarded 9 points as they facilitated two instances of 3rd person singular
that accompanied targetlike use of the RC type as they involved a character thinking about something represented by thought bubbles, for example, ‘he thinks that he likes a
dog which has long hair’. The remaining contexts involved a character choosing
something and therefore required only one use of 3rd person singular that accompanied a targetlike RC, for example, ‘he wants the dog which has long hair’, and as a result would be awarded 8 points according to the rating scale.
In terms of the remaining part of our accuracy definition: communicatively adequate use of the targeted forms, learners’ accuracy was based on the context in which they used the forms. In other words, the learners were required to produce seven
obligatory contexts of RCs that reflected the storyline, thus if an RC was produced that was grammatically correct but communicatively inadequate i.e. it did not relate to the storyline, it would not receive a score. For example, describing a picture that contains a dog with long ears but commenting ‘the cat which has short ears’. However, if the learner immediately self-corrected their use of the RC type then he/she would be graded on the self-correction. Thus, in line with the analysis of the pilot study in 4.4.1, repeated RCs, self-corrections and false starts were excluded from the analysis. For example, as in the underlined structures ‘he likes the dog which has black hairs, ah, which has black
hair.’ In addition, if a learner described an obligatory context using a different RC type,
for example, using the OS in context 7 instead of the OPREP, they would not be penalized providing its use was appropriate, in other words it reflected the storyline. Finally, in line with Mochizuki & Ortega (2008), this study prevented ‘over-use’ of the targeted forms by setting a maximum score achievable for all participants. In Mochizuki & Ortega’s (2008) study “maximum total score in the use of relativization in the narrative task was 35 points (a maximum of 5 points by 7 contexts for obligatory suppliance” (p. 21). In the case of the present study, it was 60 points, as outlined above. Thus, if a learner produced twelve RCs during a narration, they would only be graded on the seven contexts as shown in figure 12. As in Mochizuki & Ortega (2008) this score would remain constant for all participants when comparing the pre- and post-test narratives scores. For instructions of how accuracy was coded and analysed using the computer software CLAN see appendix R.
The advantages of rating scales are that they involve the use of closed questions in which a range of responses are provided for an individual to respond to (Cohen et al. 2005). In other words, an individual is simply required to choose the most appropriate item. As a result, rating scales with closed questions “are quick to complete and straightforward to code (e.g. for computer analysis) and do not discriminate unduly on the basis of how articulate the respondents are” (p. 248). Cohen et al. (2005) describe that rating scales are a popular form of measurement in research as “they combine the opportunity for a flexible response with the ability to determine frequencies, correlations and other forms of quantitative analysis” (p. 253). Furthermore, the reliability of rating scales can be tested by comparing the scores of multiple assessors. For example, if different assessors can grade a student’s production of RCs with similar consistency then the rating scale can be considered a reliable construct for testing learners’ oral performance of the form. This procedure, otherwise known as interator
reliability was carried out on the present study’s rating scale by following a similar process used in Mochizuki & Ortega (2008). Two independent rators who were both experienced EFL teachers in Japan were asked to rate a random 10% sample of the data taken from the pre- and post-tests. This consisted of scoring eight student narratives (two from the pre-test, three from the immediate post-test and three from the delayed post-test). Overall, both rators showed consistency in their responses to the accuracy of the targeted forms as on average, the scores were the same or differed by +/- 1 point per context (see appendix S). The results of independent samples t-tests also showed there were no significant differences between the means of both rator’s for each of the eight narratives as shown in table 27. As a result, the researcher was satisfied with the level of interator reliability using the present study’s accuracy rating scale.
Table 27: Interator reliability significance values between two independent rators
Narratives Rator 1 Mean Rator 2 Mean Significance value
Narrative 1 1 1 N/A as both rators averaged (M = 1) Narrative 2 1.57 1.43 t(12) = .200, p > 0.05 Narrative 3 6.43 6.57 t(12) = -.098, p > 0.05 Narrative 4 4.71 4.86 t(12) = -.090, p > 0.05 Narrative 5 6.57 6.86 t(12) = -.195, p > 0.05 Narrative 6 5.14 5.14 t(12) = .000, p > 0.05 Narrative 7 5.86 7.00 t(12) = -.691, p > 0.05 Narrative 8 6.00 5.14 t(12) = .679, p > 0.05
In terms of disadvantages of rating scales, Cohen et al. (2005) point out that “there is no assumption of equal intervals between categories, hence a rating of 4 indicates neither that it is twice as powerful as 2 nor that it is twice as strongly felt” (p. 254). Thus, if student ‘a’ produced one RC type that was graded ‘4’, and student ‘b’ produced one RC type that was graded a ‘2’, we could not infer that student ‘a’s production of RCs was twice as accurate as student ‘b’. Cohen et al. (2005) also mention “there is no check on whether the respondents are telling the truth” (p. 254). Finally, as with questionnaires, rators may be reluctant to grade the extreme values at either end of the scale. Consequently, a six point rating scale, as in Mochizuki & Ortega
(2008), would only offer a choice of four responses which does not provide much range to give accurate responses.
In the case of the present study however, all these points were addressed. Firstly, this study acknowledges that the rating scale does not have equal intervals between each category, only that each category increases in grammatical accuracy of the form. Thus, a grade of 4 will not be acknowledged as twice as accurate as a grade of 2, but it will be acknowledged as being more grammatically accurate. In terms of respondents telling the truth, this issue was addressed through our use of interator reliability in which both rators provided similar scores thus indicating a true reflection of the students’ performance. Finally, the extreme values were addressed in a similar way to the questionnaire in that additional items were added to the rating scale so that it consisted of nine responses. This allows for a seven point variation after omitting the two extreme values which provides sufficient range for rators to accurately assess learners’ performance. To conclude, although the accuracy rating scale measure may be limited, for the purpose of this study it does allow us to measure grammatical accuracy and communicatively adequate use of the OS, OPREP RC types, and 3rd person singular and plural in line with our definition. We now discuss the complexity measure of the present study.