• No se han encontrado resultados

According to the Jaccard analysis results, the highlighting carried out by each tutor was significantly different to that of the others. The assumption had been that tutors would share the same understanding about what makes good-quality student writing, so their highlights would be similar, and the overlap between the XIP and the tutors could be measured reliably. However, this proved not to be the case.

There could be various explanations for this result. Considering that all these participants had more than five years’ experience of marking such an EMA using the same marking scheme, one explanation could be that human marking is not reliable. This essay was marked earlier in 2013 by two ALs and a third marker. The essay grade was agreed as 92, pass one, in coordination meetings (see GLOSSARY) and the Open University approved granting this mark to the essay. In this case, the expectation was that all the tutors would award an essay mark in the high 80s, or low 90s. However, when the tutors were asked to guess the awarded mark, two tutors (T2 and T7) gave marks that were very different to other tutors. Five tutors agreed with the given mark but tutors 2 and 7, both awarded 75, pass two, and reacted negatively to the actual mark (see the following table). Human

marking is not always reliable, which supports the assertion that using automated technologies to support educators’ essay assessment processes could be a good idea. A second explanation could be that the nature of the highlighting exercise was not

sufficiently close to their original method of marking an essay. Although tutors were using their usual marking scheme, and were simply asked to highlight the aspects that could make them give positive credit, the results might not clearly demonstrate this. The procedures that were in place during the exercise, such as the unfamiliar process of sentence-by-sentence highlighting and marking, were different and might account for variance in marking. Therefore, it should not be assumed that experienced markers on this course are unreliable, as the university works hard to assure the reliability of the marks assigned. To examine this further, consider the following table. Tutor 2 and 7 estimated the essay mark as 76. It might be expected that they would highlight a similar number of sentences; however, Tutor 2 highlighted 13 and Tutor 7 highlighted 28. This could mean that Tutor 7 undervalued the final mark considering the number of highlights that she thought had a positive impact on the final mark. Alternatively, it could mean that the highlights do not clearly show what she actually valued. Looking at tutor 6 who, with 16 highlights awarded a mark of 87, supports this assertion. The value of R, the Pearson correlation coefficient, between the total number of highlights X = 37, 13, 45, 32, 25, 16, 28 and the estimated essay mark Y = 87, 75, 86, 90, 86, 87, 75 is 0.4308. This is a

moderate positive correlation (p=0.345), which means there is a tendency for the higher number of highlights to be associated with the higher estimate for essay mark (and vice versa). Although technically a positive correlation, the relationship between the variables is moderate to claim this assumption; therefore, conflating the highlighting of sentences and the assigning of a mark would not be helpful.

Table 6.7 Number of highlights, essay mark estimation and reaction to the actual mark for all tutors

Tutor Total number of highlights

Estimated essay mark

Reaction to the actual mark

Tutor 1 37 85-90 DB: “Okay, so the essay was given 92.” T1: “Well, I think that’s a reasonable mark.”

Tutor 2 13 75 T2: “Oh really? I wouldn’t have given it 92. No, I think that is definitely too high. Mind you, I think I’m probably quite a hard marker. If I was

monitoring and it was marked by a tutor and they had given it a mark in the low 80s, I would be fine with that. If they gave it a mark of 85 or above, I would tell them they were being lenient.”

Tutor 3 45 85-87 T3: “Well, I would give it a Pass One, I think, yes.” Tutor 4 32 89,90,91 T4: “Yes, I do agree with that, yes, obviously.” Tutor 5 25 85+, late 80s T5: “Yeah, 92, I suppose if a Pass One is 85 plus, I

would have probably upped it a bit to the late 80s but I would have given it a few marks below that…”

Tutor 6 16 87 DB: “Okay, so the essay was given 92.” T6: “Okay.”

DB: “So you agree with that?” T6: “Yes, yeah.”

Tutor 7 28 75 T7: “Bloody hell! Really? Sorry, I mean… But I wouldn’t have put it above a top end of the Pass Two anyway.”

It is important to consider that these comments, made during the interviews, were raw marks that in normal circumstances and to standardise the Open University marking would be balanced with the second marker’s decision; and with the third marker’s in case of a possible disagreement during the coordination meetings. Therefore, based on this sample size, it is not credible to generalise the result that every tutor marks completely differently and unreliably. As an illustration, consider the following script from the interview with Tutor 1.

TUTOR 1: “…I suppose I am speaking here as someone who has to support ALs as well in doing this. What we try to do is to have a co-ordination session where everybody talks about what marks they are giving to, you know, we have a debate about, about how we are valuing …”

TUTOR 1: “But those things have to be discussed and there is never, it is

inevitable, with the best will in the world that two very experienced tutors can give a very different mark to the same assignment.”

Yet, it is significant to note that human marking and assessment may vary depending on several factors whereas automated analysis always provides the same result every time. This supports the argument that there is a benefit to using an automated technology, which could support educators’ marking.

The Jaccard analysis results showed a high, significant similarity between Tutor 1 and the XIP highlights. Especially considering that Tutor 1 is a module chair for E000 who looks at the marked scripts and is responsible for guiding the ALs to mark as reliably as possible, holds coordination meetings with ALs to discuss their marking and third marks the essays to adjudicate a mark should two ALs disagree on the mark of an essay, this is promising for further evaluation of the congruency of the XIP’s analysis results with the educators’

Although there is a highly significant similarity between the module chair’s, Tutor 1, and the XIP’s highlights, in other cases the results were not significant even with the other

module chair, Tutor 2. The reason could be dependent on several other factors as discussed above but qualitative data analysis of the rest of this chapter suggests ways in which XIP could be developed in order to yield more significant results. Since the statistical results (Jaccard analysis results) given above did not prove to be reliable, it is important to examine how tutors actually define the attributes of good student writing and how they interpreted what they highlighted. The next section therefore deals with this and describes the qualitative data analysis of the interviews.

Documento similar