The examination grades achieved by any one student may be influenced by many factors. Some of the key factors are grouped below as different types of variables, although the list is not definitive:
- features of the syllabuses (syllabus variables): - features of the examinations (examination variables)', - aspects of the students’ schools {school variables)', - characteristics o f the students {student variables)',
- social factors both within and outside of the examination process {social ‘variables ’), although the appropriateness of labelling such features, aspects, characteristics and factors as discrete variables is questioned - they are more appropriately referred to as influences, as discussed in more detail later. The effect of these ‘influences’ on examination achievements is enormously complex because of their variety and the ways in which they may interact with one another.
Some of the variables may be regarded either as artefacts of assessment or as factors which interact with these artefacts of assessment. For example, mode o f response is an examination variable and an assessment artefact. Students’ sex or socio-economic class {student variables) are known to interact with this artefact (Murphy, 1982) to affect the achievements that students can demonstrate, though the process of how this is achieved is a social one and not within the remit of this thesis for exploration. Sex is a variable; gender emerges in interaction between people and is a
social context influence rather than a ‘variable’. Its effect can however be considered as
influencing the interactions of students with assessment contexts and tasks and therefore mediates outcomes (see Gipps and Murphy, 1994, and Murphy and Ivinson, 2004, for a discussion). Such assessment artefacts and interacting factors are, in my view, sources of invalidity that challenge assumptions about examination comparability. This view is explored in the rest of this section.
Defining comparable grading standards only in terms of identical grade distributions assumes that the syllabuses upon which the examinations are based define assessment domains that are attributed with the same profile of cognitive demand. It could be that assessment domains do require different cognitive demands as a default position (Gardner and Hatch, 1989). For example the domain of physics could be claimed to make more quantitative demands than other domains pertinent to this research, such as biology, which arguably may preclude comparing grading standards between the two subjects.
Another view (Christie and Forrest, 1981) is that examination standards can only be comparable if the syllabuses define assessment domains, which are appropriate to the particular subject at a particular level of education. Less clear is who decides what is ‘appropriate’ - the examiner drafting the syllabus, a government agency with an overview of the control of such syllabuses or the users of these syllabuses such as teachers? Or is the decision the result of the composite of these different groups’ effects? This notion of value-laden judgements about the knowledge associated with assessment domains is in opposition to the default position of Gardner and Hatch (1989).
My own view is that there is a default position: different assessment domains are
associated with different cognitive demands. This view is based on my experiences of teaching and examining GCSE science subjects. As a result I judge physics, for example, to be inherently more quantitative than biology. In my view different constructs are assessed in biology, chemistry and physics. Consequently, I do not interpret different grade distributions for these assessment
domains as necessarily implying a lack of comparability in grading standards. Rather the assessed students may have interacted differently with the domains’ associated cognitive demands.
However, I also argue that there is evidence of a process of ‘valuing’ in syllabus and examination paper construction. For example the outcomes of human judgements of value {social influences) are evidenced in:
(i) how examination syllabuses even for the same assessment domain may emphasize some cognitive demands more than others;
(ii) how examinations may contain a disproportionate representation of some cognitive demands through selection of items by the person(s) constructing the examination papers.
It is possible for example for two physics syllabuses to differ in the extent to which they emphasize the quantitative dimension of the domain. This may in part be due to the approach to criterion-referencing discussed in Chapter 2 which allows significant variation within a criterion description. The production of the same grade distributions from the examination of these two syllabuses could be interpreted as evidence of comparable grading standards, although the domain is significantly different in the attainments measured. As reported in Chapter 1 (WJEC, 1995), some science teachers view examining groups’ GCSE science syllabuses and papers as
differentially enabling their students to show what they know and can do. Such teachers may choose one examining group’s syllabus and associated examination papers in preference to those of another on the basis o f their judgements about their learners.
In the early 1990s I entered my students for WJEC GCSE chemistry papers. My counterpart at a neighbouring school entered his students for the Midland Examining Group’s (MEG) GCSE chemistry papers. In his view, his students did less well with WJEC because his students’ level of literacy skills disadvantaged them on WJEC’s papers with their emphasis on continuous prose answers. In contrast I viewed MEG as disadvantaging my students because the associated syllabus and examination papers used contexts that tended not to be in my students’ everyday experience. However, we were both identifying construct irrelevant variance, that is variation in performance due to factors other than what is assumed to be assessed in the mark scheme (Messick, 1989). In my counterpart’s view the construct being assessed was altered by the WJEC examination papers’ language demands, whereas for me it was altered by MEG’s choice of question context. As teachers we were both mediating the grade outcomes of our students by our
choice of ‘appropriate’ examinations. The production of the same grade distribution for our two different examinations could be claimed to be indicative of comparable grading standards - we would disagree.
The subject content, which is the knowledge that is specified in a syllabus {syllabus influence), may affect the perceived facility of the course and hence the motivation of the students {student influence). There may also be variation in the amount of organizational detail offered in different syllabuses: even within WJEC’s GCSE Biology, Chemistry and Physics syllabuses this was the case at the beginning1 of my research. For example, in the 1994 GCSE examinations pertinent to this research, the physics syllabus presents a detailed breakdown of the relationship between the assessment objectives and content in terms of mark allocations awarded to
knowledge/recall, understanding and processes. In addition, there is a breakdown of content areas and their weightings (Appendix 1). The GCSE Biology syllabus for the same year does not offer as much detail in terms of mark allocations and types of skills (ibid.).
Arguably the physics syllabus makes the subject more accessible in the degree to which it offers guidance to teachers. Similarly, differences in syllabus demand are also likely to affect the motivation of the students and potentially their attained grades. Interactions o f this type make it impossible to distinguish between the effects of differences between organizational features and the effects o f differences in cognitive demand of syllabuses. Comparability studies involving the statistical analysis of grade distributions ignore such differences and assume that the effects of syllabuses upon teaching and learning are identical for a population/cohort of students.
The nature of examination tasks {examination influences) may influence students’ achieved examination grades. The examination conditions that Nuttall (1987) suggested were conducive to eliciting students’ best attainments included:
(i) tasks that are concrete and within the experience of the individual student; (ii) tasks that are clearly presented;
(iii) tasks that are perceived as relevant to the current concerns of the student;
(iv) conditions that are not unduly threatening, for example sufficient time is allowed for task completion.
Arguably, for (i) and (iii) what is concrete, within the experience of the student and perceived as relevant for the student’s current concerns will vary from student to student {student influences) and illustrates the potential for examination and student influence / ‘variable’ interaction (Gipps and Murphy, 1994).
The conclusion is inescapable: ... Assessment (like learning) is highly context-specific and one generalises at one’s peril.
(Nuttall, 1987, p. 115)
Even when grade distributions from different examinations are identical for their
associated examination populations, they may not be identical for well-defined sub-groups within these {student influence). Differences between girls’ and boys’ achieved GCSE grades are well established and depend to some extent at least on differences in the assessment techniques used in the examinations. For example Newbold and Scanlon (1981) explored the relationships between boys’ and girls’ performances in a range of subjects, including biology, offered by the University of Cambridge Local Examinations Syndicate (UCLES) in their 1979 examinations. All of the examinations studied contained multiple-choice components. They concluded that:
In line with earlier findings in individual subjects and the sciences, there seems in all five subjects to be a pattern, which associates the relative success o f boys with objective and semi-objective test forms, and girls ’ relative attainments with tests requiring a large degree o f free response.
(Newbold and Scanlon, 1981, p. 5) Murphy’s consideration of the examination statistics for GCE ‘O’ level for England and Wales for the period 1951 — 1977 also showed that boys were advantaged by multiple-choice type questions (Murphy, 1982). This finding was replicated in his study of the Associated Examining Board (AEB) GCE ‘O ’ level science results for 1976 - 1979. Stobart (Stobart et al., 1992) has shown coursework to advantage girls. Other studies (Quinlan, 1991) have also shown that this is not necessarily the case, for example in subjects where coursework takes the form of continuous assessment within lesson time, as in science: such differential attainment is also shown to be dependent upon the nature of the coursework task (Cresswell, 1990).
Gipps and Murphy (1994) were the first to evaluate international evidence in an attempt to examine the extent of observed group differences in assessment performance and understand what these might reflect. They asked to what extent were apparent differences in achievement created by particular approaches to subject knowledge and the way in which it is tested: can changing the structure and content of the test change the pattern of results? In the case of boys and girls in particular Gipps and Murphy’s evaluation of the research indicates ‘yes’, it can. These analyses provide overall differences between sex groups but understanding gender as an influence that emerges in social interaction challenges these gender effects being consistent across a sub-group. Rather they emerge for some girls and some boys and depend on interacting factors within an assessment situation including the experiences, identities and expectations that students bring to them (Murphy and Ivinson, 2004). An overall difference indicates the presence of a gender effect but not which individual students are affected.
Students’ perceptions of the ‘difficulty’ of a particular subject and its associated
examination can vary from student to student, subject to subject and examination to examination. For example students’ perceptions of the ‘difficulty’ of a subject may be affected by the associated examination / assessment arrangements:
“English language is easy because it’s coursework ... you write your essays and the teacher picks out the best ones fo r the exam board ... there’s no hassle o f an exam like in maths. ”
(Rhys Clewer, Year 11 student in 1994, personal communication) Such perceptions may influence students’ confidence and motivation. It is widely accepted that affective factors such as these mediate students’ performance in assessments (Stobart et al., 1992). These affective factors may interact with numerous sub-group and test variables (Gipps and Murphy, 1994). Sub-groups alone may be variously defined in terms of ethnic origin, socio economics and gender and all have been shown to be influential in examination performance (Smith and Tomlinson, 1989; Nuttall et al., 1989; Drew and Gray, 1990, 1991; Troyna, 1991). Such student: social: examination interactions illustrate the complexity of making comparisons of examination performances. However, these interactions have largely been ignored in the
School variables affect students’ GCSE grade outcomes. A school’s GCSE examination entry policy is recognised as being a school variable that influences students’ achievements (Cresswell,
1997, p. 73). Schools wishing to enter students who have a highly developed knowledge of science may prefer to use a particular GCSE examination because of its syllabus (syllabus variable). A grade distribution skewed towards high attainment would be a reasonable expectation of such a scenario. A lack of similarity with the grade distribution from another examination could say more about schools’ different student entry policies than about grading standards for the respective examinations.
Tiering, a model of differentiated examination papers discussed in Chapter 2, is another assessment artefact that mediates school entry policies. The ‘ceiling’ and ‘floor’ effects on available grades in differentiated papers make it vital that teachers enter their students for appropriate tiers. Research shows that choosing the appropriate tier of entry for students is problematic (Good and Cresswell, 1988d; IGRC, 1993; Gillbom and Youdell, 1998). Tier entry decisions are based on teachers’ knowledge of their students. The range of grades available to a student depends on both the student’s performance and their teacher’s judgement of them for tier entry (Wiliam, 1996). Differential performance between boys and girls was argued by Stobart et al. (1992) as being influenced by tier entry schemes. They reviewed teachers’ comments from surveys and case study interviews and found that more boys than girls were entered for the foundation tier in a three tier model used in GCSE mathematics. Disaffection with GCSE mathematics was seen by teachers as being greater for the boys than the girls placed in the foundation tier. Girls were seen as being more content than the boys to take a lower tier. The greater disaffection shown by lower attaining boys influenced teachers’ decisions about whether to enter them at all for the GCSE examination. In contrast more girls than boys were entered for the intermediate tier with its maximum grade B. Stobart et al. (1992) suggest that the bigger female entry in the intermediate tier reflects an underestimation of girls’ mathematical abilities by their teachers who perceived girls as being less confident and anxious about failure. Teachers responded by entering proportionally more girls than boys for the intermediate tier which avoided the risk of being unclassified if performance dropped below grade C. Able girls’ lack of confidence and boys’
abundance of confidence was seen by teachers as a factor affecting performance (Stobart et al., 1992).
The research by Gillbom and Youdell (1998) also suggests that tiering introduces additional barriers to equality of opportunity for students from different ethnic origins, and in particular Black students. Black students were more likely to be entered for the foundation tier and less likely to be entered for the higher tier, and the most significant inequality of access to high grades was in those subjects which operated with a three tier entry model. Teachers tended to be cautious when entering their students for tiers and ‘played safe’ to avoid students falling off the floor of the top tier. Such tier entry effects may result in some examinations being skewed in their grade
distributions. A foundation tier may have its results skewed away from the lower grades because students capable of higher grades have been inappropriately entered for this tier paper. It might then be assumed that the foundation paper has been inappropriately ‘easy’. Simply comparing grade distributions from different examinations ignores inappropriate tier entry effects.
This section has highlighted the seemingly intractable nature of making valid examination comparisons. Consequently, a consideration of the methods used in previous studies is necessary to inform the methodology for my research.