RESUMEN DEL CONVERSATORIO CON UN GRUPO FOCAL QUE

3. LOS SERVICIOS DE LA TECNOLOGÍA DIGITAL EN LA COMUNIDAD

3.4 RESUMEN DEL CONVERSATORIO CON UN GRUPO FOCAL QUE

Norm-referenced and Criterion-referenced Measurement

In the previous Chapter, the use of a standardised test of

general reading ability was rejected for two reasons: the lack of any demonstration that the content of such a test was related to functional reading tasks, and that it was founded upon a measurement model inapprop riate to use in the present circumstances. In the present Chapter, a more appropriate model is advanced and compared with the standard model. The various approaches and implementations of the new model are consid ered and those facets most useful to the SOFRP are selected. The

statistical implications and problems in relation to this model are discussed.

It is, perhaps, most useful to go into a little more detail about the standard or classical testing model. Most tests of general reading ability are concerned with assigning a score to a testee such that his ranking in ability relative to his peers can be assessed. Such a test is known as a norm-referenced test (NRT), due to the underlying assumption of the measurement model - norm-referenced measurement (NRM) - that the distribution of scores of testees can be fitted to the normal (Gaussian) distribution. Hence, testees can be identified as below, at or above the average level for his group (a group typically being all pupils of a given age, or school year, etc.). Now, the content of such tests is devised to reflect those reading tasks that the group should be able to undertake and therefore the test score can be said also to identify what a given testee can and cannot do. The construction of a NRT, however, involves the manipulation of the difficulty levels of the items and the content of items (including distractors) in order to increase the variability of the scores achieved by testees. This means that items appear in the test less for their representative nature

vis-a-vis the content of the area being tested but because they

are the ones which possess the appropriate statistical characteristics from the available stock of items. It is the contribution to the variability of test scores that is all important in NRM as the test is to evaluated in terms of the ability of items to discriminate among testees (Gronlund, 1977).

Whilst it is often important to discriminate between testees, particularly in schools and colleges, it is not always the appropriate measurement to make. A NRT will allow a teacher to see who are the brightest and who the slowest in the class. What it will not usually do is to indicate what each of the pupils can and cannot do. That is, it gives little or no clues on *subject matter proficiency* (Glaser, 1963). A test that will do this may be termed a criterion-referenced test (CRT), as it provides information on the testee*s achievement in relation to 'an absolute standard of quality* (Glaser, 1963) and

*the meaningfulness of an individual score is not dependent on comparison with other testees* (Popham & Husek, 1969). A much used and excellent example of a CRT is the driving-test, in which the testee has to

demonstrate proficiency in a number of specific ways in relation to a set, absolute standard: acceptable driving behaviour. How other testees perform on the test is irrelevant; though how other drivers perform in the real-world task is important in relation to ^acceptable driving behaviour*, the criterion to which the test is referenced.

Definitions and Approaches

Such tests are clearly linked to decision-making e.g. whether to give remedial help to a less proficient testee, to promote a child to a more difficult reading book, or whether a certain course of

teaching has been effective in getting some subject matter over to the students. ’’Criterion-referenced tests are devised to make decisions both about individuals and treatments” (Popham & Husek, 1969). Two

major approaches may be discovered, directly resulting from how different authors have seen decision-making.

By far the most common strand of CRM testing has been linked

with instructional technology. Educators previously used to considerin NRT scores and referencing the scores of individual pupils to the mean, standard deviation, stanines and percentile ranks, etc., looked

for similar single scores on which to base their decisions. Hence the concept of *mastery* testing has grown up, where the testee is assigned to one of two or more groups (e.g. *pass-fail*; 'advance- retain-remediate') on the basis of his score in relation to some *cut- off* point(s) or *criterion-score(s) 1. Such a conception may be seen as arising from definitions of CRM based on narrow, tightly-defined objectives. Glaser & Nitko (1971) have defined a CRT as 'one that is deliberately constructed so as to yield measurements that are directly interpretable in terms of specific performance standards* (p. 653). Ivens (1970) defined a CRT as being composed of *items keyed to a set of behavioural objectives*, whilst Kriewall (1972) suggested that it should contain items which are homogeneous in difficulty for each examinee. Livingston (1972) used ‘criterion-referenced* to refer to *any test for which a criterion-score is specified without reference to the distribution of scores of a group of examinees* (p. 13), such as the mean.

Within mastery testing, Meskauskas (1976) identified two sorts of model. ‘State* models see mastery as an *al1-or-none* dichotomy and such models have been advanced by Emrick (1971), Roudabush (1974), backed up by considerable technical discussion using Bayesian decision statistics (e.g. Hambleton & Novick (1973); Berk (1976)). On the other hand, some authors have advanced ‘continuum* models, in which ’’mastery is viewed as a continuously-distributed ability ... (and) ... an area is identified at the upper end of this continuum, and if an individual equals or exceeds the lower bound of this area, he is termed a master” (Meskauskas, 1976 p. 134). Such models have been

The second, and much less common, approach to criterion-referenced measurement may be related to the original conception by Glaser. In his early work (1963), 'Competence is conceived as being a continuum characteristic. There are, at most, ambiguous suggestions that a single point exists at which competence becomes incompetence* (Glass & Smith, 1978, p. 13). Popham (1975) admits that his use of the term p e r f o r m a n c e standard* (Popham & Husek, 1969) contributed to what Glass and Smith go on to call *a case study in confusion and corruption of meaning* (loc. c i t . ). The essential differences here are the disinclination to use a cutting score and the narrowness of the definition of objectives. Popham defines CRT as a test that "is used to ascertain an i n d i v i d u a l s status with respect to a well-defined

behavior (sic) domain" (1975, p. 130), the stress being on *well-defined * rather than ’narrow*. This is much nearer to the original thinking in the area, replacing NRM with measurement of what a testee can and cannot do. Ayrer (1977) has adopted this approach, using his test to 'describe instead of certify*, where 'its function was to show where a student was having problems* (p. 704).

Criterion-referenced Measurement and SOFRP

The use of CRM by the SOFRP is clearly dictated. "Since a

primary goal of functional literacy measures is to assess achievement with respect to specified reading tasks, domain referenced tests which require generating a representative sample from a well-defined population (domain) of tasks seems most appropriate" (Kirsch & Guthrie, 1977-8). In terms of the SOFRP, both the two approaches to CRM have something to offer. It will be useful, first, to consider what is the domain with which the Project is concerned. It follows from the definition advanced in Chapter 2 above (occupational functional reading ability is 'the level to which an individual possesses those reading skills needed to perform successfully those reading tasks required of him in seeking a job, at work and in related training*), that the domain consists of the skills and knowledge needed by 16+ year old school leavers to

perform the reading tasks encountered in their first six weeks in the above circumstances in the City of Sheffield. It is not the tasks which

one is concerned to assess but the level of skills and knowledge needed to deal with them. This is obviously a broad domain, although well-defined in the context of the Project. It is highly unlikely that narrow, tightly-defined objectives would have any place in the develop ment of the functional reading tests under discussion here, there

would quite simply be far too many of them. For instance, to analyse each reading task for each job in each industry in the city is not only an enormous task but would generate an enormous number of highly specific objectives. This being far beyond the resources available and outside the level of acceptability to test users (who would be confronted either with a small set of huge tests or a huge set of small tests), it would be necessary to use less specific, more general objectives, encompassing, say, certain classes of job, or broadly defined types of reading task. This being the case, one is basically using the Popham definition.

In fact, the need for formally stated objectives becomes an irrelevance. If one is concerned with the identification of reading tasks undertaken by school-leavers, and the subsequent development of a test containing items based on materials representative of those tasks, the specification of objectives comes at the time of test assembly and not at the time of item construction. This being so, the objectives are only descriptions arrived at a posteriori and are irrelevant if the processes of development have been adequately reported.

Yet the mastery approach has something to offer from the continuum model. Whilst it has already been made clear (Chapter 2) that to call someone literate or illiterate is impossible or invalid on the basis of a single score, the use of a range of acceptable scores, or ident ifying a single score for further action by the tester rather than

‘failure* by the testee, offers a number of useful features. Teachers will more often welcome a test vdiich can identify both who is having problems and the area into which some of those problems fall. The

emphasis in a test based on a broad domain is ’looking more closely* at those with lower scores, rather than ‘fail*. Further, relating some score to high performance ratings on actual job reading tasks offers an opportunity to reduce the arbitrary nature of a priori standard setting. Also, ’above* and *below* a cut-off score allows a dichotomy to be introduced for certain statistical measures, which will be particularly useful as will be demonstrated below.

To summarize so far then, norm-referenced measurement is

inappropriate to assessing levels of occupational functional reading ability for, whilst it may order testees relative to one another, it gives no clue to what each testee can and cannot do. Criterion- referenced measurement offers this type of information by referencing performance on a test to an external criterion behaviour, in this case,

job-related reading tasks. The range of skills and knowledge required in the performance of such tasks is too broad to allow the use of specific, objective-based mastery testing, although some features of that paradigm are useful. The type of CRT under development for the SOFRP conforms more closely to the view advanced by Popham.

Statistical Considerations

CRM differs in another way from NRM and that is in the statistical formulation of development and analysis procedures. As has been

previously indicated, NRT are designed to yield wide variations in scores in order to discriminate between testees. In CRM, however, the variability in scores is irrelevant. ’’The meaning of the score is not dependent on comparison with other scores; it flows directly from the connection between the items and the criterion. It is, of course, true that one almost always gets variant scores on any psychological test: but that variability is not a necessary condition for a good criterion-

referenced test” (Popham & Husek, 1969, p. 3). Items which all testees get right or all testees get wrong are invariably deleted in NRTs

because they do not discriminate. In CRM, such an item might well be retained, for it provides information about each testee in relation to

the domain under consideration. It is conceivable in CRM to have a test in which all testees get all items correct. It is indeed desireable in the context of a post-instruction test. The effect of the irrelevance of variability is to alter the whole statistical pattern associated with NRM.

Classical testing theory, as it has come to be called, uses certain common statistical procedures to describe and evaluate tests and the performance of testees upon them. To describe the scores of testees, it is usual to report the arithmetic mean score and the standard deviation, implicitly accepting that such measures are given meaning by reference to the performance of others on the test. Various significance tests are also based on that normal

distribution of scores. In test evaluation, correlational techniques are used to relate two sets of measures, particularly in the study of the validity and reliability of the test, and to assess item discrimination.

If the test is not constructed to yield a widely spread set of scores, however, and can in fact yield a set of scores with little or no spread and still be a good test, it follows that the classical measures may be invalid and inappropriate. The mean and standard deviation may be highly misleading with a multimodal or highly skewed distribution - the score reported may not have been achieved by

anyone at all. Other descriptors such as the median, mode and range may give a more accurate picture. Significance tests based on approx imations to normality will have their basic assumptions violated and recourse to distribution-free statistics required.

It is in correlational techniques that variability becomes of crucial importance, however. The effect of reducing the variability of scores on one or both of the measures being correlated is to reduce the size of the correlation coefficient. Lord & Novick (1968) have

demonstrated that the correlation of a set of data is always smaller if its standard deviation is smaller (p. 129-131). This is not to say that the interrelationship that the coefficient is attempting to measure is smaller if variability is low, but that the measure may underestimate the size of the relationship. Hence, the coefficient may be low and underestimate or it may be low because the relationship is low. There appears no way round this problem except to say that a high correlation will be no lower but that a low correlation may be higher.

Clearly, these statistical implications pose quite a number of problems in relation to the applicability of classical techniques in test construction. The use of a CRM model suggests that either modifications are made of existing techniques or that new techniques relate them. In particular, correlational methods need careful

attention in the fields of item analysis, scrutiny and the estimation of reliability and validity. For one still needs such tools in CRM, to assess the quality and usefulness of the test developed. The specific problems and proposed solutions are dealt with in ensuing chapters, in order to present them in the contexts in which they occur. Two other points should be raised, however. The question of unidimen sional scaling was mentioned in the last chapter and needs to be

settled here. The model (Rasch, 1960) is based on the assumption that all items are testing along the same dimension* It is very unlikely that this will be the case in SOFRP except at a very general level. It is more likely that a functional reading test will be multi dimensional at the level required by latent-trait models. Further, current implementations of the Rasch model (e.g. Choppin, 1974) require the deletion of items answered correctly or incorrectly by all testees. This is not seen as appropriate to SOFRP where the performance of pupils in relation to job-related reading materials is to be assessed independent to one another. For these reasons, the adoption of a latent-trait, item analysis model is not considered correct.

Finally, the whole construction of a criterion-referenced test hangs upon the demonstration of an adequate relationship between the criterion domain (job-related reading tasks) and the iterns making up the test (Dahl (1971); Rovinelli & Hambleton (1976)). It is

essential, therefore, that the test has content validity and this is discussed in Chapter 9. Further, the relationship of test scores to the criterion domain, predictive validity, is also essential to ensure a useful product and this forms the substance of Chapters 13 to 15.

In conclusion, one must point out that the development of criterion- referenced tests has once again been almost exclusively an American affair, and as such is a product of the needs and solutions of that country's educational system. The wholesale testing of pupils for graduation minimal competency is inextricably linked with the design of dichotomous, state model, mastery tests in all but a few cases.

The emphasis of inquiry has been upon the ways for making as few incorrect decisions as possible and the design of test instruments of controlled content, objectives and desired outcomes.

Much of the American work contrasts strongly with the needs and purposes of SOFRP, whilst still containing the germs of useful ideas. The content of any test produced is controlled, not by curriculum assessment needs or the prescriptive commands of a committee, but by the nature of discovered reading tasks. It is unlikely that any useful purpose can be served by a rigidly imposed dichotomy of 'acceptable- unacceptable*. It may, however, be the case that certain scores or ranges of scores can be empirically identified with levels of acceptable performance by job incumbents or with other indicators of criterion behaviour. As such a consideration is not linked with gaining a grad uation diploma but with a teacher looking more or less closely at the reading skills of the testee, there is less emphasis needed on correct or incorrect decisions. It is surely better for the weight to be on the side of giving more help than is strictly necessary. Hence, scores used in a 'cut-off' sense need to be aimed higher than might be suggested by the empirical identification.

Concluding Remarks to the Introduction

The Sheffield Occupational Functional Reading Project was established to develop a criterion-referenced test to assess levels

In document La comunidad se activa en los telecentros. (página 65-75)