Obstáculos que nos impiden sacar provecho a la Palabra

In this equation, the total variance in an observed distribution of test scores ( ␴ ² ) equals the sum of the true variance (␴ plus the error variance ( )._tr²) ␴ The term _e² r eliability refers to the proportion of the total variance attributed to true variance.

The greater the proportion of the total variance attributed to true variance, the more reliable the test. Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests. Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability— can be affected.

Let’s emphasize here that a systematic source of error would not affect score consistency. If a measuring instru-ment such as a weight scale consistently underweighed everyone who stepped on it by 5 pounds, then the r elative standings of the people would remain unchanged. Of course, the recorded weights themselves would c onsistently vary from the true weight by 5 pounds. A scale under-weighing all comers by 5 pounds is analogous to a constant being subtracted from (or added to) every test score. A systematic error source does not change the variability of the distribution or affect reliability.

Sources of Error Variance

Sources of error variance include test construction, administration, scoring, and/or interpretation.

Test construction One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests. Consider two or more tests designed to measure a speciﬁ c skill, personality attribute, or body of knowledge. Differences are sure to be found in the way the items are worded and in the exact content sampled. Each of us has probably walked into an achievement test setting thinking, “I hope they ask this question” or “I hope they don’t ask that question.” With luck, only the questions we wanted to be asked appeared on the examination. In such situations, a testtaker would achieve a higher score on one as opposed to another test purporting to measure the same thing. The higher score would be due to the speciﬁ c content sampled, the way the items were worded, and so on. The extent to which a testtaker’s score is affected by the content sampled on the test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance. From the perspective of a test creator, a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.

J U S T T H I N K . . . What might be a source of systematic error inherent in all the tests an assessor administers in his or her private ofﬁ ce?

◆

Chapter 5: Reliability 141

Test administration Sources of error variance that occur during test administration may infl uence the testtaker’s attention or motivation. The testtaker’s reactions to those infl uences are the source of one kind of error variance. Examples of untoward i nfl uences during administration of a test include factors related to the test environ-ment: the room temperature, the level of lighting, and the amount of ventilation and noise, for instance. A relentless fl y may develop a tenacious attraction to an examinee’s face. A wad of gum on the seat of the chair may make itself known only after the testtaker sits down on it. Other environment-related variables include the instrument used to enter responses and even the writing surface on which responses are entered.

A pencil with a dull or broken point can hamper the blackening of little grids. The writing surface on a school desk may be riddled with heart carvings, the legacy of past years’ students who felt compelled to express their eternal devotion to someone now long forgotten.

Other potential sources of error variance during test administration are testtaker v ariables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. A testtaker may, for whatever reason, make a mistake in entering a test response. For example, the examinee might blacken a “b” grid when he or she meant to blacken the “d” grid. An examinee may simply misread a test item. For example, an examinee might read the question

“Which is not a source of error variance?” as “Which is a source of error variance?”

Other simple mistakes can have score-depleting consequences. Responding to the ﬁ fth item on a multiple-choice test, for example, the testtaker might blacken the grid for the sixth item. Just one skipped question will result in every subsequent test response being out of sync. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance.

Examiner-related variables are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test.

On an oral examination, some examiners may unwittingly provide clues by empha-sizing key words as they pose questions. They might convey information about the correctness of a response through head nodding, eye movements, or other nonverbal gestures. Clearly, the level of professionalism exhibited by examiners is a source of error variance.

Test scoring and interpretation The advent of computer scoring and a growing reliance on objective, computer-scorable items virtually have eliminated error variance caused by scorer differences in many tests. However, not all tests can be scored from grids blackened by No. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, and countless other tests still require hand scoring by trained personnel.

Manuals for individual intelligence tests tend to be very explicit about scoring c riteria lest examinees’ measured intelligence vary as a function of who is doing the testing and scoring. In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the e xaminer who must then quantify or qualitatively evaluate responses. In one test of creativity, examinees might be given the task of creating as many things as they can out of a set of blocks. Here, it is the examiner’s task to determine which block

Cohen−Swerdlik:

142 Part 2: The Science of Psychological Measurement

constructions will be awarded credit and which will not. For a behavioral measure of social skills in an inpatient psychiatric service, the scorers or raters might be asked to rate patients with respect to the variable “social relatedness.” Such a behavioral mea-sure might require the rater to check yes or no to items like Patient says “Good morning”

to at least two staff members.

Scorers and scoring systems are potential sources of error variance. A test may employ objective-type items amenable to computer scoring of well-documented

r eliability. Yet even then, the possibility of a technical glitch contaminating the data is possible. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance. Indeed, despite rigorous scor-ing criteria set forth in many of the b etter-known tests of intelligence, examiner/scorers occasionally still are con-fronted by situations where an examinee’s response lies in a gray area. The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations). Subjectivity in scoring can even enter into behavioral assessment. Consider the case of two behavior observers given the task of rating one psychiatric inpatient on the variable of “social relatedness.” On an item that asks sim-ply whether two staff members were greeted in the morning, one rater might judge the patient’s eye contact and mumbling of something to two staff members to qualify as a yes response. The other observer might feel strongly that a no response to the item is appropriate. Such problems in scoring agreement can be addressed through rigorous training designed to make the consistency—or reliability—of various scorers as nearly perfect as can be.

Other sources of error Certain types of assessment situations lend themselves to par-ticular varieties of systematic and nonsystematic error. For example, consider assess-ing the extent of agreement between partners regardassess-ing the quality and quantity of physical and psychological abuse in their relationship. As Mofﬁ tt et al. (1997) observed,

“Because partner abuse usually occurs in private, there are only two persons who

‘really’ know what goes on behind closed doors: the two members of the couple” (p. 47).

Potential sources of nonsystematic error in such an assessment situation include forget-ting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting. A number of studies (O’Leary & Arias, 1988; Riggs et al., 1989; Straus, 1979) have suggested underreporting or overreporting of perpetration of abuse also may contribute to systematic error. Females, for example, may underreport abuse because of fear, shame, or social desirability factors and overreport abuse if they are seeking help. Males may underreport abuse because of embarrassment and social desirability factors and overreport abuse if they are attempting to jus-tify the report.

Just as the amount of abuse one partner suffers at the hands of the other may never be known, so the amount of test variance that is true relative to error may never be known. A so-called true score, as Stanley (1971, p. 361) put it, is “not the ultimate fact in the book of the record-ing angel.” Further, the utility of the methods used for esti-mating true versus error variance is a hotly debated matter J U S T T H I N K . . .

Can you conceive of a test item on a rating scale requiring human judgment that all raters will score the same 100% of the time?

◆

J U S T T H I N K . . .

Recall your score on the most recent test you took. What percentage of that score do you think represented your “true” a bility, and what percentage of that score was represented by error? Now, hazard a guess as to what type(s) of error were involved.

◆

Chapter 5: Reliability 143

(see e.g. Collins, 1996; Humphreys, 1996; Williams & Zimmerman, 1996a, 1996b). Let’s take a closer look at such estimates and how they are derived.

In document Para Entender la Palabra de Dios (página 42-48)