In addition to the Spearman-Brown formula, other methods used to obtain estimates of internal consistency reliability include formulas developed by Kuder and Rich-ardson (1937) and Cronbach (1951). Inter-item consistency refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of inter-item consistency, in turn, is useful in assessing the homogeneity of the test. Tests are said to be homogeneous if they contain items that measure a single trait. As an adjec-tive used to describe test items, homogeneity (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”) is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial.
In contrast to test homogeneity, heterogeneity describes the degree to which a test measures different factors. A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that assesses knowledge only of color television repair skills could be expected to be more homogeneous in content than a test of electronic repair. The former test assesses only one area while the latter assesses several, such as knowledge not only of televisions but also of digital video recorders, Blu-Ray players, MP3 players, satellite radio receivers, and so forth.
The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test. Test homogeneity is desirable because it allows relatively straightforward test-score inter-pretation. Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested. Testtakers with the same score on a more heterogeneous test may have quite different abilities.
Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insuffi cient tool for measuring multifaceted psychologi-cal variables such as intelligence or personality. One way to circumvent this potential
Cohen−Swerdlik:
148 Part 2: The Science of Psychological Measurement
source of diffi culty has been to administer a series of homogeneous tests, each designed to measure some component of a heterogeneous variable. 5
The Kuder-Richardson formulas Dissatisfaction with existing split-half methods of esti-mating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson
& Kuder, 1939) to develop their own measures for estimating reliability. The most widely known of the many formulas they collaborated on is their Kuder-Richardson formula 20, or KR-20, so named because it was the twentieth formula developed in a series. Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be simi-lar. However, KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items). If test items are more heterogeneous, KR-20 will yield lower reli-ability estimates than the split-half method. Table 5–2 summarizes items on a sample heterogeneous test (the HERT), and Table 5–3 summarizes HERT performance for 20 testtakers. Assuming the diffi culty level of all the items on the test to be about the same, would you expect a split-half (odd-even) estimate of reliability to be fairly high or low?
How would the KR-20 reliability estimate compare with the odd-even estimate of reli-ability—would it be higher or lower?
We might guess that, because the content areas sampled for the 18 items from this
“Hypothetical Electronics Repair Test” are ordered in a manner whereby odd and even items tap the same content area, the odd-even reliability estimate will probably be quite high. Because of the great heterogeneity of content areas when taken as a whole, it could reasonably be predicted that the KR-20 estimate of reliability will be lower than the odd-even one. How is KR-20 computed? The following formula may be used:
5. As we will see elsewhere throughout this textbook, important decisions are seldom made on the basis of one test only. Psychologists frequently rely on a test battery —a selected assortment of tests and assessment procedures—in the process of evaluation. A test battery is typically composed of tests designed to measure different variables.
Table 5–2
Content Areas Sampled for 18 Items of the Hypothetical Electronics Repair Test (HERT)
Chapter 5: Reliability 149 Table 5–3
Performance on the 18-Item HERT by Item for 20 Testtakers
Item Number Number of Testtakers Correct
1 14
where r KR20 stands for the Kuder-Richardson formula 20 reliability coeffi cient, k is the number of test items, 2 is the variance of total test scores, p is the proportion of testtak-ers who pass the item, q is the proportion of people who fail the item, and ∑ pq is the sum of the pq products over all items. For this particular example, k equals 18. Based on the data in Table 5–3 , ∑ pq can be computed to be 3.975. The variance of total test scores is 5.26. Thus, r KR20 ⫽ .259.
An approximation of KR-20 can be obtained by the use of the twenty-fi rst formula in the series developed by Kuder and Richardson, a formula known as—you guessed it—KR-21. The KR-21 formula may be used if there is reason to assume that all the test items have approximately the same degree of diffi culty. Let’s add that this assump-tion is seldom justifi ed. Formula KR-21 has become outdated in an era of calculators and computers. Way back when, KR-21 was sometimes used to estimate KR-20 only because it required many fewer calculations.
Numerous modifi cations of Kuder-Richardson formulas have been proposed through the years. The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coeffi cient alpha. You may even hear it referred to as coeffi cient ␣ ⫺ 20. This expression incorporates both the Greek letter alpha ( ␣ ) and the number 20, the latter a reference to KR-20.
Coefficient alpha Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967), coeffi cient alpha may be thought of as the mean of all possible split-half correlations, corrected by the S pearman-Brown formula. In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coeffi cient alpha is appropriate for use on tests containing nondi-chotomous items. The formula for coeffi cient alpha is
Coeffi cient alpha is the preferred statistic for obtaining an estimate of internal con-sistency reliability. A variation of the formula has been developed for use in obtaining
Cohen−Swerdlik:
Psychological Testing and Assessment: An Introduction to Tests and Measurement, Seventh Edition
II. The Science of Psychological Measurement
5. Reliability
162 © The McGraw−Hill
Companies, 2010
150 Part 2: The Science of Psychological Measurement
an estimate of test-retest reliability (Green, 2003). Essentially, this formula yields an estimate of the mean of all possible test-retest, split-half coeffi cients. Coeffi cient alpha is widely used as a measure of reliability, in part because it requires only one administra-tion of the test.
Unlike a Pearson r, which may range in value from ⫺ 1 to ⫹ 1, coeffi cient alpha typi-cally ranges in value from 0 to 1. The reason for this is that, conceptually, coeffi cient alpha (much like other coeffi cients of reliability) is calculated to help answer questions about how similar sets of data are. Here, similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical). It is possible, however, to conceive of data sets that would yield a negative value of alpha (Streiner, 2003b). Still, because negative values of alpha are theoretically impossible, it is recommended under such rare circumstances that the alpha coeffi cient be reported as zero (Henson, 2001). Also, a myth about alpha is that “bigger is always better.” As Streiner (2003b) pointed out, a value of alpha above .90 may be “too high” and indicate redundancy in the items.
In contrast to coeffi cient alpha, a Pearson r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of ⫺ 1 may be thought of as indicating “perfect dissimilarity.” In practice, most reliability coeffi cients—regardless of the specifi c type of reliability they are measuring—range in value from 0 to 1. This is generally true, although it is possible to conceive of exceptional cases in which data sets yield an r with a negative value.
Before proceeding, let’s emphasize that all indexes of reliability, coeffi cient alpha among them, provide an index that is a characteristic of a particular group of test scores, not of the test itself (Caruso, 2000; Yin & Fan, 2000). Measures of reliability are estimates, and estimates are subject to error. The precise amount of error inherent in a reliability estimate will vary with the sample of testtakers from which the data were drawn. A reliability index published in a test manual might be very impressive. How-ever, it should be kept in mind that the reported reliability was achieved with a par-ticular group of testtakers. If a new group of testtakers is suffi ciently different from the group of testtakers on whom the reliability studies were done, the reliability coeffi cient may not be as impressive—and may even be unacceptable.