Gill et al., 2007), intervention strategies (Hwang, 2006;
Podymow et al., 2006), and costs associated with alcohol consumption (Farrell, 1998). Regardless of whether it is beer, wine, liquor, or any other alcoholic beverage, reference to a “standard drink” immediately conveys information to the knowledgeable researcher about the amount of alcohol in the beverage.
The verb “to standardize” refers to making or t ransforming something into something that can serve as a basis of comparison or judgment. One may speak, for e xample, of the efforts of researchers to standardize an alcoholic beverage that contains 15 milliliters of alcohol as a
“standard drink.” For many of the variables commonly used in assessment studies, there is an attempt to s tandardize a defi nition. As an example, Anderson (2007) sought to standardize exactly what is meant by “creative thinking.”
Well known to any student who has ever taken a n ationally administered achievement test or college admission e xamination is the standardizing of tests. But what does it mean to say that a test is “standardized”? Some “food for thought” regarding an answer to this deceptively simple question can be found in Figure 1.
Test developers standardize tests by developing r eplicable procedures for administering the test and for scoring and interpreting the test. Also part of s tandardizing a test is devel-oping norms for the test. Well, not necessarily . . . whether or not norms for the test must be d eveloped in order for the test to be deemed “s tandardized” is d ebatable. It is true that almost any “test” that has clearly specifi ed p rocedures for administration, scoring, and interpretation can be considered
“standardized.” So even Ben the deli guy’s CCPT (described in Figure 1) might be deemed a “s tandardized test” according to some. This is so because the test is “standardized” to the extent that the “test items” are clearly specifi ed (presum-ably along with “rules” for “administering” them and rules for “scoring and i nterpretation”). Still, many assessment professionals would hesitate to refer to Ben’s CCPT as a
“s tandardized test.” Why?
Traditionally, assessment professionals have reserved the term standardized test for those tests that have clearly specifi ed procedures for administration, scoring, and i nterpretation in addition to norms. Such tests also come with manuals that are as much a part of the test
package as the test’s items. The test manual, which may be published in one or more booklets, will ideally provide p otential test users with all of the information they need to use the test in a responsible fashion. The test manual enables the test user to administer the test in the “stan-dardized” manner in which it was designed to be admin-istered; all test users should be able to replicate the test administration as prescribed by the test developer. Ideally, there will be little deviation from examiner to examiner in the way that a standardized test is administered, owing to the rigorous preparation and training that all potential users of the test have undergone prior to administering the test to testtakers.
If a standardized test is designed for scoring by the test user (in contrast to computer scoring), the test manual will ideally contain detailed scoring guidelines. If the test is one of ability that has correct and incorrect answers, the manual will ideally contain an ample number of examples of correct, incorrect, or partially correct responses, complete with s coring guidelines. In like fashion, if it is a test that measures personality, interest, or any other variable that is not scored as correct or incorrect, then ample examples of potential responses will be provided along with complete scoring guidelines. We would also expect the test manual to contain detailed guidelines for interpreting the test results, including samples of both appropriate and inappropriate generalizations from the fi ndings.
Also from a traditional perspective, we think of standard-ized tests as having undergone a standardization process.
Conceivably, the term standardization could be applied to
“standardizing” all the elements of a standardized test that need to be standardized. Thus, for a standardized test of leadership, we might speak of standardizing the defi nition of leadership, standardizing test administration instructions, standardizing test scoring, standardizing test interpreta-tion, and so forth. Indeed, one defi nition of standardization as applied to tests is “the process employed to introduce objectivity and uniformity into test administration, scor-ing and interpretation” (Robertson, 1990, p. 75). Another and perhaps more typical use of standardization, however, is reserved for that part of the test development process d uring which norms are developed. It is for this very reason that the term test standardization has been used
Chapter 4: Of Tests and Testing 115 i nterchangeably by many test professionals with the term
test norming.
Assessment professionals develop and use s tandardized tests to benefi t testtakers, test users, and /or society at large. Although there is conceivably some benefi t to Ben in gathering data on the frequency of orders for a pound or two of bratwurst, this type of data gathering does not require a “standardized test.” So, getting back to Ben’s CCPT . . . although there are some writers who would staunchly defend the CCPT as a “standardized test” (simply because any two questions with clearly specifi ed guidelines for administration and scoring would make the “cut”), p ractically speaking this is simply not the case from the perspective of most assess-ment professionals.
There are a number of other ambiguities in p sychological testing and assessment when it comes to the use of the word standard and its derivatives. Consider, for example, the term standard score. Some test manuals and books reserve the term standard score for use with reference to z scores. Raw scores (as well as z scores) linearly t ransformed to any other type of standard scoring s ystems — that is, transformed to a scale with an arbitrarily
Certainly with regard to this word’s use in the context of psychological testing and assessment, what is presented as “standard” usually turns out to be not as standard as we might expect.
set mean and standard deviation — are differentiated from z scores by the term standardized. For these authors, a z score would still be referred to as a “standard score”
whereas a T score, for example, would be referred to as a
“standardized score.”
For the purpose of tackling another “nonstandard” use of the word standard, let’s digress for just a moment to images of the great American pastime of baseball. Imagine, for a moment, all of the different ways that players can be charged with an error. There really isn’t one type of error that could be characterized as standard in the game of baseball. Now, back to psychological testing and assessment—where there also isn’t just one variety of error that could be character-ized as “standard.” No, there isn’t one . . . there are lots of them! One speaks, for example, of the standard error of measurement (also known as the standard error of a score) the s tandard error of estimate (also known as the standard error of p rediction), the standard error of the mean, and the standard error of the difference. A table briefl y summarizing the main differences between these terms is presented here, although they are discussed in greater detail elsewhere in this book.
Type of “Standard Error” What Is It?
Standard error of measurement A statistic used to estimate the extent to which an observed score deviates from a true score
Standard error of estimate In regression, an estimate of the degree of error involved in predicting the value of one variable from another
Standard error of the mean A measure of sampling error
Standard error of the difference A statistic used to estimate how large a difference between two scores should be before the difference is considered statistically signifi cant
We conclude by encouraging the exercise of critical thinking upon encountering the word standard. The next time you encounter the word standard in any context, give some thought to how standard that “standard” really is.
M anufacturers of products frequently use purposive sampling when they test the appeal of a new product in one city or market and then make assumptions about how that product would sell nationally. For example, the manufacturer might test a p roduct in a market such as Cleveland because, on the basis of experience with this particu-lar product, “how goes Cleveland, so goes the nation.” The danger in using such a p urposive sample is that the sample, in this case Cleveland residents, may no longer be
Cohen−Swerdlik:
Psychological Testing and Assessment: An Introduction to Tests and Measurement, Seventh Edition
II. The Science of Psychological Measurement
4. Of Tests and Testing
128 © The McGraw−Hill
Companies, 2010
116 Part 2: The Science of Psychological Measurement
representative of the nation. Alternatively, this sample may simply not be representa-tive of national preferences with regard to the particular product being test-marketed.
Often, a test user’s decisions regarding sampling wind up pitting what is ideal against what is practical. It may be ideal, for example, to use 50 chief executive offi -cers from any of the Fortune 500 companies (that is, the top 500 companies in terms of income) as a sample in an experiment. However, conditions may dictate that it is prac-tical for the experimenter only to use 50 volunteers recruited from the local Chamber of Commerce. This important distinction between what is ideal and what is practical in sampling brings us to a discussion of what has been referred to variously as an inciden-tal sample or a convenience sample.
Ever hear the old joke about a drunk searching for money he lost under the lamp-post? He may not have lost his money there, but that is where the light is. Like the drunk searching for money under the lamppost, a researcher may sometimes employ a sample that is not necessarily the most appropriate but is rather the most convenient.
Unlike the drunk, the researcher employing this type of sample is not doing so as a result of poor judgment but because of budgetary limitations or other constraints. An incidental sample or convenience sample is one that is convenient or available for use.
You may have been a party to incidental sampling if you have ever been placed in a subject pool for experimentation with introductory psychology students. It’s not that the students in such subject pools are necessarily the most appropriate subjects for the experiments, it’s just that they are the most available. Generalization of fi ndings from incidental samples must be made with caution.
If incidental or convenience samples were clubs, they would not be considered very exclusive clubs. By contrast, there are many samples that are exclusive, in a sense, since they contain many exclusionary criteria. Consider, for example, the group of children and adolescents who served as the normative sample for one well-known children’s intelligence test. The sample was selected to refl ect key demographic variables repre-sentative of the U.S. population according to the latest available census data. Still, some groups were deliberately excluded from participation. Who?
■ persons tested on any intelligence measure in the six months prior to the testing
■ persons not fl uent in English or primarily nonverbal
■ persons with uncorrected visual impairment or hearing loss
■ persons with upper-extremity disability that affects motor performance
■ persons currently admitted to a hospital or mental or psychiatric facility
■ persons currently taking medication that might depress test performance
■ persons previously diagnosed with any physical condition or illness that might depress test performance (such as stroke, epilepsy, or meningitis)
Our general description of the norming process for a standardized test continues in what follows and, to varying degrees, in subsequent chapters. A highly recommended way to supplement this study and gain a great deal of fi rsthand knowledge about norms for intelligence tests, personality tests, and other tests is to peruse the technical manuals of major standardized instruments. By going to the library and consulting a few of these manuals, you will discover not only the “real life” way that normative samples are described but also the many varied ways that normative data can be presented.
J U S T T H I N K . . .
Why do you think each of these groups of people were excluded from the standard-ization sample of a nationally standardized intelligence test?
◆
Chapter 4: Of Tests and Testing 117
Developing norms for a standardized test Having obtained a sample, the test developer administers the test according to the standard set of instructions that will be used with the test. The test developer also describes the recommended setting for giving the test.
This may be as simple as making sure that the room is quiet and well lit or as complex as providing a specifi c set of toys to test an infant’s cognitive skills. Establishing a stan-dard set of instructions and conditions under which the test is given makes the test scores of the normative sample more comparable with the scores of future testtakers.
For example, if a test of concentration ability is given to a normative sample in the sum-mer with the windows open near people mowing the grass and arguing about whether the hedges need trimming, then the normative sample probably won’t concentrate well.
If a testtaker then completes the concentration test under quiet, comfortable conditions, that person may well do much better than the normative group, resulting in a high stan-dard score. That high score would not be very helpful in understanding the testtaker’s concentration abilities because it would refl ect the differing conditions under which the tests were taken. This example illustrates how important it is that the normative sample take the test under a standard set of conditions, which are then replicated (to the extent possible) on each occasion the test is administered.
After all the test data have been collected and analyzed, the test developer will summarize the data using descriptive statistics, including measures of central tendency and variability. In addition, it is incumbent on the test developer to provide a precise description of the standardization sample itself. Good practice dictates that the norms be developed with data derived from a group of people who are presumed to be rep-resentative of the people who will take the test in the future. In order to best assist future users of the test, test developers are encouraged to “describe the population(s) represented by any norms or comparison group(s), the dates the data were gathered, and the process used to select the samples of testtakers” ( Code of Fair Testing Practices in Education, 1988, p. 3).
In practice, descriptions of normative samples vary widely in detail. Not sur-prisingly, test authors wish to present their tests in the most favorable light possible.
Accordingly, shortcomings in the standardization procedure or elsewhere in the pro-cess of the test’s development may be given short shrift or totally overlooked in a test’s manual. Sometimes, although the sample is scrupulously defi ned, the generalizability of the norms to a particular group or individual is questionable. For example, a test carefully normed on school-age children who reside within the Los Angeles school dis-trict may be relevant only to a lesser degree to school-age children who reside within the Dubuque, Iowa, school district. How many children in the standardization sample were English speaking? How many were of Hispanic origin? How does the elementary school curriculum in Los Angeles differ from the curriculum in Dubuque? These are the types of questions that must be raised before the Los Angeles norms are judged to be generalizable to the children of Dubuque. Test manuals sometimes supply prospective test users with guidelines for establishing local norms (discussed shortly), one of many different ways norms can be categorized.
One note on terminology is in order before moving on. When the people in the normative sample are the same people on whom the test was standardized, the phrases normative sample and standardization sample are often used interchangeably. Increas-ingly, however, new norms for standardized tests for specifi c groups of testtakers are developed some time after the original standardization. That is, the test remains stan-dardized based on data from the original standardization sample; it’s just that new nor-mative data are developed based on an administration of the test to a new nornor-mative sample. Included in this new normative sample may be groups of people who were
Cohen−Swerdlik:
Psychological Testing and Assessment: An Introduction to Tests and Measurement, Seventh Edition
II. The Science of Psychological Measurement
4. Of Tests and Testing
130 © The McGraw−Hill
Companies, 2010
118 Part 2: The Science of Psychological Measurement
underrepresented in the original standardization sample data. For example, if there had been a large infl ux of potential testtakers from the Czech Republic since original stan-dardization, the new normative sample might well include a sample of Czech Republic nationals. In such a scenario, the normative sample for the new norms clearly would not be identical to the standardization sample, so it would be inaccurate to use the terms standardization sample and normative sample interchangeably.