Calidad y mejora de los resultados educativos

EstudiantesProfesores

4.3. Las potencialidades y los problemas asociados a los MOOC

4.3.7. Calidad y mejora de los resultados educativos

How can you tell if a procedure for measuring behavior is any good? What accounts for the confi dence with which psychologists use preferential looking, reaction time, IQ tests, surveys of burnout, blood pressure tests, and so on? To answer the ques-tion requires a discussion of two key factors: reliability and validity.

Reliability

In general, a measure of behavior is said to be reliable if its results are repeatable when the behaviors are remeasured. Reaction time is a good example; its high reli-ability is one reason for its popularity over the years. Someone responding to a red light in 0.18 second (18 hundredths of a second, or about one-fi fth of a second) on one trial will almost certainly respond with just about the same speed on other trials, and practically all of that person’s trials will be in the general vicinity of 0.18 second. Similarly, scores on the SAT are reasonably reliable. Someone with a combined score of 1100 would probably score close to that a second time and would be unlikely to reach a score like 1800.

From these two examples, you can see why reliability is essential in any measure.

Without it, there is no way to determine what a score on a particular measure means. Presumably, in reaction time you’re trying to determine how fast someone is. If the reaction times vary wildly, there is no way to determine whether the person is fast or slow. Likewise, if GRE scores bounced 400 or 500 points from one testing session to another, the numbers would be of no use whatsoever to graduate schools because they would have no way of estimating the student’s true score.

Figure 4.3 Reaction time study in progress at Clark University, circa 1892. The response will be made by releasing a telegraph key with the right hand as quickly as possible when the stimulus is seen through the tube.

Courtesy Clark University Archives

Evaluating Measures 117

A behavioral measure’s reliability is a direct function of the amount of measure-ment error present. If there is a great deal of error, reliability is low, and vice versa.

No behavioral measure is perfectly reliable, so some degree of measurement error occurs with all measurement. That is, every measure is a combination of a hypo-thetical true score plus some measurement error. Ideally, measurement error is low enough so the observed score is close to the true score.

The reaction time procedure provides a good illustration of how measurement error works and how it affects reliability. As in the earlier example, suppose a person takes 0.18 second on a reaction time trial. Is this the true measure of speed?

Probably not, a conclusion easily reached when you notice that for the following five hypothetical trials this same person’s reaction times are:

0.16 sec 0.17 sec 0.19 sec 0.17 sec 0.19 sec

These scores vary (slightly) because some degree of measurement error contrib-utes to each trial. This error is caused by several possible factors, some of which operate randomly from trial to trial. For example, on a particular trial the person might respond faster than the true score by guessing the stimulus was about to be presented or slower because of a momentary lapse of attention. Also, a systematic amount of error could occur if, for example, the experimenter signaled the partici-pants to get ready just before turning on the stimulus, and the amount of time between the ready signal and the stimulus was the same from trial to trial. Then the participants could learn to anticipate the stimulus and produce reaction times systematically faster than true ones.

Despite the presence of a small degree of measurement error, the above scores do cluster together pretty well, and the reaction times certainly would be judged more reliable than if the scores following the 0.18-second one were these:

0.11 sec 0.21 sec 0.19 sec 0.08 sec 0.33 sec

With scores ranging from less than one-tenth of a second to one-third of a second, it is difficult to say what the person’s real speed is.

When scores are reliable, therefore, the researcher can assign some meaning to their magnitude. Reliability also allows the researcher to make more meaningful comparisons with other sets of scores. For example, comparing the first set of scores above (0.16, 0.17, etc.) with the ones below reveals a clear difference in basic speed of response:

0.23 sec 0.25 sec 0.21 sec 0.22 sec 0.24 sec

It is fair to say the true reaction time of this second person is slower than that of the person described earlier.

There are ways of calculating reliability, but this is seldom done in experimental research. Rather, confidence in the reliability of a measure develops over time, a benefit of the replication process. For example, the habituation and reaction time procedures have been used often enough and yielded consistent enough results for researchers to be highly confident about their reliability.

Reliability is assessed more formally in research that evaluates the adequacy of any type of psychological test. These are instruments designed to measure such constructs as personality factors (e.g., extroversion), abilities (e.g., intelligence), and attitudes (e.g., political beliefs). They are usually paper-and-pencil tests in

which a person responds to questions or statements. In the study mentioned earlier on burnout and vacations, participants filled out several self-report measures, including one called the BI, or Burnout Index (Westman & Eden, 1997). Analyses designed to establish the reliability of this kind of test require correlational statis-tical procedures. For example, the test could be given on two occasions and the similarity of the two sets of results could be determined. Unless dramatic changes are taking place in the participant’s life, the scores on two measurements with the BI should be similar. The degree of similarity is expressed in terms of a correla-tion (high similarity = strong correlacorrela-tion). The specifics of this kind of statistical analysis, especially as it relates to psychological testing, will be explained more fully in Chapter 9.

Validity

A behavioral measure is said to be valid if it measures what it is designed to measure.

A measure of burnout should truly measure the phenomenon of burnout and not some other construct. A test of intelligence should measure intelligence and not something else.

Conceptually, the simplest level of validity is called content validity. This type of validity concerns whether or not the actual content of the items on a test makes sense in terms of the construct being measured. It comes into play at the start of the process of creating a test because it concerns the precise wording of the test items.

A measure of burnout, for example, is more likely to be reflected by a measure of perceived job stress than of vocabulary, and the opposite is true of a measure of intelligence. With a complex construct of many attributes, such as intelligence, content validity also concerns whether the measure includes items that assess each of the attributes. Content validity is sometimes confused with face validity, which is not actually a “valid” form of validity at all (Anastasi & Urbina, 1997). Face validity concerns whether the measure seems valid to those who are taking it, and it is important only in the sense that we want those taking our tests and filling out our surveys to treat the task seriously. Of course, a test can seem to make sense to those taking it and still not be a valid test. Most of the surveys found in popular maga-zines (“What’s Your Sex Appeal Index?”) fit into this category.

A more critical test of validity is called criterion validity, which concerns whether the measure (a) can accurately forecast some future behavior or (b) is meaning-fully related to some other measure of behavior. For a test to be useful as an IQ test, for example, it should (a) do a reasonably good job of predicting how well a child will do in school and (b) produce results similar to those produced by other known measures of intelligent behavior. The term criterion validity is used because the measure in question is related to some outcome or criterion. In the examples above, the criterion variables are (a) future grades in school and (b) scores on an already established test for intelligence. As with reliability estimates, criterion validity research is correlational in nature and occurs primarily in research on psychological testing. You’ll see criterion validity again in Chapter 9.

A third form of validity, construct validity, concerns whether a test adequately measures some construct, and it connects directly with what is now a familiar concept to you—the operational definition. As you recall from Chapter 3, a construct (e.g., cognitive dissonance) is a hypothetical factor developed as part of a

Evaluating Measures 119

theory to help explain a phenomenon (e.g., decision making) or created as a short-hand term for a cluster of related behaviors (e.g., self-esteem). Constructs are never observed directly, so we develop operational definitions for them as a way of investi-gating them empirically, and then develop measures for them. For example, aggres-sion is a construct that in a particular study might be operationally defined as the number of shocks subjects believe they are delivering to another subject. Another example: Emotional intelligence is a construct operationally defined as a score on a paper-and-pencil test with items designed to identify people skilled at reading the emotions of others. Construct validity relates to whether a particular measurement truly measures the construct as a whole; it is similar to theory in the sense that it is never established or destroyed with a single study, and it is never proven for the same reason theories are never proven. Rather, confidence in construct validity accumu-lates gradually and inductively as research produces supportive results.

Research establishing criterion validity helps establish construct validity as well, but construct validity research includes additional procedures said to establish convergent and discriminant validity. Scores on a test measuring some construct should relate to scores on other tests that are theoretically related to the construct (convergent validity) but not to scores on other tests that are theoretically unre-lated to the construct (discriminant validity). Consider, for example, the construct of self-efficacy. This construct, first developed by Bandura (1986), refers to “ judg-ments of [our] capabilities to organize and execute courses of action required to attain designated types of performances” (p. 391). Students with a high degree of self-efficacy about their schoolwork, for instance, believe they have good academic skills, know what to do to get good grades, and tend to get good grades. To increase confidence in the construct validity of a test designed to measure self-efficacy, one might compare self-efficacy scores with those on already established tests of locus of control and self-confidence. Locus of control (LOC) concerns our personal beliefs about the causes of what happens to us. Those with an internal LOC believe they control what happens to them (by working hard, for instance), while those with an external LOC believe outside forces (luck, for instance) determine what happens to them. You can see that someone with a high level of self-efficacy should also be someone with an internal LOC. Thus, research showing a strong relation-ship between the two would strengthen the construct validity of the self-efficacy measure because convergent validity would have been demonstrated. On the other hand, self-confidence is not necessarily related to self-efficacy. Someone with high self-efficacy might indeed be self-confident, but lots of people with high self-confi-dence might be confident for the wrong reasons and not be high in self-efficacy.

You have probably met people who put on a display of self-confidence but don’t have much substance to back it up. So research showing that measures of self-efficacy and self-confidence are not related or only weakly related would establish discriminant validity for the measure of self-efficacy.

Research Example 4—Construct Validity

As a concrete example of construct validity research, consider this study by Mayer and Frantz (2004). They developed a test called the Connectedness to Nature Scale (CNS), designed to measure individual differences in “levels of feeling emotionally connected with the natural world” (p. 503). They hoped the scale would be useful

in predicting environmentally friendly behavior (e.g., recycling, not littering). Here are a few items from the scale:

rI think of the natural world as a community to which I belong.

rI have a deep understanding of how my actions affect the natural world.

rLike a tree can be part of a forest, I feel embedded within the broader natural world.

rMy personal welfare is independent of the welfare of the natural world.

Those scoring high and therefore having a large amount of the construct connectedness to nature would agree with the first three statements and disagree with the fourth one.

Mayer and Frantz (2004) completed a series of studies on their new scale, exam-ining both reliability and validity. To evaluate the construct validity of the CNS, they gave the test to a wide range of adults (i.e., not just college students), along with another test called the New Ecological Paradigm (NEP) scale, a survey about

“their lifestyle patterns and time spent outdoors” (p. 505). The NEP scale (Dunlap, Van Liere, Mertig, & Jones, 2000) measures cognitive beliefs about ecological issues (sample item—“We are approaching the limit of the number of people the earth can support.”). Mayer and Frantz also gathered data on participants’ ecological behaviors (e.g., how often they turned off the lights in vacant rooms), scholastic aptitude (this part of the research used college students and collected SAT data), and a measure of social desirability (those scoring high on this test want to look good to others). They expected the CNS and NEP to be related, and they were;

people scoring high on one were likely to score high on the other. This outcome supported convergent validity, as did the finding that high CNS scores predicted ecological behavior and outdoor lifestyle patterns. They also expected CNS scores would not be related to SAT scores or to the measure of social desirability, and this also happened (discriminant validity).

This brief description only scratches the surface of Mayer and Frantz’s (2004) article, which included five different experiments, but it should give you a sense of what is involved in trying to establish the construct validity of a new measuring tool, in this case the Connectedness to Nature Scale.

In document UNIVERSIDAD AUTÓNOMA DE MADRID (página 184-187)