Display 3.50 shows a hypothetical data set with the height of younger sisters plotted against the height of their older sisters. There is a moderate positive association: r = 0.337. For both younger and older sisters, the mean height is 65 in. and the standard deviation is 2.5 in. The line drawn on the first plot, y = x, indicates the location of points representing the same height for both sisters. If you rotate your book and sight down the line, you can see that the points are scattered symmetrically about it.
In the second plot, look at the vertical strip for the older sisters with heights between 62 in. and 63 in. The X is at the mean height of the younger sisters with older sisters in this height range. It falls at about 64 in., not between 62 in. and 63 in. as you would expect. Looking at the vertical strip on the right, the mean height of younger sisters with older sisters between 68 in. and 69 in. is only about 66 in. If you were to use the line y = x to predict the height of the younger sister, you would tend to predict a height that is too small if the older sister is shorter than average and a height that is too large if the older sister is taller than average.
The flatter line through the third scatterplot in Display 3.50 is the least squares regression line. Notice that this line gets as close as it can to the center of each vertical strip. Thus, the least squares line is sometimes called the line of
means. The predicted value of y at a given value of x, using the regression line
as the model, is the estimated mean of all responses that can be produced at that particular value of x.
Notice that the regression line has a smaller slope than the major axis ( y = x ) of the ellipse. This means that the predicted values are closer to the mean than you might expect, which will always be the case for positively correlated data following a linear trend. The difference between these two lines is sometimes called the regression effect. If the correlation is near +1 or −1, the two lines will be nearly on top of each other and the regression effect will be minimal. For a moderate correlation such as that for the sisters’ heights, the regression effect will be quite large.
The regression effect was first noticed by British scientist Francis Galton around 1877. Galton noticed that the largest sweet-pea seeds tended to produce daughter seeds that were large but smaller than their parent. The smallest sweet-pea seeds tended to produce daughter seeds that were also small but larger than their parent. There was, in Galton’s words, a regression toward the mean.
This is the origin of the term regression line. [Source:D. W. Forrest, Francis Galton: The Life and Work of a Victorian Genius (Taplinger, 1974).]
The regression effect is with us in everyday life whenever some element of chance is involved in a person’s score. For example, athletes are said to experience a “sophomore slump.” That is, athletes who have the best rookie seasons do not tend to be the same athletes who have the best second year. The top students on the second exam in your class probably did not do as well, relative to the rest of the class, on the first exam. The children of extremely tall or short parents do not tend to be as extreme in height as their parents. There does, indeed, seem to be a phenomenon at work that pulls us back toward the average. As Galton noticed, this prevents the spread in human height, for example, from increasing. Look for this effect as you work on regression analyses of data.
The regression line is a line of means.
“Regression toward the mean” is another term for regression effect.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3 Correlation: The Strength of a Linear Trend 153
© 2008 Key Curriculum Press
Display 3.50 Scatterplots showing the regression effect.
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
154 Chapter 3 Relationships Between Two Quantitative Variables
Regression Toward the Mean
D24. Why is the regression line sometimes called the “line of means”? D25. The equation of the regression line for the scatterplot in Display 3.50
is y = 43.102 + 0.337x. Interpret the slope of this line in the context of the situation and compare it to the interpretation of the slope of the line y = x.
Summary 3.3: Correlation
In your study of normal distributions in Chapter 2, you used the mean to tell the center and then used the standard deviation as the overall measure of how much the values deviated from that center. For “well-behaved” quantitative relationships—that is, those whose scatterplots look elliptical—you use the regression line as the center and then measure the overall amount of variation from the line using the correlation, r. You can think of the correlation, r, as the average product of the z-scores.
Geometrically, the correlation measures how tightly packed the points of the scatterplot are about the regression line.
• The correlation has no units and ranges from −1 to +1. It is unchanged if you interchange x and y or if you make a linear change of scale in x or y, such as from feet to inches or from pounds to kilograms.
• In assessing correlation, begin by making a scatterplot and then follow these steps:
1. Shape: Is the plot linear, shaped roughly like an elliptical cloud, rather than curved, fan-shaped, or formed of separate clusters? If so, draw an ellipse to enclose the cloud of points. The data should be spread throughout the ellipse; otherwise, the pattern might not be linear or might have unusual features that require special handling. You should not calculate the correlation for patterns that are not linear.
2. Trend: If your ellipse tilts upward to the right, the correlation is positive; if it tilts downward to the right, the correlation is negative. The relationship between the correlation and the slope, b1, of the regression
line is given by
3. Strength: If your ellipse is almost a circle or is horizontal, the relationship is weak and the correlation is near zero. If your ellipse is so thin that it looks like a line, the relationship is very strong and the correlation is near +1 or −1.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3 Correlation: The Strength of a Linear Trend 155
• Correlation is not the same as causation. Two variables may be highly correlated without one having any causal relationship with the other. The value of r tells nothing about why x and y are related. In particular, a strong relationship between x and y might be due to a lurking variable.
You can interpret the value r2 as the proportion of the total variation in y that can be accounted for by using x in the prediction model:
The regression effect (or regression toward the mean) is the tendency of y-values to be closer to their mean than you might expect. That is, the regression line is flatter than the major axis of the ellipse surrounding the data.
Estimating the Correlation
P9. By comparing to the plots in Display 3.41 on page 140, match each of the five scatterplots in Display 3.51 with its correlation, choosing from −0.95, −0.5, 0, 0.5, and 0.95.
a. b.
c. d.
e.
Display 3.51 Five scatterplots.
P10. The table in E12 ( Display 3.31 on page 134 ) gives the amount of fat and number of calories in various pizzas.
a. Guess a value for the correlation, r. b. Calculate r using your calculator. A Formula for the Correlation, r
P11. Eight artificial “data sets” are shown here. For each one, find the value of r, without computing if possible. Drawing a quick sketch might be helpful.
a. b.
c. d.
e. f.
g. h.
Practice
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
?
156 Chapter 3 Relationships Between Two Quantitative Variables
P12. The table in E12 ( Display 3.31 on page 134 ) gives the amount of fat and number of calories in various pizzas. In P10, you used your calculator to find the correlation, r. This time, make a table like that in Display 3.42 on page 142, and use the formula to find r. What do you notice about the products
P13. The scatterplot in Display 3.52 is divided into quadrants by vertical and horizontal lines that pass through the point of averages, ( x, y )
Display 3.52 Scatterplot divided into quadrants at the point of averages, ( x, y ).
a. Is the correlation positive or negative? b. Give the coordinates of the point that will
contribute the most to the correlation, r. c. Consider the product
Where are the points that have a positive product? How many of the 30 points have a positive product?
d. Where are the points that have a negative product? How many of the 30 points have a negative product?
Correlation and the Appropriateness of a Linear Model
P14. Both plots in Display 3.53 have a correlation of 0.26. For each plot, is fitting a regression line (as shown on the plot) an appropriate thing to do? Why or why not?
Display 3.53 Two scatterplots with the same correlation.
The Relationship Between the Correlation and the Slope
P15. Imagine a scatterplot of two sets of exam scores for students in a statistics class. The score for a student on Exam 1 is graphed on the x-axis, and his or her score on Exam 2 is graphed on the y-axis. The slope of the regression line is 0.368. The mean of the Exam 1 scores is 72.99, and the standard deviation is 12.37. The mean of the Exam 2 scores is 75.80, and the standard deviation is 7.00.
a. Find the correlation of these scores. b. Find the equation of the regression line for
predicting an Exam 2 score from an Exam 1 score. Predict the Exam 2 score for a student who got a score of 80 on Exam 1.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3 Correlation: The Strength of a Linear Trend 157
© 2008 Key Curriculum Press
c. Find the equation of the regression line for predicting an Exam 1 score from an Exam 2 score.
d. Sketch a scatterplot that could represent the situation described.
Correlation Does Not Imply Causation P16. If you take a random sample of U.S. cities
and measure the number of fast-food franchises in each city and the number of cases of stomach cancer per year in the city, you find a high correlation.
a. What is the lurking variable?
b. How would you adjust the data for the lurking variable to get a more meaningful comparison?
P17. If you take a random sample of public school students in grades K–12 and measure weekly allowance and size of vocabulary, you will find a strong relationship. Explain in terms of a lurking variable why you should not conclude that raising a student’s allowance will tend to increase his or her vocabulary. P18. For the countries of the United Nations,
there is a strong negative relationship between the number of TV sets per thousand people and the birthrate. What would be a careless conclusion about cause and effect? What is the lurking variable?
Interpreting r2
P19. Data on the association between high school graduation rates and the percentage of families living in poverty for the 50 U.S. states were presented in E26. Display 3.54 contains the scatterplot and a standard computer output of the regression analysis.
Display 3.54 Poverty rates versus high school graduation rates.
a. Under “SOURCE,” the “Total” variation is the SST, and the “Error” variation is the SSE. From this information, find r, the correlation.
b. Write an interpretation for r2 in the context of these data.
c. Does the presence of a linear relationship here imply that a state that raises its graduation rate will cause its poverty rate to go down? Explain your reasoning. d. What are the units for each of the values
x, y, b1, and r?
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
158 Chapter 3 Relationships Between Two Quantitative Variables
Regression Toward the Mean
P20. The plot in Display 3.55 shows the heights of older sisters plotted against the heights of their younger sisters. On a copy of this scatterplot, draw vertical lines to divide the points into six groups. Mark the approximate location of the mean of the y-values of each vertical strip. Sketch the regression line, y = 43 + 0.337x. Note that the regression
Display 3.55 The heights of older sisters versus the heights of their younger sisters.
line comes as close as possible to the mean of each vertical strip. Now draw an ellipse around the data and connect the two ends of the ellipse. Is the regression line “flatter” than this line? Does this plot show the regression effect?
P21. Display 3.56 shows the first two exam scores for 29 college students enrolled in an introductory statistics course. Do you see any evidence of regression to the mean? If so, explain the nature of the evidence.
Display 3.56 Exam scores.
Exercises
E27. Each scatterplot in Display 3.57 was made on the same set of axes. Match each scatterplot with its correlation, choosing from −0.06, 0.25, 0.40, 0.52, 0.66, 0.74, 0.85, and 0.90.
Display 3.57 Eight scatterplots with various correlations.
E28. Estimate the correlation between the variables in these scatterplots.
a. The proportion of the state population living in dorms versus the proportion living in cities in Display 3.4 on page 109.
b. The graduation rate versus the 75th percentile of SAT scores in E5 on page 113.
c. The college graduation rate versus the percentage of students in the top 10% of their high school graduating class in E5 on page 113.
E29. For each set of pairs, ( x, y ), compute the correlation by hand, standardizing and finding the average product.
a. ( −2, −1 ), ( −1, 1 ), ( 0, 0 ), ( 1, 1 ), ( 2, 1 ) b. ( −2, 2 ), ( 0, 2 ), ( 0, 3 ), ( 0, 4 ), ( 2, 4 ) E30. For each artificial data set in P11 on
page 155, compute the correlation by hand, standardizing and finding the average product.
E31. The scatterplot in Display 3.58 shows part of the hat size data of E6 on page 113. The plot is divided into quadrants by vertical and horizontal lines that pass through the point of averages, ( x, y ).
a. b. c. d.
e. f. g. h.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3 Correlation: The Strength of a Linear Trend 159 Display 3.58 Head circumference, in inches, versus
hat size.
a. Estimate the value of the correlation. b. Using the idea of standardized scores,
explain why the correlation is positive. c. Identify the point that contributes the
most to the correlation. Explain why the contribution it makes is large.
d. Identify a point that contributes little to the correlation. Explain why the contribution it makes is small. E32. The ellipses in Display 3.59 represent
scatterplots that have a basic elliptical shape.
Display 3.59 Three pairs of elliptical scatterplots.
a. Match these conditions with the corresponding pair of ellipses.
I. One is larger than the other, the are equal, and the correlations are strong.
II. One of the correlations is stronger than the other, the are equal, and the are equal.
III. One is larger than the other, the are equal, and the correlations are weak.
b. Draw a pair of elliptical scatterplots to illustrate each comparison.
i. One is larger than the other, the are equal, and the correlations are weak.
ii. One is larger than the other, the are equal, and the correlations are strong.
E33. Several biology students are working together to calculate the correlation for the relationship between air temperature and how fast a cricket chirps. They all use the same crickets and temperatures, but some measure temperature in degrees Celsius and others measure it in degrees Fahrenheit. Some measure chirps per second, and others measure chirps per minute. Some use x for temperature and y for chirp rate, while others have it the other way around. a. Will all the students get the same value
for the slope of the least squares line? Explain why or why not.
b. Will they all get the same value for the correlation? Explain why or why not.
E34. For the sample of top-rated universities in E5 on page 113, the graduation rate has mean 82.7% and standard deviation 8.3%. The student/faculty ratio has mean 11.7 and standard deviation 4.3. The correlation is −0.5.
a. Find the equation of the least squares line for predicting graduation rate from student/faculty ratio.
b. Find the equation of the least squares line for predicting student/faculty ratio from graduation rate.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
160 Chapter 3 Relationships Between Two Quantitative Variables
E35. These questions concern the relationship between the correlation, r, and the slope, b1,
of the regression line.
a. If y is more variable than x, will the slope of the least squares line be greater (in absolute value) than the correlation? Justify your answer.
b. For a list of pairs (x, y), r = 0.8, b1 = 1.6,
and the standard deviations of x and y are 25 and 50. (Not necessarily in that order.) Which is the standard deviation for x? Justify your answer.
c. Students in a statistics class estimated and then measured their head circumferences in inches. The actual circumferences had SD 0.93, and the estimates had SD 4.12. The equation of the least squares line for predicting estimated values from actual values was = 11.97 + 0.36x. What was the correlation?
d. What would be the slope of the least squares line for predicting actual head circumferences from the estimated values?
E36. Lost final exam. After teaching the same history course for about a hundred years, an instructor has found that the correlation, r, between the students’ total number of points before the final examination and the number of points scored on their final examination is 0.8. The pre-final-exam point totals for all students in this year’s course have mean 280 and SD 30. The points on the final exam have mean 75 and SD 8. The instructor’s dog ate Julie’s final exam, but the instructor knows that her total number of points before the exam was 300. He decides to predict her final exam score from her pre-final-exam total. What value will he get?
E37. Lurking variables. For each scenario, state a careless conclusion assuming cause and effect, and then identify a possible lurking variable.
a. For a large sample of different animal species, there is a strong positive correlation between average brain weight and average life span.
b. Over the last 30 years, there has been a strong positive correlation between the average price of a cheeseburger and the average tuition at private liberal arts colleges.
c. Over the last decade, there has been a strong positive correlation between the price of an average share of stock, as measured by the S&P 500, and the number of Web sites on the Internet. E38. Manufacturers of low-fat foods often
increase the salt content in order to keep the flavor acceptable to consumers. For a sample of different kinds and brands of cheeses, Consumer Reports measured several variables, including calorie content, fat content, saturated fat content, and sodium content. Using these four variables, you can form six pairs of variables, so there are six different correlations. These correlations turned out to be either about 0.95 or about −0.5.
a. List all six pairs of variables, and for each pair decide from the context whether the correlation is close to 0.95 or to −0.5. b. State a careless conclusion based on
taking the negative correlations as evidence of cause and effect.
c. Explain the negative correlation using the idea of a lurking variable.
E39. A study to determine whether ice cream consumption depends on the outside temperature gave the results shown in Display 3.60.
© 2008 Key Curriculum Press
Lesson http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3 Correlation: The Strength of a Linear Trend 161
a. Use the values of SST and SSE in the