ASPECTOS LEGALES
2.1 DE LA LEGISLACIÓN EN OTROS PAÍSES.
and the data is concentrated into
opposite quadrants.
To produce a numeric summary that can be used to compare data sets, this sum is scaled by a term related to the product of the sample standard deviations. With this scaling, the
correlation only involves the respective z-scores, and the quantity is always between −1 and 1.
When there is a linear relationship between x and y then values of r2 close to 1 indicate
a strong linear relationship, and values close to a a weak linear relationship. (Sometimes r may be close to a, but a different type of relationship holds.)
The Pearson correlation coefficient
The Pearson correlation coefficient, r, of two data vectors x and y is defined by
(3.1) The value of r is between −1 and 1.
In R this is found with the cor() function, as in cor(x, y).
We look at the correlations for the three data sets just discussed. First we attach the variable names, as they have been previously detached.
> attach(homedata); attach(maydow); attach(kid.weights)
In Example 3.2, on Maplewood home values, we saw a nearly linear relationship between the 1970 assessed values and the 2000 ones. The correlation in this case is
> cor(y1970,y2000) [1] 0.9111
In Example 3.3, where the temperature’s influence on the Dow Jones average was considered, no trend was discernible. The correlation in this example is
> cor(max.temp[−1],diff(DJA)) [1] 0.01029
In the height-and-weight example, the correlation is
> cor(height,weight) [1] 0.8238
The number is close to 1, but we have our doubts that a linear relationship a correct description.
The Spearman rank correlation
If the relationship between the variables is not linear but is increasing, such as the apparent curve for the height-and-weight data set, we can still use the correlation coefficient to understand the strength of the relationship. Rather than use the raw data for the calculation, we use the ranked data. That is, the data is ordered from smallest to largest, and a data point’s rank is its position after sorting, with 1 being the smallest and n
the largest. Ties are averaged. The Spearman rank correlation is the Pearson correlation coefficient computed with the ranked data.
The rank() function will rank the data.
> x = c(30,20,7,42,50,20)
> rank(x) # ties are averaged [1] 4.0 2.5 1.0 5.0 6.0 2.5
The first three numbers are interpreted as: 30 is the fourth smallest value, 20 is tied for second and third, and 7 is the smallest.
Computing the Spearman correlation is done with cor () using the argument method="spearman" (which can be abbreviated). It can also be done directly combining cor() with rank().
For our examples, the correlations are as follows:
## homedata example, r = 0.9111 > cor(rank(y1970), rank(y2000)) [1] 0.907
## Dow Jones example, r = 0.01029
> cor(max.temp[−1], diff(DJA), method="spearman") # slight?
[1] 0.1316
## height and weight example, r = 0.8238 > cor(height,weight, m="s") # abbreviated [1] 0.8822
> detach(homedata); detach(maydow); detach(kid.weights)
The data on home values is basically linear, and there the Spearman correlation actually went down. For the height-versus-weight data, the Spearman correlation coefficient increases as expected, as the trend there appears to be more quadratic than linear.
3.3.3Problems
3.13 For the homedata (UsingR) data set, make a histogram and density estimate of the multiplicative change in values (the variable y2000/y1970). Describe the shape, and explain why it is shaped thus. (Hint: There are two sides to the tracks.)
3.14 The galton on (UsingR) data set contains measurements of a child’s height and an average of his or her parents’ heights (analyzed by Francis Galton in 1885). Find the Pearson and Spearman correlation coefficients.
3.15 The data set normtemp (UsingR) contains body measurements for 130 healthy, randomly selected individuals. The variable temperature measures normal body temperature, and the variable hr measures resting heart rate. Make a scatterplot of the two variables and find the Pearson correlation coefficient.
3.16 The data set fat (UsingR) contains several measurements of 252 men. The variable body. fat contains body-fat percentage, and the variable BMI records the body mass index (weight divided by height squared). Make a scatterplot of the two variables and then find the correlation coefficient.
3.17 The data set twins (UsingR) contains IQ scores for pairs of identical twins who were separated at birth. Make a scatterplot of the variables Foster and Biological. Based on the scatterplot, predict what the Pearson correlation coefficient will be and whether the Pearson and Spearman coefficients will be similar. Check your guesses.
3.18 The state.x77 data set contains various information for each of the fifty United States. We wish to explore possible relationships among the variables. First, we make the data set easier to work with by turning it into a data frame.
> x77 = data.frame(state.x77) > attach(x77)
Now, make scatterplots of Population and Frost; Population and Murder; Population and Area; and Income and HS. Grad. Do any relationships appear linear? Are there any surprising correlations?
3.19 The data set nym.2002 (UsingR) contains information about the 2002 New York City Marathon. What do you expect the correlation between age and finishing time to be? Find it and see whether you were close.
3.20 For the data set state. center do this plot:
> with(state.center,plot(x,y))
Can you tell from the shape of the points what the data set is?
3.21 The batting (UsingR) data set contains baseball statistics for the 2002 major league baseball season. Make a scatterplot to see whether there is any trend. What is the correlation between the number of strikeouts (SO) and the number of home runs (HR)? Does the data suggest that in order to hit a lot of home runs one should strike out a lot?
3.22 The galton on (UsingR) data set contains data recorded by Gallon in 1885 on the heights of children and their parents. The data is discrete, so a simple scatterplot does not show all the data points. In this case, it is useful to “jitter” the points a little when plotting by adding a bit of noise to each point. The jitter() function will do this. An optional argument, fact or=, allows us to adjust the amount of jitter. Plot the data as below and find a value for factor=that shows the data better.
> attach(galton)
> plot(jitter(parent,factor=1),jitter(child,factor=l))
3.4Simple linear regression
In this section, we introduce the simple linear regression model for describing paired data sets that are related in a linear manner. When we say that variables x and y have a linear relationship in a mathematical sense we mean that y=mx+b, where m is the slope of the line and b the intercept. We call x the independent variable and y the dependent one.
In statistics, we don’t assume these variables have an exact linear relationship: rather, the possibility for noise or error is taken into account.
In the simple linear regression model for describing the relationship between xi and
yi, an error term is added to the linear relationship:
yi=β0+β1xi+εi.
(3.2) The value εi is an error term, and the coefficients β0 and β1 are the regression
coefficients.† The data vector x is called the predictor variable andythe
†These are Greek letters:ε is epsilon and β is beta.
response variable. The error terms are unknown, as are the regression coefficients. The goal of linear regression is to estimate the regression coefficients in a reasonable manner from the data.
The term "linear" applies to the way the regression coefficients are used. The model would also be considered a linear model. The term “simple” is used to emphasize that only one predictor variable is used, in contrast with the multiple regression model, which is discussed in Chapter 10.
Estimating the intercept β0 and the slope β1 gives an estimate for the underlying linear
relationship. We use "hats" to denote the estimates. The estimated regression line is then written
For each data point xi we have a corresponding value, with being a
point on the estimated regression line.
We refer to as the predicted value for yi, and to the estimated regression line as the
prediction line. The difference between the true value yi and this predicted value is the
residual,ei: