TRIGO BLANDO - y Medio Rural y Marino.

A scientist records the pH level of a reactive solution every 10 minutes. She records the data on a graph.

A graph such as this displaying the measurements of two quantities – here pH level and time – is called a scatter diagram.

A scatter diagram can show whether there seems to be some relationship between the two quantities. In this example, it looks like that there is a fairly good linear relationship of positive slope between pH and time.

The following scatter diagram between IQ levels and shoe size suggests no relationship between these quantities:

If a scatter diagram suggests a linear relationship of positive slope, then we say that the two quantities depicted are positively correlated. If the relationship seems to be linear of negative slope then we say that they are negatively correlated.

EXERCISE: Find a group of friends. Draw a scatter diagram for shoe size and height. Any correlation?

[Warning: Men’s and women’s shoe sizes are computed differently. Perhaps use a yard stick to measure the lengths of people’s feet in inches?]

LINES OF BEST FIT

Suppose some data, for

x

and

y

values, that looks as though it is linearly correlated.

We want to determine an equation for the line that fits the data well.

There are two approaches:

1. Just “eyeball” one.

2. Use mathematics to derive the equation of the line that fits the data well in some sense.

Notice: In conducting an experiment, one usually has complete control of one

variable, the x variable. For example, in measuring pH levels, one has control of the times that the measurements are taken, but not of the pH levels one reads.

Thus, deviations of a data points from a line of best fit should be measured as vertical segments – variations of the y-values – with no deviation horizontally. For this reason, people look for lines that minimize vertical deviations only (or, to avoid absolute values, the squares of the vertical deviations).

Here’s one method for doing this, the least squares method. We’ll explain it with an example.

EXAMPLE: Here are three data points: (1,2) (2,5) (6,8)

Choose a line that minimizes the squares of the vertical deviations.

Answer: One thing that seems reasonable (and turns out to be a true property of the general theory) that a line of best fit would properly represent the data and go through the “most average” data point.

Let:

1 2 6 average of the x-values 3

3 2 5 8 average of the y-values 5

3 x

= = + + =

So the line should go through the point (3, 5).

Now the question is: What should the slope of this line be?

Let’s work out the

y

-values of this line for the given

x

-values and compare them to the actual

y

-values of the data points:

The sum of the differences squared is:

( ) (

) (

)

This has smallest value when “ 2

So the line of best fit (by minimizing squared differences) is:

( )

15 3 5

y=14 x− +

Definition: The process of finding a line of best fit is called regression.

The method of choosing a line of best fit by minimizing squares of differences is called the least squares method.

For completeness, here are the general formulas for the least squares method:

LEAST SQUARES METHOD

Suppose we have N data points in a scatter diagram:

Let:

(called the variance of the x-values).

(

) (

² ²

)

( )

(called the variance of the y-values).

(

)(

) (

)(

) ( )( )

(called the covariance of the x- and y-values).

Then the line of best fit goes through the point ( , )x y and has slope ^xy

Example: For our three data points:

Answer: It is involved in answering the question: How good is the fit really?

MEASURING THE DEGREE OF FIT: THE CORRELATION COEFFICIENT.

Here are some data values:

We chose a line y mx b= + that made the sum of deviations squared:

( )

(

1 1

)

(

) )

(

) )

D= y − mx +b + y − mx +b +⋯+ y − mx +b the smallest.

This quantity reflects the amount of variation of the points about the regression line.

Now:

(

) (

² ²

)

(

)

T = y −y + y −y +⋯+ y − y

represents the amount of variation of the y-values in general – in the sense of measuring the amount of variation about the mean given by the horizontal line

y = . y

Since the regression line is designed to be better than any other line, we necessarily have:

D≤ . T

This prompts one to think of the proportion:

T D

−

This is a number guaranteed to be between 0 and 1.

If T D T

− equals 1, then this is saying that D= 0, which means that there is no scatter about the regression line. That is, all data points lie exactly on a line.

If T D T

− equals 0, then this is saying that T = D. That is, the amount of scatter about the regression line is no different than the amount of scatter in general.

That is, computing a regression line has no effect on scatter, and so there is no relationship between the x- and y-values of any significance.

Since T D T

− is always a positive number we give it a name that is always a positive quantity:

Definition: ² T D

R T

= −

A tedious (but not difficult) exercise in algebra shows that this quantity is given by the formula:

Usually people take the square root of this quantity:

( )

xy ² xx yy

R S

= ± S S

choosing the + sign to indicate data has a positive slope and the – sign to

indicate negative slope .

The number R is called the correlation coefficient of the data.

Example: Let’s compute the correlation coefficient of our data:

Since the data has positive slope:

( )

^7.5 ²

7 9 0.95

xx yy

R S

= + S S = ≈

⋅

This is very good. [Of course, with just three data points there is little

information to go on. CHALLENGE: Explain why the correlation coefficient will have value R =1 - indicating perfect fit – if we work a data set of just two data points.]

One wants a correlation coefficient pretty close to 1 or to -1.

A value around 0.85 or higher (or -0.85 and lower) is usually deemed “good.”

One wouldn’t want to make predictions of interpolation or extrapolation with poor fitting lines.

EXERCISE: Consider the following data.

a) Use the least squares method to find a line of best fit.

b) Find the correlation coefficient for this line

c) Does it seem reasonable to use this line of best fit for general analysis?

d) Make a prediction as to the y-value of the data when x= 1.7. (Interpolation) e) Make a prediction for the y-value when x = 13.2. (Extrapolation)

A WORD OF WARNING

It’s always wise to LOOK at a data a set before diving in and completing a linear regression. For example, although we can certainly find a line of best fit to the data shown, it would have little meaning. (We might wish to find a quadratic or an exponential curve to fit the data.)

If you suspect data fits a curve of the form y=ac^x taking logarithms gives

logy=xlogc+loga, a straight line relationship between x and logy. Perform a linear regression (via the methods of this section) to the table of data values shown …

Suppose we obtain a line of best fit logy=mx+b with:

1.3 0.2 m b

= −

This gives: logy =1.3x−0.2 and so ^y⁼

(

¹⁰^1.3

)

^x¹⁰⁻^0.2 ⁼^{0.63 19.95}^⋅ ^x^.

If you suspect data follows a curve of the form y=ax², take square roots and fit a line to the data y _and_x_.

And so forth.

In document y Medio Rural y Marino. (página 31-41)