Chapter 5 Linear Regression and Correlation Simple and Multiple 2020 ppt

(1)



(2)

Introduction



Relationship between education spending and test scores

The correlation is negative (-0.2). The United States spends in education the second most of any country, and has below average test-scores. Ethnically homogeneous Japan, South Korea and Finland spend at average rates and have the best test scores. Tiny, ethnically homogeneous and "hungry" Estonia spends less than half as much as the United States and Norway on education but has far better test scores. Source: Economy Industry USA View

The Organization for Economic Co-operation and Development

(OECD)

released the results of it s 2009 global rankings

on student performance in mathematics, reading, and science, on

the Program for

(3)

Introduction



(4)

Introduction



(5)

Introduction



(6)

Introduction



(7)

Introduction



(8)

Introduction



(9)

Introduction



(10)

Introduction



(11)

Introduction



Relationship between per pupil spending and mean math scores in PISA 2012, by country

The figure shows the simple correlation between the mean scores in mathematics and the expenditure per pupil in secondary education for each of the countries that participated in PISA 2012. It is easy to see that students in countries like Qatar and Singapore spend similar amounts of Dollars per Student, achieving very different PISA math scores.

The Organization for Economic

Co-operation and Development (OECD)

released the results o f its 2012 global ranki ngs

on student performance in mathematics,

reading, and science, on the Program for International Student Assessment,

(12)

Ranking of top countries in math, reading, and science is out — and the US didn't crack the top 10



Source: OECD. China is represented by the

provinces of Beijing, Shanghai, Jiangsu, and Guangdong.

The PISA is a worldwide exam administered every three years that measures 15-year-olds

in 72 countries.

About 540,000 students took the exam in 2015..

Asian countries topped the rankings across all

(13)

PISA tests: Singapore top in global education rankings/2015



"If you think maths is a hard subject you won't succeed," 10-year-old Hai Yang tells me. striking feature of Singapore's education:

*The whole class has just been working on a problem, taking it in turns to stand up and explain how they worked it out. And they do this in English, one of several languages spoken in Singapore. It turns out there is more than one way to reach the right solution.

*What is impressive is their commitment to understanding exactly how to do it.

"If we just blindly look at the teacher's answer, when we grow up we might not know how to do it any more“ *Building blocks

This is an approach known as maths mastery which some schools in the UK have begun using in an adapted form.

*"We believe in Singapore in the fundamentals, that in order for a child to be well educated you need to give them the fundamental language and grammar in various disciplines, a language where you can read, a language where you can understand numbers." S

=ingapore has also thought a lot about how to make teaching a rewarding profession.

*Teachers can follow a career path that takes them towards being a principal, a researcher into education or a master classroom teacher. They get time to deepen their knowledge and prepare lessons.

*In Montfort Secondary school they are encouraging the teenage boys to make prototype products, ranging from a smart garden watering system to an electronic keyboard.

*Using your science and maths skills to solve real world problems is exactly the kind of ability the PISA tests are intended to measure. An empty room at the school is being turned into what they call a "makers lab". *Simple tools and materials will be available for the pupils to use in their spare time to make things to take home. If they want to work out how to light up their guitar with LED lights, this is where they can do it.

*Another striking feature of Singapore's education is that head teachers are rotated between schools every six to eight years. There is also an increasing emphasis on collaboration.

*"Today teachers work in teams, they grow together, they research together, they work together." High stakes

(14)

The Objective of Correlation and Regression



The objective for correlation is to establish the relationship between two or more quantitative variables without being able to infer causal relationships, and for

regression analysis is to establish a mathematical model to estimate the value of a variable based on the value of the other variables. This technique is appropriate when:

A mathematical function or equation linking two metric-scaled (interval or ratio) variables is to be constructed, under the assumption that values of one of the two variables is dependent on the values of the other.

Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof.

Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job” clearly a nominal variable. One could then use several independent variables such as job training, post-secondary education and the like to predict the odds of getting a job.

Multiple Regression Analysis Technique this technique is appropriate

(15)

Methodology



To perform a regression analysis and correlation is advisable to follow the following steps:

1. Collecting data from sources such as questionnaires, forms or databases, texts, brochures, magazines, internet, direct measurements, etc.

2. Draw the scatter diagram, which suggests that model could be used, is a graph showing the intensity and direction of the relationship between two variables. Only up to three-dimensional planes are best seen models suggested. This question is important: Does the relationship appear to be linear or curved? 3. Calculate the values of the correlation coefficient and the coefficient of determination (note: correlation coefficient measures the percentage of linear association between variables and coefficients of determination measures the percentage of variability of the dependent variable explained by the independent variable).

4. Set the model suggests the scatter diagram or suggested by the experience of the investigator.

5. Estimate the regression line using a processing program with statistical applications (Excel, SPSS, Statgraphics, Minitab, SAS, Statistics, etc.) or by formulas.

(16)



Techniques for Examining Associations

Spearman Correlation

The technique is appropriate

when:

The degree of association

between two sets of ranks

(pertaining to two variables) is

to be examined.

Illustrative research question(s)

this technique can answer

“Is

there a significant relationship

between motivation levels of

teachers and the quality of

their performance?“

Assume that the data on motivation and quality of performance are in the form of ranks, say, 1 through 50, for 50 teachers who were evaluated

subjectively by their administrators on each variable.

Pearson Correlation

This technique is appropriate

When:

The degree of association

between two metric-scaled

(interval or ratio) variables is to

be examined.

Illustrative research

question(s) this technique

can answer

“

Is there a

significant relationship between

parents' age (measured in

actual years) and their

perceptions of the school's

(17)

Spearman Rank Coefficient (r

s

)



• Used for non-linear relationships

• It is a non-parametric measure of correlation.

• This procedure makes use of the two sets of ranks that

may be assigned to the sample values of x and Y.

• Spearman Rank correlation coefficient could be

computed in the following cases:



Both variables are quantitative.



Both variables are qualitative ordinal.



One variable is quantitative and the other is

qualitative ordinal.

(18)

Spearman Correlation



Example: Quality of life

Fourteen cities have been rated on an index that measures the quality of life.

Also, the percentage of the population that has moved into each city over the

past year has been determined. Have cities with higher quality of life scores

attracted more new residents?

Association between quality of life and percentage of new residents

City

Quality of life

Percentage of New Residents

A 25 5

B 10 4

C 15 3

D 30 6

E 20 3

F 25 9

G 10 5

H 15 3

I 30 7

J 20 8

K 15 5

L 17 6

M 20 7

(19)

Steps in SPSS for Spearman correlation



(20)

OUTPUT DATA – Spearman correlation



Correlations

Quality of Life

Percentage of New Residents Spearman's

rho

Quality of LifeCorrelation

Coefficient 1.000 .586* Sig. (2-tailed) _.028

N 14 14

Percentage of New

Residents

Correlation

Coefficient .586* 1.000 Sig. (2-tailed) _.028

N 14 14

*. Correlation is significant at the 0.05 level (2-tailed).

(21)

Simple Correlation (r) Pearson



It is also called Pearson's correlation or product moment correlation coefficient. It measures the direction (the sign denotes the direction) and strength (the value of

r denotes the strength of association) between two variables of the quantitative variables.

Direct or positive, if the values of the two variables deviate in the same direction i.e. if an increase (or decrease) in the values of one variable results, on average, in a corresponding increase (or decrease) in the values of the other variable the correlation is said to be direct or positive. Examples:

•Student’s performance and number of hours studied •Satisfaction and loyalty at work.

Inverse or negative, if the variables deviate in opposite direction i.e. if increase in the values of one variable results on average, in corresponding decrease in the values of other variable. Examples:

•TV viewing and class grades-students who spend more time watching TV tend to have lower grades (or phrased as students with higher grades tend to spend less time watching TV)

(22)

Pearson Correlation



-1

-0.75 -0.25

₀

0.25 0.75

₁

strong

moderate

weak

moderate

strong

no relation Inverse perfect

correlation

Direct

inverse

Direct perfect correlation



The value of “r” ranges between ( -1) and ( +1)



The value of “r” denotes the strength of the association

as illustrated by the following diagram. If r = 0 or close to

Zero this means no association or correlation between

the two variables.

(23)

Example



A sample of 12 students was selected, data about their performance and

the time that usually wake-up was recorded as shown in the following

table . It is required to find the correlation between performance and the

time that student usually wakes up.

Student

Wake-up

Time

Academic

Performance

Kalisa

5.30

13.0 Seraphine

10.00

9.0 Manasse

8.00

13.0 Odette

9.00

11.0 Laurence

6.00

16.0 Pascal

7.00

10.0 Gallican

7.30

13.0 Marcel

6.00

11.0 Sandrine

5.00

14.0 Acqueline

9.30

10.0 Judith

5.30

16.5 Innoncent

7.30

12.0

Hypothesis

Ho: ρ = 0 (there is no association between performance and the time that usually wake-up)

Ha:

There is an association between them

0

(24)

Steps in SPSS



Again to perform a correlation and regression analysis is advisable to

follow the following steps:

Step 1: Scatter Diagram (

After collecting the data, draw the scatter

diagram)

The starting point is to draw a scatter of points on a graph, with one

variable on the X-axis and the other variable on the Y-axis; it is

customary represent the dependent variable on the vertical axis and

independent on the horizontal axis. When studying the relationship

between two variables, one can be considered as cause and the other

as a result or effect of the other. Call the exogenous or independent

variable that causes, the effect is the endogenous variable. The scatter

plots or diagrams give an idea of the relationship (if any) between the

variables as suggested by the data. The closer the points of a straight

line are, the stronger the linear relationship between two variables will

be.

(25)

Steps and Output of scatter dot



(26)

Step 2. Correlation



(27)



OUTPUT - Correlation

Correlations

Wake up-Time

Academic performance Wake up-Time Pearson

Correlation 1 -.720** Sig.

(2-tailed) .008

N 12 12

Academic performance

Pearson

Correlation -.720** 1 Sig.

(2-tailed) .008

N 12 12

**. Correlation is significant at the 0.01 level (2-tailed).

These variables have a strong inverse association (r = -.720). The wake-up time is relate with the academic performance. Sig.=.008, means there is a strong inverse relationship between the time that students wake-up and their performance (the meaning is, later get up less score)

Coefficient of determination is the percentage of variation in the dependent variable ‘Y’ explained by the independent variable ‘X’.

How well does this line fit the data?

The value of r2 =(-.720)2=0.5184, 51.84 ≈ 52%

The 'goodness of fit' indicates the percentage of the variation in performance which is accounted for by the variation of the wake-up time; in other hands 52% of the variance in performance is explained by the time that students wake up.

(28)

Example



Country % Immunization Mortality_rate

Bolivia 77 118

Brasil 69 65

Cambodia 32 184

Canadá 85 8

China 94 43

Czech_Republic 99 12

Egypt 89 55

Ethiopia 13 208

Finland 95 7

France 95 9

Greece 54 9

India 89 124

Italy 95 10

Japan 87 6

México 91 33

Poland 98 16

Russian_federation 73 32

Senegal 47 145

Turkey 76 87

A study was conducted to find whether there is any relationship between the mortality rate and percentage of the immunization in some countries of the world. The following set of data was found in the page "http://www.unicef.org/statistics/". Let us determine is there relationship for this set of data. The first column represents the countries and the second and third columns represent the % of immunization and mortality rate of each country.

(29)

Steps in SPSS for draw Scatter diagram



Graphs>Chart builder>OK>front the variable box, take the variable immunization to “x-axis” and Rate_mortality to “y-“x-axis” and click in Group Point ID> take the variable country to the Point ID>OK

1

3

4

5

OK

(30)

Step 3. Regression Analysis



Scatter diagram of the mortality rate by % immunization with regression line inserted in some countries in the world

(31)

Steps in SPSS for Regression



Analyze >Regression Linear>

1

2

3

4

5

6

(32)

Interpretation from outcome of SPSS



•Checking the Model Fit

Model Summary

Model R R Square

Adjusted R Square

Std. Error of the Estimate

1 .791a _.626 _.605 _40.13931

a. Predictors: (Constant), Immunization %

The model summary table reports the strength of the relationship between the model and the dependent variable. “R=.791”, correlation coefficient, is the linear correlation between the observed and model-predicted values of the dependent variable. Its large value indicates a strong positive or direct relationship.

R Square = .626, the coefficient of determination, is the squared value correlation coefficient. It shows that about 62.6% the variation in mortality is explained by the model.

ANOVAa

Model

Sum of

Squares df

Mean

Square F Sig. 1 Regression 48497.050 1 48497.05 30.101 .000b

Residual 29000.950 18 1611.16 Total 77498.000 19

a. Dependent Variable: Mortality_rate b. Predictors: (Constant), Immunization %

The significance value of the F statistic is less than 0.05, which means that the variation explained by the model is not due to chance.

(33)

Checking the coefficients of the regression line

(parameter estimates)



This table shows the coefficients of the regression line:

•The first variable (constant) represents the constant, also referred to as the point to intercept the regression line when it crosses the Y axis. In other words this is the predicted value of mortality when all other variables are 0.

•The second, these are the values for the regression equation for predicting the dependent variable from the independent variable.

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta

1 (Constant) 224.316 31.440 7.135 .000 Immunization

% -2.136 .389 -.791 -5.486 .000 a. Dependent Variable: Mortality_rate

The regression equation can be presented in many different ways, for example:

Mortality predicted= 224.316 - 2.136* % of immunization

= 224.316 average mortality rate without any influence of the % of immunization (constant source).

= - 2.136 decreased mortality rate for each % of immunization as indicated nonzero correlation (slope of the line)1

0

(34)

Prediction of Mortality Rate



What rate of mortality could be predicted for the group of countries with

80% immunization?

The best estimate of the mortality is obtained by substituting the value of

80% for that of the independent variable, x, and calculating the

corresponding value of the Mortality.

Estimated Mortality:

mortality

of

rate

X

Y





224 .

316 

2 .

136 

224 .

316 

2 .

136 *

80 

53 .

436 

53 Expected mortality would be 53 mortality rate.

With these results we conclude:

1. The variables are associated or related linearly in the population from which the sample comes (with a very small chance that the relationship found is explained by chance, less than one per thousand).

2. Found that the relationship is very good (r = - .791), in fact that the independent variable (% of immunization) explained 62.6% ( ) the variability of the dependent variable (mortality).

3. That the relationship is inverse or negative, decreasing in average mortality rate 2,136 per % increase in immunization in the countries under study.

(35)

The Multiple Linear Regression Model



Multiple linear regressions are an extension of the simple model that

incorporates two or more independent variables. Multiple regression

analysis produces an equation with several coefficients, depending on

the number of independent variables X are introduced to the model, thus

generating hyper planes.

i n

n

X

Y





₀





₁ ₁





₂ ₂



_









Why is this important?

The relationship is rarely a function of just one variable, but is instead

influenced by many variables. So the idea is that we should be able to

obtain a more accurate predicted score if using multiple variables to

predict our outcome.

β_i is the intercept and β_idetermines the contribution of the independent variable x_i

X

₁_i

,

X

₂_i

, …,

X

_k_i

are values on the independent variables for unit i

(36)

Example for Multiple Regression



The following table presents information on three variables for a small

sample of eight nations. We will take the abortion rate as the dependent

variable and examine the relationship with two variables: one measures

the status and power of women and the other measures religiosity.

Nation

Abortion

Rate (Y)

Women's

Status (x

₁

)

Religiosity

(x

₂

)

Canada

165

0.5

74 Chile

100

0.45

93 Denmark

400

0.8

48 Germany

208

0.54

67 Italy

389

0.7

70 Japan

379

0.52

55 UK

207

0.58

67 US

428

0.84

35 The research

question might be:

“How much does an

independent

variable contribute

to explaining

dependent variable

after the effect of

another

independent

(37)

Output from SPSS (simple correlation)



Correlations

Abortion Rate (Y)

Women's Status (X₁)

Religiosity (X₂) Abortion Rate

(Y)

Pearson

Correlation 1 .817* _-.842**

Sig. (2-tailed) 0.013 0.009

N 8 8 8

Women's Status (x₁)

Pearson

Correlation .817* ₁ _-.801*

Sig. (2-tailed) 0.013 0.017

N 8 8 8

Religiosity (x₂) Pearson

Correlation -.842** _-.801* ₁

Sig. (2-tailed) 0.009 0.017

N 8 8 8

*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).

(38)

Scatter Diagram



Regression line is the best straight line description of the plotted points and use can use it to describe the association between the variables.

If all the lines fall exactly on the line then the line is 0 and you have a perfect relationship.

In the first chart we can see when the woman has a higher state will have the highest rate of abortion.

(39)

Steps for Multiple Regressions in SPSS



1. Click ANALYZE 2. Select

REGRESSION 3. Click LINEAR 4. Move “Abortion

Rate” to

DEPENDENT Box

5. Move “Women status and religiosity” to INDEPENDENT (S) box

6. Click OK 7. Continue the

(40)

Output Multiple Correlation and Coeficient of determiantion



Model Summaryb

Model R R Square

Adjusted R Square

Std. Error of the Estimate

Durbin-Watson 1 .875a _0.765 _0.671 _73.19844 _1.569

Predictors: (Constant), Religiosity (x₂), Women's Status (x₁) Dependent Variable: Abortion Rate (Y)

Interpret multiple correlation coefficient (R), and the coefficient of multiple determination (R2_{). How much of the variance in abortion rate is explained by}

the two independent variables?

R = .875 (the model improved by interacting independent variables), in

other hand there are strong correlation between religiosity and women’s

status with abortion rate.

(R

2

_{= .765.) 76.5% of the variation in abortion rate can be explained by}

variation in religiosity and women’s status.

(41)

Assumption of Autocorrelation



Checking Values are not correlated and multicollinearity

We use the Durbin-Watson statistic is a test to find out the serial correlation between adjacent error terms (residuals). The range of this statistic ranges from 0 to 4. A value around 2 means that errors are not correlated, less than 2 that the errors are positively correlated and greater than 2 that are negatively correlated. In the example Durbin-Watson = 1.569 is a value slightly less than 2, indicating that the errors terms are not autocorrelated.

Multicollinearity exists when independent variables in a regression equation are highly correlated among themselves.

Multicollinearity in regression analysis refers to how strongly interrelated the independent variables in a model are. When multicollinearity is too high, the individual parameter estimates become difficult to interpret. Most regression programs can compute variance inflation factors (VIF) for each variable. As a rule of thumb, VIF above 5.0 suggests problems with multicollinearity.

(42)

ANOVA Regression



Interpretation: As Sig < 0.05 then reject null hypothesis, indicating that at least one of the explanatory variables is related or affects to abortion rate. We conclude that the model is useful for predicting.

Hypothesis of Slope

Approach of the hypothesis:

(Consider that all the coefficients are simultaneously equal to zero)

(At least one regression coefficient is not equal to zero) ANOVAa Model Sum of Squares df Mean

Square F Sig.

1 Regression 87171.94 2 43585.971 8.135 .027b Residual 26790.06 5 5358.012

Total 113962 7

a. Dependent Variable: Abortion Rate (Y)

b. Predictors: (Constant), Religiosity (x₂), Women's Status (x₁)

0 :

₁ ₂

(43)

Regression Equation



Coefficientsa Model Unstandardized Coefficients Standardized

Coefficients t Sig. B Std. Error Beta

1

(Constant) 310.885 345.19 0.901 0.409 Women's

Status (x1) 348.413 317.472 0.398 1.097 0.322 Religiosity

(x2) -3.789 2.624 -0.523 -1.444 0.208

a. Dependent Variable: Abortion Rate (Y)

•Find the multiple regression equation with Women's Status (x₁) and religiosity (x₂).

The model has the following equation:

18 .

231

90 *

789 .

3

75 .

0 *

413 .

348

885 .

310 _

,

*

789 .

3 *

413 .

348

885 .

310 _















rate

Abortion

therefore

y

religiosit

status

rate

Abortion

Religiosity is negatively related to abortion rate and women's status is positively related to abortion rate

The

predicted abortion rate is 231.18 •What will be abortion rate would be expected for Women's Status 0.75, and religiosity of 90?

2

1

3 .

789

413 .

348

885 .

310 ˆ

_x

(44)

Assignment 5



1. Find and interpret the relationship between Anxiety and Test Scores (follow all steps)

2. In a study of the relationship between level education and income the following data was obtained. Find the relationship between them and comment.

Sample Level Education (x) Income (y)

A Preparatory 25

B Primary 10

C Master’s degree 8

D Secondary 10

E Bachelor degree 15

F Illiterate 50

G Postgraduate diploma 60

Compute the Spearman rank correlation coefficient and test it for significance at the .05 level. What conclusion may be reached?

(

x

)

Anxiety 10 8 2 1 5 6

(

Y

) Test

score 2 3 9 7 6 5

3. A psychologist believes that those who score high on a need-achievement test will likely have a high salary to match. To test this theory, the psychologist has given questionnaires to a random sample of 17 subjects and has ranked the data so that the highest value in each category has been assigned a 1.

Subject A B C D E F G H I J K L M N O P Q

Rank - Need

Achievement 1 8 4 10 12 2 13 6 16 11 14 3 9 7 15 17 5 Salary Rank 3 7 2 12 9 1 11 6 17 13 15 5 10 8 14 16 4

(45)

Assignment 5



,

(46)

Assignment 5



5.Open from SPSS data file “survey_sample.sav.” This data file contains survey data, including demographic data and various attitude measures. It is based on a subset of variables from the 1998 NORC General Social Survey.

With this data calculate and interpret:

a.Compute and interpret the coefficient of simple correlation (hours per day watching TV and Highest year of school completed)

b.Draw a scatter diagram and interpret with the variables from the previous example

c.Compute and interpret the multiple coefficient of correlation. (From here, use the variables indicated below)

Data:

Dependent variable: Total family income

Independents variables: Age of respondent; Highest year of school completed; Highest year school completed, father; Highest year school completed, mother; Highest year school

completed, spouse.

d. Compute and interpret the multiple coefficient of determination within the context of this problem

e. Compute and interpret the multiple regression equation. Is the model significant (perform the hypotheses for multiple regression analysis)

f. From the analysis performed, would you recommend removing any variable (s) that do not contribute significantly to the model?

g. Check if the assumptions of autocorrelated assumed (Durbin Watson)

h. What will be the total family income that would be expected for the 50-year-old participant, who has 15 years of study, and that his spouse completed 14 years of study and the

(47)



Assignment 5

6. Given a hypothetical sample of 20 patients who have collected the following data: cholesterol level in blood plasma (in mg/100 ml), age (in years), saturated fat (in g/ week) and level exercise (quantified as 0: no exercise, 1: moderate exercise and 2: intense exercise), the adjustment to a linear model between cholesterol level and other variables.

Develop analysis in statistical software and interpret the output. Note. Answer the questions like the previous exercise (alternatives 'a to h')

h. What will be the cholesterol that would be expected for the 60-year-old patient, who consumes 40 grams of fat and does not do any type of exercise?

Patient Cholesterol Age Fat Exercise

1 350 80 35 0

2 190 30 40 2

3 263 42 15 1

4 320 50 20 0

5 280 45 35 0

6 198 35 50 1

7 232 18 70 1

8 320 32 40 0

9 303 49 45 0

10 220 35 35 0

11 405 50 50 0

12 190 20 15 2

13 230 40 20 1

14 227 30 35 0

15 440 30 80 1

16 318 23 40 2

17 212 35 40 1

18 340 18 80 0

19 195 22 15 0