What cannot be measured cannot be managed.
Timely forecast prevents distress sale, investment and trade Introduction
Reliable and timely forecasts of business output are needed for various strategic resolutions relating to storage, distribution, pricing, marketing, import-export, etc. However, these prior estimates are only subjective guess works and not the objective estimates. For this a statistically sound objective forecasts of business output is a paramount interest for any business organisation. Regression is a statistically sound descriptive relationship of a dependent variable that connects a set of explanatory variables of interest. It put together a set of formal expressions of these relationships to the point when the behaviour of the model adequately mimics the behaviour of the system. It may also be defined as conditional expectation of a dependent variable for given values of explanatory variables.
British Biometrician, Sir Francis Galton (1822-1911) first noticed that offspring of exceptionally tall or short parents regress or return to average population height and he introduced the word regression. Since then, regression has been used as statistical technique to estimate and test functional relationship in a set of interrelated variables. It is a powerful statistical technique. The purpose of regression analysis is
To estimate functional relationship (conditional expectation between a dependent variable (character) and a set of inter-related variables.)
To forecast the dependent variable from independent variables
Linear regression models Y = ß0 + ß1X1 + ß2X2 + ...+ e where Y and Xi are output and business characters respectively. These may be used in original scale or some suitably transformed variables of these can be used. ß0 and ßi are constants to be estimated and e is random error. Y: response/dependent/outcome/endogenous variable.
X: predictor/independent/explanatory/exogenous/prognostic/pre disposing variable.
X & Y: may be quantitative/qualitative, Quantitative variable may be either discrete/ continuous, Qualitative variable is also called categorical variable, All variables are represented numerically in regression analysis. Categorical variables should be converted into numerically by Dummy Variable Techniques
These models may be improved by taking regressors as principal components of business inputs or growth. The growth indices may be any index that is used as economic or business barometers. Performance of regression models are judged by coefficient of determination (R2). It
177 15 -24 th S epte m ber 2 014
indicates percentage of variation in Y explained by the independent variables. Since it is the shared variance, it also sometimes interpreted as degree of association.
Use of Dummy Variable
A Dummy Variable is also called dichotomous/binary/ contrast variable. It takes on only two values, usually 0 & 1. E.g. A categorical variable with two categories can be represented by a single dummy variable. The variable, place of residence, Urban & Rural is an example. U: 1 if urban, 0 otherwise (rural). The category with assigned value 0 is called reference category. The choice of reference category is arbitrary.
Why we assign 1 & 0 in defining dummy variable? It is simple and convenient for interpretation of a & b.
e.g.: Y = a + b X
where, F: fertility X: 1 if urban, 0 if otherwise
If X= 0 Y = a. Now a is profit of reference category (i.e. rural) ,
If X = 1, Y = a + b. Here b is the difference in profit between urban & rural Fundamental Assumption:
The error terms i pay an important role investigating the adequacy of the fitted model & detecting departure from the fundamental assumption. The assumptions are
- mean zero, E(ei) = 0, linearity
- constant variance, Var (ei) = 2, homoscedasticity - uncorrelated, E (ei ej) = 0, independence
Managing a model
While undertaking regression and prediction, one should be aware of - Outliers (shocking values)
- Interaction
- Muulticollinearity &
- Autocorrelation & Overcome Hardles
Outlier : Adultery ground of 349 days of delivery after Military Service abroad in Aug 1945 (Mr Hadlum vs Mrs Hadlum 1949) was an outlier was discordant and appeal failed. House of Lords later on fixed Credibility limit for gestation period 360 D (Preston Jones Vs Preston Jones 1952). Outlier may be mean shifted or variance shifted in the model. Cook Statistic, A P Statistic and Qi statistic are useful in identifying outliers. Outliers are far away from the others (they are often due to data error). It exerts strong influence on b0 & b1
178 15 -24 th S epte m ber 2 014
(i) Scatter plot,
(ii) Minimum & Maximum values of variable iii) Frequency distribution,
(iv) Box plot,
(v) Case wise diagnostics for outliers (through statistical package) Interaction:
It implies that the effect of X1, say on Y depends on the level of the other variable X2 . e.g. 1. The effect of education X1 on fertility Y is found larger in the higher income X2. . The effect of age X1 on no. of visit to doctor Y is found higher in women than in men (sex X2 )
How to deal Interaction:
- Model that include interaction effects Y = b0+b1 X1 + b2 X2 + b12 X1 X2
= b0+ b1 X1 + b2 X2 + b3 X3 ; where b3 = 12, X3 = X1 X2 The effect on Y of the increasing X2 by 1 unit, then
Y* = b0+ b1 X1 + b2 (X2 + 1) + b3 (X2 + 1) X1 = b0+ b1 X1 + b2 X2 + b3 X1 X2 + b2 + b3 X1 = Y + (b2 + b3 X1)
The effect of 1 unit increase in X2 is to increase Y by b2 + b3 X1. It means the effect on Y of Increasing X2 by 1 unit depends on the level of X1 (which is interaction), where b2 : main effect b3 X1 : interaction effect/sometimes b3 is called interaction effect.
Multicollinearity
- Two predictor variables are said to be multicollinear if they are highly correlated (r > 0.8). Now, If X1 and X2 are perfectly correlated, then the model conforms exactly to a st. line( instead of a plane). Y= m + nX In this situation, b0,b1 and b2 are not uniquely solvable. However X1 and X2 are not perfectly correlated, the co-efficients are solvable.
How to deal
- Collinearity statistic e.g. Variance Inflation Factor (VIF)= 1/(1 – R2). If VIF > 5 or 10, then there is a multicollinearity. In real-life observational data certain amount of multicollinearity is inevitable. When two predictor variables are correlated (both are important), is should not eliminate one of them to reduce multicollinearity unless r > 0.8. Through computer, calculate correlation matrix then scanned for correlation > 0.8 between pairs of predictor variables. Alternatively, Ridge regression and Regression of principal component scores may be used. Inserting higher polynomial (if predictor is not linear to response variable.
e.g. Y = b0+ b1 X1 + b2 X2 + b3 X22 ; if b3 is significant, if not discard, introduce higher polynomial(b3 X22) So does, Y = b0+ b1 X1 +b2 X2 + b3 X22 + b4 X23 ….. till respective b is significant
179 15 -24 th S epte m ber 2 014
Autocorrelation: In time series data, assumption of uncorrelated /independent errors are often violated and exhibit serial correlatio. e.g. , E (ei ei + j) = 0, Then, the error terms are said to be auto correlated. Primary cause of autocorrelation involves failure to include one/more important predictor variable(s). e.g. Y : annual sale of a soft drink, X1: annual advertising expenditure. If X2: population size, which influences Y, is not included in the model, this causes autocorrelation.
Dealing with Autocorrelation
There are three approaches to deal autocorrelation.
- if autocorrelation is present, identify the predicted variable(s) & include it in the model. - if the problem can’t be resolved by including omitted factors, then turn to a model that specifically incorporate the autocorrelation structure (it needs special parameter estimation techniques)
- weighted/generalised least square method can be used if there are sufficient knowledge of the autocorrelation structure.
Example 1: Data for Linear Regression
popcorn oil amt batch yield trial
plain little large 8.2 1
gourmet little large 8.6 1
plain lots large 10.4 1
gourmet lots large 9.2 1
plain little small 9.9 1
gourmet little small 12.1 1
plain lots small 10.6 1
gourmet lots small 18 1
plain little large 8.8 2
gourmet little large 8.2 2
plain lots large 8.8 2
gourmet lots large 9.8 2
plain little small 10.1 2
gourmet little small 15.9 2
plain lots small 7.4 2
gourmet lots small 16 2
Parameter Estimates Response yield
Term Estimate Std Error t Ratio Prob>|t|
Intercept 10.75 0.354436 30.33 <.0001* popcorn[gourmet] 1.475 0.354436 4.16 0.0032* oil amt[little] -0.525 0.354436 -1.48 0.1768 popcorn[gourmet]*oil amt[little] -0.5 0.354436 -1.41 0.1960 batch[large] -1.75 0.354436 -4.94 0.0011* popcorn[gourmet]*batch[large] -1.525 0.354436 -4.30 0.0026* oil amt[little]*batch[large] -0.025 0.354436 -0.07 0.9455 popcorn[gourmet]*oil amt[little]*batch[large] 0.5 0.354436 1.41 0.1960 R2 0.892456
180 15 -24 th S epte m ber 2 014