There are several assumptions upon which a regression analysis is based and this section will highlight each of these assumptions as well as discuss what effect a violation of these assumptions can have upon results obtained. Several procedures for the detection of these violations and subsequent corrections are also detailed.
The most obvious assumption upon which a regression analysis is based is the one of linearity, which refers to the dependent variable being a linear function of a specific set of independent variables and an error term (Asteriou and Hall 2011). In other words, it is assumed that an equation of the correct functional form is being used and that all the variables have been carefully selected. There are also several assumptions made concerning the error term namely homoskedasticity, the independence of the error term and the normality of the error term (Hair et al 1995). Other assumptions made throughout the regression procedure are that the variables are measured accurately and the explanatory variables are independent of each other. The data must also constitute a random sample. These assumptions are now explained in turn in greater detail below.
form of the model. Omitting relevant variables from the model would lead to the formation of ‘biased’ (therefore, incorrect) estimates of the regression coefficients. Similarly, by including variables which are not relevant, although would lead to unbiased coefficient estimates, would also lead to inflated standard errors of the relevant variables, particularly if there is correlation between a relevant and irrelevant variables, leading to incorrect conclusions being drawn from the analysis (Schroeder et al 1986).
Another assumption made with regards to the data is that all independent variables present in the model are not related to each other. This problem of correlation between independent variables is referred to as multicollinearity. While regression coefficients estimated using correlated independent variables are unbiased, the presence of multicollinearity ensures that the standard errors of these coefficients are larger than they would be in the absence of correlation between independent variables (Schroeder et al 1986). This increase in standard error will ultimately mean t-ratios will be smaller and therefore increase the likelihood that incorrect conclusions are made throughout the hypothesis testing procedure.
There are several ways multicollinearity can be detected and hence accounted for however they all tend to be difficult to interpret and are often prone to misuse (Wooldridge 2009). The simplest of these methods to account for the presence of multicollinearity is to increase the sample size. By increasing the sample size, standard errors can be reduced and the R2 between two independent variables can be reduced therefore attacking the multicollinearity problem directly (Thomas 1997). In the absence of more data however, there are two other methods available to a researcher for detecting and dealing with multicollinearity. These methods involve the use of the variance inflation factor (VIF), or tolerance statistics.
The VIF is a measure of the inflation of the variance of regression coefficients caused by the presence of multicollinearity. In other words the VIF statistic will indicate if an independent variable is correlated to another. The interpretation of this VIF statistic is considered to be somewhat problematic as it is achieved via a scale of 0 through to 10+. The closer this VIF statistic is to zero the weaker the correlation between independent variables is assumed to be. If the VIF statistic should exceed the ‘cutoff’ of 10 or above however then the presence of multicollinearity is deemed to be
extreme. Wooldridge (2009, p99), expresses concern regarding the generally accepted ‘cutoff’ point of 10 and describes it as an ‘arbitrary’ value, whilst Schroeder, et al (1986) also state that the method of looking for high correlations between variables is far from foolproof.
The tolerance statistic for detecting multicollinearity is similar to the VIF statistic in the sense that both statistics reveal the extent to which an independent variable is explained by the other independent variables. The tolerance statistic therefore, reveals the amount of variability of the selected independent variable that is not explained by the other independent variables (Hair et al 1995). A very small tolerance statistic will represent high collinearity and the cutoff threshold for tolerance is 0.10 (which corresponds to VIF values above 10). Hair et al (1995) state that depending upon the degree of collinearity present a researcher has several options including simply dropping or substituting the highly correlated predictor variables from the analysis or using more complex regression procedures such as ridge regressions23.
Throughout the regression procedure several important assumptions are also made regarding the error term in the equation. The simplest of these assumptions is that the error terms are normally distributed (have a mean of zero). A second assumption regarding the error term is that there is no autocorrelation present, otherwise meaning that the error term for one particular observation is in no way correlated with error terms from other observations. The presence of autocorrelation will ultimately lead to underestimation of the standard error and inflated t-ratio statistics causing incorrect conclusions to be drawn from the hypothesis testing procedure (Schroeder et al 1986). A common method for the detection of autocorrelation is the Durbin-Watson coefficient, which is valid when a constant term is present in the regression model and there are no lagged variables and serial correlation is assumed to be first order only (Asteriou and Hall 2011). When autocorrelation has been detected a technique called generalised least squares (GLS) is one commonly used to overcome the problem, which is based on OLS regression but uses transformed variables (Schroeder et al
23According to Hair et al(1995), ridge regressions can be employed as a means to overcome the problems multicollinearity presents. In the presence of multicollinearity least squares estimates will be unbiased, but their
1986). Alternatively, robust estimation methods are also appropriate for fairly arbitrary forms of serial correlation and heteroskedasticity (Wooldridge 2009).
Yet another assumption regarding the error term is the assumption of homoskedastic errors. Several different methods are available to a researcher for the detection of heteroskedasticity such as, a series of Lagrange multiplier tests (LM tests) like the Breusch-Pagan, Harvey-Godfrey, Park tests or the more complex Goldfeld-Quandt test. Generalised least squares can also be used to circumvent the effects of heteroskedasticity (Asteriou and Hall 2011).
This assumption of homoskedasticty assumes that the variance in an error term is not related to any factor (observation) present in the analysis. When the variance of the error term changes for every observation this assumption is violated and heteroskedasticity is present (Asteriou and Hall 2011). Similarly to autocorrelation, violation of this assumption leads to inaccurate standard errors of regression coefficients, therefore biasing hypothesis test results (Schroeder et al 1986).
The presence of heteroskedasticity can be encountered when using dummy variables. On occasion, dummy variables may sometimes be employed in regression analyses, which unlike other continuous variables can only assume a limited number of values. These dummy variables can usually only assume the values 0 or 1 and are therefore referred to as dichotomous or binary variables and can be specified within an econometric model as an independent variable or a dependent variable. The use of dummy variables is deemed appropriate in situations where the theory implies that behaviour differs between different time periods, or between two groups within a cross section (Schroeder et al 1986).
In order to overcome the difficulties heteroskedasticity presents, robust estimation methods can be used which adjust standard errors so that they remain valid even in the presence of heteroskedasticity of unknown form (Wooldridge 2009). This means no matter what the level of heteroskedasticity present in the population, robust estimation will still report valid statistics. Although robust estimation is not equal to OLS under classical assumptions, it is less sensitive to the violation of these assumptions, in particular the assumption that requires the disturbances to be normally distributed (Thomas 1997).