One method that may be used to control for endogeneity caused by omitted variable bias or reverse causality are panel data models, which I used in Chapter 6. For panel data models, a longitudinal element must be added to the data, as repeated measurements for each individual observation are needed at different time points. Compared to cross sectional regression models, panel data models split the unobserved error term (ɛ) into two
components, individual-specific unobservable effect (𝑣𝑖) and the random error term (𝑢𝑖𝑡):
𝜀𝑖𝑡 = 𝑣𝑖 + 𝑢𝑖𝑡 (3.1)
While the random error term represents idiosyncratic shocks, the individual-specific unobservable effect refers to the unobserved characteristics of the individual that remain
15
constant over time. Additionally, both 𝑣𝑖 and 𝑢𝑖𝑡 are assumed to be random variables from a normal distribution:
𝑣𝑖~𝑁(0, 𝜎𝑣𝑖) (3.2)
𝑢𝑖𝑡~𝑁(0, 𝜎𝑢𝑖𝑡) (3.3)
Given the presence of an individual specific effect, it is extremely likely that the values of the dependent variable will cluster together for each individual. Such clustering can be
accounted for by using the generalised least squares estimator, which allows for the fact that the error term for a particular individual will be correlated over the waves of a panel. The critical issue for panel data analysis is whether the individual-specific unobservable effect is correlated with the set of observed regressors. Failure to correctly account for the correlation between the two factors when estimating such models may lead to inconsistent estimates of the slope coefficients (Jones et al., 2013).
One panel data model that may be used is the random effects GLS model (GLS). Unlike the pooled estimator, which applies a cross sectional regression model to a panel data structure, the GLS model takes into account the fact that there are repeated observations for each individual, and adjusts the error term for autocorrelation. For each observation i in time period t, the GLS model can be given by:
𝑦𝑖𝑡 = 𝛽0+ 𝛽1𝑥1𝑖𝑡+ 𝛽2𝑥2𝑖𝑡+ ⋯ + 𝛽𝑛𝑥𝑖𝑡+ 𝑣𝑖 + 𝑢𝑖𝑡 , (3.4)
where i = 1, 2, … 𝑛 and t = 1,2, … 𝑛
Let 𝑦𝑖𝑡 represent the ith value of the dependent variable 𝑦 at time 𝑡. 𝑥𝑖𝑡 represents the ith value of the explanatory variable 𝑥 at time 𝑡, with the associated coefficient 𝛽1 . 𝛽0 is the constant coefficient, the predicted value of 𝑦 when 𝑥=0. In addition, 𝑣𝑖 represents the time
16
invariant individual specific error term and 𝑢𝑖𝑡 represents the assumed random error term for individual 𝑖 at time 𝑡, with 𝐶𝑜𝑣[𝑣𝑖, 𝑢𝑖𝑡|𝑋] = 0 for all 𝑖, 𝑡.
The time invariant individual specific error term (𝑣𝑖) is seen to capture the between-subject variation, the cross sectional variation in the outcome and explanatory variables for each individual. It is also possible to estimate a between effects regression, however this will always be less efficient than a random effects model as it ignores the within variation (Cameron and Trevidi 2009). The random error term (𝑢𝑖𝑡) is seen to capture the within- subject variation, the variation in the outcome and explanatory variables over time for each individual.
The GLS model can therefore be seen as a weighted average of the within and between estimators, with the weights determined by the proportion of the between variance
compared to the overall variance. Thus, the estimates from the GLS model will approach the pooled estimator when the between standard error is significantly smaller than the within standard error, and vice versa.
However, a key aspect of the GLS model is that it explicitly assumes that the unobserved individual level heterogeneity is unrelated to the vector of explanatory variables (Greene 2003). In reality this is usually an extremely strong assumption, and therefore a model specification which removes unobserved individual level heterogeneity completely may be more appropriate in the majority of situations.
Unlike the GLS model, the fixed effects (FE) model does not require the assumption that the individual specific error term is uncorrelated with one or more of the explanatory variables. There are three ways in which to control for these time-invariant unobservable individual effects: first differencing, the least squared dummy variable (LSDV) estimator and the within estimator. In this thesis I used the within estimator, as it is more efficient than first
differencing when the error term is homoskedastic and serially uncorrelated, and gives smaller standard errors as compared to the LSDV.
The within estimator removes the individual specific error terms by mean-differencing the data, and then estimating an OLS regression on the mean-differenced data.
17
(𝑦𝑖𝑡− 𝑦̅) = (𝛽𝑖 0− 𝛽̅ ) + 𝛽𝑡 1 (𝑥1𝑖𝑡− 𝑥̅ ) + ⋯ + 𝛽𝑖 𝑛 (𝑥𝑛𝑡− 𝑥̅ ) + (𝑣𝑖 𝑖− 𝑣̅) + (𝑢𝑖 𝑖𝑡 − 𝑢̅ ) (3.5) 𝑖
and therefore:
𝑦̈𝑖𝑡 = 𝛽̈0+ 𝛽1𝑥̈1𝑖𝑡+ ⋯ + 𝛽𝑛𝑥̈𝑛𝑡+ 𝑢̈𝑖𝑡 , (3.6)
where i = 1, 2, … 𝑛 and t = 1,2, … 𝑛
Let 𝑦̈𝑖𝑡 represent the ith value of the demeaned dependent variable 𝑦 at time 𝑡. 𝑥̈1𝑖𝑡 represents the ith value of the demeaned explanatory variable 𝑥 at time 𝑡, with the
associated coefficient 𝛽1 . 𝛽0 represents the constant coefficient. By definition, the individual specific error term 𝑣𝑖 is constant across time, and demeaning this variable will remove it from the regression model. 𝑢̈𝑖𝑡 represents the idiosyncratic error term for individual 𝑖 at time 𝑡. Therefore, estimating an OLS model on the demeaned data leads to consistent estimates of the explanatory variables, even if the unobserved individual specific error term is
correlated with one or more of the explanatory variables.
Although consistent, there are several problems associated with the FE model. Firstly, as discussed by Lancaster (2000), when the number of waves or number of observations are small, the estimates from the FE models may be biased, poorly estimated and inconsistent due to the incidental parameters problem (Neyman and Scott 1948). This is due to the fact that the 𝑁 incidental parameters cannot be estimated if 𝑇𝑖 is small, because there are only 𝑇𝑖 observations for each individual. This inconsistent estimation of the individual, time invariant fixed effect can spill over to inconsistent estimation of the model parameters (Cameron and Trivedi 2009).
A second problem with the FE model is that although mean differencing the data will remove the individual specific fixed effect from the model and render the empirical estimates consistent, it will also remove time invariant variables of potential interest from the model, for example gender and ethnicity. In order to account for this, other empirical strategies have been suggested.
One approach that has been suggested is the Mundlak methodology (Mundlak 1978), which parametrises 𝑣𝑖 by including group means of the time varying explanatory variables as additional explanatory variables in the GLS model, and acts as a proxy fixed effects model:
18
𝑦𝑖𝑡 = 𝛽0+ 𝛽1𝑥1𝑖𝑡+ ⋯ + 𝛽𝑛𝑥𝑛𝑡+ 𝜑1𝑥̅1𝑖𝑡+ ⋯ + 𝜑𝑛𝑥̅𝑛𝑡+ 𝑣𝑖 + 𝑢𝑖𝑡 , (3.7)
where i = 1, 2, … 𝑛 and t = 1,2, … 𝑛
Let 𝑦𝑖𝑡 represent the ith value of the dependent variable 𝑦 at time 𝑡. 𝑥𝑖 represents the ith value of an explanatory variable 𝑥 at time 𝑡, with the associated coefficient 𝛽. 𝜀𝑖𝑡 represents the random disturbance term for individual 𝑖 at time 𝑡, with a mean value of 0. 𝑥̅𝑖 represents the time averaged ith value of an explanatory variable 𝑥 with its associated coefficient 𝜑1. Once more, 𝑣𝑖 represents the time invariant individual specific error term and 𝑢𝑖𝑡 represents the idiosyncratic error term for individual 𝑖 at time 𝑡.
This approach ensures consistent estimation of all within effects, as the deviations from the clustered means should be uncorrelated with the means themselves, the individual error term (𝑢𝑖𝑡) and any time varying covariates. However, the cluster means themselves can still be correlated with the time invariant individual specific error term (𝑣𝑖), and this may once more produce inconsistent estimates of the between effects (Cameron and Triviedi 2009). In order to establish which the preferred empirical strategy is, two specification tests can be performed. Firstly, in order to test whether pooled analysis or panel data models are more appropriate, the Breusch-Pagan Lagrange multiplier test (Breusch and Pagan 1979, 1980) can be implemented, which tests for heteroskedasticity in the error term of the pooled OLS model. Under the null hypothesis that the individual-level variance component of the error term is zero, a rejection of the null hypothesis implies that a panel data model is needed. Secondly, in order to test whether the GLS model is consistent, the Hausman Test (Hausman 1978) can be implemented, which tests the assumption that the unobserved individual level heterogeneity is uncorrelated with the set of explanatory variables. Under the null
hypothesis that the individual level heterogeneity is uncorrelated with the explanatory variables, a rejection of the null hypothesis implies that the FE model should be used rather than GLS, as it is more efficient.
19 3.1.2 Instrumental variables
Although panel data models may be able to account for endogeneity caused by omitted variable bias or reverse causality by controlling for unobserved time-invariant individual level heterogeneity, multiple waves of data are not always available for use, and even if they are, panel data methods are still unable to control for time variant individual level heterogeneity. A number of alternative methods have also been developed in order to estimate causal effects through directly controlling on both observable and unobservable characteristics, including differences in differences (DiD) estimators, regression discontinuity designs (RDD) and IV methods. Although DiD and RDD estimators require a natural experiment or policy change in order to achieve identification, IV methods exploit random variation in the
explanatory variable of interest caused by a variable that is plausibly exogenous to the main equation. I used IV methods in both Chapter 5 and Chapter 6.
To be an appropriate IV in a linear model, an IV, 𝑧, must satisfy two main conditions. Firstly, the IV, 𝑧, must be significantly correlated with the suspected endogenous variable 𝑥:
Corr (𝑧, 𝑥) ≠ 0 (3.8)
Secondly, the IV, 𝑧, must be uncorrelated with the error term, 𝜀 , of the econometric model:
Cov (𝑧, 𝜀 ) = 0 (3.9)
This can once more be displayed intuitively using a DAG. As previously shown in Figure 3.2, the error term, 𝜀, may be associated with the key explanatory variable 𝑥 as well as the dependent variable 𝑦, most commonly through omitted variable bias or reverse causality, potentially causing the estimates to be endogenous. However, a valid IV, 𝑧, offers a solution to this problem, as this variable is correlated significantly with 𝑥, and not with 𝜀 or 𝑦, as shown in Figure 3.3. Therefore, if the IV, 𝑧, is truly uncorrelated with the error term, 𝜀, the endogeneity problem should be eliminated.
20
𝑧 𝑥 𝑦
𝜀
Figure 3.3- DAG showing an instrumental variable acting as an exogenous form of variation for an explanatory variable in the presence of endogeneity
The simplest IV estimator is the Wald estimator (Wald 1940, Durbin 1954), which uses a single dummy instrument to estimate a model with one endogenous regressor and no covariates (Angrist and Pischke 2009). With no covariates, the regression model can be shown through two equations:
𝑥ᵢ = 𝛽0+ 𝛽1𝑧1ᵢ + 𝜀ᵢ , (3.10)
𝑦ᵢ = 𝜓0+ 𝜓1𝑥 1ᵢ + 𝜂ᵢ , (3.11)
where i = 1,2,…n
In the first stage model, let 𝑥𝑖 be the ith value of an explanatory variable assumed to be endogenous. 𝑧1𝑖 is an binary IV significantly correlated to 𝑥𝑖, with its associated coefficient 𝛽1. 𝛽0 represents the constant coefficient, and 𝜀ᵢ represents the error term, which is assumed to be randomly distributed.
In the second stage of the model, let 𝑦ᵢ be the ith value of the dependent variable 𝑦. 𝑥 1 is a prediction of 𝑥𝑖 from the first stage equation, with its associated vector coefficient
𝜓1. 𝜓0 represents the constant coefficient, and 𝜂ᵢ represents the error term, which is assumed to be randomly distributed.
Given the fact that 𝑧₁ᵢ is a dummy variable that equals 1 with probability 𝑝, it can be shown that the relationship between the IV and the outcome variable can be given by:
21 and therefore the estimate of 𝑥 1 can be shown as:
𝜓1 = 𝐸[𝑦𝑖|𝑧𝑖 = 1] − 𝐸[𝑦𝑖|𝑧𝑖 = 0] 𝐸[𝑥𝑖|𝑧𝑖 = 1] − 𝐸[𝑥𝑖|𝑧𝑖 = 0]=
𝐴 𝐵
(3.13)
The numerator, given by 𝐴, is the mean difference of 𝑦 in the group of individuals for which 𝑧𝑖 = 1 and the group for which 𝑧𝑖 = 0, which measures the causal effect of 𝑧𝑖 on 𝑦𝑖. The denominator, given by 𝐵, is the mean difference of 𝑥 in the group of individuals for which 𝑧𝑖 = 1 and the group for which 𝑧𝑖 = 0, which measures the causal effect of 𝑥𝑖 on 𝑦𝑖. The causal parameter is therefore the ratio of the two differences, known as indirect least squares.
If additional covariates are included in the model specification, the simplest and most commonly used technique is the 2SLS model (Angrist et al., 1995). This model is made up of two consecutive OLS regressions, with the additional exogenous covariates included in both the first stage and second stage equations. Whereas a ‘just-identified’ model indicates that there are the same number of endogenous variables and IVs, the 2SLS model allows for ‘over-identified’ models, where there are more IVs than endogenous variables. However, despite the appealing nature of IV estimators such as 2SLS, there are several associated problems with this method. Firstly, in practice it can be extremely difficult to identify a valid IV strategy, as the criteria for validity discussed previous is extremely strict. Secondly, as discussed in detail by Bound et al., (1995), having a ‘weak’ instrument (an instrument that is not sufficiently correlated with the endogenous variable) may significantly impact the consistency and efficiency of the estimates from 2SLS models. Amongst others, Cragg and Donald (1993) and Stock and Yogo (2002) have proposed formal tests for the weakness of IVs, both with critical values for the first stage F-statistic of the two stage models. In application, having a partial first stage F-statistic of less than 10 is generally considered the rule of thumb cut-off point for a weak instrument. Due to the potential weakness of IV, in certain cases it may in fact be better to use a biased OLS estimate rather than a consistent estimate using IV with weak instruments (Cerulli 2015).
22
Thirdly, as argued by Nelson and Startz (1990) and Staiger and Stock (1997), IV models will be biased in finite samples. Staiger and Stock (1997) have compared the finite sample bias of IV estimators to the relative bias of the OLS estimator, concluding that the inverse of the first stage partial F-statistic can be used as an estimate of the relative bias of IV estimators. For instance, in the case that the F-statistic is equal to 10 (the previously discussed rule of thumb cut-off point), the finite sample bias of a correctly specified IV estimator will be roughly 10% of the bias from the OLS model.
Finally, testing the relationship between the instrument (𝑧) and the error term (𝜀) is notoriously difficult in practice. Formally, testing this condition requires an over identified setting (where this is access to more than one IV for the endogenous variable), a relatively rare occurrence given the problems in finding a single IV for an endogenous variable. Furthermore, even if there is an over identified setting, statistical tests for exogeneity (such as those developed by Sargan 1958 and Hansen 1982) can only test the joint exogeneity of all the available IV strategies, not each individual IV.
The parameters identified from IV models should be interpreted as the local average treatment effect (LATE) (Angrist et al., 1996) rather than an average treatment effect (ATE) for the whole population or the average treatment effect on the treated (ATET). This distinction is essential, as the ATE calculated using different instruments and sub- populations are specific to those instruments and sub-populations, and should not be extrapolated to the whole population.
To clarify the theory underpinning the LATE, assume a simplified model with a binary outcome variable 𝑦, a binary, endogenous treatment variable 𝑥 and a binary IV 𝑧, which is significantly associated with 𝑥. In this context, the LATE framework partitions the population into four potential statuses, as shown in Figure 3.4:
𝑧 = 0
𝑥 = 0 𝑥 =1
𝑧 = 1 𝑥 = 0 Never-taker Defier
𝑥 = 1 Complier Always-taker
23
Never-taker: an individual who, independent of 𝑧, does not take the treatment 𝑥 Defier: an individual who take the treatment 𝑥 when 𝑧 = 0, but does not take the
treatment 𝑥 when 𝑧 = 1
Complier: an individual who takes the treatment 𝑥 when 𝑧 = 1, but does not take the treatment 𝑥 when 𝑧 = 0
Always-taker: an individual who, independent of 𝑧, takes the treatment 𝑥 As it is not possible to know if a given individual in the sample is a never-taker, defier, complier or always-taker, there is a missing observation issue. Under the assumption that the effect of the treatment is heterogonous across the sample, it can be proved that the Wald estimator is equal to the ATE in the sub-group of compliers only, and therefore the LATE can be shown as:
𝐿𝐴𝑇𝐸 =𝑦1− 𝑦0 𝑧1− 𝑧0 = 𝜓1
(3.14)
The numerator represents the difference between the averages of 𝑦 in the sub-sample of compliers. The denominator represents the difference between the frequency of treated individuals amongst the compliers having 𝑧 = 1 and the frequency of the untreated individuals amongst the compliers having 𝑧 = 0.
However, the use of the LATE calculated from IV models has some disadvantages. Firstly, the effect of the LATE is the ATE for the non-observable compliant sub-population, and
therefore is not generalisable to the whole population. Although this non-observable sub- population can often be regarded as the population of interest, it means that generating policy relevant conclusions using IV methods can be challenging.
Secondly, as the LATE calculates the ATE for the compliant sub-population, this effect will be different depending on the instrument being used. Although this means that the estimates from two different instruments are not directly comparable, as argued by Angrist and Fernandez-Val (2010), differences in estimates from different IV strategies need not signal a failure of the exclusion restriction. Instead, these differences may be attributable to
differences in the types of people who are affected by the underlying experiments implicit in any IV identification strategy.
24 3.2 Missing Data
The vast majority of secondary datasets, especially longitudinal designs, have a certain degree of missing data, most commonly due to attrition or non-response. Attrition refers to the loss of sample members over time. Sample members may drop out of surveys for a number of reasons, including moving house, lack of availability or a lack of interest. Non- response refers to individuals not answering certain questions in the survey. Sample members may not respond to certain questions for a number of reasons, for example not